# active_learning_for_costsensitive_classification__e3470428.pdf Journal of Machine Learning Research 20 (2019) 1-50 Submitted 11/17; Revised 3/19; Published 4/19 Active Learning for Cost-Sensitive Classification Akshay Krishnamurthy akshay@cs.umass.edu Microsoft Research New York, NY 10011 Alekh Agarwal alekha@microsoft.com Microsoft Research Redmond, WA 98052 Tzu-Kuo Huang tkhuang@protonmail.com Uber Advanced Technology Center Pittsburgh, PA 15201 Hal Daum e III hal@umiacs.umd.edu Microsoft Research New York, NY 10011 John Langford jcl@microsoft.com Microsoft Research New York, NY 10011 Editor: Sanjoy Dasgupta We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label s cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. We empirically compare COAL to passive learning and several active learning baselines, showing significant improvements in labeling effort and test cost on real-world datasets. Keywords: Active Learning, Cost-sensitive Learning, Structured Prediction, Statistical Learning Theory, Oracle-based Algorithms. 1. Introduction The field of active learning studies how to efficiently elicit relevant information so learning algorithms can make good decisions. Almost all active learning algorithms are designed for binary classification problems, leading to the natural question: How can active learning address more complex prediction problems? Multiclass and importance-weighted classification require only minor modifications but we know of no active learning algorithms that enjoy theoretical guarantees for more complex problems. One such problem is cost-sensitive multiclass classification (CSMC). In CSMC with K classes, passive learners receive input examples x and cost vectors c RK, where c(y) is c 2019 Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daum e III, John Langford. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-681.html. Krishnamurthy, Agarwal, Huang, Daum e III, Langford the cost of predicting label y on x.1 A natural design for an active CSMC learner then is to adaptively query the costs of only a (possibly empty) subset of labels on each x. Since measuring label complexity is more nuanced in CSMC (e.g., is it more expensive to query three costs on a single example or one cost on three examples?), we track both the number of examples for which at least one cost is queried, along with the total number of cost queries issued. The first corresponds to a fixed human effort for inspecting the example. The second captures the additional effort for judging the cost of each prediction, which depends on the number of labels queried. (By querying a label, we mean querying the cost of predicting that label given an example.) In this setup, we develop a new active learning algorithm for CSMC called Cost Overlapped Active Learning (COAL). COAL assumes access to a set of regression functions, and, when processing an example x, it uses the functions with good past performance to compute the range of possible costs that each label might take. Naturally, COAL only queries labels with large cost range, akin to uncertainty-based approaches in active regression (Castro et al., 2005), but furthermore, it only queries labels that could possibly have the smallest cost, avoiding the uncertain, but surely suboptimal labels. The key algorithmic innovation is an efficient way to compute the cost range realized by good regressors. This computation, and COAL as a whole, only requires that the regression functions admit efficient squared loss optimization, in contrast with prior algorithms that require 0/1 loss optimization (Beygelzimer et al., 2009; Hanneke, 2014). Among our results, we prove that when processing n (unlabeled) examples with K classes and a regression class with pseudo-dimension d (See Definition 1), 1. The algorithm needs to solve O(Kn5) regression problems over the function class (Corollary 4). Thus COAL runs in polynomial time for convex regression sets. 2. With no assumptions on the noise in the problem, the algorithm achieves generalization error O( p Kd/n) and requests O(nθ2 Kd) costs from O(nθ1 Kd) examples (Theorems 5 and 9) where θ1, θ2 are the disagreement coefficients (Definition 8)2. The worst case offers minimal improvement over passive learning, akin to active learning for binary classification. 3. With a Massart-type noise assumption (Assumption 3), the algorithm has generalization error O(Kd/n) while requesting O(Kd(θ2+Kθ1) log n) labels from O(Kdθ1 log n) examples (Corollary 6, Theorem 10). Thus under favorable conditions, COAL requests exponentially fewer labels than passive learning. We also derive generalization and label complexity bounds under a milder Tsybakov-type noise condition (Assumption 4). Existing lower bounds from binary classification (Hanneke, 2014) suggest that our results are optimal in their dependence on n, although these lower bounds do not directly apply to our setting. We also discuss some intuitive examples highlighting the benefits of using COAL. CSMC provides a more expressive language for success and failure than multiclass classification, which allows learning algorithms to make the trade-offs necessary for good performance and broadens potential applications. For example, CSMC can naturally express 1. Cost here refers to prediction cost and not labeling effort or the cost of acquiring different labels. 2. O( ) suppresses logarithmic dependence on n, K, and d. Active Learning for Cost-Sensitive Classification 216 218 220 222 224 226 Number of Queries Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) Figure 1: Empirical evaluation of COAL on Reuters text categorization dataset. Active learning achieves better test cost than passive, with a factor of 16 fewer queries. See Section 7 for details. partial failure in hierarchical classification (Silla Jr. and Freitas, 2011). Experimentally, we show that COAL substantially outperforms the passive learning baseline with orders of magnitude savings in the labeling effort on a number of hierarchical classification datasets (see Figure 1 for comparison between passive learning and COAL on Reuters text categorization). CSMC also forms the basis of learning to avoid cascading failures in joint prediction tasks like structured prediction and reinforcement learning (Daum e III et al., 2009; Ross and Bagnell, 2014; Chang et al., 2015). As our second application, we consider learning to search algorithms for joint or structured prediction, which operate by a reduction to CSMC. In this reduction, evaluating the cost of a class often involves a computationally expensive rollout, so using an active learning algorithm inside such a passive joint prediction method can lead to significant computational savings. We show that using COAL within the Aggravate algorithm (Ross and Bagnell, 2014; Chang et al., 2015) reduces the number of roll-outs by a factor of 1 4 on several joint prediction tasks. Our code is publicly available as part of the Vowpal Wabbit machine learning library.3 2. Related Work Active learning is a thriving research area with many theoretical and empirical studies. We recommend the survey of Settles (2012) for an overview of more empirical research. We focus here on theoretical results. Our work falls into the framework of disagreement-based active learning, which studies general hypothesis spaces typically in an agnostic setup (see Hanneke (2014) for an excellent survey). Existing results study binary classification, while our work generalizes to CSMC, assuming that we can accurately predict costs using regression functions from our class. One difference that is natural for CSMC is that our query rule checks the range of predicted costs for a label. The other main difference is that we use a square loss oracle to search the version space. In contrast, prior work either explicitly enumerates the version space (Balcan et al., 2006; 3. http://hunch.net/~vw Krishnamurthy, Agarwal, Huang, Daum e III, Langford Zhang and Chaudhuri, 2014) or uses a 0/1 loss classification oracle for the search (Dasgupta et al., 2007; Beygelzimer et al., 2009, 2010; Huang et al., 2015). In most instantiations, the oracle solves an NP-hard problem and so does not directly lead to an efficient algorithm, although practical implementations using heuristics are still quite effective. Our approach instead uses a squared-loss regression oracle, which can be implemented efficiently via convex optimization and leads to a polynomial time algorithm. In addition to disagreement-based approaches, much research has focused on plug-in rules for active learning in binary classification, where one estimates the class-conditional regression function (Castro and Nowak, 2008; Minsker, 2012; Hanneke and Yang, 2012; Carpentier et al., 2017). Apart from Hanneke and Yang (2012), these works make smoothness assumptions and have a nonparametric flavor. Instead, Hanneke and Yang (2012) assume a calibrated surrogate loss and abstract realizable function class, which is more similar to our setting. While the details vary, our work and these prior results employ the same algorithmic recipe of maintaining an implicit version space and querying in a suitably-defined disagreement region. Our work has two notable differences: (1) our algorithm operates in an oracle computational model, only accessing the function class through square loss minimization problems, (2) our results apply to general CSMC, which exhibit significant differences from binary classification. See Subsection 6.1 for further discussion. Focusing on linear representations, Balcan et al. (2007); Balcan and Long (2013) study active learning with distributional assumptions, while the selective sampling framework from the online learning community considers adversarial assumptions (Cavallanti et al., 2011; Dekel et al., 2010; Orabona and Cesa-Bianchi, 2011; Agarwal, 2013). These methods use query strategies that are specialized to linear representations and do not naturally generalize to other hypothesis classes. Supervised learning oracles that solve NP-hard optimization problems in the worst case have been used in other problems including contextual bandits (Agarwal et al., 2014; Syrgkanis et al., 2016) and structured prediction (Daum e III et al., 2009). Thus we hope that our work can inspire new algorithms for these settings as well. Lastly, we mention that square loss regression has been used to estimate costs for passive CSMC (Langford and Beygelzimer, 2005), but, to our knowledge, using a square loss oracle for active CSMC is new. Advances over Krishnamurthy et al. (2017). Active learning for CSMC was introduced recently in Krishnamurthy et al. (2017) with an algorithm that also uses cost ranges to decide where to query. They compute cost ranges by using the regression oracle to perform a binary search for the maximum and minimum costs, but this computation results in a sub-optimal label complexity bound. We resolve this sub-optimality with a novel cost range computation that is inspired by the multiplicative weights technique for solving linear programs. This algorithmic improvement also requires a significantly more sophisticated statistical analysis for which we derive a novel uniform Freedman-type inequality for classes with bounded pseudo-dimension. This result may be of independent interest. Krishnamurthy et al. (2017) also introduce an online approximation for additional scalability and use this algorithm for their experiments. Our empirical results use this same online approximation and are slightly more comprehensive. Finally, we also derive general- Active Learning for Cost-Sensitive Classification ization and label complexity bounds for our algorithm in a setting inspired by Tsybakov s low noise condition (Mammen and Tsybakov, 1999; Tsybakov, 2004). Comparison with Foster et al. (2018). In a follow-up to the present paper, Foster et al. (2018) build on our work with a regression-based approach for contextual bandit learning, a problem that bears some similarities to active learning for CSMC. The results are incomparable due to the differences in setting, but it is worth discussing their techniques. As in our paper, Foster et al. (2018) maintain an implicit version space and compute maximum and minimum costs for each label, which they use to make predictions. They resolve the sub-optimality in Krishnamurthy et al. (2017) with epoching, which enables a simpler cost range computation than our multiplicative weights approach. However, epoching incurs an additional log(n) factor in the label complexity, and under low-noise conditions where the overall bound is O(polylog(n)), this yields a polynomially worse guarantee than ours. 3. Problem Setting and Notation We study cost-sensitive multiclass classification (CSMC) problems with K classes, where there is an instance space X, a label space Y = {1, . . . , K}, and a distribution D supported on X [0, 1]K.4 If (x, c) D, we refer to c as the cost-vector, where c(y) is the cost of predicting y Y. A classifier h : X Y has expected cost E(x,c) D[c(h(x))] and we aim to find a classifier with minimal expected cost. Let G {g : X 7 [0, 1]} denote a set of base regressors and let F GK denote a set of vector regressors where the yth coordinate of f F is written as f( ; y). The set of classifiers under consideration is H {hf | f F} where each f defines a classifier hf : X 7 Y by hf(x) argmin y f(x; y). (1) When using a set of regression functions for a classification task, it is natural to assume that the expected costs under D can be predicted by some function in the set. This motivates the following realizability assumption. Assumption 1 (Realizability) Define the Bayes-optimal regressor f , which has f (x; y) Ec[c(y)|x], x X (with D(x) > 0), y Y. We assume that f F. While f is always well defined, note that the cost itself may be noisy. In comparison with our assumption, the existence of a zero-cost classifier in H (which is often assumed in active learning) is stronger, while the existence of hf in H is weaker but has not been leveraged in active learning. We also require assumptions on the complexity of the class G for our statistical analysis. To this end, we assume that G is a compact convex subset of L (X) with finite pseudodimension, which is a natural extension of VC-dimension for real-valued predictors. Definition 1 (Pseudo-dimension) The pseudo-dimension Pdim(F) of a function class F : X R is defined as the VC-dimension of the set of threshold functions H+ {(x, ξ) 7 1{f(x) > ξ} : f F} X R {0, 1}. 4. In general, labels just serve as indices for the cost vector in CSMC, and the data distribution is over (x, c) pairs instead of (x, y) pairs as in binary and multiclass classification. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Assumption 2 We assume that G is a compact convex set with Pdim(G) = d < . As an example, linear functions in some basis representation, e.g., g(x) = Pd i=1 wiφi(x), where weights wi are bounded in some norm, have pseudodimension d. In fact, our result can be stated entirely in terms of covering numbers, and we translate to pseudo-dimension using the fact that such classes have parametric covering numbers of the form (1/ε)d. Thus, our results extend to classes with nonparametric growth rates as well (e.g., Holdersmooth functions), although we focus on the parametric case for simplicity. Note that this is a significant departure from Krishnamurthy et al. (2017), which assumed that G was finite. Our assumption that G is a compact convex set introduces a computational challenging of managing this infinitely large set. To address this challenge, we follow the trend in active learning of leveraging existing algorithmic research on supervised learning (Dasgupta et al., 2007; Beygelzimer et al., 2010, 2009) and access G exclusively through a regression oracle. Given an importance-weighted dataset D = {xi, ci, wi}n i=1 where xi X, ci R, wi R+, the regression oracle computes Oracle(D) argmin g G i=1 wi(g(xi) ci)2. (2) Since we assume that G is a compact convex set it is amenable to standard convex optimization techniques, so this imposes no additional restriction. However, in the special case of linear functions, this optimization is just least squares and can be computed in closed form. Note that this is fundamentally different from prior works that use a 0/1-loss minimization oracle (Dasgupta et al., 2007; Beygelzimer et al., 2010, 2009), which involves an NP-hard optimization in most cases of interest. Remark 2 Our assumption that G is convex is only for computational tractability, as it is crucial in the efficient implementation of our query strategy, but is not required for our generalization and label complexity bounds. Unfortunately recent guarantees for learning with non-convex classes (Liang et al., 2015; Rakhlin et al., 2017) do not immediately yield efficient active learning strategies. Note also that Krishnamurthy et al. (2017) obtain an efficient algorithm without convexity, but this yields a suboptimal label complexity guarantee. Given a set of examples and queried costs, we often restrict attention to regression functions that predict these costs well and assess the uncertainty in their predictions given a new example x. For a subset of regressors G G, we measure uncertainty over possible cost values for x with γ(x, G) c+(x, G) c (x, G), c+(x, G) max g G g(x), c (x, G) min g G g(x). (3) For vector regressors F F, we define the cost range for a label y given x as γ(x, y, F) γ(x, GF (y)) where GF (y) {f( ; y) | f F} are the base regressors induced by F for y. Note that since we are assuming realizability, whenever f F, the quantities c+(x, GF (y)) and c (x, GF (y)) provide valid upper and lower bounds on E[c(y)|x]. To measure the labeling effort, we track the number of examples for which even a single cost is queried as well as the total number of queries. This bookkeeping captures settings Active Learning for Cost-Sensitive Classification Algorithm 1 Cost Overlapped Active Learning (COAL) 1: Input: Regressors G, failure probability δ 1/e. 2: Set ψi = 1/ i, κ = 3, νn = 324(d log(n) + log(8Ke(d + 1)n2/δ)). 3: Set i = κ min{ νn 4: for i = 1, 2, . . . , n do 5: gi,y arg ming G b Ri(g; y). (See (5)). 6: Define fi {gi,y}K y=1. 7: (Implicitly define) Gi(y) {g Gi 1(y) | b Ri(g; y) b Ri(gi,y; y) + i}. 8: Receive new example x. Qi(y) 0, y Y. 9: for every y Y do 10: c c+(y) Max Cost((x, y), ψi/4) and c c (y) Min Cost((x, y), ψi/4). 11: end for 12: Y {y Y | c c (y) miny c c+(y )}. 13: if |Y | > 1 then 14: Qi(y) 1 if y Y and c c+(y) c c (y) > ψi. 16: Query costs of each y with Qi(y) = 1. 17: end for where the editorial effort for inspecting an example is high but each cost requires minimal further effort, as well as those where each cost requires substantial effort. Formally, we define Qi(y) {0, 1} to be the indicator that the algorithm queries label y on the ith example and measure y Qi(y), and L2 y Qi(y). (4) 4. Cost Overlapped Active Learning The pseudocode for our algorithm, Cost Overlapped Active Learning (COAL), is given in Algorithm 1. Given an example x, COAL queries the costs of some of the labels y for x. These costs are chosen by (1) computing a set of good regression functions based on the past data (i.e., the version space), (2) computing the range of predictions achievable by these functions for each y, and (3) querying each y that could be the best label and has substantial uncertainty. We now detail each step. To compute an approximate version space we first find the regression function that minimizes the empirical risk for each label y, which at round i is: b Ri(g; y) 1 i 1 j=1 (g(xj) cj(y))2Qj(y). (5) Recall that Qj(y) is the indicator that we query label y on the jth example. Computing the minimizer requires one oracle call. We implicitly construct the version space Gi(y) in Line 7 as the surviving regressors with low square loss regret to the empirical risk minimizer. The Krishnamurthy, Agarwal, Huang, Daum e III, Langford tolerance on this regret is i at round i, which scales like O(d/i), where recall that d is the pseudo-dimension of the class G. COAL then computes the maximum and minimum costs predicted by the version space Gi(y) on the new example x. Since the true expected cost is f (x; y) and, as we will see, f ( ; y) Gi(y), these quantities serve as a confidence bound for this value. The computation is done by the Max Cost and Min Cost subroutines which produce approximations to c+(x, Gi(y)) and c (x, Gi(y)) respectively (See (3)). Finally, using the predicted costs, COAL issues (possibly zero) queries. The algorithm queries any non-dominated label that has a large cost range, where a label is non-dominated if its estimated minimum cost is smaller than the smallest maximum cost (among all other labels) and the cost range is the difference between the label s estimated maximum and minimum costs. Intuitively, COAL queries the cost of every label which cannot be ruled out as having the smallest cost on x, but only if there is sufficient ambiguity about the actual value of the cost. The idea is that labels with little disagreement do not provide much information for further reducing the version space, since by construction all regressors would suffer similar square loss. Moreover, only the labels that could be the best need to be queried at all, since the cost-sensitive performance of a hypothesis hf depends only on the label that it predicts. Hence, labels that are dominated or have small cost range need not be queried. Similar query strategies have been used in prior works on binary and multiclass classification (Orabona and Cesa-Bianchi, 2011; Dekel et al., 2010; Agarwal, 2013), but specialized to linear representations. The key advantage of the linear case is that the set Gi(y) (formally, a different set with similar properties) along with the maximum and minimum costs have closed form expressions, so that the algorithms are easily implemented. However, with a general set G and a regression oracle, computing these confidence intervals is less straightforward. We use the Max Cost and Min Cost subroutines, and discuss this aspect of our algorithm next. 4.1. Efficient Computation of Cost Range In this section, we describe the Max Cost subroutine which uses the regression oracle to approximate the maximum cost on label y realized by Gi(y), as defined in (3). The minimum cost computation requires only minor modifications that we discuss at the end of the section. Describing the algorithm requires some additional notation. Let j j + b Rj(gj,y; y) be the right hand side of the constraint defining the version space at round j, where gj,y is the ERM at round j for label y, b Rj( ; y) is the risk functional, and j is the radius used in COAL. Note that this quantity can be efficiently computed since gj,y can be found with a single oracle call. Due to the requirement that g Gi 1(y) in the definition of Gi(y), an equivalent representation is Gi(y) = Ti j=1{g : b Rj(g; y) j}. Our approach is based on the observation that given an example x and a label y at round i, finding a function g Gi(y) which predicts the maximum cost for the label y on x is equivalent to solving the minimization problem: minimizeg G(g(x) 1)2 such that 1 j i, b Rj(g; y) j. (7) Active Learning for Cost-Sensitive Classification Algorithm 2 Max Cost 1: Input: (x, y), tolerance tol, (implicitly) risk functionals { b Rj( ; y)}i j=1. 2: Compute gj,y = argming G b Rj(g; y) for each j. 3: Let j = κ min{ νn j 1, 1}, j = b Rj(gj,y; y) + j for each j. 4: Initialize parameters: cℓ 0, ch 1, T log(i+1)(12/ i)2 log(i + 1)/T. 5: while |cℓ ch| > tol2/2 do 2 7: µ(1) 1 Ri+1. Use MW to check feasibility of Program (9). 8: for t = 1, . . . , T do 9: Use the regression oracle to find gt argmin g G µ(t) 0 (g(x) 1)2 + j=1 µ(t) j b Rj(g; y) (6) 10: If the objective in (6) for gt is at least µ(t) 0 c + Pi j=1 µ(t) j j, cℓ c, go to 5. µ(t+1) j µ(t) j 1 η j b Rj(gt; y) , µ(t+1) 0 µ(t) 0 1 ηc (gt(x) 1)2 12: end for 14: end while 15: Return c c+(y) = 1 cℓ. Given this observation, our strategy will be to find an approximate solution to the problem (7) and it is not difficult to see that this also yields an approximate value for the maximum predicted cost on x for the label y. In Algorithm 2, we show how to efficiently solve this program using the regression oracle. We begin by exploiting the convexity of the set G, meaning that we can further rewrite the optimization problem (7) as minimize P (G)Eg P (g(x) 1)2 such that 1 j i, Eg P h b Rj(g; y) i j. (8) The above rewriting is effectively cosmetic as G = (G) by the definition of convexity, but the upshot is that our rewriting results in both the objective and constraints being linear in the optimization variable P. Thus, we effectively wish to solve a linear program in P, with our computational tool being a regression oracle over the set G. To do this, we create a series of feasibility problems, where we repeatedly guess the optimal objective value for the problem (8) and then check whether there is indeed a distribution P which satisfies all the constraints and gives the posited objective value. That is, we check ? P (G) such that Eg P (g(x) 1)2 c and 1 j i, Eg P b Rj(g; y) j. (9) Krishnamurthy, Agarwal, Huang, Daum e III, Langford If we find such a solution, we increase our guess, and otherwise we reduce the guess and proceed until we localize the optimal value to a small enough interval. It remains to specify how to solve the feasibility problem (9). Noting that this is a linear feasibility problem, we jointly invoke the Multiplicative Weights (MW) algorithm and the regression oracle in order to either find an approximately feasible solution or certify the problem as infeasible. MW is an iterative algorithm that maintains weights µ over the constraints. At each iteration it (1) collapses the constraints into one, by taking a linear combination weighted by µ, (2) checks feasibility of the simpler problem with a single constraint, and (3) if the simpler problem is feasible, it updates the weights using the slack of the proposed solution. Details of steps (1) and (3) are described in Algorithm 2. For step (2), the simpler problem that we must solve takes the form ? P (G) such that µ0Eg P (g(x) 1)2 + j=1 µj Eg P b Rj(g; y) µ0c + This program can be solved by a single call to the regression oracle, since all terms on the left-hand-side involve square losses while the right hand side is a constant. Thus we can efficiently implement the MW algorithm using the regression oracle. Finally, recalling that the above description is for a fixed value of objective c, and recalling that the maximum can be approximated by a binary search over c leads to an oracle-based algorithm for computing the maximum cost. For this procedure, we have the following computational guarantee. Theorem 3 Algorithm 2 returns an estimate c c+(x; y) such that c+(x; y) c c+(x; y) c+(x; y) + tol and runs in polynomial time with O(max{1, i2/ν2 n} log(i) log(1/tol)/tol4) calls to the regression oracle. The minimum cost can be estimated in exactly the same way, replacing the objective (g(x) 1)2 with (g(x) 0)2 in Program (7). In COAL, we set tol = 1/ i at iteration i and have νn = O(d). As a consequence, we can bound the total oracle complexity after processing n examples. Corollary 4 After processing n examples, COAL makes O(K(d3 + n5/d2)) calls to the square loss oracle. Thus COAL can be implemented in polynomial time for any set G that admits efficient square loss optimization. Compared to Krishnamurthy et al. (2017) which required O(n2) oracle calls, the guarantee here is, at face value, worse, since the algorithm is slower. However, the algorithm enforces a much stronger constraint on the version space which leads to a much better statistical analysis, as we will discuss next. Nevertheless, these algorithms that use batch square loss optimization in an iterative or sequential fashion are too computational demanding to scale to larger problems. Our implementation alleviates this with an alternative heuristic approximation based on a sensitivity analysis of the oracle, which we detail in Section 7. 5. Generalization Analysis In this section, we derive generalization guarantees for COAL. We study three settings: one with minimal assumptions and two low-noise settings. Active Learning for Cost-Sensitive Classification Our first low-noise assumption is related to the Massart noise condition (Massart and N ed elec, 2006), which in binary classification posits that the Bayes optimal predictor is bounded away from 1/2 for all x. Our condition generalizes this to CSMC and posits that the expected cost of the best label is separated from the expected cost of all other labels. Assumption 3 A distribution D supported over (x, c) pairs satisfies the Massart noise condition with parameter τ > 0, if for all x (with D(x) > 0), f (x; y (x)) min y =y (x) f (x; y) τ, where y (x) argminy f (x; y) is the true best label for x. The Massart noise condition describes favorable prediction problems that lead to sharper generalization and label complexity bounds for COAL. We also study a milder noise assumption, inspired by the Tsybakov condition (Mammen and Tsybakov, 1999; Tsybakov, 2004), again generalized to CSMC. See also Agarwal (2013). Assumption 4 A distribution D supported over (x, c) pairs satisfies the Tsbyakov noise condition with parameters (τ0, α, β) if for all 0 τ τ0, min y =y (x) f (x; y) f (x; y (x)) τ βτ α, where y (x) argminy f (x; y). Observe that the Massart noise condition in Assumption 3 is a limiting case of the Tsybakov condition, with τ = τ0 and α . The Tsybakov condition states that it is polynomially unlikely for the cost of the best label to be close to the cost of the other labels. This condition has been used in previous work on cost-sensitive active learning (Agarwal, 2013) and is also related to the condition studied by Castro and Nowak (2008) with the translation that α = 1 κ 1, where κ [0, 1] is their noise level. Our generalization bound is stated in terms of the noise level in the problem so that they can be readily adapted to the favorable assumptions. We define the noise level using the following quantity, given any ζ > 0. min y =y (x) f (x; y) f (x; y (x)) ζ . (10) Pζ describes the probability that the expected cost of the best label is close to the expected cost of the second best label. When Pζ is small for large ζ the labels are well-separated so learning is easier. For instance, under a Massart condition Pζ = 0 for all ζ τ. We now state our generalization guarantee. Theorem 5 For any δ < 1/e, for all i [n], with probability at least 1 δ, we have Ex,c[c(hfi+1(x)) c(hf (x))] min ζ>0 ζPζ + 32Kνn where νn, fi are defined in Algorithm 1, and hfi is defined in (1). Krishnamurthy, Agarwal, Huang, Daum e III, Langford In the worst case, we bound Pζ by 1 and optimize for ζ to obtain an O( p Kd log(1/δ)/i) bound after i samples, where recall that d is the pseudo-dimension of G. This agrees with the standard generalization bound of O( p Pdim(F) log(1/δ)/i) for VC-type classes because F = GK has O(Kd) statistical complexity. However, since the bound captures the difficulty of the CSMC problem as measured by Pζ, we can obtain sharper results under Assumptions 3 and 4 by appropriately setting ζ. Corollary 6 Under Assumption 3, for any δ < 1/e, with probability at least 1 δ, for all i [n], we have Ex,c[c(hfi+1(x)) c(hf (x))] 32Kνn Corollary 7 Under Assumption 4, for any δ < 1/e, with probability at least 1 δ, for all 32Kνn βτ α+2 0 i n, we have Ex,c[c(hfi+1(x)) c(hf (x))] 2β 1 α+2 32Kνn Thus, Massart and Tsybakov-type conditions lead to a faster convergence rate of O(1/n) and O(n α+1 α+2 ). This agrees with the literature on active learning for classification (Massart and N ed elec, 2006) and can be viewed as a generalization to CSMC. Both generalization bounds match the optimal rates for binary classification under the analogous low-noise assumptions (Massart and N ed elec, 2006; Tsybakov, 2004). We emphasize that COAL obtains these bounds as is, without changing any parameters, and hence COAL is adaptive to favorable noise conditions. 6. Label Complexity Analysis Without distributional assumptions, the label complexity of COAL can be O(n), just as in the binary classification case, since there may always be confusing labels that force querying. In line with prior work, we introduce two disagreement coefficients that characterize favorable distributional properties. We first define a set of good classifiers, the cost-sensitive regret ball: Fcsr(r) n f F E [c(hf(x)) c(hf (x))] r o . We also recall our earlier notation γ(x, y, F) (see (3) and the subsequent discussion) for a subset F F which indicates the range of expected costs for (x, y) as predicted by the regressors corresponding to the classifiers in F. We now define the disagreement coefficients. Definition 8 (Disagreement coefficients) Define DIS(r, y) x | f, f Fcsr(r), hf(x) = y = hf (x) . Active Learning for Cost-Sensitive Classification Then the disagreement coefficients are defined as: θ1 sup ψ,r>0 r P ( y | γ(x, y, Fcsr(r)) > ψ x DIS(r, y)) , θ2 sup ψ,r>0 y P (γ(x, y, Fcsr(r)) > ψ x DIS(r, y)) . Intuitively, the conditions in both coefficients correspond to the checks on the domination and cost range of a label in Lines 12 and 14 of Algorithm 1. Specifically, when x DIS(r, y), there is confusion about whether y is the optimal label or not, and hence y is not dominated. The condition on γ(x, y, Fcsr(r)) additionally captures the fact that a small cost range provides little information, even when y is non-dominated. Collectively, the coefficients capture the probability of an example x where the good classifiers disagree on x in both predicted costs and labels. Importantly, the notion of good classifiers is via the algorithmindependent set Fcsr(r), and is only a property of F and the data distribution. The definitions are a natural adaptation from binary classification (Hanneke, 2014), where a similar disagreement region to DIS(r, y) is used. Our definition asks for confusion about the optimality of a specific label y, which provides more detailed information about the cost-structure than simply asking for any confusion among the good classifiers. The 1/r scaling is in agreement with previous related definitions (Hanneke, 2014), and we also scale by the cost range parameter ψ, so that the favorable settings for active learning can be concisely expressed as having θ1, θ2 bounded, as opposed to a complex function of ψ. The next three results bound the labeling effort (4), in the high noise and low noise cases respectively. The low noise assumptions enable significantly sharper bounds. Before stating the bounds, we recall that L1 corresponds to the number of examples where at least one cost is queried, while L2 is the total number of costs queried across all examples. Theorem 9 With probability at least 1 δ, the label complexity of the algorithm over n examples is at most L1 = O nθ1 p Kνn + log(1/δ) , L2 = O nθ2 p Kνn + K log(1/δ) . Theorem 10 Assume the Massart noise condition holds. With probability at least 1 δ the label complexity of the algorithm over n examples is at most L1 = O K log(n)νn τ 2 θ1 + log(1/δ) , L2 = O (KL1) Theorem 11 Assume the Tsybakov noise condition holds. With probability at least 1 δ the label complexity of the algorithm over n examples is at most α α+1 1 (Kνn) α α+2 n 2 α+2 + log(1/δ) , L2 = O (KL1) In the high-noise case, the bounds scales with nθ for the respective coefficients. In comparison, for binary classification the leading term is O (nθerror(hf )) which involves a Krishnamurthy, Agarwal, Huang, Daum e III, Langford different disagreement coefficient and which scales with the error of the optimal classifier hf (Hanneke, 2014; Huang et al., 2015). Qualitatively the bounds have similar worstcase behavior, demonstrating minimal improvement over passive learning, but by scaling with error(hf ) the binary classification bound reflects improvements on benign instances. For the special case of multiclass classification, we are able to recover the dependence on error(hf ) and the standard disagreement coefficient with a simple modification to our proof, which we discuss in detail in the next subsection. On the other hand, in both low noise cases the label complexity scales sublinearly with n. With bounded disagreement coefficients, this improves over the standard passive learning analysis where all labels are queried on n examples to achieve the generalization guarantees in Theorem 5, Corollary 6, and Corollary 7 respectively. In particular, under the Massart condition, both L1 and L2 bounds scale with θ log(n) for the respective disagreement coefficients, which is an exponential improvement over the passive learning analysis. Under the milder Tsybakov condition, the bounds scale with θ α α+1 n 2 α+2 , which improves polynomially over passive learning. These label complexity bounds agree with analogous results from binary classification (Castro and Nowak, 2008; Hanneke, 2014; Hanneke and Yang, 2015) in their dependence on n. Note that θ2 Kθ1 always and it can be much smaller, as demonstrated through an example in the next section. In such cases, only a few labels are ever queried and the L2 bound in the high noise case reflects this additional savings over passive learning. Unfortunately, in low noise conditions, we do not benefit when θ2 Kθ1. This can be resolved by letting ψi in the algorithm depend on the noise level τ, but we prefer to use the more robust choice ψi = 1/ i which still allows COAL to partially adapt to low noise and achieve low label complexity. The main improvement over Krishnamurthy et al. (2017) is demonstrated in the label complexity bounds under low noise assumptions. For example, under Massart noise, our bound has the optimal log(n)/τ 2 rate, while the bound in Krishnamurthy et al. (2017) is exponentially worse, scaling with nβ/τ 2 for β (0, 1). This improvement comes from explicitly enforcing monotonicity of the version space, so that once a regressor is eliminated it can never force COAL to query again. Algorithmically, computing the maximum and minimum costs with the monotonicity constraint is much more challenging and requires the new subroutine using MW. 6.1. Recovering Hanneke s Disagreement Coefficient In this subsection we show that in many cases we can obtain guarantees in terms of Hanneke s disagreement coefficient (Hanneke, 2014), which has been used extensively in active learning for binary classification. We also show that, for multiclass classification, the label complexity scales with the error of the optimal classifier h , a refinement on Theorem 9. The guarantees require no modifications to the algorithm and enable a precise comparison with prior results. Unfortunately, they do not apply to the general CSMC setting, so they have not been incorporated into our main theorems. We start with defining Hanneke s disagreement coefficient (Hanneke, 2014). Define the disagreement ball e F(r) {f F : P[hf(x) = hf (x)] r} and the disagreement region Active Learning for Cost-Sensitive Classification g DIS(r) {x | f, f e F(r), hf(x) = hf (x)}. The coefficient is defined as 1 r P h x g DIS(r) i . (11) This coefficient is known to be O(1) in many cases, for example when the hypothesis class consists of linear separators and the marginal distribution is uniform over the unit sphere (Hanneke, 2014, Chapter 7). In comparison with Definition 8, the two differences are that θ1, θ2 include the cost-range condition and involve the cost-sensitive regret ball Fcsr(r) rather than e F(r). As e F(r) Fcsr(r), we expect that θ1 and θ2 are typically larger than θ0, so bounds in terms of θ0 are more desirable. We now show that such guarantees are possible in many cases. The low noise case. For general CSMC, low noise conditions admit the following: Proposition 12 Under Massart noise, with probability at least 1 δ the label complexity of the algorithm over n examples is at most L1 = O log(n)νn τ 2 θ0 + log(1/δ) . Under Tsybakov noise, the label complexity is at most L1 = O θ0n 2 α+2 (log(n)νn) α α+2 + log(1/δ) . In both cases we have L2 = O(KL1). That is, for any low noise CSMC problem, COAL obtains a label complexity bound in terms of Hanneke s disagreement coefficient θ0 directly. Note that this adaptivity requires no change to the algorithm. Proposition 12 enables a precise comparison with disagreementbased active learning for binary classification. In particular, this bound matches the guarantee for CAL (Hanneke, 2014, Theorem 5.4) with the caveat that our measure of statistical complexity is the pseudodimension of the F instead of the VC-dimension of the hypothesis class. As a consequence, under low noise assumptions, COAL has favorable label complexity in all examples where θ0 is small. The high noise case. Outside of the low noise setting, we can introduce θ0 into our bounds, but only for multiclass classification, where we always have c 1 ey for some y [K]. Note that f(x; y) is now interpreted as a prediction for 1 P(y|x), so that the least cost prediction y (x) corresponds to the most likely label. We also obtain a further refinement by introducing error(hf ) E(x,c)[c(hf (x))]. Proposition 13 For multiclass classification, with probability at least 1 δ, the label complexity of the algorithm over n examples is at most L1 = 4θ0n error(hf ) + O θ0 Knνn error(hf ) + Kκνn log(n) + log(1/δ) . This result exploits two properties of the multiclass cost structure. First we can relate Fcsr(r) to the disagreement ball e F(r), which lets us introduce Hanneke s disagreement coefficient θ0. Second, we can bound Pζ in Theorem 5 in terms of error(hf ). Together the bound is comparable to prior results for active learning in binary classification (Hsu, 2010; Hanneke and Yang, 2012; Hanneke, 2014), with a slight generalization to the multiclass setting. Unfortunately, both of these refinements do not apply for general CSMC. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Summary. In important special cases, COAL achieves label complexity bounds directly comparable with results for active learning in binary classification, scaling with θ0 and error(hf ). In such cases, whenever θ0 is bounded for which many examples are known COAL has favorable label complexity. However, in general CSMC without low-noise assumptions, we are not able to obtain a bound in terms of these quantities, and we believe a bound involving θ0 does not hold for COAL. We leave understanding natural settings where θ1 and θ2 are small, or obtaining sharper guarantees as intriguing future directions. 6.2. Three Examples We now describe three examples to give more intuition for COAL and our label complexity bounds. Even in the low noise case, our label complexity analysis does not demonstrate all of the potential benefits of our query rule. In this section we give three examples to further demonstrate these advantages. Our first example shows the benefits of using the domination criterion in querying, in addition to the cost range condition. Consider a problem under Assumption 3, where the optimal cost is predicted perfectly, the second best cost is τ worse and all the other costs are substantially worse, but with variability in the predictions. Since all classifiers predict the correct label, we get θ1 = θ2 = 0, so our label complexity bound is O(1). Intuitively, since every regressor is certain of the optimal label and its cost, we actually make zero queries. On the other hand, all of the suboptimal labels have large cost ranges, so querying based solely on a cost range criteria, as would happen with an active regression algorithm (Castro et al., 2005), leads to a large label complexity. A related example demonstrates the improvement in our query rule over more na ıve approaches where we query either no label or all labels, which is the natural generalization of query rules from multiclass classification (Agarwal, 2013). In the above example, if the best and second best labels are confused occasionally θ1 may be large, but we expect θ2 Kθ1 since no other label can be confused with the best. Thus, the L2 bound in Theorem 9 is a factor of K smaller than with a na ıve query rule since COAL only queries the best and second best labels. Unfortunately, without setting ψi as a function of the noise parameters, the bounds in the low noise cases do not reflect this behavior. The third example shows that both θ0 and θ1 yield pessimistic bounds on the label complexity of COAL in some cases. The example is more involved, so we describe it in detail. We focus on statistical issues, using a finite regressor class F. Note that our results on generalization and label complexity hold in this setting, replacing d log(n) with log |F|, and the algorithm can be implemented by enumerating F. Throughout this example, we use O( ) to further suppress logarithmic dependence on n. Let X {x1, . . . , x M}, Y {0, 1}, and consider functions F {f , f1, . . . , f M}. We have f (x) (1/4, 1/2), x X and fi(xi) (1/4, 0) and fi(xj) (1/4, 1) for i = j. The marginal distribution is uniform and the true expected costs are given by f so that the problem satisfies the Massart noise condition with τ = 1/4. The key to the construction is that fis have high square loss on labels that they do not predict. Observe that as P[hfi(x) = hf (x)] = 1/M and hfi(xi) = hf (xi) for all i, the probability of disagreement is 1 until all fi are eliminated. As such, we have θ0 = M. Similarly, we have E[c(hfi(x)) c(hf (x))] = 1 4M and γ(x, 1, Fcsr( 1 4M )) = 1, so θ1 = 4M. Therefore, the Active Learning for Cost-Sensitive Classification bounds in Theorem 10 and Proposition 12 are both O(M log |F|) = O(|F|). On the other hand, since (fi(xj, 1) f (xj, 1))2 = 1/4 for all i, j [M], COAL eliminates every fi once it has made a total of O(log |F|) queries to label y = 1. Thus the label complexity is actually just O(log |F|), which is exponentially better than the disagreement-based analyses. Thus, COAL can perform much better than suggested by the disagreement-based analyses, and an interesting future direction is to obtain refined guarantees for cost-sensitive active learning. 7. Experiments We now turn to an empirical evaluation of COAL. For further computational efficiency, we implemented an approximate version of COAL using: 1) a relaxed version space Gi(y) {g G | b Ri(g; y) b Ri(gi,y; y) + i}, which does not enforce monotonicity, and 2) online optimization, based on online linear least-squares regression. The algorithm processes the data in one pass, and the idea is to (1) replace gi,y, the ERM, with an approximation go i,y obtained by online updates, and (2) compute the minimum and maximum costs via a sensitivity analysis of the online update. We describe this algorithm in detail in Subsection 7.1. Then, we present our experimental results, first for simulated active learning (Subsection 7.2) and then for learning to search for joint prediction (Subsection 7.3). 7.1. Finding Cost Ranges with Online Approximation Consider the maximum and minimum costs for a fixed example x and label y at round i, all of which may be suppressed. We ignore all the constraints on the empirical square losses for the past rounds. First, define b R(g, w, c; y) b R(g; y) + w(g(x) c)2, which is the risk functional augmented with a fake example with weight w and cost c. Also define gw arg min g G b R(g, w, 0; y), gw arg min g G b R(g, w, 1; y), and recall that gi,y is the ERM given in Algorithm 1. The functional b R(g, w, c; y) has a monotonicity property that we exploit here, proved in Appendix C. Lemma 14 For any c and for w w 0, define g = argming b R(g, w, c) and g = argming b R(g, w , c). Then b R(g ) b R(g) and (g (x) c)2 (g(x) c)2. As a result, an alternative to Min Cost and Max Cost is to find w max{w | b R(gw) b R(gi,y) i}, (12) w max{w | b R(gw) b R(gi,y) i}, (13) and return gw(x) and gw(x) as the minimum and maximum costs. We use two steps of approximation here. Using the definition of gw and gw as the minimizers of b R(g, w, 1; y) and b R(g, w, 0; y) respectively, we have b R(gw) b R(gi,y) w gi,y(x)2 w gw(x)2, b R(gw) b R(gi,y) w (gi,y(x) 1)2 w (gw(x) 1)2. Krishnamurthy, Agarwal, Huang, Daum e III, Langford We use this upper bound in place of b R(gw) b R(gi,y) in (12) and (13). Second, we replace gi,y, gw, and gw with approximations obtained by online updates. More specifically, we replace gi,y with go i,y, the current regressor produced by all online linear least squares updates so far, and approximate the others by gw(x) go i,y(x) w s(x, 0, go i,y), gw(x) go i,y(x) + w s(x, 1, go i,y), where s(x, y, go i,y) 0 is a sensitivity value that approximates the change in prediction on x resulting from an online update to go i,y with features x and label y. The computation of this sensitivity value is governed by the actual online update where we compute the derivative of the change in the prediction as a function of the importance weight w for a hypothetical example with cost 0 or cost 1 and the same features. This is possible for essentially all online update rules on importance weighted examples, and it corresponds to taking the limit as w 0 of the change in prediction due to an update, divided by w. Since we are using linear representations, this requires only O(s) time per example, where s is the average number of non-zero features. With these two steps, we obtain approximate minimum and maximum costs using go i,y(x) wo s(x, 0, go i,y), go i,y(x) + wo s(x, 1, go i,y), wo max{w | w go i,y,(x)2 (go i,y(x) w s(x, 0, go i,y))2 i} wo max{w | w (go i,y,(x) 1)2 (go i,y(x) + w s(x, 1, go i,y) 1)2 i}. The online update guarantees that go i,y(x) [0, 1]. Since the minimum cost is lower bounded by 0, we have wo 0, go i,y(x) s(x,0,go i,y) i . Finally, because the objective w(go i,y(x))2 w(go i,y(x) w s(x, 0, go i,y))2 is increasing in w within this range (which can be seen by inspecting the derivative), we can find wo with binary search. Using the same techniques, we also obtain an approximate maximum cost. 7.2. Simulated Active Learning We performed simulated active learning experiments with three datasets. Image Net 20 and 40 are sub-trees of the Image Net hierarchy covering the 20 and 40 most frequent classes, where each example has a single zero-cost label, and the cost for an incorrect label is the tree-distance to the correct one. The feature vectors are the top layer of the Inception neural network (Szegedy et al., 2015). The third, RCV1-v2 (Lewis et al., 2004), is a multilabel textcategorization dataset, which has 103 labels, organized as a tree with a similar tree-distance cost structure as the Image Net data. Some dataset statistics are in Table 1. We compare our online version of COAL to passive online learning. We use the costsensitive one-against-all (csoaa) implementation in Vowpal Wabbit5, which performs online linear regression for each label separately. There are two tuning parameters in our implementation. First, instead of i, we set the radius of the version space to i = κνi 1 i 1 (i.e. the log(n) term in the definition of νn is replaced with log(i)) and instead tune the constant 5. http://hunch.net/~vw Active Learning for Cost-Sensitive Classification 210 211 212 213 214 215 216 217 Number of Queries Image Net 20 Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) 212 213 214 215 216 217 218 219 Number of Queries Image Net 40 Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) 216 218 220 222 224 226 Number of Queries Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) 26 28 210 212 214 216 Number of Examples Queried Image Net 20 Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) 28 29 210 211 212 213 214 215 216 217 Number of Examples Queried Image Net 40 Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) 210 212 214 216 218 220 Number of Examples Queried Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) Figure 2: Experiments with COAL. Top row shows test cost vs. number of queries for simulated active learning experiments. Bottom row shows test cost vs. number of examples with even a single label query for simulated active learning experiments. K n feat density Image Net 20 20 38k 6k 21.1% Image Net 40 40 71k 6k 21.0% RCV1-v2 103 781k 47k 0.16% K n feat len POS 45 38k 40k 24 NER 9 15k 15k 14 Wiki 9 132k 89k 25 Table 1: Dataset statistics. len is the average sequence length and density is the percentage of non-zero features. κ. This alternate mellowness parameter controls how aggressive the query strategy is. The second parameter is the learning rate used by online linear regression6. For all experiments, we show the results obtained by the best learning rate for each mellowness on each dataset, which is tuned as follows. We randomly permute the training data 100 times and make one pass through the training set with each parameter setting. For each dataset let perf(mel, l, q, t) denote the test performance of the algorithm using mellowness mel and learning rate l on the tth permutation of the training data under a query budget of 2(q 1) 10 K, q 1. Let query(mel, l, q, t) denote the number of queries actually made. Note that query(mel, l, q, t) < 2(q 1) 10 K if the algorithm runs out of the 6. We use the default online learning algorithm in Vowpal Wabbit, which is a scale-free (Ross et al., 2013) importance weight invariant (Karampatziakis and Langford, 2011) form of Ada Grad (Duchi et al., 2010). Krishnamurthy, Agarwal, Huang, Daum e III, Langford Image Net 20 Image Net 40 RCV1-v2 POS NER NER-wiki passive 1 1 0.5 1.0 0.5 0.5 active (10 1) 0.05 0.1 0.5 1.0 0.1 0.5 active (10 2) 0.05 0.5 0.5 1.0 0.5 0.5 active (10 3) 1 10 0.5 10 0.5 0.5 Table 2: Best learning rates for each learning algorithm and each dataset. training data before reaching the qth query budget7. To evaluate the trade-offbetween test performance and number of queries, we define the following performance measure: AUC(mel, l, t) = 1 perf(mel, l, q +1, t)+perf(mel, l, q, t) log2 query(mel, l, q + 1, t) query(mel, l, q, t) , (14) where qmax is the minimum q such that 2(q 1) 10 is larger than the size of the training data. This performance measure is the area under the curve of test performance against number of queries in log2 scale. A large value means the test performance quickly improves with the number of queries. The best learning rate for mellowness mel is then chosen as l (mel) arg max l median1 t 100 AUC(mel, l, t). The best learning rates for different datasets and mellowness settings are in Table 2. In the top row of Figure 2, we plot, for each dataset and mellowness, the number of queries against the median test cost along with bars extending from the 15th to 85th quantile. Overall, COAL achieves a better trade-offbetween performance and queries. With proper mellowness parameter, active learning achieves similar test cost as passive learning with a factor of 8 to 32 fewer queries. On Image Net 40 and RCV1-v2 (reproduced in Figure 1), active learning achieves better test cost with a factor of 16 fewer queries. On RCV1-v2, COAL queries like passive up to around 256k queries, since the data is very sparse, and linear regression has the property that the cost range is maximal when an example has a new unseen feature. Once COAL sees all features a few times, it queries much more efficiently than passive. These plots correspond to the label complexity L2. In the bottom row, we plot the test error as a function of the number of examples for which at least one query was requested, for each dataset and mellowness, which experimentally corresponds to the L1 label complexity. In comparison to the top row, the improvements offered by active learning are slightly less dramatic here. This suggests that our algorithm queries just a few labels for each example, but does end up issuing at least one query on most of the examples. Nevertheless, one can still achieve test cost competitive with passive learning using a factor of 2-16 less labeling effort, as measured by L1. We also compare COAL with two active learning baselines. Both algorithms differ from COAL only in their query rule. All Or None queries either all labels or no labels using both domination and cost-range conditions and is an adaptation of existing multiclass active learners (Agarwal, 2013). No Dom just uses the cost-range condition, inspired by active 7. In fact, we check the test performance only in between examples, so query(mel, l, q, t) may be larger than 2(q 1) 10 K by an additive factor of K, which is negligibly small. Active Learning for Cost-Sensitive Classification 212 213 214 215 216 217 218 219 Number of Queries Image Net 40 ablations All Or None Passive No Dom COAL (1e-2) 216 218 220 222 224 226 Number of Queries RCV1 ablations All Or None COAL (1e-2) Passive No Dom Figure 3: Test cost versus number of queries for COAL, in comparison with active and passive baselines on the Image Net40 and RCV1-v2 dataset. On RCV1-v2, passive learning and No Dom are nearly identical. regression (Castro et al., 2005). The results for Image Net 40 and RCV1-v2 are displayed in Figure 3, where we use the AUC strategy to choose the learning rate. We choose the mellowness by visual inspection for the baselines and use 0.01 for COAL8. On Image Net 40, the ablations provide minimal improvement over passive learning, while on RCV1-v2, All Or None does provide marginal improvement. However, on both datasets, COAL substantially outperforms both baselines and passive learning. While not always the best, we recommend a mellowness setting of 0.01 as it achieves reasonable performance on all three datasets. This is also confirmed by the learning-tosearch experiments, which we discuss next. 7.3. Learning to Search We also experiment with COAL as the base leaner in learning-to-search (Daum e III et al., 2009; Chang et al., 2015), which reduces joint prediction problems to CSMC. A joint prediction example defines a search space, where a sequence of decisions are made to generate the structured label. We focus here on sequence labeling tasks, where the input is a sentence and the output is a sequence of labels, specifically, parts of speech or named entities. Learning-to-search solves such problems by generating the output one label at a time, conditioning on all past decisions. Since mistakes may lead to compounding errors, it is natural to represent the decision space as a CSMC problem, where the classes are the actions available (e.g., possible labels for a word) and the costs reflect the long term loss of each choice. Intuitively, we should be able to avoid expensive computation of long term loss on decisions like is the a determiner? once we are quite sure of the answer. Similar ideas motivate adaptive sampling for structured prediction (Shi et al., 2015). We specifically use Aggravate (Ross and Bagnell, 2014; Chang et al., 2015; Sun et al., 2017), which runs a learned policy to produce a backbone sequence of labels. For each position in the input, it then considers all possible deviation actions and executes an oracle for the rest of the sequence. The loss on this complete output is used as the cost for the 8. We use 0.01 for All Or None and 10 3 for No Dom. Krishnamurthy, Agarwal, Huang, Daum e III, Langford 220 221 222 223 224 225 226 Number of Rollouts Test Hamming Loss Part-of-Speech Tagging Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) 216 217 218 219 220 Number of Rollouts Test Error (1 - F1) Named-Entity Recognition Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) 219 220 221 222 223 224 225 Number of Rollouts Test Error (1-F1) NER (Wikipedia) Passive COAL (1e-1) COAL (1e-2) COAL (1e-3) Figure 4: Learning to search experiments with COAL. Accuracy vs. number of rollouts for active and passive learning as the CSMC algorithm in learning-to-search. deviating action. Run in this way, Aggravate requires len K roll-outs when the input sentence has len words and each word can take one of K possible labels. Since each roll-out takes O(len) time, this can be computationally prohibitive, so we use active learning to reduce the number of roll-outs. We use COAL and a passive learning baseline inside Aggravate on three joint prediction datasets (statistics are in Table 1). As above, we use several mellowness values and the same AUC criteria to select the best learning rate (see Table 2). The results are in Figure 4, and again our recommended mellowness is 0.01. Overall, active learning reduces the number of roll-outs required, but the improvements vary on the three datasets. On the Wikipedia data, COAL performs a factor of 4 fewer rollouts to achieve similar performance to passive learning and achieves substantially better test performance. A similar, but less dramatic, behavior arises on the NER task. On the other hand, COAL offers minimal improvement over passive learning on the POS-tagging task. This agrees with our theory and prior empirical results (Hsu, 2010), which show that active learning may not always improve upon passive learning. In this section we provide proofs for the main results, the oracle-complexity guarantee and the generalization and label complexity bounds. We start with some supporting results, including a new uniform freedman-type inequality that may be of independent interest. The proof of this inequality, and the proofs for several other supporting lemmata are deferred to the appendices. 8.1. Supporting Results A deviation bound. For both the computational and statistical analysis of COAL, we require concentration of the square loss functional b Rj( ; y), uniformly over the class G. To describe the result, we introduce the central random variable in the analysis: Mj(g; y) Qj(y) (g(xj) cj(y))2 (f (xj; y) cj(y))2 , (15) Active Learning for Cost-Sensitive Classification where (xj, cj) is the jth example and cost presented to the algorithm and Qj(y) {0, 1} is the query indicator. For simplicity we often write Mj when the dependence on g and y is clear from context. Let Ej[ ] and Varj[ ] denote the expectation and variance conditioned on all randomness up to and including round j 1. Theorem 15 Let G be a function class with Pdim(G) = d, let δ (0, 1) and define νn 324(d log(n) + log(8Ke(d + 1)n2/δ)). Then with probability at least 1 δ, the following inequalities hold simultaneously for all g G, y [K], and i < i [n]. j=i Mj(g; y) 3 j=i Ej Mj(g; y) + νn, (16) j=i Ej Mj(g; y) j=i Mj(g; y) + νn. (17) This result is a uniform Freedman-type inequality for the martingale difference sequence P i Mi Ei Mi. In general, such bounds require much stronger assumptions (e.g., sequential complexity measures (Rakhlin and Sridharan, 2017)) on G than the finite pseudo-dimension assumption that we make. However, by exploiting the structure of our particular martingale, specifically that the dependencies arise only from the query indicator, we are able to establish this type of inequality under weaker assumptions. The result may be of independent interest, but the proof, which is based on arguments from Liang et al. (2015), is quite technical and deferred to Appendix A. Note that we did not optimize the constants. The Multiplicative Weights Algorithm. We also use the standard analysis of multiplicative weights for solving linear feasibility problems. We state the result here and, for completeness, provide a proof in Appendix B. See also Arora et al. (2012); Plotkin et al. (1995) for more details. Consider a linear feasibility problem with decision variable v Rd, explicit constraints ai, v bi for i [m] and some implicit constraints v S (e.g., v is non-negative or other simple constraints). The MW algorithm either finds an approximately feasible point or certifies that the program is infeasible assuming access to an oracle that can solve a simpler feasibility problem with just one explicit constraint P i µi ai, v P i µibi for any non-negative weights µ Rm + and the implicit constraint v S. Specifically, given weights µ, the oracle either reports that the simpler problem is infeasible, or returns any feasible point v that further satisfies ai, v bi [ ρi, ρi] for parameters ρi that are known to the MW algorithm. The MW algorithm proceeds iteratively, maintaining a weight vector µ(t) Rm + over the constraints. Starting with µ(1) i = 1 for all i, at each iteration, we query the oracle with the weights µ(t) and the oracle either returns a point vt or detects infeasibility. In the latter case, we simply report infeasibility and in the former, we update the weights using the rule µ(t+1) i µ(t) i 1 ηbi ai, vt Here η is a parameter of the algorithm. The intuition is that if vt satisfies the ith constraint, then we down-weight the constraint, and conversely, we up-weight every constraint that is Krishnamurthy, Agarwal, Huang, Daum e III, Langford violated. Running the algorithm with appropriate choice of η and for enough iterations is guaranteed to approximately solve the feasibility problem. Theorem 16 (Arora et al. (2012); Plotkin et al. (1995)) Consider running the MW algorithm with parameter η = p log(m)/T for T iterations on a linear feasibility problem where oracle responses satisfy ai, v bi [ ρi, ρi]. If the oracle fails to find a feasible point in some iteration, then the linear program is infeasible. Otherwise the point v 1 T PT t=1 vt satisfies ai, v bi + 2ρi p log(m)/T for all i [m]. Other Lemmata. Our first lemma evaluates the conditional expectation and variance of Mj, defined in (15), which we will use heavily in the proofs. Proofs of the results stated here are deferred to Appendix C. Lemma 17 (Bounding variance of regression regret) We have for all (g, y) G Y, Ej[Mj] = Ej Qj(y)(g(xj) f (xj; y))2 , Var j [Mj] 4Ei[Mj]. The next lemma relates the cost-sensitive error to the random variables Mj. Define Fi = f GK | y, f( ; y) Gi(y) , which is the version space of vector regressors at round i. Additionally, recall that Pζ captures the noise level in the problem, defined in (10) and that ψi = 1/ i is defined in the algorithm pseudocode. Lemma 18 For all i > 0, if f Fi, then for all f Fi Ex,c[c(hf(x)) c(hf (x))] min ζ>0 ζPζ + 1 (ζ 2ψi) 2ψi + 4ψ2 i ζ + 6 Note that the lemma requires that both f and f belong to the version space Fi. For the label complexity analysis, we will need to understand the cost-sensitive performance of all f Fi, which requires a different generalization bound. Since the proof is similar to that of Theorem 5, we defer the argument to appendix. Lemma 19 Assuming the bounds in Theorem 15 hold, then for all i, Fi Fcsr(ri) where ri minζ>0 n ζPζ + 44K i The final lemma relates the query rule of COAL to a hypothetical query strategy driven by Fcsr(ri), which we will subsequently bound by the disagreement coefficients. Let us fix the round i and introduce the shorthand bγ(xi, y) = bc+(xi, y) bc (xi, y), where bc+(xi, y) and bc (xi, y) are the approximate maximum and minimum costs computed in Algorithm 1 on the ith example, which we now call xi. Moreover, let Yi be the set of non-dominated labels at round i of the algorithm, which in the pseudocode we call Y . Formally, Yi = {y | bc (xi, y) miny bc+(xi, y )}. Finally recall that for a set of vector regressors F F, we use γ(x, y, F) to denote the cost range for label y on example x witnessed by the regressors in F. Active Learning for Cost-Sensitive Classification Lemma 20 Suppose that the conclusion of Lemma 19 holds. Then for any example xi and any label y at round i, we have bγ(xi, y) γ(xi, y, Fcsr(ri)) + ψi. Further, with y i = argminy f (xi; y), yi = argminy bc+(xi, y), and yi = argminy =y i bc (xi, y), y = y i y Yi f (xi; y) f (xi; y i ) γ(xi, y, Fcsr(ri)) + γ(xi, y i , Fcsr(ri)) + ψi/2, |Yi| > 1 y i Yi f (xi; yi) f (xi; y i ) γ(xi, yi, Fcsr(ri)) + γ(xi, y i , Fcsr(ri)) + ψi/2. 8.2. Proof of Theorem 3 The proof is based on expressing the optimization problem (7) as a linear optimization in the space of distributions over G. Then, we use binary search to re-formulate this as a series of feasibility problems and apply Theorem 16 to each of these. Recall that the problem of finding the maximum cost for an (x, y) pair is equivalent to solving the program (7) in terms of the optimal g. For the problem (7), we further notice that since G is a convex set, we can instead write the minimization over g as a minimization over P (G) without changing the optimum, leading to the modified problem (8). Thus we have a linear program in variable P, and Algorithm 2 turns this into a feasibility problem by guessing the optimal objective value and refining the guess using binary search. For each induced feasibility problem, we use MW to certify feasibility. Let c [0, 1] be some guessed upper bound on the objective, and let us first turn to the MW component of the algorithm. The program in consideration is ? P (G) s.t. Eg P (g(xi) 1)2 c and j [i], Eg P b Rj(g; y) j. (18) This is a linear feasibility problem in the infinite dimensional variable P, with i + 1 constraints. Given a particular set of weights µ over the constraints, it is clear that we can use the regression oracle over g to compute gµ = arg min g G µ0(g(xi) 1)2 + X j [i] µj Eg P b Rj(g; y). (19) Observe that solving this simpler program provides one-sided errors. Specifically, if the objective of (19) evaluated at gµ is larger than µ0c + P j [i] µj j then there cannot be a feasible solution to problem (18), since the weights µ are all non-negative. On the other hand if gµ has small objective value it does not imply that gµ is feasible for the original constraints in (18). At this point, we would like to invoke the MW algorithm, and specifically Theorem 16, in order to find a feasible solution to (18) or to certify infeasibility. Invoking the theorem requires the ρj parameters which specify how badly gµ might violate the jth constraint. For us, ρj κ suffices since ˆRj(g; y) ˆRj(gj,y; y) [0, 1] (since gj,y is the ERM) and j κ. Since κ 2 this also suffices for the cost constraint. If at any iteration, MW detects infeasibility, then our guessed value c for the objective is too small since no function satisfies both (g(xi) 1)2 c and the empirical risk constraints in (18) simultaneously. In this case, in Line 10 of Algorithm 2, our binary search procedure Krishnamurthy, Agarwal, Huang, Daum e III, Langford increases our guess for c. On the other hand, if we apply MW for T iterations and find a feasible point in every round, then, while we do not have a point that is feasible for the original constraints in (18), we will have a distribution PT such that EPT (g(xi) 1)2 c + 2κ T and j [i], EPT b Rj(g; y) j + 2κ We will set T toward the end of the proof. If we do find an approximately feasible solution, then we reduce c and proceed with the binary search. We terminate when ch cℓ τ 2/2 and we know that problem (18) is approximately feasible with ch and infeasible with cℓ. From ch we will construct a strictly feasible point, and this will lead to a bound on the true maximum c+(x, y, Gi). Let P be the approximately feasible point found when running MW with the final value of ch. By Jensen s inequality and convexity of G, there exists a single regressor that is also approximately feasible, which we denote g. Observe that g satisfies all constraints with strict inequality, since by (20) we know that b Rj(g ; y) b Rj(gj,y; y) j/κ < j. We create a strictly feasible point gζ by mixing g with g with proportion 1 ζ and ζ for which will be in [0, 1] when we set T. Combining inequalities, we get that for any j [i] (1 ζ) b Rj( g; y) + ζ b Rj(g ; y) b Rj(gj,y; y) + (1 ζ) b Rj(gj,y; y) + j b Rj(gj,y; y) + j, and hence this mixture regressor gζ is exactly feasible. Here we use that κ 2 and that i is monotonically decreasing. With the pessimistic choice g (xi) = 0, the objective value for gζ is at most (gζ(xi) 1)2 (1 ζ)( g(xi) 1)2 + ζ(g (xi) 1)2 (1 ζ) cℓ+ τ 2/2 + 2κ + 4κ Thus gζ is exactly feasible and achieves the objective value above, which provides an upper bound on the maximum cost. On the other hand cℓprovides a lower bound. Our setting of T = log(i+1)(8κ2/ i)2 τ 4 ensures that that this excess term is at most τ 2, since i 1. Note that since τ [0, 1], this also ensures that ζ [0, 1]. With this choice of T, we know that cℓ (c+(y) 1)2 cℓ+ τ 2, which implies that c+(y) [1 p cℓ+ τ 2, 1 cℓ]. Since p cℓ+ τ 2 cℓ+ τ, we obtain the guarantee. Active Learning for Cost-Sensitive Classification As for the oracle complexity, since we start with cℓ= 0 and ch = 1 and terminate when ch cℓ τ 2/2, we perform O(log(1/τ 2)) iterations of binary search. Each iteration requires T = O max{1, i2/ν2 n}log(i) τ 4 rounds of MW, each of which requires exactly one oracle call. Hence the oracle complexity is O max{1, i2/ν2 n}log(i) log(1/τ) 8.3. Proof of the Generalization Bound Recall the central random variable Mj(g; y), defined in (15), which is the excess square loss for function g on label y for the jth example, if we issued a query. The idea behind the proof is to first apply Theorem 15 to argue that all the random variables Mj(g; y) concentrate uniformly over the function class G. Next for a vector regressor f, we relate the cost-sensitive risk to the excess square loss via Lemma 18. Finally, using the fact that gi,y minimizes the empirical square loss at round i, this implies a cost-sensitive risk bound for the vector regressor fi = (gi,y) at round i. First, condition on the high probability event in Theorem 15, which ensures that the empirical square losses concentrate. We first prove that f Fi for all i [n]. At round i, by (17), for each y and for any g we have j=1 Ej Mj(g; y) j=1 Mj(g; y) + νn. The first inequality here follows from the fact that Ej Mj(g; y) is a quadratic form by Lemma 17. Expanding Mj(g; y), this implies that b Ri+1(f ( ; y); y) b Ri+1(g; y) + νn Since this bound applies to all g G it proves that f Fi+1 for all i, using the definition of i and κ. Trivially, we know that f F1. Together with the fact that the losses are in [0, 1] and the definition of i, the above analysis yields b Ri(f ( ; y); y) b Ri(g; y) + i This implies that f ( ; y) strictly satisfies the inequalities defining the version space, which we used in the MW proof. We next prove that fi+1 Fj for all j [i]. Fix some label y and to simplify notation, we drop dependence on y. If gi+1 / Gt+1 for some t {0, . . . , i} then, first observe that we must have t large enough so that νn/t 1. In particular, since t+1 = κ min{1, νn/t} and we always have ˆRt+1(gi+1; y) ˆRt+1(gt+1; y) + 1 due to boundedness, we do not evict any functions until νn/t 1. For t νn, we get j=1 Mj = t b Rt+1(gi+1) b Rt+1(g ) = t b Rt+1(gi+1) b Rt+1(gt+1) + b Rt+1(gt+1) b Rt+1(g ) κνn νn = (κ 1)νn. Krishnamurthy, Agarwal, Huang, Daum e III, Langford The inequality uses the radius of the version space and the fact that by assumption gi+1 / Gt+1, so the excess empirical risk is at least t+1 = κνn/t since we are considering large t. We also use (20) on the second term. Moreover, we know that since gi+1 is the empirical square loss minimizer for label y after round i, we have Pi j=1 Mj(gi+1; y) 0. These two facts together establish that j=t+1 Mj(gi+1) (1 κ)νn. However, by Theorem 15 on this intermediary sum, we know that j=t+1 Ej Mj(gi+1) j=t+1 Mj(gi+1) + νn < (2 κ)νn < 0 using the definition of κ. This is a contradiction, so we must have that gi+1 Gj for all j {1, . . . , i}. The same argument applies for all y and hence we can apply Lemma 18 on all rounds to obtain i Ex,c[c(hfi+1(x)) c(hf (x))] 1 (ζ 2ψj) 2ψj + 4ψ2 j ζ + 6 y Ej [Mj(fi+1; y)] We study the four terms separately. The first one is straightforward and contributes ζPζ to the instantaneous cost sensitive regret. Using our definition of ψj = 1/ j the second term can be bounded as j=1 1 (ζ < 2ψj) 2ψj = The inequality above, Pn i=1 1 i 2 n, is well known. For the third term, using our definition of ψj gives 4ψ2 j ζ = 4 ζ (1 + log(i)). Finally, the fourth term can be bounded using (17), which reveals j=1 Ej[Mj] 2 j=1 Mj + 2νn Since for each y, Pi j=1 Mj(fi+1; y) 0 for the empirical square loss minimizer (which is what we are considering now), we get j=1 Ej[Mj(fi+1; y)] 12 Active Learning for Cost-Sensitive Classification And hence, we obtain the generalization bound Ex,c[c(x; hfi+1(x)) c(x; hf (x))] min ζ>0 ζi (4 log(i) + 16 + 12Kνn) ζPζ + 32Kνn Proof of Corollary 6. Under the Massart noise condition, set ζ = τ so that Pζ = 0 and we immediately get the result. Proof of Corollary 7. Set ζ = min τ0, 32Kνn iβ 1 α+2 , so that for i sufficiently large the second term is selected and we obtain the bound. 8.4. Proof of the Label Complexity bounds The proof for the label complexity bounds is based on first relating the version space Fi at round i to the cost-sensitive regret ball Fcsr with radius ri. In particular, the containment Fi Fcsr(ri) in Lemma 19 implies that our query strategy is more aggressive than the query strategy induced by Fcsr(ri), except for a small error introduced when computing the maximum and minimum costs. This error is accounted for by Lemma 20. Since the probability that Fcsr will issue a query is intimately related to the disagreement coefficient, this argument leads to the label complexity bounds for our algorithm. Proof of Theorem 9. Fix some round i with example xi, let Fi be the vector regressors used at round i and let Gi(y) be the corresponding regressors for label y. Let yi = argminy c c+(xi, y), y i = argminy f (xi; y), and yi = argminy =y i c c (xi, y). Assume that Lemma 19 holds. The L2 label complexity is X y Qi(y) = X y 1{|Yi| > 1 y Yi}1{bγ(xi, y) > ψi} y 1{|Yi| > 1 y Yi}1{γ(xi, y, Fcsr(ri)) > ψi/2}. For the former indicator, observe that y Yi implies that there exists a vector regressor f Fi Fcsr(ri) such that hf(xi) = y. This follows since the domination condition means that there exists g Gi(y) such that g(xi) miny maxg Gi(y ) g (xi). Since we are using a factored representation, we can take f to use g on the yth coordinate and use the maximizers for all the other coordinates. Similarly, there exists another regressor f Fi such that hf (xi) = y. Thus this indicator can be bounded by the disagreement coefficient X y 1{ f, f Fcsr(ri) | hf(xi) = y = hf (xi)}1{γ(xi, y, Fcsr(ri)) > ψi/2} y 1{x DIS(ri, y) γ(xi, y, Fcsr(ri)) ψi/2}. We will now apply Freedman s inequality on the sequence {P y Qi(y)}n i=1, which is a martingale with range K. Moreover, due to non-negativity, the conditional variance is at most Krishnamurthy, Agarwal, Huang, Daum e III, Langford K times the conditional mean, and in such cases, Freedman s inequality reveals that with probability at least 1 δ REX log(1/δ) + 2R log(1/δ) 2EX + 3R log(1/δ), where X is the non-negative martingale with range R and expectation EX. The last step is by the fact that 2 ab a + b. For us, Freedman s inequality implies that with probability at least 1 δ/2 y Qi(y) 2 X y Qi(y) + 3K log(2/δ) y 1{x DIS(ri, y) γ(xi, y, Fcsr(ri)) ψi/2} + 3K log(2/δ) ri ψi θ2 + 3K log(2/δ). The last step here uses the definition of the disagreement coefficient θ2. To wrap up the proof we just need to upper bound the sequence, using our choices of ψi = 1/ i, ri = 2 44K i, and i = κ min{1, νn i 1}. With simple calculations this is easily seen to be at most 88Kκνn + 3K log(2/δ). Similarly for L1 we can derive the bound i 1 ( y | γ(xi, y, Fcsr(ri)) ψi/2 x DIS(ri, y)) , and then apply Freedman s inequality to obtain that with probability at least 1 δ/2 ψi θ1 + 3 log(2/δ) 8θ1n p 88Kκνn + 3 log(2/δ). Proof of Theorem 10. Using the same notations as in the bound for the high noise case we first express the L2 label complexity as X y Qi(y) = X y 1{|Yi| > 1, y Yi}Qi(y). We need to do two things with the first part of the query indicator, so we have duplicated it here. For the second, we will use the derivation above to relate the query rule to the disagreement region. For the first, by Lemma 20, for y = y i , we can derive the bound 1{f (xi; y) f (xi; y i ) γ(xi, y, Fcsr(ri)) + γ(xi, y i , Fcsr(ri)) + ψi/2} 1{τ ψi/2 γ(xi, y, Fcsr(ri)) + γ(xi, y i , Fcsr(ri))}. Active Learning for Cost-Sensitive Classification For y = y i , we get the same bound but with yi, also by Lemma 20. Focusing on just one of these terms, say where y = y i and any round where τ ψi, we get 1{τ ψi/2 γ(xi, y, Fcsr(ri)) + γ(xi, y i , Fcsr(ri))}Qi(y) 1{τ/2 γ(xi, y, Fcsr(ri)) + γ(xi, y i , Fcsr(ri))}Qi(y) 1{τ/4 γ(xi, y, Fcsr(ri))}Qi(y) + 1{τ/4 γ(xi, y i , Fcsr(ri))}Qi(y) Di(y) + Di(y i ), where for shorthand we have defined Di(y) 1{τ/4 γ(xi, y, Fcsr(ri)) xi DIS(ri, y)}. The derivation for the first term is straightforward. We obtain the disagreement region for y i since the fact that we query y (i.e. Qi(y)) implies there is f such that hf(xi) = y, so this function witnesses disagreement to y i . The term involving y i and yi is bounded in essentially the same way, since we know that when |Yi| > 1, there exists two function f, f Fi such that hf(xi) = yi and hf (xi) = y i . In total, we can bound the L2 label complexity at any round i such that τ ψi by L2 Di( yi) + Di(y i ) + X y =y i (Di(y) + Di(y i )) KDi(y i ) + 2 X For the earlier rounds, we simply upper bound the label complexity by K. Since the range of this random variable is at most 3K, using Freedman s inequality just as in the high noise case, we get that with probability at least 1 δ/2 i=1 K1{τ ψi} + 2 KDi(y i ) + 2 X + 9K log(2/δ) K 1/τ 2 + 8 τ (Kθ1 + θ2) i=2 ri + 9K log(2/δ). The first line here is the application of Freedman s inequality. In the second, we evaluate the expectation, which we can relate to the disagreement coefficients θ1, θ2. Moreover, we use the setting ψi = 1/ i to evaluate the first term. As a technicality, we remove the index i = 1 from the second summation, since we are already accounting for queries on the first round in the first term. The last step is to evaluate the series, for which we use the definition of ri = minζ>0 {ζPζ + 44K i/ζ} and set ζ = τ, the Massart noise level. This gives ri = 44K i/τ. In total, we get L2 K 1/τ 2 + 352Kκνn τ 2 (Kθ1 + θ2) log(n) + 9K log(2/δ). As θ2 Kθ1 always, we drop θ2 from the above expression to obtain the stated bound. For L1 we use a very similar argument. First, by Lemmas 19 and 20 i=1 1{|Yi| > 1 y Yi, γ(xi, y, Fcsr(ri)) ψi/2}. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Again by Lemma 20, we know that |Yi| > 1 y Yi f, f Fcsr(ri), hf(xi) = y = hf (xi). Moreover, one of the two classifiers can be f , and so, when τ ψi, we can deduce f (xi, y) f (xi, y i ) γ(xi, y, Fcsr(ri)) + γ(xi, y i , Fcsr(ri)) + ψi/2 τ/4 γ(xi, y, Fcsr(ri)) τ/4 γ(xi, y i , Fcsr(ri)). Combining this argument, we bound the L1 label complexity as i=1 1{τ ψi} + i=2 1{ y | xi DIS(Fcsr(ri), y) γ(xi, y, Fcsr(ri)) τ/4}. (21) Applying Freedman s inequality just as before gives L1 1/τ 2 + 2 τ θ1 + 2 log(2/δ) 1/τ 2 + 352Kκνn τ 2 θ1 log(n) + 2 log(2/δ), with probability at least 1 δ/2. Proof of Theorem 11. For the Tsybakov case, the same argument as in the Massart case gives that with probability at least 1 δ/2 L2 K 1/τ 2 + 8 τ (Kθ1 + θ2) i=2 ri + 2nβτ α + 9K log(2/δ). The main difference here is the term scaling with nτ α which arises since we do not have the deterministic bound f (xi, y) f (xi, y i ) τ as we used in the Massart case, but rather this happens except with probability βτ α (provided τ τ0). Now we must optimize ζ in the definition of ri and then τ. For ζ the optimal setting is (44K i/β) 1 α+2 which gives ri 2β 1 α+2 (44K i) α+1 α+2 . Since we want to set ζ τ0, this requires i 1 + 44Kκνn βτ α+2 0 . For these early rounds we will simply pay K in the label complexity, but this will be dominated by other higher order terms. For the later rounds, we get i=2 ri 2β 1 α+2 (44Kκνn) α+1 α+2 n X α+2 2(α + 2) (44Kκνn) α+1 α+2 (βn) 1 α+2 . This bound uses the integral approximation Pn i=2(i 1) α+1 α+2 1 + R n 1 1 x α+1 α+2 dx (α + 2)n 1 α+2 . At this point, the terms involving τ in our bound are K 1/τ 2 + 16 τ (Kθ1 + θ2)(α + 2) (44Kκνn) α+1 α+2 (βn) 1 α+2 + 2nβτ α. We set τ = (8(α + 2)(Kθ1 + θ2)) 1 α+1 (44Kκνn) 1 α+2 (βn) 1 α+2 by optimizing the second two terms which gives a final bound of L2 O (Kθ1 + θ2) α α+1 (Kνn) α α+2 n 2 α+2 + K log(1/δ) . Active Learning for Cost-Sensitive Classification This follows since the 1/τ 2 term agrees in the n dependence and is lower order in other parameters, while the unaccounted for querying in the early rounds is independent of n. The bound of course requires that τ τ0, which again requires n large enough. Note we are treating α and β as constants and we drop θ2 from the final statement. The L1 bound requires only slightly different calculations. Following the derivation for the Massart case, we get L1 1/τ 2 + 8θ1 i=2 ri + 2nβτ α + 2 log(2/δ) 1/τ 2 + 16θ1(α + 2) τ (44Kκνn) α+1 α+2 (βn) 1 α+2 + 2nβτ α + 2 log(2/δ), not counting the lower order term for the querying in the early rounds. Here we set τ = (8(α + 2)θ1) 1 α+1 (44Kκνn) 1 α+2 (βn) 1 α+2 to obtain α α+1 1 (Kνn) α α+2 n 2 α+2 + log(1/δ) . Proof of Proposition 12: Massart Case. Observe that with Massart noise, we have Fcsr(r) e F(r/τ), which implies that E1 { y | x DIS(r, y) γ(x, y, Fcsr(r)) τ/4} E1 n x g DIS(r/τ) o 1 r E1 n x g DIS(r) o . Thus we may replace θ1 with θ0 in the proof of the L1 label complexity bound above. Proof of Proposition 12: Tsybakov Case. The proof is identical to Theorem 11 except we must introduce the alternative coefficient θ0. To do so, define X(τ) {x X : miny =y (x) f (x, y) f (x, y (x)) τ}, and note that under the Tsybakov noise condition, we have P[x / X(τ)] βτ α for all 0 τ τ0. For such a value of τ we have P[hf(x) = hf (x)] E1{x X(τ) hf(x) = hf (x)} + P[x / X(τ)] τ Ec(hf(x)) c(hf (x)) + βτ α. We use this fact to prove that Fcsr(r) e F(2r/τ) for τ sufficiently small. This can be seen from above by noting that if f Fcsr(r) then the right hand side is r τ + βτ α, and if τ (r/β) 1 1+α the containment holds. Therefore, we have E1 { y | x DIS(r, y) γ(x, y, Fcsr(r)) τ/4} E1 { y | x DIS(r, y)} E1 n x g DIS(2r/τ) o τ sup r>0 E1 n x g DIS(r) o . Krishnamurthy, Agarwal, Huang, Daum e III, Langford Thus, provided τ (rn/β) 1 α+1 , we can replace θ1 with θ0 in the above argument. This gives L1 1/τ 2 + 4θ0 i=2 ri + 2nβτ α + 2 log(2/δ) 1/τ 2 + 8θ0(α + 2) τ (44Kκνn) α+1 α+2 (βn) 1 α+2 + 2nβτ α + 2 log(2/δ). As above, we have taken ri = 2β 1 α+2 (44K i) α+1 α+2 and approximated the sum by an integral. Since n = κνn/(n 1), we can set τ = 2 α+1 α+2 (44Kκνn) 1 α+2 (βn) 1 α+2 . This is a similar choice to what we used in the proof of Theorem 11 except that we are not incorporating θ0 into the choice of τ, and it yields a final bound of O θ0n 2 α+2 ν α α+2 n + log(1/δ) . Proof of Proposition 13. First we relate θ1 to θ0 in the multiclass case. For f Fcsr(r), we have P[hf(x) = hf (x)] P[hf(x) = y] + P[hf (x) = y] = Ec(hf(x)) c(hf (x)) 2error(hf ) + r Therefore for any r > 0 and any x, 1{ y | x DIS(r, y) γ(x, y, Fcsr(r)) ψ/2} 1{x g DIS(ri + 2error(hf ))}. Applying this argument in the L1 derivation above, we obtain i Ei1{x g DIS(ri + 2error(hf ))} + 3 log(2/δ) i (ri + 2error(hf ))θ0 + 3 log(2/δ). We now bound ri via Lemma 19. In multiclass classification, the fact that c = 1 ey for some y implies that f (x, y) is one minus the probability that the true label is y. Thus f (x, y) [0, 1], P y f (x, y) = K 1, and for any x we always have min y =y (x) f (x, y) f (x, y (x)) 1 2f (x, y (x)). Hence, we may bound Pζ, for ζ 1/2, as follows min y =y (x) f (x, y) f (x, y (x)) ζ Px D [1 2f (x, y (x)) ζ] Exf (x, y (x)) 2 1 ζ 4error(hf ). Now apply Lemma 19 with ζ = min{1/2, p 44K i/error(hf )}, and we obtain 44K ierror(hf ) + 264K i. Using the definition of i the final L1 label complexity bound is L1 4θ0n error(hf ) + O θ0 Knκνn error(hf ) + Kκνn log(n) + log(1/δ) . Active Learning for Cost-Sensitive Classification 9. Discussion This paper presents a new active learning algorithm for cost-sensitive multiclass classification. The algorithm enjoys strong theoretical guarantees on running time, generalization error, and label complexity. The main algorithmic innovation is a new way to compute the maximum and minimum costs predicted by a regression function in the version space. We also design an online algorithm inspired by our theoretical analysis that outperforms passive baselines both in CSMC and structured prediction. On a technical level, our algorithm uses a square loss oracle to search the version space and drive the query strategy. This contrasts with many recent results using argmax or 0/1loss minimization oracles for information acquisition problems like contextual bandits (Agarwal et al., 2014). As these involve NP-hard optimizations in general, an intriguing question is whether we can use a square loss oracle for other information acquisition problems. We hope to answer this question in future work. Acknowledgments Part of this research was completed while TKH was at Microsoft Research and AK was at University of Massachusetts, Amherst. AK thanks Chicheng Zhang for insightful conversations. AK is supported in part by NSF Award IIS-1763618. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Appendix A. Proof of Theorem 15 In the proof, we mostly work with the empirical ℓ1 covering number for G. At the end, we translate to pseudo-dimension using a lemma of Haussler. Definition 21 (Covering numbers) Given class G X R, α > 0, and sample X = (x1, . . . , xn) X n, the covering number N1(α, H, X) is the minimum cardinality of a set V Rn such that for any g G, there exists a (v1, . . . , vn) V with 1 n Pn i=1 |g(xi) vi| α. Lemma 22 (Covering number and Pseudo-dimension (Haussler, 1995)) Given a hypothesis class G X R with Pdim(G) d, for any X X n we have N1(α, G, X) e(d + 1) 2e Fixing i and y, and working toward (16) we seek to bound j=1 Mj(g; y) 3 2Ej[Mj(g; y)] τ The bound on the other tail is similar. In this section, we sometimes treat the query rule as a function which maps an x to a query decision. We use the notation Qj : X {0, 1} to denote the query function used after seeing the first j 1 examples. Thus, our query indicator Qj is simply the instantiation Qj(xj). In this section, we work with an individual label y and omit the explicit dependence in all our arguments and notation. For notational convenience, we use zj = (xj, cj, Qj) and with g ( ) = f ( ; y), we define ξj = g (xj) cj and ℓ(g, x) = (g(x) g (x)). (23) Note that ξj is a centered random variable, independent of everything else. We now introduce some standard concepts from martingale theory for the proof of Theorem 15. Definition 23 (Tangent sequence) For a dependent sequence z1, . . . , zi we use z 1, . . . , z i to denote a tangent sequence, where z j|z1:j 1 d= zj|z1:j 1, and, conditioned on z1, . . . , zi, the random variables z 1, . . . , z i are independent. In fact, in our case we have z j = (x j, c j, Q j) where (x j, c j) D and Q j : X {0, 1} is identical to Qj. We use M j(g) to denote the empirical excess square loss on sample z j. Note that we will continue to use our previous notation of Ej to denote conditioning on z1, . . . , zj 1. We next introduce one more random process, and then proceed with the proof. Definition 24 (Tree process) A tree process Q is a binary tree of depth i where each node is decorated with a value from {0, 1}. For a Rademacher sequence ϵ { 1, 1}i we use Qi(ϵ) to denote the value at the node reached when applying the actions ϵ1, . . . , ϵi 1 from the root, where +1 denotes left and 1 denotes right. The proof follows a fairly standard recipe for proving uniform convergence bounds, but has many steps that all require minor modifications from standard arguments. We compartmentalize each step in various lemmata: Active Learning for Cost-Sensitive Classification 1. In Lemma 25, we introduce a ghost sample and replace the conditional expectation Ej[ ] in (22) with an empirical term evaluated on an tangent sequence. 2. In Lemma 26, we perform symmetrization and introduce Rademacher random variables and the associated tree process. 3. In Lemma 27 we control the symmetrized process for finite G. 4. In Lemma 28, we use the covering number to discretize G. We now state and prove the intermediate results. Lemma 25 (Ghost sample) Let Z = (z1, . . . , zi) be the sequence of (x, c, Q) triples and let Z = (z 1, . . . , z i) be a tangent sequence. Then for β0 β1 > 0 if τ 4(1+β1)2 (β0 β1) , then j=1 Mj(g) (1 + 2β0)Ej Mj(g) τ j=1 Mj(g) M j(g) 2β1Qj(x j)ℓ2(g, x j) τ j=1 (1 2β0)Ej Mj(g) Mj(g) τ j=1 M j(g) Mj(g) 2β1Qj(x j)ℓ2(g, x j) τ Proof We derive the first inequality, beginning with the right hand side and working toward a lower bound. The main idea is to condition on Z and just work with the randomness in Z . To this end, let ˆg achieve the supremum on the left hand side, and define the events j=1 Mj(ˆg) (1 + 2β0)Ej Mj(ˆg) τ i=1 M j(ˆg) + 2β1Qj(x j)ℓ2(ˆg, x j) (1 + 2β0)Ej M j(ˆg) τ/2 Starting from the right hand side, by adding and subtracting (1 + 2β0)Ej Mj(ˆg) we get j=1 Mj(g) M j(g) 2β1Qj(x j)ℓ2(g, x j) τ/2 j=1 Mj(ˆg) M j(ˆg) 2β1Qj(x j)ℓ2(ˆg, x j) τ/2 PZ,Z h E \ E i = EZ 1{E} PZ [E |Z] . Krishnamurthy, Agarwal, Huang, Daum e III, Langford Since we have defined ˆg to achieve the supremum, we know that EZ1{E} = PZ j=1 Mj(g) (1 + 2β0)Ej Mj(g) τ which is precisely the left hand side of the desired inequality. Hence we need to bound the PZ [E |Z] term. For this term, we note that PZ [E |Z] = PZ i=1 M j(ˆg) + 2β1Qj(x j)ℓ2(ˆg, x j) (1 + 2β0)Ej M j(ˆg) τ/2 | Z i=1 M j(ˆg) + 2β1Ej M j(ˆg) (1 + 2β0)Ej M j(ˆg) τ/2 | Z M j(ˆg) Ej M j(ˆg) + 2 β1 β0)Ej M j(ˆg) τ/2 | Z Here, (a) follows since Ej[M j(g) | Z] = Qj(x j)ℓ2(g, x j) for any g by Lemma 17. Since we are conditioning on Z, ˆg is also not a random function and the same equality holds when we take expectation over Z , even for ˆg. With this, we can now invoke Chebyshev s inequality: PZ [E |Z] = 1 PZ j=1 M j(ˆg) + 2β1Qj(x j)ℓ2(ˆg, x j) (1 + 2β0)Ej M j(ˆg) τ/2 1 Var h Pi j=1 M j(ˆg) + 2β1Qj(x j)ℓ(ˆg, x j) Z i τ/2 + 2(β0 β1) P i Ej[Qj(x j)ℓ2(ˆg, x j) Z] 2 . Since we are working conditional on Z, we can leverage the independence of Z (recall Definition 23) to bound the variance term. j=1 M j(g) + 2β1Qj(x j)ℓ2(ˆg, x j) j=1 Ej M j(ˆg)2 + (2β1)2Qj(x j)ℓ4(ˆg, x j) + 4β1M j(ˆg)Qj(x j)ℓ2(ˆg, x j) 4(1 + β1)2 i X j=1 Ej Qj(x j)ℓ2(ˆg, x j). Here we use that Var[X] E[X2] and then we use that ℓ2 ℓsince the loss is bounded in [0, 1] along with Lemma 17. Returning to the application of Chebyshev s inequality, if we expand the quadratic in the denominator and drop all but the cross term, we get the bound P Z[E |Z] 1 2(1 + β1)2 τ(β0 β1) 1/2, Active Learning for Cost-Sensitive Classification where the last step uses the requirement on τ. This establishes the first inequality. For the second inequality the steps are nearly identical. Let ˆg achieve the supremum on the left hand side and define j=1 (1 2β0)Ej Mj(ˆg) Mj(ˆg) τ j=1 (1 2β0)Ej M j(ˆg) M j(ˆg) + 2β1Qj(x j)ℓ2(ˆg, x j) τ/2 Using the same argument, we can lower bound the right hand side by j=1 M j(g) Mj(g) 2β1Qj(x j)ℓ2(g, x j) τ/2 EZ 1{E} PZ E Z Applying Chebyshev s inequality yields the same expression as for the other tail. Lemma 26 (Symmetrization) Using the same notation as in Lemma 25, we have j=1 Mj(g) M j(g) 2β1Qj(x j)ℓ2(g, x j) τ 2E sup Q Pϵ j=1 ϵj Qj(ϵ)((1 + β1)ℓ(g, xj)2 + 2ξjℓ(g, xj)) β1Qj(ϵ)ℓ(g, xj)2 τ/2 the same bound holds on the lower tail with (1 β1) replacing (1 + β1). Proof For this proof, we think of Qj as a binary variable that is dependent on z1, . . . , zj 1 and xj. Similarly Q j depends on z1, . . . , zj 1 and x j. Using this notation, and decomposing the square loss, we get Mj(g) = Qj (g(xj) cj)2 (g (xj) cj)2 = Qjℓ2(g, xj) + 2Qjξjℓ(g, xj). As such, we can write Mj(g) M j(g) 2β1Q jℓ2(g, x j) = (1 + β1)Qjℓ2(g, xj) + 2Qjξjℓ(g, xj) | {z } β1Qjℓ2(g, xj) | {z } T2,j (1 + β1)Q jℓ2(g, x j) 2Q jξ jℓ(g, x j) | {z } β1Q jℓ2(g, x j) | {z } Here we have introduce the short forms T1,j, T2,j and the primed version just to condense the derivations. Overall we must bound j=1 T1,j T2,j T 1,j T 2,j τ j=1 T1,j T2,j T 1,j T 2,j τ Krishnamurthy, Agarwal, Huang, Daum e III, Langford Observe that in the final term T1,i, T 1,i are random variables with identical conditional distribution, since there are no further dependencies and (xi, ξi, Qi) are identically distributed to (x i, ξ i, Q i). As such, we can symmetrize the ith term by introducing the Rademacher random variable ϵi { 1, +1} to obtain j=1 T1,j T2,j + ϵi(T1,i T 1,i) j=1 (T2,j + T 2,j) τ EZ,Z sup Qi,Q i Eϵi1 j=1 T1,j T2,j + ϵi(T1,i T 1,i) j=1 (T2,j + T 2,j) τ Here in the second step, we have replace the expectation over Qi, Q i with supremum, which breaks the future dependencies for the (i 1)st term. Note that while we still write EZ,Z , we are no longer taking expectation over Qi, Q i here. The important point is that since xj, ξj are all i.i.d., the only dependencies in the martingale are through Qjs and by taking supremum over Qi, Q i, swapping the role of T1,i 1 and T 1,i 1 no longer has any future effects. Thus we can symmetrize the (i 1)st term. Continuing in this way, we get EEϵi 1 sup Qi,Q i Eϵi1 j=1 T1,j T2,j + j=i 1 ϵj(T1,j T 1,j) j=1 (T2,j + T 2,j) τ sup Qj,Q j Eϵj j=1 ϵj(T1,j T 1,j) j=1 (T2,j + T 2,j) τ Here in the final expression the outer expectation is just over the variables xj, x j, ξj, ξ j and the bracket notation denotes interleaved supremum and expectation. Expanding the definitions of T1,i, T2,i, we currently have sup Qj,Q j Eϵj j=1 ϵj (1 + β1)Qjℓ2(g, xj) + 2Qjξjℓ(g, xj) j=1 β1Qjℓ2(g, xj) j=1 ϵj (1 + β1)Q jℓ2(g, x j) + 2Q jξ jℓ(g, x j) j=1 β1Q jℓ2(g, x j) τ Next we use the standard trick of splitting the supremum over g into a supremum over two functions g, g , where g optimizes the primed terms. This provides an upper bound, but moreover if we replace τ with τ/2 we can split the indicator into two and this becomes j=1 ϵj Qj (1 + β1)ℓ2(g, xj) + 2ξjℓ(g, xj) β1Qjℓ2(g, xj) τ/2 = 2E sup Q Pϵ j=1 ϵj Qj(ϵ) (1 + β1)ℓ2(g, xj) + 2ξjℓ(g, xj) β1Qj(ϵ)ℓ2(g, xj) τ/2 Active Learning for Cost-Sensitive Classification The tree process Q arises here because the interleaved supremum and expectation is equivalent to choosing a binary tree decorated with values from {0, 1} and then navigating the tree using the Rademacher random variables ϵ. The bound for the other tail is proved in the same way, except (1 + β1) is replaced by (1 β1). The next lemma is more standard, and follows from the union bound and the bound on the Rademacher moment generating function. Lemma 27 (Finite class bound) For any x1:i, ξ1:i, Q, and for finite G, we have j=1 ϵj Qj(ϵ) (1 + β1)ℓ2(g, xj) + 2ξjℓ(g, xj) β1Qj(ϵ)ℓ2(g, xj) τ |G| exp 2β1τ The same bound applies for the lower tail. Proof Applying the union bound and the Chernofftrick, we get that for any λ > 0 the LHS is bounded by g exp( τλ)Eϵ exp j=1 ϵj Qj(ϵ) (1 + β1)ℓ2(g, xj) + 2ξjℓ(g, xj) β1Qi(ϵ)ℓ2(g, xj) | {z } g exp( τλ)Eϵ1:i 1 exp Eϵi|ϵ1:i 1 exp(λT3,i). Let us examine the ith term conditional on ϵ1:i 1. Conditionally on ϵ1:i 1, Qi(ϵ) is no longer random, so we can apply the MGF bound for Rademacher random variables to get Eϵi|ϵ1:i 1 exp λϵi Qi(ϵ) (1 + β1)ℓ2(g, xi) + 2ξiℓ(g, xi) λβ1Qi(ϵ)ℓ2(g, xi) 2 Qi(ϵ) (1 + β1)ℓ2(g, xi) + 2ξiℓ(g, xi) 2 λβ1Qi(ϵ)ℓ2(g, xi) exp λ2Qi(ϵ)(3 + β1)2 2 ℓ2(g, xi) λβ1Qi(ϵ)ℓ2(g, xi) 1. Here the first inequality is the standard MGF bound on Rademacher random variables E exp(aϵ) exp(a2/2). In the second line we expand the square and use that ℓ, ξ [ 1, 1] to upper bound all the terms. Finally, we use the choice of λ = 2β1 (3+β1)2 . Repeating this argument from i down to 1, finishes the proof of the upper tail. The same argument applies for the lower tail, but we actually get (3 β1)2 in the denominator, which is of course upper bounded by (3 + β1)2, since β1 > 0. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Lemma 28 (Discretization) Fix x1, . . . , xi and let V Ri be a cover of G at scale α on points x1, . . . , xi. Then for any Q j=1 ϵj Qj(ϵ) (1 + β1)ℓ2(g, xj) + 2ξjℓ(g, xj) β1Qj(ϵ)ℓ2(g, xj) τ j=1 ϵj Qj(ϵ) (1 + β1)ℓ2(v, xj) + 2ξjℓ(v, xj) β1Qj(ϵ)ℓ2(v, xj) τ 4i(1 + β)α The same bound holds for the lower tail with 1 β1. Here ℓ(v, xj) = (vj g (xj)). Proof Observe first that if v is the covering element for g, then we are guaranteed that j=1 |ℓ(g, xj) ℓ(v, xj)| = 1 j=1 |g(xj) vj| α, j=1 |ℓ2(g, xj) ℓ2(v, xj)| = 1 j=1 |(g(xj) vj)(g(xj) + vj 2g (xj)| 2α, since g, v, g [0, 1]. Thus, adding and subtracting the corresponding terms for v, and applying these bounds, we get a residual term of iα(2(1 + β1) + 2 + 2β1) = 4iα(1 + β1). Proof of Theorem 15. Finally we can derive the deviation bound. We first do the upper tail, Mj Ej Mj. Set β0 = 1/4, β1 = 1/8 and apply Lemmas 25 and 26 to (22). j=1 Mj(g) 3 2Ej Mj(g) τ j=1 Mj(g) M j(g) 1 4Qj(x j)ℓ2(g, x j) τ/2 4E sup Q Pϵ j=1 ϵj Qj(ϵ) 9 8ℓ2(g, xj) + 2ξjℓ(g, xj) 1 8Qj(ϵi)ℓ2(g, xj) τ Now let V (X) be the cover for G at scale α = τ 32i(9/8) = τ 36i, which makes τ/4 4i(1+β1)α = τ 8. Thus we get the bound 4EX,ξ sup Q Pϵ j=1 ϵj Qj(ϵ) 9 8ℓ2(h, xj) + 2ξjℓ(h, xj) 1 8Qj(ϵi)ℓ2(h, xj) τ 4EX|V (X)| exp 2(1/8)(τ/8) = 2 exp( 2τ/625)EXN τ 36i, G, X . This entire derivation requires that τ 4(9/8)2 (1/8)2 = 324. Active Learning for Cost-Sensitive Classification The lower tail bound is similar. By Lemmas 25 and 26, with β0 = 1/4 and β1 = 1/8, 1 2Ej Mj(g) Mj(g) τ 1 2Ej Mj(g) Mj(g) 2(1/8)Qj(x j)ℓ2(g, x j) τ 4E sup Q Pϵ j=1 ϵj Qj(ϵ) 7 8ℓ2(g, xj) + 2ξjℓ(g, xj) 1 j=1 Qi(ϵ)ℓ2(g, xj) τ 4E sup Q Pϵ j=1 ϵj Qj(ϵ) 9 8ℓ2(g, xj) + 2ξjℓ(g, xj) 1 j=1 Qi(ϵ)ℓ2(g, xj) τ This is the intermediate term we had for the upper tail, so we obtain the same bound. To wrap up the proof, apply Haussler s Lemma 22, to bound the covering number 36i, G, X e(d + 1) 72ie Finally take a union bound over all pairs of starting and ending indices i < i , all labels y, and both tails to get that the total failure probability is at most 8Ke(d + 1) exp ( 2τ/625) X The result now follows from standard approximations. Specifically we use the fact that we anyway require τ 324 to upper bound that 1/τ d term, use (i i)d nd and set the whole expression to be at most δ. Appendix B. Multiplicative Weights For completeness we prove Theorem 16 here. For this section only let q(t) p(t) be the distribution used by the algorithm at round t. If the program is feasible, then there exists a point that is also feasible against every distribution q. By contraposition, if on iteration t, the oracle reports infeasibility against q(t), then the original program must be infeasible. Now suppose the oracle always finds vt that is feasible against q(t). This implies i=1 q(t) i (bi ai, vt ). Define ℓt(i) = bi ai,vt ρi which is in [ 1, 1] by assumption. We compare this term to the corresponding term for a single constraint i. Using the standard potential-based analysis, Krishnamurthy, Agarwal, Huang, Daum e III, Langford with Φ(t) = P i p(t) i , we get i=1 p(T) i (1 ηℓT (i)) Φ(T) exp i=1 q(T) i ℓT (i) i=1 q(t) i ℓt(i) For any i, we also have t=1 (1 ηℓt(i)). Thus, taking taking logarithms and re-arranging we get i=1 q(t) i ℓt(i) log(m) t=1 log 1 1 ηℓt(i) t=1 ℓt(i) + ηT. Here we use standard approximations log(1/(1 x)) x + x2 (which holds for x 1/2) along with the fact that |ℓt(i)| 1. Using the definition of ℓt(i) we have hence proved that t=1 ai, vt Tbi + ρi log m Now with our choice of η = p log(m)/T we get the desired bound. If η 1/2 then the result is trivial by the boundedness guarantee on the oracle. Appendix C. Proofs of Lemmata Proof [Proof of Lemma 14] By the definitions, b R(g ) + w (g (x) c)2 = b R(g , w , c) b R(g) + w (g(x) c)2 = b R(g) + w(g(x) c)2 + (w w)(g(x) c)2 b R(g ) + w(g (x) c)2 + (w w)(g(x) c)2. Rearranging shows that (w w)(g (x) c)2 (w w)(g(x) c)2. Since w w, we have (g (x) c)2 (g(x) c)2, which is the second claim. For the first, the definition of g gives b R(g) + w(g(x) c)2 b R(g ) + w(g (x) c)2. Rearranging this inequality gives, b R(g ) b R(g) w((g(x) c)2 (g (x) c)2) 0, which yields the result. Active Learning for Cost-Sensitive Classification Proof [Proof of Lemma 17] We take expectation of Mj over the cost conditioned on a fixed example xj = x and a fixed query outcome Qj(y): E[Mj | xj = x, Qj(y)] = Qj(y) Ec[g(x)2 f (x; y)2 2c(y)(g(x) f (x; y)) | xj = x] = Qj(y) g(x)2 f (x; y)2 2f (x; y)(g(x) f (x; y)) = Qj(y)(g(x) f (x; y))2. The second equality is by Assumption 1, which implies E[c(y) | xj = x] = f (x; y). Taking expectation over xj and Qj(y), we have Ej[Mj] = Ej Qj(y)(g(xj) f (xj; y))2 . For the variance: Var j [Mj] Ej[M2 j ] = Ej Qj(y)(g(xj) f (xj; y))2(g(xj) + f (xj; y) 2c(y))2 4 Ej Qj(y)(g(xj) f (xj; y))2 = 4Ej[Mj]. This concludes the proof. Proof [Proof of Lemma 18] Fix some f Fi, and let ˆy = hf(x) and y = hf (x) for shorthand, but note that both depend on x. Define Sζ(x) = 1 (f (x, ˆy) f (x, y ) + ζ) , S ζ(x) = 1 min y =y f (x, y) f (x, y ) + ζ . Observe that for fixed ζ, Sζ(x)1 (ˆy = y ) S ζ(x) for all x. We can also majorize the complementary indicator to obtain the inequality SC ζ (x) f (x, ˆy) f (x, y ) We begin with the definition of realizability, which gives Ex,c[c(hf(x)) c(hf (x)] = Ex [(f (x, ˆy) f (x, y )) 1 (ˆy = y )] = Ex Sζ(x) + SC ζ (x) (f (x, ˆy) f (x, y )) 1 (ˆy = y ) ζEx S ζ(x) + Ei SC ζ (x)1 (ˆy = y ) (f (x, ˆy) f (x, y )) . The first term here is exactly the ζPζ term in the bound. We now focus on the second term, which depends on our query rule. For this we must consider three cases. Case 1. If both ˆy and y are not queried, then it must be the case that both have small cost ranges. This follows since f Fi and hf(x) = ˆy so y does not dominate ˆy. Moreover, since the cost ranges are small on both ˆy and y and since we know that f is well separated under event SC ζ (x), the relationship between ζ and ψi governs whether we make a mistake or not. Specifically, we get that SC ζ (x)1 (ˆy = y ) 1 (no query) 1 (ζ 2ψi) at round i. In other words, if we do not query and the separation is big but we make a mistake, then it must mean that the cost range threshold ψi is also big. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Using this argument, we can bound the second term as, Ei SC ζ (x)1 (ˆy = y ) (1 Qi(ˆy))(1 Qi(y )) (f (x, ˆy) f (x, y )) Ei SC ζ (x)1 (ˆy = y ) (1 Qi(ˆy))(1 Qi(y ))2ψi Ei [1 (ζ 2ψi) 2ψi] = 1 (ζ 2ψi) 2ψi. Case 2. If both ˆy and y are queried, we can relate the second term to the square loss, Ei SC ζ (x)Qi(ˆy)Qi(y ) (f (x, ˆy) f (x, y )) ζ Ei h Qi(ˆy)Qi(y ) (f (x, ˆy) f (x, y ))2i ζ Ei h Qi(ˆy)Qi(y ) (f (x, ˆy) f(x, ˆy) + f(x, y ) f (x, y ))2i ζ Ei Qi(ˆy)(f (x, ˆy) f(x, ˆy))2 + Qi(y )(f(x, y ) f (x, y ))2 y Ei Qi(y)(f (x, y) f(x, y))2 = 2 y Ei [Mi(f; y)] . Passing from the second to third line here is justified by the fact that f (x, ˆy) f (x, y ) and f(x, ˆy) f(x, y ) so we added two non-negative quantities together. The last step uses Lemma 17. While not written, we also use the event 1 (ˆy = y ) to save a factor of 2. Case 3. The last case is if one label is queried and the other is not. Both cases here are analogous, so we do the derivation for when y(x) is queried but y (x) is not. Since in this case, y (x) is not dominated (hf(x) is never dominated provided f Fi), we know that the cost range for y (x) must be small. Using this fact, and essentially the same argument as in case 2, we get Ei SC ζ (x)Qi(ˆy)(1 Qi(y )) (f (x, ˆy) f (x, y )) ζ Ei h Qi(ˆy)(1 Qi(y )) (f (x, ˆy) f (x, y ))2i ζ Ei h Qi(ˆy) (f (x, ˆy) f(x, ˆy))2 + (1 Qi(y )) (f(x, y ) f (x, y ))2i 2ψ2 i ζ + 2 ζ Ei h Qi(ˆy) (f (x, ˆy) f(x, ˆy))2i 2ψ2 i ζ + 2 y Ei [Mi(f; y)] . We also obtain this term for the other case where y is queried but ˆy is not. To summarize, adding up the contributions from these cases (which is an over-estimate since at most one case can occur and all are non-negative), we get Ex,c[c(hf(x)) c(hf (x)] ζPζ + 1 (ζ 2ψi) 2ψi + 4ψ2 i ζ + 6 y Ei [Mi(f; y)] . This bound holds for any ζ, so it holds for the minimum. Active Learning for Cost-Sensitive Classification Proof [Proof of Lemma 19] The proof here is an easier version of the generalization bound proof for fi. First, condition on the high probability event in Theorem 15, under which we already showed that f Fi for all i [n]. Now fix some f Fi. Since by the monotonicity property defining the sets Gi, we must have f Fj for all 1 j i, we can immediately apply Lemma 18 on all rounds to bound the cost sensitive regret by j=1 min ζ>0 ζPζ + 1{ζ 2ψj}2ψj + 4ψ2 j ζ + 6 y Ej[Mj(f; y)] As in the generalization bound, the first term contributes ζPζ, the second is at most 12 ζ and the third is at most 4/ζ(1 + log(i 1)). The fourth term is slightly different. We still apply (17) to obtain j=1 Ej[Mj(f; y)] 2 i 1 j=1 Mj(f; y) + 2νn = 2 b Ri(f; y) b Ri(gi,y; y) + b Ri(gi,y; y) b Ri(f ; y) + 2νn i 1 = 2(κ + 1)νn We use this bound for each label. Putting terms together, the cost sensitive regret is ζ + 4(1 + log(i 1)) ζ + 12K(κ + 1)νn ζPζ + 44κKνn This proves containment, since this upper bounds the cost sensitive regret of every f Fi. Proof [Proof of Lemma 20] The first claim is straightforward, since Fi Fcsr(ri) and since we set the tolerance parameter in the calls to Max Cost and Min Cost to ψi/4. Specifically, bγ(xi, y) γ(xi, y, Fi) + ψi/2 γ(xi, y, Fcsr(ri)) + ψi/2. For the second claim, suppose y = y i . Then y Yi c c (xi, y) c c+(xi, yi) c c (xi, y) c c+(xi, y i ) c (xi, y, Fcsr(ri)) c+(xi, y i , Fcsr(ri)) + ψi/2 f (xi; y) γ(xi, y, Fcsr(ri)) f (xi; y i ) + γ(xi, y i , Fcsr(ri)) + ψi/2. This argument uses the tolerance setting ψi/4, Lemma 19 to translate between the version space and the cost-sensitive regret ball, and finally the fact that f Fcsr(ri) since it has zero cost-sensitive regret. This latter fact lets us lower (upper) bound the minimum (maximum) cost by f prediction minus (plus) the cost range. For y i we need to consider two cases. First assume y i = yi. Since by assumption |Yi| > 1, it must be the case that c c (xi, yi) c c+(xi, y i ) at which point the above derivation produces the desired implication. On the other hand, if y i = yi then c c+(xi, yi) c c+(xi, y i ), but this also implies that c c (xi, yi) c c+(xi, y i ), since minimum costs are always smaller than maximum costs, and yi is included in the search defining yi. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. In International Conference on Machine Learning, 2013. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014. Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 2012. Maria Florina Balcan and Philip M. Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, 2013. Maria Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In International Conference on Machine Learning, 2006. Maria Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Conference on Learning Theory, 2007. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In International Conference on Machine Learning, 2009. Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems, 2010. Alexandra Carpentier, Andrea Locatelli, and Samory Kpotufe. Adaptivity to noise parameters in nonparametric active learning. In Conference on Learning Theory, 2017. R.M. Castro and R.D. Nowak. Minimax bounds for active learning. Transaction on Information Theory, 2008. Rui Castro, Rebecca Willett, and Robert D. Nowak. Faster rates in regression via active learning. In Advances in Neural Information Processing Systems, 2005. Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Learning noisy linear classifiers via adaptive and selective sampling. Machine Learning, 2011. Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daum e III, and John Langford. Learning to search better than your teacher. In International Conference on Machine Learning, 2015. Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems, 2007. Hal Daum e III, John Langford, and Daniel Marcu. Search-based structured prediction. Machine Learning, 2009. Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Robust selective sampling from single and multiple teachers. In Conference on Learning Theory, 2010. Active Learning for Cost-Sensitive Classification John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory, 2010. Dylan Foster, Alekh Agarwal, Miroslav Dudik, Haipeng Luo, and Robert Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, 2018. Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 2014. Steve Hanneke and Liu Yang. Surrogate losses in passive and active learning. ar Xiv:1207.3772, 2012. Steve Hanneke and Liu Yang. Minimax analysis of active learning. Journal of Machine Learning Research, 2015. David Haussler. Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 1995. Daniel Hsu. Algorithms for Active Learning. Ph D thesis, University of California at San Diego, 2010. Tzu-Kuo Huang, Alekh Agarwal, Daniel Hsu, John Langford, and Robert E. Schapire. Efficient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems, 2015. Nikos Karampatziakis and John Langford. Online importance weight aware updates. In Uncertainty in Artificial Intelligence, 2011. Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daum e, III, and John Langford. Active learning for cost-sensitive classification. In International Conference on Machine Learning, 2017. John Langford and Alina Beygelzimer. Sensitive error correcting output codes. In Conference on Learning Theory, 2005. David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004. Data available at http://www.jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README. htm. Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, 2015. Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 1999. Pascal Massart and Elodie N ed elec. Risk bounds for statistical learning. The Annals of Statistics, 2006. Krishnamurthy, Agarwal, Huang, Daum e III, Langford Stanislav Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 2012. Francesco Orabona and Nicolo Cesa-Bianchi. Better algorithms for selective sampling. In International Conference on Machine Learning, 2011. Serge A. Plotkin, David B. Shmoys, and Eva Tardos. Fast approximation algorithms for fractional packing and covering problems. Mathematics of Operations Research, 1995. Alexander Rakhlin and Karthik Sridharan. On equivalence of martingale tail bounds and deterministic regret inequalities. In Proceedings of the 2017 Conference on Learning Theory, 2017. Alexander Rakhlin, Karthik Sridharan, Alexandre B Tsybakov, et al. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2017. Stephane Ross and J. Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. ar Xiv:1406.5979, 2014. Stephane Ross, Paul Mineiro, and John Langford. Normalized online learning. In Uncertainty in Artificial Intelligence, 2013. Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2012. Tianlian Shi, Jacob Steinhardt, and Percy Liang. Learning where to sample in structured prediction. In Artificial Intelligence and Statistics, 2015. Carlos N. Silla Jr. and Alex A. Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 2011. Wen Sun, Arun Venkatraman, Geoffrey J. Gordon, Byron Boots, and J. Andrew Bagnell. Deeply Aggre Va Te D: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning, 2017. Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E. Schapire. Efficient algorithms for adversarial contextual learning. In International Conference on Machine Learning, 2016. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015. Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 2004. Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, 2014.