# consistency_analysis_for_binary_classification_revisited__35ce75e4.pdf Consistency Analysis for Binary Classification Revisited Krzysztof Dembczy nski 1 Wojciech Kotłowski 1 Oluwasanmi Koyejo 2 Nagarajan Natarajan 3 Statistical learning theory is at an inflection point enabled by recent advances in understanding and optimizing a wide range of metrics. Of particular interest are non-decomposable metrics such as the F-measure and the Jaccard measure which cannot be represented as a simple average over examples. Non-decomposability is the primary source of difficulty in theoretical analysis, and interestingly has led to two distinct settings and notions of consistency. In this manuscript we analyze both settings, from statistical and algorithmic points of view, to explore the connections and to highlight differences between them for a wide range of metrics. The analysis complements previous results on this topic, clarifies common confusions around both settings, and provides guidance to the theory and practice of binary classification with complex metrics. 1. Introduction Real-world applications of binary classification to complex decision problems have led to the design of a wide range of evaluation metrics (Choi & Cha, 2010). Prominent examples include area under the ROC curve (AUC) for imbalanced labels (Menon et al., 2013), F-measure for information retrieval (Lewis, 1995), and precision at the top (Kar et al., 2014; 2015; Jasinska et al., 2016). To this end, several algorithms have been proposed for optimizing many of these metrics, primarily focusing on large-scale learning, without a conscious emphasis on statistical consequences of choosing models and their asymptotic behavior (Kar et al., 2015; Joachims, 2005). Wide use of such complex metrics has also re-invigorated research into their theoretical properties, which can then serve as a guide to prac- Authors listed in the alphabetical order 1Institute of Computing Science, Poznan University of Technology, Poland 2Department of Computer Science, University of Illinois at Urbana Champaign, USA 3Microsoft Research, India. Correspondence to: Wojciech Kotłowski . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). tice (Koyejo et al., 2014a; Narasimhan et al., 2014a; Dembczy nski et al., 2012; Waegeman et al., 2014; Natarajan et al., 2016). Complex evaluation metrics for binary classification are best described as set metrics, or non-decomposable metrics as, in general, the evaluation for a set of predictions cannot be decomposed into the average of individual instance evaluations. This is in contrast to decomposable metrics such as accuracy which are defined as the empirical average of the instance evaluations. This property is the primary source of difficulty in theoretical analysis, and interestingly has led to two distinct settings and notions of consistency. On one hand, Population Utility (PU) focuses on estimation so a consistent PU classifier is one which correctly estimates the population optimal utility as the size of the training set (equiv. test set) increases. The PU approach has strongest roots in classical statistical analysis which often deals with asymptotically optimal estimation. On the other hand, Expected Test Utility (ETU) focuses on generalization. Thus, the consistent ETU classifier is one which optimizes the expected prediction error over test sets of a pre-defined size. The ETU approach has strongest roots in statistical machine learning which prizes generalization as the primary goal. Importantly, these distinctions are irrelevant when the metric is a linear function of the confusion matrix e.g. (weighted) accuracy and other linear metrics. To the best of our knowledge, this dichotomy was first explicitly noted by Ye et al. (2012) in the context of F-measure.1 Like in Ye et al. (2012), our goal is not to adjudicate the correctness of either approach, but instead to explore deep connections, and highlight significant differences between both approaches for a wide range of metrics. Contributions: We present a variety of results comparing and contrasting the PU and ETU approaches for consistent classification: We show that for a wide range of metrics, PU and ETU are asymptotically equivalent with respect to the size of the test set, subject to a certain p-Lipschitzness 1Note that Ye et al. (2012) termed the two approaches Empirical Utility Maximization (EUM) and Decision Theoretic Approach (DTA), respectively. We have instead chosen the more descriptive names Population Utility (PU) and Expected Test Utility (ETU). Consistency Analysis for Binary Classification Revisited condition which is satisfied by many metrics of interest. This further implies asymptotic equivalence of the Bayes optimal classifiers (Section 3.1). Similar results were previously only known for F-measure. We provide lower bounds for the difference between PU and ETU metrics for finite test sets, and for certain metrics thereby highlighting the difference between PU and ETU consistent classifiers with small test sets (Section 3.2). We analyze approximate ETU classification using low order Taylor approximations, showing that the approximation can be computed with effectively linear complexity, yet achieves low error under standard assumptions (Section 4.1). We consider the effects of model mis-specification and find that ETU may be more sensitive than PU, but this may be alleviated by properly calibrating the estimated probabilities (Section 4.2). In addition, we present experimental results using simulated and real data to evaluate our theoretical claims (Section 5). 2. Preliminaries and Problem Setup We consider the binary classification problem, where the input is a feature vector x X, and the output is a label y {0, 1}. We assume the examples (x, y) are generated i.i.d. according to P(x, y). A classifier is a mapping h: X {0, 1}. We let 1C denote the indicator function i.e. equal to one if C is satisfied, and zero otherwise. Given a distribution P and a binary classifier h, define: TP(h) = P(h = 1, y = 1), TN(h) = P(h = 0, y = 0), FP(h) = P(h = 1, y = 0), FN(h) = P(h = 0, y = 1), which are entries of the so-called confusion matrix, namely true positives, true negatives, false positives and false negatives. In this paper, we are interested in optimizing performance metrics Φ(h, P) (we use explicit dependence on P because we will also consider the empirical version of Φ) that are functions of the above four quantities. However, since the entries of the confusion matrix are interdependent, it suffices to only use their three independent combinations. Following Natarajan et al. (2016), we parametrize Φ(h, P) = Φ(u(h), v(h), p) by means of: u(h) = TP(h), v(h) = P(h = 1), and p = P(y = 1). As argued by Natarajan et al. (2016), any metric being a function of the confusion matrix can be parameterized in this way. Table 1 lists popular examples of such metrics Metric Definition Φ(u, v, p) Accuracy TP + TN 1 + 2u v p AM TP/2 TP+FN + TN/2 TN+FP u+p(1 v p) Fβ (1+β2)TP (1+β2)TP+β2FN+FP (1+β2)u β2p+v Jaccard TP TP+FP+FN p+v 2u TP TN (TP+FN)(TN+FP) u(1 v p+u) AUC FP FN (TP+FN)(FP+TN) (v u)(p u) Table 1. Examples of performance metrics. with explicit parameterization Φ(u, v, p). Throughout the paper we assume Φ(u, v, p) is bounded from above and from below.2 2.1. Formal Definitions of PU and ETU Definition 1 (Population Utility (PU)). Given a distribution P and classifier h, the PU of h for a performance metric Φ is defined as Φ(u(h), v(h), p). We let h PU denote any maximizer of the PU, h PU argmax h Φ(u(h), v(h), p) . In words, the PU is obtained by taking the value of metric Φ evaluated at the expected confusion matrix of h over P. Thus, one can think of the PU as evaluating the classifier h on a single test set of infinite size drawn i.i.d. from P. In contrast, ETU evaluates the expected utility for a fixedsize test set. Formally, given a sample S = {(xi, yi)}n i=1 of size n, generated i.i.d. from P, we let bu(h), bv(h), bp denote the corresponding empirical quantities: i=1 h(xi)yi, bv(h) = 1 i=1 h(xi), bp = 1 and the empirical value of metric Φ is then Φ(bu(h), bv(h), bp). Definition 2 (Expected Test Utility (ETU)). Let x = (x1, . . . , xn) Xn be an arbitrary sequence of inputs. Given a distribution P and a classifier h, the ETU of h for a performance metric Φ conditioned on x is defined as:3 Ey|x h Φ bu(h), bv(h), bp i , 2In fact, for essentially all metrics used in practice it holds 0 Φ(u, v, p) 1. 3The conditional expectation y|x is defined up to a zeromeasure set (over x), but this does not create any problems as we always consider x being sampled from the data distribution. Consistency Analysis for Binary Classification Revisited where the expectation over y = (y1, . . . , yn) is with respect to the conditional distribution P(y|x) i.i.d. over the examples. We let h ETU(x) denote any maximizer of the ETU, h ETU(x) argmax h Ey|x h Φ bu(h), bv(h), bp i . One can think of ETU as evaluating the classifier h on infinitely many test sets of size n drawn i.i.d. from P. We will see (in Section 4) that the optimal predictions (in both PU and ETU approaches) can be accurately estimated using the conditional probabilities P(yi|xi). In practice, we first obtain an estimator of the conditional probability and then compute the optimal predictions on test data based on their conditional probability estimates. Remark 1. More generally, ETU optimizes the expected utility Ey,x h Φ bu(h), bv(h), bp i . However, clearly, it is sufficient to analyze the predictions at any given x (Natarajan et al., 2016) as in Definition 2. 2.2. Well-behaved Performance Metrics The two frameworks treat the metrics as utility measures (i.e., they are to be maximized). Further, it is reasonable to expect that Φ(h, P) is non-decreasing in true positive and true negative rates (and indeed, virtually all performance measures used in practice behave this way). As shown by Natarajan et al. (2016), such monotonicity in true positive and true negative rates implies another property, called TP monotonicity, which is better suited to the parameterization employed here. Definition 3 (TP monotonicity). Φ(u, v, p) is said to be TP monotonic if for any v, p and u1 > u2, it holds that Φ(u1, v, p) > Φ(u2, v, p). It is easy to verify that all measures in Table 1 are TP monotonic. A contribution in this work is to develop a notion of regularity for metrics, that helps establish statistical connections between the two frameworks and their optimal classifiers. We call it p-Lipschitzness, defined next. Definition 4 (p-Lipschitzness). Φ(u, v, p) is said to be p-Lipschitz if: |Φ(u, v, p) Φ(u , v , p )| Up|u u |+Vp|v v |+Pp|p p |, for any feasible u, v, p, u , v , p . The Lipschitz constants Up, Vp, Pp are allowed to depend on p, in contrast to the standard Lipschitz functions. The rationale behind p-Lipschitzness is that we want to control the change in value of the measure under small changes in their arguments. This property turns out to be essential to show equivalence between ETU and PU approaches. On the other hand, if we simply used a standard definition of Lipschitz function (with global constants), it would not be satisfied by many interesting measures. Hence, we weaken the Lipschitz property by allowing the constant to vary as a function of p. One can also show that general linear-fractional performance metrics studied in (Koyejo et al., 2014a; Narasimhan et al., 2015; Kotłowski & Dembczy nski, 2016) satisfy p-Lipschitzness under mild conditions (Appendix A). Proposition 1. All measures in Table 1 are p-Lipschitz. Proof. We only give a proof for Fβ-measure here (See Appendix A for the rest). For ease, let us denote Fβ(u, v, p) by Fβ and Fβ(u , v , p ) by F β. Let u = u u , v = v v , p = p p . We have: |Fβ F β| = (1 + β2)|u(β2p + v ) u (β2p + v)| (β2p + v)(β2p + v ) = (1 + β2)| u(β2p + v ) u β2 p u v| (β2p + v)(β2p + v ) β2p +v | p| + u β2p +v | v| . Since u min{p , v }, we have β2u β2p +v 1, u β2p +v 1, and thus we can choose Up = Vp = Pp = 1+β2 As for an example of a metric which is not p-Lipschitz, consider the precision defined as Φ(u, v, p) = u v . Indeed, if v is close to zero, choosing v = 2v, u = u and p = p gives: Φ(u, v, p) Φ(u , v , p ) = u which can be arbitrarily large for sufficiently small v, while the difference |v v | = v is small. As it turns out in Section 3.2, this pathological behavior of the precision metric is responsible for a large deviation between PU and ETU, which suggests that p-Lipschitzness is in some sense necessary to establish connections. 3. Equivalence of PU and ETU Most of the existing literature on optimizing nondecomposable classification metrics focus on one of the two approaches in isolation. In this section, we show that the two approaches are in fact asymptotically equivalent, for a range of well-behaved metrics. Informally, given a distribution P and a performance metric Φ, our first result is that for sufficiently large n, the PU of the associated h ETU is arbitrarily close to that of h PU, and likewise, the ETU of h PU is arbitrarily close to that of h ETU. In contrast, we also Consistency Analysis for Binary Classification Revisited show that the PU and ETU optimal classifiers may suffer differences for small samples. 3.1. Asymptotic Equivalence The intuition behind the equivalence lies in the observation that the optimal classifiers under the two approaches exhibit a very simple, similar form, under mild assumptions on the distribution (Koyejo et al., 2014b; Narasimhan et al., 2014b; Natarajan et al., 2016). Let η(x) := P(y = 1|x) denote the conditional probability of positive class as a function of x. The following lemma shows that for any fixed classifier h that thresholds η(x), and sufficiently large sample size n, its performance measured with respect to PU and ETU are close, in particular, differ by a factor that decays as fast as O(1/ n). In fact, the result holds uniformly over all such binary classifiers. Lemma 1. Let H = {h | h = 1η(x) τ, τ [0, 1]}, be the class of thresholded binary decision functions. Let Φ be a performance metric which is p-Lipschitz. Then, with probability at least 1 δ over a random sample S = {(xi, yi)}n i=1 of size n generated i.i.d. from P, it holds uniformly over all h H, Φ(u(h),v(h), p(h)) Ey|x Φ(bu(h), bv(h), bp(h)) 2 log(n + 1) δ 2n + Lp n, where bu(h), bv(h), bp(h) are empirical quantities evaluated on S, and Lp = max{Up, Vp, Pp}. Remark 2. Lemma 1 generalizes the result obtained by Ye et al. (2012) for Fβ-measure to arbitrary p-Lipschitz metrics. Furthermore, using more careful bounding technique, we are able to get a better dependence on the sample size n, essentially O(1/ n) (neglecting logarithmic terms). In fact, this dependence cannot be improved any further in general (See Appendix B). The uniform convergence result in Lemma 1 enables the first main result of this work. In particular, the convergence holds when the optimal classifiers with respect to ETU and PU are of the thresholded form, i.e. h PU H, and h ETU(x) H almost surely (with respect to random sample of inputs x), where H = {h | h = 1η(x) τ, τ [0, 1]} is the class of threshold functions on function η(x). Several recent results have shown that the optimal classifier for many popular metrics (including all metrics in Table 1) indeed has the thresholded form (Narasimhan et al., 2014a; Lewis, 1995), under a mild condition related to continuity of the distribution of η(x) (See the proof of Theorem 1 in Appendix B.2 for details): Assumption 1. The random variable η(x) has a density (with respect to the Lebesgue measure) on [0, 1]. We are now ready to state the result. Proofs omitted in the main text are supplied in the Appendix. Theorem 1. Let Φ be a performance metric that is TP monotonic and p-Lipschitz, and P be a distribution satisfying Assumption 1. Consider the ETU optimal classifier h ETU (Definition 2) and the PU optimal classifier h PU (Definition 1). Then, for any given ϵ and δ, we can choose n large enough (in Definition 2 of ETU), such that, with probability at least 1 δ over the random choice of the sample of inputs x, we have: Φ u(h ETU(x)),v(h ETU(x)), p Φ u(h PU), v(h PU), p ϵ. Similarly, for large enough n, with probability 1 δ, Ey|x h Φ bu(h ETU(x)), bv(h ETU(x)), bp i Ey|x h Φ bu(h PU), bv(h PU), bp i ϵ. Remark 3. In essence, Theorem 1 suggests that, for large sample sizes, the optimal in the sense of one approach gives an accurate estimate (or a proxy) of the optimal in the sense of the other approach. Our characterization of p-Lipschitzness is key to showing the equivalence. 3.2. Finite Sample Regime The aforementioned result is asymptotic; to elucidate the point, we now give an example where optimal classifiers corresponding to PU and ETU differ. It is important to be aware of such extremities, especially when one applies a learned model to test data of modest sample sizes. The way we argue a lower bound is by specifying a metric and a distribution, such that on a randomly obtained test set of modest size, say m, the gap in the empirical metric computed on the test data for the two optimal classifiers can be large. As one is typically primarily interested in the empirical metric on a given test set, focusing on the empirical metric ensures fairness and forbids favoring either definition. Example. For some constant α > 0, consider the (adjusted) empirical precision metric defined as: ΦPrec(bu(h(x)), bv(h(x)), p) = bu(h(x)) bv(h(x)) + α. Note that ΦPrec [0, 1 1+α]; Furthermore, it is p-Lipschitz, with Lipschitz constant Vp 1 α (see Definition 4). Thus, choosing very small values of α implies very high Lipschitz constant, and in turn the metric becomes less stable . To establish the desired lower bound, we choose a small 0 < Consistency Analysis for Binary Classification Revisited α 1. Let {X1, X2, X3} denote a partition of the instance space X, i.e. 3 i=1Xi = X and Xi Xj = 0, for any pair (i, j). Consider the joint distribution P defined as: P(y = 1|x X1) = 1 , P(y = 1|x X3) = 0, P(y = 1|x X2) = 1 ϵ = 1 α, (1) P(X1) + P(X3) = ϵ2 , P(X2) = 1 ϵ2. for some 1 ϵ2 > 0 and note that the distribution is defined to be dependent on our choice of α. The last line in the above set of equations suggests that the distribution has a small region where labels are deterministically positive or negative, but overwhelmingly positive elsewhere. Theorem 2. Let x = {x1, x2, . . . , xn} denote a set of instances drawn i.i.d. from the distribution P. Let y = {y1, y2, . . . , yn} denote their labels drawn from the same distribution. With probability at least (1 ϵ2 ϵn), ΦPrec(bu(h ETU(x)), bv(h ETU(x)), ˆp) ΦPrec(bu(h PU(x)), bv(h PU(x)), ˆp) 1 n(1 + α) . 4. Algorithms: Optimization and Conditional Probability Estimation Characterization of the optimal classifier as a thresholding of the conditional probability yields simple and efficient PU consistent estimators. The idea is to first obtain an estimator for the conditional probability using training data, and then search for an optimal threshold on a separate validation set (Narasimhan et al., 2014b; Koyejo et al., 2014b). Threshold search can be efficiently implemented in linear time (assuming probabilities are pre-sorted). In contrast, although a similar thresholding characterization exists for ETU (Natarajan et al., 2016), evaluation and prediction require the computation of an expensive expectation (Definition 2). For general metrics, there is an O(n3) procedure to determine the optimal test set labeling (Jansche, 2007; Chai, 2005; Natarajan et al., 2016), and the procedure can be sped up to O(n2) in some special cases (Ye et al., 2012; Natarajan et al., 2016). Here, we consider an approximation to ETU that requires only O(n) computation, yet achieves error O(n 3/2) compared to exact optimization. 4.1. Approximation Algorithms Recall that ETU seeks to find the classifier of the form: h ETU(x) = argmax h Ey|x Φ(bu(h), bv(h), bp) . Following (Lewis, 1995; Natarajan et al., 2016) we know that when Φ is TP monotonic, it suffices to sort observations in decreasing order according to η(x) and assign positive labels to top k of them, for k = 0, . . . , n. Unfortunately, for each k, we need to calculate the expected utility Algorithm 1 Approximate ETU Consistent Classifier 1: Input: Φ and sorted estimates of ηi, i = 1, 2, . . . , n 2: Init s i = 0, i [n], bp = 1 n Pn i=1 yi, bu0 = 0 3: Set Φ0 = Φ(0, 0, bp) 4: for k = 1, 2, . . . , n do 5: Set buk = (k 1)buk 1+ηk k , bvk = k n 6: Set Φk = Φ(buk, bvk, bp) (via Lemmas 2 or 3) 7: end for 8: k arg maxk=0,...,n Φk. 9: return s s.t. s i 1 for i [k ]. measure, which is time consuming requiring O(n2) in general. Our goal here is to approximate this term, so that it can be computed in O(n) time, then the whole procedure can be implemented in amortized time O(n). Fix a binary classifier h: X {0, 1} and the input sample x = (x1, . . . , xn). Let bu(h), bv(h), bp denote the empirical quantities, as defined in Section 3.1. Furthermore, we define semi-empirical quantities: i=1 h(xi)η(xi), and ep = 1 (there is no need to define ev(h)). Note that eu(h) = Ey|x bu(h) , and ep = Ey|x [bp]. Zeroth-order approximation. Our first approximation is based on Taylor-expanding the measure up to the second order: Lemma 2. If Φ is twice-differentiable in (u, p) and all its second-order derivatives are bounded by constant A, then: Ey|x Φ(bu(h), bv(h), bp) Φ(eu(h), bv(h), ep) A We note that the first order terms vanish in the Taylor approximation (proof in Appendix). This constitutes a simple, yet powerful method for approximating ETU utility. Algorithm 1 outlines the resulting algorithm. As shown, the classifier can be computed in O(n) time overall, assuming the data is already sorted according to η(xi) (otherwise, the procedure is dominated by sorting time O(n log n)). We note that (Lewis, 1995) proposed a similar first order approximation, albeit without any rigorous guarantee. Second order approximation. Naturally, we can get a better approximation by Taylor-expanding the measure up to the third order. Lemma 3. Assume Φ is three times differentiable in (u, p) and assume all its third-order derivatives are bounded by constant B. Let 2 uu, 2 up, 2 pp denote the second-order Consistency Analysis for Binary Classification Revisited derivative terms evaluated at (eu, ep), and likewise define 2 up, 2 pp. We then have: Ey|x Φ(bu(h), bv(h), bp) Φappr(h) B 3n3/2 , Φappr(h) = Φ(eu(h), bv(h), ep) 2( 2 uu + 2 2 up)su + 2 ppsp, i=1 η(xi)(1 η(xi)), i=1 h(xi)η(xi)(1 η(xi)). Theorem 3 (Consistency). Given n instances x = (x1, x2, . . . , xn), sort them in decreasing order of η(xi). For 0 k n, let s(k) denote the vector with positions corresponding to top k of the sorted instances set to 1, and 0 otherwise. (a) Suppose first order derivatives are bounded by A, let: h a = arg max s(k) Φ(eu(s(k)), bv(s(k)), ep), Φ(h ETU) Φ(eu(h a), bv(h a), ep) A (b) Suppose second order derivatives are bounded by B, let: h b = arg max s(k) Φappr(s(k)), where Φappr(h) is defined in Lemma 3. We have: Φ(h ETU) Φappr(h b) 2B 3n3/2 . As before, the approximation can be computed in O(n) total time. We could also expand the function up to orders higher than the third order, and get better approximations (still with O(n) computation if the order of the expansion is independent of n) at the cost of an even more complicated approximation formula. In experiments, we find that on real datasets with test data sets of size 100 or more, even the zeroth order approximation is highly accurate. 4.2. Conditional Probability Estimation and Model Misspecification So far, we assumed that we have access to the true class conditional density η(x) = P(y = 1|x) and the resulting classifier is a threshold function on η(x). In practice, one employs some probability estimation procedure and gets bη(x), which we call a model.4 Then, one uses bη(x) as if it were a true conditional probability η(x) to obtain PU or ETU classifiers. Note that since η(x) is unknown, and we only have access to bη(x), the best we can hope for is to choose the optimal threshold on bη(x) (for PU) or choose the optimal number of test set observations k to be classified as positive after sorting them according to bη(x) (for ETU). Next, we investigate these finite sample effects in practical PU and ETU procedures. For this analysis, we treat bη(x) as given and fixed, make no other assumptions on how it was obtained. Let b H = {h | h = 1bη(x) τ, τ [0, 1]} denote the class of binary threshold functions on bη(x). Consider PU first, and let h be the PU-optimal classifier from b H, i.e.: h = argmax h b H Φ(u(h), v(h), p). In practice, however, one does not have access to P, and thus u(h), v(h), p cannot be computed. Instead, given bη(x), one uses a validation sample S = {(xi, yi)}n i=1 to choose a threshold on bη(x) (and thus, a classifier from b H), by directly optimizing the empirical version of the metric on S: bh = argmax h b H Φ(bu(h), bv(h), bp). We would like to assess how close is bh to h . By following the proof of Lemma 1 (which never assumes the class H is based on thresholding η(x)), it is easy to show that with high probability, Φ(u(bh), v(bh), p) Φ(u(h ), v(h ), p) O 1 n Thus, if we have a sufficiently large validation sample at our disposal, we can set the threshold which maximizes the empirical version of the metric, and our performance is guaranteed to be e O(1/ n) close to the performance of the Φ-optimal classifier from b H. In other words, PU does not require to know the true distribution in order to select the best classifier in b H, only a sufficiently large validation sample is required. In contrast, ETU procedure is inherently based on using bη(x) as a replacement for η(x) (which we do not know) to decide upon label assignments. Let x = (x1, . . . , xn) be the input sample of size n. Assume for simplicity the distribution of η(x) and bη(x) are continuous on [0, 1], so that for any i = j, η(xi) = η(xj) with probability one, and 4For instance, bη(x) could be obtained from logistic regression or neural network with soft-max function on the final layer. Consistency Analysis for Binary Classification Revisited similarly for bη. Then, given x and bη, the ETU procedure chooses the classifier of the form: bh = argmax h b H Ey bη(x) Φ(bu(h), bv(h), bp) . Likewise, the optimal ETU classifier in b H is given by: h = argmax h b H Ey η(x) Φ(bu(h), bv(h), bp) , i.e. by definition, the optimal classifier in the restricted class H involves the expectation with respect to the true η. Let us denote ΦETU = Ey η(x) Φ(bu(h), bv(h), bp) , so that h maximizes ΦETU. In the supplementary material, we show that under some mild assumptions on Φ: Ex h ΦETU(bh) ΦETU(h ) i O 1 n + Pp|p pbη| + sup h b H Up|u(h) ubη(h)|, where pbη = E bη(x) and ubη(h) = E h(x)bη(x) , are the quantities corresponding to p and u(h), which were calculated by replacing the conditional probability η with its estimate bη. Thus, while for the PU procedure, the difference between bh and h diminishes as n grows, it is not the case of ETU, as there are two bias terms |p pbη| and |u(h) ubη(h)| which do not depend on n. These terms correspond to using incorrect conditional probability bη while selecting the classifier, and are present even if the sample size tends to infinity. Thus, it seems crucial for the success of ETU procedure to have bη calibrated with respect to the true distribution. A popular choice for class probability estimation is to use logistic regression. However, if the model is misspecified, which happens often in practice, the aforementioned discussion suggests that the desired ETU solution may not be achieved. Therefore, we need to learn the class probability function more carefully. Here, we consider two variants. The first is to use the Isotron algorithm. In case of the generalized linear model, i.e. P(y|x) = γ(w , x) for some unknown link function γ and model w , Kalai & Sastry (2009) proposed a simple and elegant algorithm (see Appendix E) that alternatively learns w and the link function γ (approximated by a piecewise linear function). It provably learns the model under certain assumptions on P. The model w and link function γ are learned using training data, and at prediction time, the link function and the scores of training data (i.e., x T i w) are used to calibrate the class probabilities η(x) of test instances. We also consider using a recalibrated logistic model, i.e., we first estimate the class probabilities via standard logistic regression, and recalibrate the probabilities by running one update of the γ function in Isotron algorithm (which essentially solves a quadratic problem known as the Pool of Adjacent Violators). At test time, we use the learnt γ and the logistic model to estimate η(x) for test instances. 5. Experiments We empirically evaluate the effectiveness and accuracy of ETU approximations introduced in Section 4.1, on synthetic as well as real datasets. We also show on several benchmark datasets that, by carefully calibrating the conditional probabilities in ETU, we can improve the classification performance. 5.1. Convergence of Approximations We consider F1 and Jaccard metrics from Table 1. We sample conditional probabilities ηi for n instances from the uniform distribution. The optimal predictions (see Definition 2) are obtained using Algorithm 1 of (Natarajan et al., 2016) (which is equivalent to searching over 2n possible label vectors). Then we compute the approximate optimal predictions using the first and the second order approximations discussed in Section 4.1. For each metric, we measure the deviation between the true and the approximate optimal values with increasing sample size in Figure 1. We observe linear convergence for the first order approximation and quadratic convergence for the second order approximation. This suggests that the bounds in Theorem 3 indeed can be improved for some metrics, if not in general. n 0 20 40 60 80 100 |F(app) 1 - F1 *| Second order approximation First order approximation n 0 20 40 60 80 100 |Jacc(app) - Jacc*| Second order approximation First order approximation (b) Jaccard Figure 1. Convergence of approximations demonstrated on synthetic data. 5.2. Approximations on Real Data We report results on seven multiclass and multilabel benchmark datasets: (1) LETTERS: 16000 train, 4000 test instances, (2) SCENE: 1137 train, 1093 test (3) YEAST: 1500 train, 917 test (4) WEBPAGE: 6956 train, 27824 test (5) IMAGE: 1300 train, 1010 test (6) BREAST CANCER: 463 train, 220 test instances, (7) SPAMBASE: 3071 train, 1530 test instances.5 In case of multiclass datasets, we report results (using one-vs-all classifiers) averaged over classes (as 5See (Koyejo et al., 2014b; Ye et al., 2012) for details. Consistency Analysis for Binary Classification Revisited DATASET Exact Approx (first) Approx (second) Exact Approx (first) Approx (second) F1 F1 F1 Jaccard Jaccard Jaccard LETTERS (26) 0.5666 0.5666 0.5666 0.4273 0.4272 0.4273 SCENE (6) 0.6916 0.6917 0.6916 0.5374 0.5376 0.5374 YEAST (14) 0.4493 0.4494 0.4493 0.3242 0.3242 0.3242 WEB PAGE 0.6336 0.6336 0.6336 0.4637 0.4637 0.4637 SPAMBASE 0.8448 0.8457 0.8448 0.7314 0.7326 0.7314 IMAGE 0.8542 0.8542 0.8542 0.7455 0.7455 0.7455 BREAST CANCER 0.9660 0.9660 0.9660 0.9342 0.9342 0.9342 Table 2. Comparison of ETU approximation methods: F1 and Jaccard metrics defined in Table 1. Reported values correspond to performance evaluated on heldout data (higher values are better). For multiclass datasets (number of classes indicated in parenthesis), average performance over classes is reported. Evidently, the ETU approximations are accurate to at least 4 decimal digits in several cases. DATASET Logistic Isotron Recalibrated PU Logistic Isotron Recalibrated PU F1 F1 Logistic: F1 F1 Jaccard Jaccard Logistic: Jaccard Jaccard LETTERS (26) 0.5666 0.5905 0.5722 0.5745 0.4273 0.4447 0.4304 0.4318 SCENE (6) 0.6916 0.6222 0.6809 0.4397 0.5374 0.4673 0.5362 0.4318 YEAST (14) 0.4493 0.4666 0.4629 0.4730 0.3242 0.3399 0.3414 0.3422 WEB PAGE 0.6336 0.7362 0.6756 0.6809 0.4637 0.5825 0.5037 0.5194 SPAMBASE 0.8448 0.7825 0.8905 0.8839 0.7314 0.5729 0.7837 0.8003 IMAGE 0.8542 0.8683 0.8569 0.8581 0.7455 0.7673 0.7704 0.7623 BREAST CANCER 0.9660 0.9669 0.9799 0.9766 0.9342 0.9359 0.9605 0.9481 Table 3. Modeling conditional probabilities in ETU: Logistic model vs calibrated model using Isotron: F1 and Jaccard metrics defined in Table 1. Reported values correspond to performance evaluated on heldout data (higher values are better). For multiclass datasets (number of classes indicated in parenthesis), average performance over classes is reported. in Natarajan et al. (2016)). We compare the exact ETU optimal, computed using the algorithm of Natarajan et al. (2016), with the approximations. The results for F1 and Jaccard metrics are presented in Table 2. The results convincingly show that the approximations are highly accurate, and almost always indistinguishable from optimizing true metrics, on real datasets. Note that even the first-order approximation (in fact, this is zeroth-order, as the first order term is zero; see Section 4) achieves high accuracy, as the test set sizes are relatively large. 5.3. Model Misspecification We now study how class probability estimation (CPE) and model misspecification affects the performances of PU and ETU approaches, on the seven benchmark datasets. We compare four methods: (a) ETU with logistic regression based CPE, (b) ETU with Isotron based CPE (discussed in Section 4.1), (c) ETU with recalibrated logistic regression based CPE (discussed in Section 4.1), and (d) PU using logistic regression based CPE followed by threshold tuning on validation set (Koyejo et al., 2014b). Additional comparisons to structured SVM (Joachims, 2005) and other classifiers are available in previously published work by others (Koyejo et al., 2014b; Natarajan et al., 2016), and are omitted here. The results are presented in Table 3. We observe that the logistic model (column 1) is insufficient for many of the datasets. The results improve in several cases using the estimated generalized linear model with Isotron (column 2). However, there is a confounding factor that the two algorithms are very different, and noticed improvement may not necessarily be due to better CPE. To isolate this, recalibrated logistic model results are presented in column 3. The results are in general much better than the standard logistic model, which suggests that it is indeed the case of model misspecification in these datasets. Finally, we present the results with PU algorithm in column 4. We find that the results closely match that of the recalibrated logistic model (except in the case of SCENE dataset); thus, correcting for model misspecification helps demonstrate the theorized asymptotic equivalence of PU and ETU approaches in practice. 6. Conclusions and Future Work We have presented new results which elucidate the relationship between the two notions of consistency for complex binary classification metrics. Next, we plan to explore surrogates to further improve training efficiency nondecomposable metrics. We will also extend to more complex prediction problems such as multilabel classification, where a similar dichotomy exists. Consistency Analysis for Binary Classification Revisited Acknowledgments W. Kotłowski has been supported by the Polish National Science Centre under Grant No. 2013/11/D/ST6/03050. Chai, Kian Ming Adam. Expectation of F-measures: tractable exact computation and some empirical observations of its properties. In Proceedings of the 28th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 593 594. ACM, 2005. Choi, Seung-Seok and Cha, Sung-Hyuk. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, pp. 43 48, 2010. Dembczy nski, Krzysztof, Waegeman, Willem, Cheng, Weiwei, and H ullermeier, Eyke. On label dependence and loss minimization in multi-label classification. Machine Learning, 88(1-2):5 45, 2012. Jansche, Martin. A maximum expected utility framework for binary sequence labeling. In Annual Meeting of the Association of Computational Linguistics, volume 45, pp. 736, 2007. Jasinska, Kalina, Dembczynski, Krzysztof, Busa-Fekete, R obert, Pfannschmidt, Karlson, Klerx, Timo, and H ullermeier, Eyke. Extreme f-measure maximization using sparse probability estimates. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1435 1444, 2016. Joachims, Thorsten. A support vector method for multivariate performance measures. In Proceedings of the 22nd Intl. Conf. on Machine Learning, pp. 377 384. ACM, 2005. Kalai, Adam and Sastry, Ravi. The Isotron algorithm: High-dimensional isotonic regression. In Conference on Learning Theory (COLT), 2009. Kar, Purushottam, Narasimhan, Harikrishna, and Jain, Prateek. Online and stochastic gradient methods for nondecomposable loss functions. In Advances in Neural Information Processing Systems, pp. 694 702, 2014. Kar, Purushottam, Narasimhan, Harikrishna, and Jain, Prateek. Surrogate functions for maximizing precision at the top. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 189 198, 2015. Kotłowski, Wojciech and Dembczy nski, Krzysztof. Surrogate regret bounds for generalized classification performance metrics. Machine Learning Journal, DOI 10.1007/s10994-016-5591-7, 2016. Koyejo, Oluwasanmi, Natarajan, Nagarajan, Ravikumar, Pradeep K., and Dhillon, Inderjit S. Consistent binary classification with generalized performance metrics. In Neural Information Processing Systems (NIPS), 2014a. Koyejo, Oluwasanmi O, Natarajan, Nagarajan, Ravikumar, Pradeep K, and Dhillon, Inderjit S. Consistent binary classification with generalized performance metrics. In Advances in Neural Information Processing Systems, pp. 2744 2752, 2014b. Lewis, David D. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 246 254. ACM, 1995. Menon, Aditya, Narasimhan, Harikrishna, Agarwal, Shivani, and Chawla, Sanjay. On the statistical consistency of algorithms for binary classification under class imbalance. In Proceedings of the 30th Intl. Conf. on Machine Learning, pp. 603 611, 2013. Mohri, Mehryar, Rostamizadeh, Afshin, and Talwalkar, Ameet. Foundations of Machine Learning. The MIT Press, 2012. Narasimhan, Harikrishna, Vaish, Rohit, and Agarwal, Shivani. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Neural Information Processing Systems (NIPS), 2014a. Narasimhan, Harikrishna, Vaish, Rohit, and Agarwal, Shivani. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pp. 1493 1501, 2014b. Narasimhan, Harikrishna, Ramaswamy, Harish, Saha, Aadirupa, and Agarwal, Shivani. Consistent multiclass algorithms for complex performance measures. In Proceedings of the 32nd Intl. Conf. on Machine Learning, pp. 2398 2407, 2015. Natarajan, Nagarajan, Koyejo, Oluwasanmi, Ravikumar, Pradeep, and Dhillon, Inderjit. Optimal classification with multivariate losses. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1530 1538, 2016. Waegeman, Willem, Dembczynski, Krzysztof, Jachnik, Arkadiusz, Cheng, Weiwei, and H ullermeier, Eyke. On the bayes-optimality of F-measure maximizers. Journal of Machine Learning Research, 15:3333 3388, 2014. Ye, Nan, Chai, Kian Ming A, Lee, Wee Sun, and Chieu, Hai Leong. Optimizing F-measures: a tale of two approaches. In Proceedings of the Intl. Conf. on Machine Learning, 2012.