# binary_classification_with_karmic_thresholdquasiconcave_metrics__e3697c2a.pdf Binary Classification with Karmic, Threshold-Quasi-Concave Metrics Bowei Yan 1 Oluwasanmi Koyejo 2 Kai Zhong 1 Pradeep Ravikumar 3 Complex performance measures, beyond the popular measure of accuracy, are increasingly being used in the context of binary classification. These complex performance measures are typically not even decomposable, that is, the loss evaluated on a batch of samples cannot typically be expressed as a sum or average of losses evaluated at individual samples, which in turn requires new theoretical and methodological developments beyond standard treatments of supervised learning. In this paper, we advance this understanding of binary classification for complex performance measures by identifying two key properties: a so-called Karmic property, and a more technical threshold-quasiconcavity property, which we show is milder than existing structural assumptions imposed on performance measures. Under these properties, we show that the Bayes optimal classifier is a threshold function of the conditional probability of positive class. We then leverage this result to come up with a computationally practical plug-in classifier, via a novel threshold estimator, and further, provide a novel statistical analysis of classification error with respect to complex performance measures. 1. Introduction Binary classification, with the goal of predicting a binary response given input features, is perhaps the classical problem in machine learning, with wide ranging applications. A key ingredient in binary classification is a performance measure, that quantifies how well a given classifier fits the data. While the performance measure of accuracy has been the predominant focus of both theory and practice, it has severe limitations in many practical settings, such as imbal- 1University of Texas at Austin, Austin, Texas, USA 2University of Illinois at Urbana-Champaign, Champaign, Illinois, USA 3Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Correspondence to: Bowei Yan . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). anced classes, and where different types of errors made by the classifier have different costs (Gu et al., 2009; Wallace et al., 2011). Accordingly, practitioners in applied machine learning settings such as information retrieval and medical diagnosis have developed complex performance metrics that capture important trade-offs between different types of errors; we have collated a few instances in Table 1. A key complication with many complex classification performance metrics, such as the F-measure (Manning et al., 2008) and Harmonic Mean (Kennedy et al., 2009), is that they cannot be decomposed into the sum or average of individual losses on each sample. Even the simple performance measure of precision the fraction of correct positive predictions, among the set of positive predictions is not a sum of individual losses on each sample. Thus, the standard theoretical and practical treatments of supervised learning, such as standard empirical risk minimization that minimizes the empirical expectation of a loss evaluated on a single random example, are not applicable. This practical reality has motivated research into effective and efficient algorithms tailored to complex nondecomposable performance measures. One class of approaches extend standard empirical risk minimization to this non-decomposable setting, which often relies on strong assumptions on either the form of the classifiers, such as requiring linear classifiers (Narasimhan et al., 2015a), or restricted to specific performance measures such as F-measure (Parambath et al., 2015). An alternative approach is the plug-in estimator, where we first derive the form of the Bayes optimal classifier, estimate the statistical quantities associated with the Bayes optimal classifier, and finally plug-in the sample estimates of the population quantities to then obtain the overall estimate of the Bayes optimal classifier. In particular, for many complex performance metrics, the Bayes optimal classifier is simply a thresholding of the conditional probability of the positive class (Koyejo et al., 2014; Narasimhan et al., 2014), so that the plug-in estimator requires (a) an estimate of the conditional probability, and (b) the associated threshold. Plug-in methods have been of particular interest in non-parametric functional estimation as they typically require weaker assumptions on the function class and are often easy to implement. In this paper, we seek to advance our understanding and practice of binary classification under complex non- Binary Classification with Karmic, Threshold-Quasi-Concave Metrics decomposable performance measures. We show that for a very broad class of performance measures, encompassing a large set of performance measures used in practice, the Bayes optimal classifier is simply a thresholding of the conditional probability of the response. Towards this general result, we identify two key properties that a performance measure could satisfy. The first is what we call a Karmic property that loosely has the performance measure be more sensitive to an increase in true positives and true negatives, and a decrease in false positives and false negatives. The second is a more technical property we call threshold-quasiconcavity, which in turn ensures the performance measure is well-behaved around an optimal threshold. As we show these properties are satisfied by performance metrics used in practice, and in particular, these conditions are milder than existing results that restrict either the structural form of the performance measures, or impose strong shape constraints such as particular monotonicities. Our general result has two main consequences, which we investigate further: one algorithmic, and the other for the analysis of classification error for general performance measures. As the algorithmic consequence, we leverage the derived form of the Bayes optimal classifier, and some additional general assumptions on the performance measures, to provide a tractable algorithm to estimate the threshold, which coupled with an estimator of the conditional probability, provides a tractable plug-in estimator of the Bayes optimal classifier. Towards the statistical analysis consequence, we provide an analysis of the excess classification error, but with respect to general non-decomposable performance measures, of the general class of plugin-estimators for our class of Bayes optimal classifiers. Our analysis of classification error rates for such plug-in classifiers depend on three natural quantities: the rate of convergence for the conditional probability estimate, the rate of convergence for the threshold estimate, and a measurement of noise in the data. For the last part, we extend margin or low-noise assumptions for binary classification with the accuracy performance measure to complex performance measures. Low noise assumptions, proposed by Mammen et al. (1999) in the context of the accuracy performance measure, bounds the noise level in the neighborhood of the Bayes optimal threshold i.e. 1 2 for standard classification. Under such a low-noise assumption, Audibert et al. (2007) derive fast convergence rates for plug-in classification rules based on the smoothness of the conditional probability function. Similar margin assumptions have also been introduced for density level set estimation by Polonik (1995). We provide a natural extension of such a low-noise assumption, under which we provide explicit rates of convergence of classification error with respect to complex performance measures. We provide corollaries for both parametric and non-parametric instances of our general class of plugin-classifiers. The rest of the paper is organized as below. In Section 2 we introduce the problem and relevant notations. The characterization and properties of Bayes optimal classifier are derived in Section 3. We discuss the algorithm for estimating the plug-in estimator in Section 4, and present the statistical convergence guarantee in Section 5. Applications of the derived rate for two special cases, Gaussian generative model and β-H older class conditional probability are presented in Section 6 where explicit convergence rates are provided. We conclude the paper in Section 7. Detailed proofs are deferred to the supplementary materials. 2. Problem Setup and Preliminaries Binary classification entails predicting a binary label Y { 1} associated with a feature vector X X Rd. Such a a function mapping f : X 7 { 1} from the feature space X to the labels { 1} is called a binary classifier. Let Θ = {f : X { 1}} denote a set of binary classifiers. We assume (X, Y ) has distribution P P, and let η(x) := P(Y = 1|X = x) denote the conditional probability of the label Y given feature vector x. A key quantity is the confusion matrix, that consists of four population quantities: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Definition 2.1 (Confusion Matrix). For any classifier f : Rd 7 { 1}, its confusion matrix is defined as C(f, P) := [TP(f, P), FP(f, P), FN(f, P), TN(f, P)] [0, 1]4, where: TP(f, P) = P(Y = +1, f(X) = +1), FP(f, P) = P(Y = 1, f(X) = +1), FN(f, P) = P(Y = +1, f(X) = 1), TN(f, P) = P(Y = 1, f(X) = 1). Another key ingredient is the utility or performance measure U : Θ P R, that measures the performance of a classifier. In this paper, we focus on complex binary classification performance measures that can be expressed as a function of the confusion matrix. Formally, U(f, P) = G(C(f, P)). When it is clear from context, we will drop the dependency of the distribution P in C and U. The confusion-matrix functions G corresponding to popular performance measures are listed in Table 1. Given the performance measure U, we are interested in the corresponding Bayes optimal classifier: f = arg max f Θ U(f, P). (2) Given any candidate classifier f, we are then interested in the excess risk U(f , P) U(f, P), which is the utility gap between a given classifier f and the corresponding Bayes optimal. Assumption 1 (Karmic Performance Measure). The confusion-matrix function G corresponding to the performance measure U is Lipschitz continuous, and satisfies the Binary Classification with Karmic, Threshold-Quasi-Concave Metrics Table 1. Examples of evaluation metrics. Notation: TPR = TP TP+FN; TNR = TN TN+FP. METRIC DEFINITION REFERENCE G(C) ACCURACY TP + TN (1, 0, 0, 1)C ARITHMETIC MEAN (AM) (TPR + TNR)/2 MENON ET AL. (2013) ( C1 C1+C3 + C4 C2+C4 )/2 YOUDEN S INDEX TPR + TNR 1 YOUDEN (1950) C1 C1+C3 + C4 C2+C4 1 Fβ (1+β2)TP (1+β2)TP+β2FN+FP VAN RIJSBERGEN (1974) (1+β2,0,0,0)C (1+β2,1,β2)C LINEAR-FRACTIONAL a1TP+a2FP+a3FN+a4TN b1TP+b2FP+b3FN+b4TN KOYEJO ET AL. (2014) a T C b T C G-MEAN TPR TNR DASKALAKI ET AL. (2006) q C1C4 (C1+C3)(C2+C4) (1 TPR)2+(1 TNR)2 2 KUBAT ET AL. (1997) 1 ( C3 C1+C3 )2+( C2 C2+C4 )2 2 H-MEAN 2 1/TPR+1/TNR KENNEDY ET AL. (2009) 2 . C1+C3 condition that G(C)T (1, 1, 1, 1)T CB, for some constant CB > 0. We term performance measures that satisfy the condition G(C)T (1, 1, 1, 1)T CB as Karmic measures , since it guarantees a lower bound on the sensitivity of the performance measure in the direction of increasing true positives and true negatives, and decreasing false positives and false negatives. While our Karmic assumption slightly weakens the existing monotonicity assumption used in literature, it is worth pointing out that the analysis in (Narasimhan et al., 2014) requires not only monotonicity but also additional assumptions (Assumption B in (Narasimhan et al., 2014)). Assumption B assumes the existence and uniqueness of an optimal threshold, which turns out to be non-trivial to check. Our analysis on the threshold-quasi-concavity closes this gap. Assumption 1 is satisfied if G is strictly monotonically increasing with respect to TP, TN and decreasing with respect to FP, FN. Such an assumption is natural in that one would typically prefer a classifier with higher TP for fixed TN and vice versa (Narasimhan et al., 2015b). This condition is satisfied by most metrics in common use e.g. for the F1 measure, G(C)T (1, 1, 1, 1)T = 4(FP+FN) (2TP+FP+FN)2 is strictly positive as long as the data is not fully separable. 2.1. Related Work Representation of the Bayes Optimal Classifier. The Bayes optimal classifier under the accuracy metric is classically known to be a thresholding of the conditional probability of the response, with the threshold of 1/2 (see e.g. Devroye et al. (2013)). This property of Bayes optimal classifier having the thresholded form is called the probability thresholding principle for binary classification by Lewis (1995). Prior work has also shown that the thresholding principle, with a metric dependent threshold, for more complex specific measures such as F-measure (Jansche, 2007; Zhao et al., 2013), Arithmetic Mean (AM) (Menon et al., 2013), linear-fractional performance metrics (Koyejo et al., 2014), and monotonic concave metrics (Narasimhan et al., 2015a). Plug-in Classifiers for Complex Metrics. For Bayes optimal classifiers that have thresholded form, a line of work has devised plug-in classifiers that then estimate the threshold, and the conditional probability of response. For the AM metric, Menon et al. (2013) show that the threshold is simply the proportion of the positive class. For linear fractional functions, Koyejo et al. (2014) provide an implicit characterization of the optimal threshold, but the solution of which in turn requires the knowledge of the optimal classifier, which is unknown in practice. As a practical estimator, Koyejo et al. (2014) propose an exhaustive search for the threshold over all data samples, and show that the resulting algorithm is consistent, but for which non-asymptotic convergence rates are not known. Narasimhan et al. (2014) also note the importance of estimating the optimal threshold, but do not provide practical algorithms. As we show in Section 3, the empirical risk as a function of the threshold is in general neither convex nor concave. Hence, care must be taken to construct an optimization algorithm that guarantees convergence to the true threshold. Estimators designed for specific Utility Functions. Perhaps the most studied non-decomposable performance metric is the F-measure (Nan et al., 2012; Joachims, 2005; Zhao et al., 2013), with wide use in information retrieval and related areas, and for which researchers have developed tailored estimators (Nan et al., 2012; Joachims, 2005) as well as risk bounds for these estimators (Zhao et al., 2013). For instance, Busa-Fekete et al. (2015) propose a scalable online F-measure estimator for large-scale datasets, with a root finding algorithm for the threshold update which exploits special properties of the F-measure. Similarly, for the Binary Classification with Karmic, Threshold-Quasi-Concave Metrics Arithmetic Mean (AM) measure, Menon et al. (2013) design a consistent optimization scheme, based on a balanced classification-calibrated surrogate to AM. Unfortunately, these techniques are not easily extended to general complex performance metrics. Algorithms for General Classification Measures. Joachims (2005) poses the classification problem as a structured prediction problem, and for linear classifiers, propose a structural SVM solver, but for which neither consistency nor explicit convergence rates are known. Kar et al. (2014) proposes an online gradient descent algorithm which requires function classes that satisfy a uniform convergence property, which is difficult to verify apriori. Along similar lines, Narasimhan et al. (2015a) propose a stochastic gradient method, that involves a linearization of classification metric. Their proposed approach depends strongly on the assumption of a linear (or kernelized) classifier, and it is not obvious that the procedure can be extended to more complex non-linear function classes. 3. The Bayes Optimal Classifier Revisited In this section, we characterize the Bayes-optimal classifier for the broad class of Karmic performance measures, that satisfy Assumption 1. We then show that with one additional assumption, we call threshold-quasi-concavity, the optimal threshold can be guaranteed to be unique. This result will be crucial for the design and analysis of our computationally efficient threshold finding procedure in Section 4. Denote µ as the measure corresponding to the marginal distribution of X. The utility is Frech et differentiable, whose Frech et derivative of U may be computed as: [ U(f)]x = G(C(f))T [ C(f)]x 2 G(C)T (1, 1, 1, 1)T η(x) G(C)T (0, 1, 0, 1)T dµ(x) From the Karmic measure Assumption 1, we know that G(C)T (1, 1, 1, 1)T > 0. We define the Bayes critical set of G(f, P) for any f F as the set of instances where the utility has zero derivative: A3(f) = x : η(x) = G(C)T (0, 1, 0, 1)T G(C)T (1, 1, 1, 1)T For notational simplicity, we have omitted the dependency of gi on C(f). Similarly, we will use A 3 := A3(f ) to denote the Bayes critical set. In this paper we focus on distributions where the critical set of G(f, P) satisfies P(A3(f)) = 0. For instance, this is true for any distribution that satisfies the following assumption. Assumption 2 (η-continuity). Let ν denote the probability measure that is associated with random variable Z = η(X) = P(Y = 1|X), then ν is absolutely continuous with respect to µ. Furthermore, the density of η(X), denoted by pη( ), has full support on [0, 1], and is bounded everywhere. Absolute-continuity guarantees the existence of the density of Z. Armed with the above assumption on the conditional probability of the response, we can then characterize the Bayes optimal classifier as follows. Theorem 3.1 (Bayes Optimal Classifier as a Thresholding Function). Suppose that U is a performance measure that satisfies Assumption 1, and that η(X) satisfies Assumption 2. Let f be the Bayes classifier with respect to U and C be its confusion matrix. Then, for all x (A 3)c, f (x) = sign η(x) G(C )T (0, 1, 0, 1)T G(C )T (1, 1, 1, 1)T Threshold of Bayes optimal classifier For some performance measures, the optimal threshold reduces to an absolute constant; for instance it has the value of 1/2 for the accuracy measure U(f, P) = TP + TN (see e.g. (Devroye et al., 2013)). In the general case however, the optimal threshold δ is a solution of the fixed point equation: ( G(C )T (0, 1, 0, 1)T )/( G(C )T (1, 1, 1, 1)T ) = δ , which is fixed point equation due to the dependency of C on the threshold δ . Theorem 3.1 guarantees the existence of a solution to the above fixed point equation, but not its uniqueness. As we will show in Section 5, uniqueness can be achieved with some additional regularity assumptions. We note that Theorem 3.1 only imposes a weak Karmic assumption on the performance measure, which as as stated in Section 2, is more general than even a simple strictly monotonicity assumption. In particular, it generalizes prior work such as (Koyejo et al., 2014; Menon et al., 2013), that impose more stringent assumptions (linear or linear fractional form of the measures, or strong monotonicity conditions). We next briefly discuss why the critical set is crucial. Consider for instance the example studied in Narasimhan et al. (2014): with domain X = {x1, x2, x3}, a corresponding probability mass function (0.25, 0.5, 0.25), and the conditional probability η = (0.49, 0.5, 0.51). Narasimhan et al. (2014) show that for this setting, and for the case of the Hmean measure, there exist at least two deterministic Bayes optima: ( 1, 1, 1) and (1, 1, 1)}, which can be seen to not have a thresholded form i.e. it cannot be expressed as a (signed) thresholding of the conditional probability. Our analysis reveals why this is the case. Binary Classification with Karmic, Threshold-Quasi-Concave Metrics From the threshold expression in (3) from Theorem 3.1, the optimal threshold can be computed explicitly as G(C )T (0, 1,0,1)T G(C )T (1, 1, 1,1)T = 1 2. Thus, the Bayes critical set A 3 = {x : η(x) = 1 2} = {x2} has measure P(X A3) = P(X = x2) = 1 2 > 0. It is clear that the Bayes optimal classifier may not take a thresholded form on the Bayes critical set. 3.1. Uniqueness of the Bayes Optimal Threshold. We are interested in characterizing mild conditions on the performance measure under which the fixed point equation characterizing the Bayes optimal threshold has a unique solution, under which case P(A 3) = 0 (guaranteed by the η-continuity Assumption 2). The performance measure restricted to classifiers that are threshold functions of the conditional probability, can be rewritten as a function of the conditional probability η and the threshold δ. Definition 3.1. We define Vη(δ, P) := U(fη,δ, P) as the performance measure of any threshold classifier fη,δ(x) = sign (η(x) δ). Its arguments are the threshold δ, and distribution P, while the subscript η notes its dependence on the conditional probability η. We next introduce the definition of quasi-concavity, and the assumption of V being strictly quasi-concave. Definition 3.2. A function f : X R is said to be quasiconcave if x, y X, such that f(x) f(y), it follows that f(x), y x 0. We further say that f is strictly quasiconcave if it is quasi-concave and its gradient only vanishes at the global optimum, i.e., f(y) < maxx X f(x) f(y) > 0. Quasi-concave functions have super level sets are convex sets, and moreover by definition are unimodal i.e. have a unique maximal point. Assumption 3. (Threshold-Quasi-Concavity) The threshold-classifier performance measure Vη(δ, P) is strictly quasi-concave for δ [0, 1]. Assumption 3 seems abstract, but it entails that the performance measure is well-behaved as a function of the threshold. Moreover, it can be easily shown to hold for performance measures in practical use. We provide a proposition that shows that the assumption is satisfied for two important classes of performance measures: linear-fractional functions and concave functions. Proposition 3.1. If Assumptions 1, 2 hold, and either: (a) G is twice continuously differentiable and concave, or (b) G is a linear fractional function G(C) = a T C b T C with |b T C| > 0. Then Vη(δ, P) is strictly quasi-concave. Theorem 3.2. Under Assumption 3, the fixed-point equa- δ = G(C(fδ), P)T (0, 1, 0, 1)T G(C(fδ), P)T (1, 1, 1, 1)T , (4) where fδ(x) = sign (η(x) δ), has a unique fixed point δ (0, 1). Hence the threshold in Theorem 3.1 is uniquely defined. Theorems 3.1 and 3.2 have two key consequences: first, we can use the representation to design plugin-estimators of the Bayes optimal classifier; second, it facilitates the statistical analysis for rates of convergence. We will discuss each of these two consequences in the following sections. 4. Algorithmic Consequence: Estimation of the Threshold Theorem 3.1 shows that for Karmic performance measures, the Bayes optimal classifiers has the thresholded form as in Eq. (3), and moreover under the threshold-quasi-concavity Assumption 3, this threshold is unique. An immediate algorithmic consequence of this is to focus on plug-in classifiers that separately estimate the conditional probability, and the threshold. We present this plugin-classifier template in Algorithm 1. The template needs: (a) an estimator for conditional probability density η(x), and (b) an estimator for the threshold. For the convenience of analysis, we divide the set of samples into two independent subsets: the conditional probability estimator is estimated using one subset, and the threshold is estimated using the other. In the coming subsec- Algorithm 1 Two-step Plug-in Classifier for General Metrics 1: Input: Training sample {Xi, Yi}n i=1, utility measure U, conditional probability estimator ˆη, stepsize α. 2: Randomly split the training sample into two subsets {X(1) i , Y (1) i }n1 i=1 and {X(2) i , Y (2) i }n2 i=1; 3: Estimate ˆη on {X(1) i , Y (1) i }n1 i=1; 4: Estimate ˆδ with {X(2) i , Y (2) i }; 5: Output: ˆf(x) = sign ˆη ˆδ . tions we discuss how to estimate the conditional probability and the threshold respectively. 4.1. Estimation of Conditional Probability Function The estimation of the conditional probability of the response plays a crucial role in the success of Algorithm 1, but we emphasize that it is not the focus of our paper. In particular, this is a well studied problem, and numerous methods have been proposed for both parametric and non-parametric model assumptions on the conditional probability function. Binary Classification with Karmic, Threshold-Quasi-Concave Metrics In this section we briefly discuss some common estimators, and defer additional details to Section 6. Parametric methods. In a classical paper, Ng and Jordan (2002) compares two models of classification: one can either estimate P(Y ) and P(X|Y ) first, then get the conditional probability by Bayes rule (generative model approach); or directly estimate P(Y |X) (discriminative model approach). The two approaches can also be related. In particular, if PθY (X|Y ) belongs to exponential family, we have PθY (X|Y ) = h(x) exp ( θY , φ(X) A(θY )) , where φ(X) is the set of sufficient statistics, θY is the vector of the true canonical parameters, and A(θ) is the logpartition function. Using Bayes rule, we then have: P(Y = 1|X) = 1 1 + exp ( θ1 θ0, φ(X) + c ) where c = A(θ0) A(θ1). The conditional distribution can be seen to follow a logistic regression model, with the generative model sufficient statistics as the features, and the difference of the generative model parameters serving as the parameters of the discriminative model. A natural class of estimators for either the generative or discriminative models is based on Maximum likelihood Estimation (MLE). In Section 6, we derive the rate of convergence for the special case where the generative distributions are Gaussians with same covariances for both classes. Non-parametric methods. One can also estimate η(x) = P(Y = 1|X) non-parametrically, where a common model assumption is some form of smoothness on η(x). One popular class of smooth functions is the following. Definition 4.1 (β-H older class). Let β > 0, denote β the maximal integer that is strictly less than β. For x X and any β -times continuously differentiable real-valued function η on X, we denote by ηx its Taylor polynomial of degree β at point x, β-H older class is defined as the functions that satisfy, for x, x X, |ηx(x) ηx(x )| Cβ x x β. In particular, when 0 β < 1, we have |η(x) η(x )| Cβ x x β where β > 0. We can then estimate η(x) from this family of smooth functions via locally polynomial estimators (Audibert et al., 2007), or kernel (conditional) density estimators (Jiang, 2017) with a properly chosen bandwidth. 4.2. Estimation of the Threshold When Vη is quasi-concave, a key consequence is that its gradient with respect to the threshold suffices to provide ascent direction information. We leverage this consequence, and summarize a simple binary search algorithm based on the sign of V η(δ, P) in Algorithm 2. Algorithm 2 Binary search for the optimal threshold 1: Input: Training sample {Xi, Yi}n i=1, utility measure U, conditional probability estimator ˆη, tolerance ϵ0. 2: δℓ= 0; δr = 1; 3: while |δℓ δr| ϵ0 do 4: Evaluate s = sign V ˆη(δ, Pn) ; 5: if s 0 then 6: δℓ= δℓ+δr 2 ; 7: else 8: δr = δℓ+δr 2 ; 9: end if 10: end while 11: Output: δℓ+δr In the next section, we then analyze the rates of convergence for the excess generalization error of the plug-in classifier learned from Algorithm 1, and with threshold estimated via Algorithm 2. 5. Statistical Analysis Consequence: Rates of Convergence We next analyze the convergence rate of the excess utility. As we will show, the rates of convergence depend on three quantities: the noise level of the data distribution, the convergence rate of the conditional probability function, and the convergence rate of the threshold. We start by introducing some assumptions. We assume that the estimator of the conditional probability of response satisfies the following condition. Assumption 4. Let Sn denote a sample set of size n, and ηSn denote the conditional probability estimator learnt from Sn. Then, for some absolute constants c1, c2 > 0, the conditional probability estimator satisfies the following condition: sup Sn P(|ηSn(x) η(x)| ϵ) c1 exp( c2 an ϵ2) a.e. The convergence rate also depends on the noise in the training labels, which is typically captured via the probability mass near the Bayes optimal threshold. Here we generalize the classical margin assumption (sometimes also called low noise assumption) of Audibert et al. (2007), developed for the accuracy metric, to the case where the optimal threshold is not a fixed constant 1 Binary Classification with Karmic, Threshold-Quasi-Concave Metrics Assumption 5. For some function C0(δ ) > 0 that depends on the threshold δ , there exists an α 0 such that PX(0 < |η(X) δ | t) C0(δ ) tα. The assumption characterizes the behavior of the regression function in the vicinity of the optimal threshold δ . The case α = 0 bounds the probability by a constant potentially larger than one, and is trivially satisfied. The other extreme case α = is most advantageous for classification, since in this case the regression function η is bounded away from the threshold. In cases where the threshold is not an absolute constant (such as 1/2), it has to be estimated from data. We make the following assumption on its convergence rate. Assumption 6. Given a conditional probability estimate ˆη learned from an independent data source, the estimator ˆδn of the threshold, from a sample set of size n, satisfies the following condition, for some absolute constant c3 > 0: P U(sign (ˆη δ )) U(sign ˆη ˆδ ) b 1 n n c3. Note that Assumption 6 allows the rate bn to in turn depend on ˆη, or more specifically, EX|η(X) ˆη(X)|. Moreover, it does not necessarily require that ˆδ converge to δ , only that their corresponding utility function values be close. Armed with these largely notational assumptions, we can now provide the rate for the overall data-splitting two-step plug-in classifier described in Algorithm 1: Theorem 5.1. Suppose Assumption 1 and 2 hold, and further that Assumptions 4 and 6 hold for some ˆη and ˆδ. Let U = U(sign (η δ ) , P) be the Bayes optimal utility. If we split the data as n1 = n2 = n 2 , then with probability greater than 1 n c4: U U sign ˆη ˆδ , P c5 max n a 1+α 2 n , b 1 n o . where c4, c5 > 0 are absolute constants. 5.1. Key Lemmas We provide a detailed proof of the theorem in the Appendix, but provide brief vignettes via some key lemmas that also provide some additional insight into the statistical analysis. A key tool when analyzing traditional binary classification is to turn the excess risk into an expectation of the absolute deviation of conditional probability from the threshold 1 2. We show in the following lemma that a similar result holds with general optimal threshold: Lemma 5.1. Let Cn and C be the vectorized confusion matrices associated with fn = sign (ηn δ ) and f = sign (η δ ) respectively, where δ is the threshold for the Bayes optimal classifier. Denote CG := G(C )T (1, 1, 1, 1), and CH := maxf 2G(C(f)) op, where op refers to the operator norm of a matrix. If for some constant c6, an c6 CH CG min{δ ,1 δ } 2 , then G(C ) G(Cn) 1 2CGE[|η δ |1(fn = f )], G(C ) G(Cn) 3 2CGE[|η δ |1(fn = f )]. This lemma thus helps us control the excess utility via the error of the conditional probability estimator ˆη η. Armed with this result, and additionally using Assumption 4 on the convergence rate of the conditional probability estimator, we can then show that the excess utility converges at the rate O(a 1+α Lemma 5.2. Suppose that Assumptions 4 and 5 are satisfied, and that the Bayes optimal classifier is f = sign (η δ ). Then there exists a constant c7 > 0 which depend on G and C(f ), such that U(sign (η δ )) U(sign (ηn δ )) Lemma 5.2 describes the classification error rate when the optimal threshold is known. Stitching this together with Assumption 6 on the convergence rate of the threshold estimator can then be shown to yield the statement of Theorem 5.1. 5.2. Risk Bound for the Plugin Classifier from Algorithm 2 Prior work on threshold estimation for plug-in classifiers have ranged over brute-force search (Koyejo et al., 2014) with no rates of convergence, level-set based methods (Parambath et al., 2015) for the specific class of linear fractional metrics, and Frank-Wolfe based methods (Narasimhan et al., 2015b) for the specific class of concave performance metrics. However these estimators, in addition to focus on specific performance metrics, are only able to achieve a convergence rate of O(max{E ˆη(X) η(X) 1, 1/ n}). This entails that even if when the conditional probability estimator has a fast convergence rate, the final convergence rate for these estimators will still be bounded by O(1/ n). In this section we show that our simple threshold search procedure in Algorithm 2 achieves a fast O(1/n) or O(1/an) rate of convergence by leveraging our analysis from Section 3. Lemma 5.3. Assume that Assumptions 1, 2, 5 hold, and that the confusion-matrix function G corresponding to the performance measure U satisfies the same conditions as in Proposition 3.1. Let ˆδ denote the output of Algorithm 2 with sample size n and tolerance τ = log n n , and ˆη denote Binary Classification with Karmic, Threshold-Quasi-Concave Metrics a conditional probability estimator satisfying Assumption 4 obtained on an independent sample set of size n. Denoting n = min{n, an}, we then have that the rate bn in Assumption 6 satisfies: bn = log n An immediate corollary then gives the excess risk for the plug-in classifier. Corollary 5.1. Suppose Assumption 3 holds. If τ = log n n , n1 = n2 = n 2 , then there exist constants c8, c9 > 0, such that with probability at least 1 min{n, an} c8, U(f , P) U( ˆf, P) c9 max log n 6. Explicit Rates for Specific Conditional Probability Models In this section, we analyze two special cases where we can achieve explicit rate of convergence for the conditional probability estimation. For the first example, we consider the Gaussian generative model. We will show that the rate of convergence for the excess utility obtained in Theorem 5.1 is O( log n n ) in this case. The second example is for nonparametric kernel estimators when the conditional probability function satisfies certain smoothness assumption. 6.1. Gaussian Generative Model Consider two Gaussian distributions with the same variance, without loss of generality we assume the covariance matrix is identity Id for both classes. We define an asymmetric mixture of two Gaussians indexed by the centers and mixing weights. Pµ,κ : P(Y = 1) = κ, P(Y = 0) = 1 κ, X|Y = 1 N µ 2 , Id , X|Y = 0 N µ 2 , Id . (5) As stated in Section 4, we can compute the conditional probability and show that it can be fitted with a logistic regression model. Next we present results related to the key quantities in Theorem 5.1: an and α. Lemma 6.1. Model defined in Eq. (5) with maximum likelihood estimator satisfies Assumption 4 with an = n. The following lemma specifies the margin assumption parameter for the above model. Lemma 6.2. The Gaussian generative model defined as in Eq. (5), satisfies Assumption 5 with α = 1. Combining this result with Theorem 5.1 gives us the following corollary. Corollary 6.1. Assume Assumptions 1-5 hold, P is generated from Eq. (5). Let ˆf be the output of Algorithm 1 with ˆη estimated by MLE of logistic regression. We have with probability tending to 1, U(f , P) U( ˆf, P) = O log n For Gaussian generative models, fast rates of O( 1 n) are only known for 0-1 loss (Li et al., 2015). The logarithm factor can be further removed under 0-1 loss, or other cases when the threshold is known, as one can apply Lemma 5.2 with α = 1 and get exactly the same rate as in Li et al. (2015). Corollary 6.1 generalizes this result for a much broader class of utility functions, when the threshold is unknown and estimated from data. 6.2. β-H older Densities When the conditional probability function belongs to the β-H older class as defined in Definition 4.1, we have the following lemma on the convergence rates of ηn. Lemma 6.3. For β-H older conditional probability functions, there exists estimators such that Assumption 4 holds with an = n Examples of such estimators include locally polynomial estimators (Audibert et al., 2007), or kernel (conditional) density estimators (Jiang, 2017). Combined with Theorem 5.1 we have the following corollary. Corollary 6.2. Assume Assumptions 1-5 hold and P be a distribution where P(Y = 1|X) belongs to β-H older class. With locally polynomial estimators (Audibert et al., 2007) or kernel (conditional) density estimators (Jiang, 2017), we have: U(f , P) U( ˆf) = O n (min{α,1}+1)β The convergence rate obtained in Corollary 6.2 is faster than O( 1 n) if β > max{ d 2}. It is worth pointing out that the fast rate is obtained via a trade-off between the parameter α and β: to have a very smooth conditional probability function η, i.e., a large value of β, it cannot deviate from the critical level very abruptly, hence α has to be small. 7. Conclusion We study Bayes optimal classification for general performance metrics. We derive the form of the Bayes optimal classifier, provide practical algorithms to estimate this Bayes optimal classifier, and provide novel analysis of classification error with respect to general performance metrics, and in particular show our estimators are not only consistent but have fast rates of convergence. We also provide corollaries of our general results for some special cases, such as when the inputs are drawn from a Gaussian mixture generative models, or when the conditional probability function lies in a H older space, explicitly proving fast rates under mild regularity conditions. Binary Classification with Karmic, Threshold-Quasi-Concave Metrics Acknowledgements P.R. acknowledges the support of NSF via IIS-1149803, IIS-1664720, DMS-1264033, and PNC via the PNC Center for Financial Services Innovation. Jean-Yves Audibert, Alexandre B Tsybakov, et al. Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608 633, 2007. R obert Busa-Fekete, Bal azs Sz or enyi, Krzysztof Dembczynski, and Eyke H ullermeier. Online f-measure optimization. In Advances in Neural Information Processing Systems, pages 595 603, 2015. Sophia Daskalaki, Ioannis Kopanas, and Nikolaos Avouris. Evaluation of classifiers for an uneven class distribution problem. Applied artificial intelligence, 20(5):381 417, 2006. Luc Devroye, L aszl o Gy orfi, and G abor Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. Qiong Gu, Li Zhu, and Zhihua Cai. Evaluation measures of the classification performance of imbalanced data sets. In International Symposium on Intelligence Computation and Applications, pages 461 471. Springer, 2009. Martin Jansche. A maximum expected utility framework for binary sequence labeling. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 736 743, 2007. Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International Conference on Machine Learning, pages 1694 1703, 2017. Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, pages 377 384. ACM, 2005. Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for nondecomposable loss functions. In Advances in Neural Information Processing Systems, pages 694 702, 2014. Kenneth Kennedy, Brian Mac Namee, and Sarah Jane Delany. Learning without default: A study of one-class classification and the low-default portfolio problem. In Artificial Intelligence and Cognitive Science, pages 174 187. Springer, 2009. Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent binary classification with generalized performance metrics. In Advances in Neural Information Processing Systems, pages 2744 2752, 2014. Miroslav Kubat, Stan Matwin, et al. Addressing the curse of imbalanced training sets: one-sided selection. In ICML, volume 97, pages 179 186. Nashville, USA, 1997. David D Lewis. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 246 254. ACM, 1995. Tianyang Li, Adarsh Prasad, and Pradeep K Ravikumar. Fast classification rates for high-dimensional gaussian generative models. In Advances in Neural Information Processing Systems, pages 1054 1062, 2015. Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6): 1808 1829, 1999. Christopher D Manning, Prabhakar Raghavan, Hinrich Sch utze, et al. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008. Aditya Menon, Harikrishna Narasimhan, Shivani Agarwal, and Sanjay Chawla. On the statistical consistency of algorithms for binary classification under class imbalance. In International Conference on Machine Learning, pages 603 611, 2013. Ye Nan, Kian Ming Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing f-measure: A tale of two approaches. ar Xiv preprint ar Xiv:1206.4625, 2012. Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pages 1493 1501, 2014. Harikrishna Narasimhan, Purushottam Kar, and Prateek Jain. Optimizing non-decomposable performance measures: A tale of two classes. In 32nd International Conference on Machine Learning (ICML), 2015a. Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consistent multiclass algorithms for complex performance measures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2398 2407, 2015b. Binary Classification with Karmic, Threshold-Quasi-Concave Metrics Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, pages 841 848, 2002. Shameem A Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Theory of optimizing pseudolinear performance measures: Application to f-measure. ar Xiv preprint ar Xiv:1505.00199, 2015. Wolfgang Polonik. Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The Annals of Statistics, pages 855 881, 1995. Cornelis Joost Van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365 373, 1974. Byron C Wallace, Kevin Small, Carla E Brodley, and Thomas A Trikalinos. Class imbalance, redux. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 754 763. IEEE, 2011. William J Youden. Index for rating diagnostic tests. Cancer, 3(1):32 35, 1950. Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, and Gavin Brown. Beyond fano s inequality: bounds on the optimal f-score, ber, and cost-sensitive risk and their implications. The Journal of Machine Learning Research, 14(1):1033 1090, 2013.