# binary_classification_with_karmic_thresholdquasiconcave_metrics__e3697c2a.pdf

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

Bowei Yan 1 Oluwasanmi Koyejo 2 Kai Zhong 1 Pradeep Ravikumar 3

Complex performance measures, beyond the popular measure of accuracy, are increasingly being used in the context of binary classiﬁcation. These complex performance measures are typically not even decomposable, that is, the loss evaluated on a batch of samples cannot typically be expressed as a sum or average of losses evaluated at individual samples, which in turn requires new theoretical and methodological developments beyond standard treatments of supervised learning. In this paper, we advance this understanding of binary classiﬁcation for complex performance measures by identifying two key properties: a so-called Karmic property, and a more technical threshold-quasiconcavity property, which we show is milder than existing structural assumptions imposed on performance measures. Under these properties, we show that the Bayes optimal classiﬁer is a threshold function of the conditional probability of positive class. We then leverage this result to come up with a computationally practical plug-in classiﬁer, via a novel threshold estimator, and further, provide a novel statistical analysis of classiﬁcation error with respect to complex performance measures.

1. Introduction

Binary classiﬁcation, with the goal of predicting a binary response given input features, is perhaps the classical problem in machine learning, with wide ranging applications. A key ingredient in binary classiﬁcation is a performance measure, that quantiﬁes how well a given classiﬁer ﬁts the data. While the performance measure of accuracy has been the predominant focus of both theory and practice, it has severe limitations in many practical settings, such as imbal-

1University of Texas at Austin, Austin, Texas, USA 2University of Illinois at Urbana-Champaign, Champaign, Illinois, USA 3Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Correspondence to: Bowei Yan <boweiy@utexas.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

anced classes, and where different types of errors made by the classiﬁer have different costs (Gu et al., 2009; Wallace et al., 2011). Accordingly, practitioners in applied machine learning settings such as information retrieval and medical diagnosis have developed complex performance metrics that capture important trade-offs between different types of errors; we have collated a few instances in Table 1. A key complication with many complex classiﬁcation performance metrics, such as the F-measure (Manning et al., 2008) and Harmonic Mean (Kennedy et al., 2009), is that they cannot be decomposed into the sum or average of individual losses on each sample. Even the simple performance measure of precision the fraction of correct positive predictions, among the set of positive predictions is not a sum of individual losses on each sample. Thus, the standard theoretical and practical treatments of supervised learning, such as standard empirical risk minimization that minimizes the empirical expectation of a loss evaluated on a single random example, are not applicable.

This practical reality has motivated research into effective and efﬁcient algorithms tailored to complex nondecomposable performance measures. One class of approaches extend standard empirical risk minimization to this non-decomposable setting, which often relies on strong assumptions on either the form of the classiﬁers, such as requiring linear classiﬁers (Narasimhan et al., 2015a), or restricted to speciﬁc performance measures such as F-measure (Parambath et al., 2015). An alternative approach is the plug-in estimator, where we ﬁrst derive the form of the Bayes optimal classiﬁer, estimate the statistical quantities associated with the Bayes optimal classiﬁer, and ﬁnally plug-in the sample estimates of the population quantities to then obtain the overall estimate of the Bayes optimal classiﬁer. In particular, for many complex performance metrics, the Bayes optimal classiﬁer is simply a thresholding of the conditional probability of the positive class (Koyejo et al., 2014; Narasimhan et al., 2014), so that the plug-in estimator requires (a) an estimate of the conditional probability, and (b) the associated threshold. Plug-in methods have been of particular interest in non-parametric functional estimation as they typically require weaker assumptions on the function class and are often easy to implement.

In this paper, we seek to advance our understanding and practice of binary classiﬁcation under complex non-

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

decomposable performance measures. We show that for a very broad class of performance measures, encompassing a large set of performance measures used in practice, the Bayes optimal classiﬁer is simply a thresholding of the conditional probability of the response. Towards this general result, we identify two key properties that a performance measure could satisfy. The ﬁrst is what we call a Karmic property that loosely has the performance measure be more sensitive to an increase in true positives and true negatives, and a decrease in false positives and false negatives. The second is a more technical property we call threshold-quasiconcavity, which in turn ensures the performance measure is well-behaved around an optimal threshold. As we show these properties are satisﬁed by performance metrics used in practice, and in particular, these conditions are milder than existing results that restrict either the structural form of the performance measures, or impose strong shape constraints such as particular monotonicities.

Our general result has two main consequences, which we investigate further: one algorithmic, and the other for the analysis of classiﬁcation error for general performance measures. As the algorithmic consequence, we leverage the derived form of the Bayes optimal classiﬁer, and some additional general assumptions on the performance measures, to provide a tractable algorithm to estimate the threshold, which coupled with an estimator of the conditional probability, provides a tractable plug-in estimator of the Bayes optimal classiﬁer. Towards the statistical analysis consequence, we provide an analysis of the excess classiﬁcation error, but with respect to general non-decomposable performance measures, of the general class of plugin-estimators for our class of Bayes optimal classiﬁers. Our analysis of classiﬁcation error rates for such plug-in classiﬁers depend on three natural quantities: the rate of convergence for the conditional probability estimate, the rate of convergence for the threshold estimate, and a measurement of noise in the data. For the last part, we extend margin or low-noise assumptions for binary classiﬁcation with the accuracy performance measure to complex performance measures. Low noise assumptions, proposed by Mammen et al. (1999) in the context of the accuracy performance measure, bounds the noise level in the neighborhood of the Bayes optimal threshold i.e. 1 2 for standard classiﬁcation. Under such a low-noise assumption, Audibert et al. (2007) derive fast convergence rates for plug-in classiﬁcation rules based on the smoothness of the conditional probability function. Similar margin assumptions have also been introduced for density level set estimation by Polonik (1995). We provide a natural extension of such a low-noise assumption, under which we provide explicit rates of convergence of classiﬁcation error with respect to complex performance measures. We provide corollaries for both parametric and non-parametric instances of our general class of plugin-classiﬁers.

The rest of the paper is organized as below. In Section 2 we introduce the problem and relevant notations. The characterization and properties of Bayes optimal classiﬁer are derived in Section 3. We discuss the algorithm for estimating the plug-in estimator in Section 4, and present the statistical convergence guarantee in Section 5. Applications of the derived rate for two special cases, Gaussian generative model and β-H older class conditional probability are presented in Section 6 where explicit convergence rates are provided. We conclude the paper in Section 7. Detailed proofs are deferred to the supplementary materials.

2. Problem Setup and Preliminaries

Binary classiﬁcation entails predicting a binary label Y { 1} associated with a feature vector X X Rd. Such a a function mapping f : X 7 { 1} from the feature space X to the labels { 1} is called a binary classiﬁer. Let Θ = {f : X { 1}} denote a set of binary classiﬁers. We assume (X, Y ) has distribution P P, and let η(x) := P(Y = 1|X = x) denote the conditional probability of the label Y given feature vector x.

A key quantity is the confusion matrix, that consists of four population quantities: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Deﬁnition 2.1 (Confusion Matrix). For any classiﬁer f : Rd 7 { 1}, its confusion matrix is deﬁned as C(f, P) := [TP(f, P), FP(f, P), FN(f, P), TN(f, P)] [0, 1]4, where:

TP(f, P) = P(Y = +1, f(X) = +1),

FP(f, P) = P(Y = 1, f(X) = +1),

FN(f, P) = P(Y = +1, f(X) = 1),

TN(f, P) = P(Y = 1, f(X) = 1).

Another key ingredient is the utility or performance measure U : Θ P R, that measures the performance of a classiﬁer. In this paper, we focus on complex binary classiﬁcation performance measures that can be expressed as a function of the confusion matrix. Formally, U(f, P) = G(C(f, P)). When it is clear from context, we will drop the dependency of the distribution P in C and U. The confusion-matrix functions G corresponding to popular performance measures are listed in Table 1. Given the performance measure U, we are interested in the corresponding Bayes optimal classiﬁer:

f = arg max f Θ U(f, P). (2)

Given any candidate classiﬁer f, we are then interested in the excess risk U(f , P) U(f, P), which is the utility gap between a given classiﬁer f and the corresponding Bayes optimal. Assumption 1 (Karmic Performance Measure). The confusion-matrix function G corresponding to the performance measure U is Lipschitz continuous, and satisﬁes the

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

Table 1. Examples of evaluation metrics. Notation: TPR = TP TP+FN; TNR = TN TN+FP.

METRIC DEFINITION REFERENCE G(C)

ACCURACY TP + TN (1, 0, 0, 1)C ARITHMETIC MEAN (AM) (TPR + TNR)/2 MENON ET AL. (2013) ( C1 C1+C3 + C4 C2+C4 )/2 YOUDEN S INDEX TPR + TNR 1 YOUDEN (1950) C1 C1+C3 + C4 C2+C4 1 Fβ (1+β2)TP (1+β2)TP+β2FN+FP VAN RIJSBERGEN (1974) (1+β2,0,0,0)C

(1+β2,1,β2)C LINEAR-FRACTIONAL a1TP+a2FP+a3FN+a4TN

b1TP+b2FP+b3FN+b4TN KOYEJO ET AL. (2014) a T C b T C G-MEAN

TPR TNR DASKALAKI ET AL. (2006) q

C1C4 (C1+C3)(C2+C4)

(1 TPR)2+(1 TNR)2

2 KUBAT ET AL. (1997) 1

( C3 C1+C3 )2+( C2 C2+C4 )2

2 H-MEAN 2 1/TPR+1/TNR KENNEDY ET AL. (2009) 2 . C1+C3

condition that G(C)T (1, 1, 1, 1)T CB, for some constant CB > 0.

We term performance measures that satisfy the condition G(C)T (1, 1, 1, 1)T CB as Karmic measures , since it guarantees a lower bound on the sensitivity of the performance measure in the direction of increasing true positives and true negatives, and decreasing false positives and false negatives. While our Karmic assumption slightly weakens the existing monotonicity assumption used in literature, it is worth pointing out that the analysis in (Narasimhan et al., 2014) requires not only monotonicity but also additional assumptions (Assumption B in (Narasimhan et al., 2014)). Assumption B assumes the existence and uniqueness of an optimal threshold, which turns out to be non-trivial to check. Our analysis on the threshold-quasi-concavity closes this gap.

Assumption 1 is satisﬁed if G is strictly monotonically increasing with respect to TP, TN and decreasing with respect to FP, FN. Such an assumption is natural in that one would typically prefer a classiﬁer with higher TP for ﬁxed TN and vice versa (Narasimhan et al., 2015b). This condition is satisﬁed by most metrics in common use e.g. for the F1 measure, G(C)T (1, 1, 1, 1)T = 4(FP+FN) (2TP+FP+FN)2 is strictly positive as long as the data is not fully separable.

2.1. Related Work

Representation of the Bayes Optimal Classiﬁer. The Bayes optimal classiﬁer under the accuracy metric is classically known to be a thresholding of the conditional probability of the response, with the threshold of 1/2 (see e.g. Devroye et al. (2013)). This property of Bayes optimal classiﬁer having the thresholded form is called the probability thresholding principle for binary classiﬁcation by Lewis (1995). Prior work has also shown that the thresholding principle, with a metric dependent threshold, for more complex speciﬁc measures such as F-measure (Jansche, 2007;

Zhao et al., 2013), Arithmetic Mean (AM) (Menon et al., 2013), linear-fractional performance metrics (Koyejo et al., 2014), and monotonic concave metrics (Narasimhan et al., 2015a).

Plug-in Classiﬁers for Complex Metrics. For Bayes optimal classiﬁers that have thresholded form, a line of work has devised plug-in classiﬁers that then estimate the threshold, and the conditional probability of response. For the AM metric, Menon et al. (2013) show that the threshold is simply the proportion of the positive class. For linear fractional functions, Koyejo et al. (2014) provide an implicit characterization of the optimal threshold, but the solution of which in turn requires the knowledge of the optimal classiﬁer, which is unknown in practice. As a practical estimator, Koyejo et al. (2014) propose an exhaustive search for the threshold over all data samples, and show that the resulting algorithm is consistent, but for which non-asymptotic convergence rates are not known. Narasimhan et al. (2014) also note the importance of estimating the optimal threshold, but do not provide practical algorithms. As we show in Section 3, the empirical risk as a function of the threshold is in general neither convex nor concave. Hence, care must be taken to construct an optimization algorithm that guarantees convergence to the true threshold.

Estimators designed for speciﬁc Utility Functions. Perhaps the most studied non-decomposable performance metric is the F-measure (Nan et al., 2012; Joachims, 2005; Zhao et al., 2013), with wide use in information retrieval and related areas, and for which researchers have developed tailored estimators (Nan et al., 2012; Joachims, 2005) as well as risk bounds for these estimators (Zhao et al., 2013). For instance, Busa-Fekete et al. (2015) propose a scalable online F-measure estimator for large-scale datasets, with a root ﬁnding algorithm for the threshold update which exploits special properties of the F-measure. Similarly, for the

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

Arithmetic Mean (AM) measure, Menon et al. (2013) design a consistent optimization scheme, based on a balanced classiﬁcation-calibrated surrogate to AM. Unfortunately, these techniques are not easily extended to general complex performance metrics.

Algorithms for General Classiﬁcation Measures. Joachims (2005) poses the classiﬁcation problem as a structured prediction problem, and for linear classiﬁers, propose a structural SVM solver, but for which neither consistency nor explicit convergence rates are known. Kar et al. (2014) proposes an online gradient descent algorithm which requires function classes that satisfy a uniform convergence property, which is difﬁcult to verify apriori. Along similar lines, Narasimhan et al. (2015a) propose a stochastic gradient method, that involves a linearization of classiﬁcation metric. Their proposed approach depends strongly on the assumption of a linear (or kernelized) classiﬁer, and it is not obvious that the procedure can be extended to more complex non-linear function classes.

3. The Bayes Optimal Classiﬁer Revisited

In this section, we characterize the Bayes-optimal classiﬁer for the broad class of Karmic performance measures, that satisfy Assumption 1. We then show that with one additional assumption, we call threshold-quasi-concavity, the optimal threshold can be guaranteed to be unique. This result will be crucial for the design and analysis of our computationally efﬁcient threshold ﬁnding procedure in Section 4.

Denote µ as the measure corresponding to the marginal distribution of X. The utility is Frech et differentiable, whose Frech et derivative of U may be computed as:

[ U(f)]x = G(C(f))T [ C(f)]x

2 G(C)T (1, 1, 1, 1)T η(x)

G(C)T (0, 1, 0, 1)T dµ(x)

From the Karmic measure Assumption 1, we know that G(C)T (1, 1, 1, 1)T > 0. We deﬁne the Bayes critical set of G(f, P) for any f F as the set of instances where the utility has zero derivative:

A3(f) = x : η(x) = G(C)T (0, 1, 0, 1)T

G(C)T (1, 1, 1, 1)T

For notational simplicity, we have omitted the dependency of gi on C(f). Similarly, we will use A 3 := A3(f ) to denote the Bayes critical set.

In this paper we focus on distributions where the critical set of G(f, P) satisﬁes P(A3(f)) = 0. For instance, this is true for any distribution that satisﬁes the following assumption.

Assumption 2 (η-continuity). Let ν denote the probability measure that is associated with random variable Z = η(X) = P(Y = 1|X), then ν is absolutely continuous with respect to µ. Furthermore, the density of η(X), denoted by pη( ), has full support on [0, 1], and is bounded everywhere.

Absolute-continuity guarantees the existence of the density of Z. Armed with the above assumption on the conditional probability of the response, we can then characterize the Bayes optimal classiﬁer as follows.

Theorem 3.1 (Bayes Optimal Classiﬁer as a Thresholding Function). Suppose that U is a performance measure that satisﬁes Assumption 1, and that η(X) satisﬁes Assumption 2. Let f be the Bayes classiﬁer with respect to U and C be its confusion matrix. Then, for all x (A 3)c,

f (x) = sign η(x) G(C )T (0, 1, 0, 1)T

G(C )T (1, 1, 1, 1)T

Threshold of Bayes optimal classiﬁer For some performance measures, the optimal threshold reduces to an absolute constant; for instance it has the value of 1/2 for the accuracy measure U(f, P) = TP + TN (see e.g. (Devroye et al., 2013)). In the general case however, the optimal threshold δ is a solution of the ﬁxed point equation:

( G(C )T (0, 1, 0, 1)T )/( G(C )T (1, 1, 1, 1)T ) = δ ,

which is ﬁxed point equation due to the dependency of C

on the threshold δ . Theorem 3.1 guarantees the existence of a solution to the above ﬁxed point equation, but not its uniqueness. As we will show in Section 5, uniqueness can be achieved with some additional regularity assumptions.

We note that Theorem 3.1 only imposes a weak Karmic assumption on the performance measure, which as as stated in Section 2, is more general than even a simple strictly monotonicity assumption. In particular, it generalizes prior work such as (Koyejo et al., 2014; Menon et al., 2013), that impose more stringent assumptions (linear or linear fractional form of the measures, or strong monotonicity conditions).

We next brieﬂy discuss why the critical set is crucial. Consider for instance the example studied in Narasimhan et al. (2014): with domain X = {x1, x2, x3}, a corresponding probability mass function (0.25, 0.5, 0.25), and the conditional probability η = (0.49, 0.5, 0.51). Narasimhan et al. (2014) show that for this setting, and for the case of the Hmean measure, there exist at least two deterministic Bayes optima: ( 1, 1, 1) and (1, 1, 1)}, which can be seen to not have a thresholded form i.e. it cannot be expressed as a (signed) thresholding of the conditional probability. Our analysis reveals why this is the case.

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

From the threshold expression in (3) from Theorem 3.1, the optimal threshold can be computed explicitly as G(C )T (0, 1,0,1)T

G(C )T (1, 1, 1,1)T = 1 2. Thus, the Bayes critical set A 3 = {x : η(x) = 1 2} = {x2} has measure P(X A3) = P(X = x2) = 1

2 > 0. It is clear that the Bayes optimal classiﬁer may not take a thresholded form on the Bayes critical set.

3.1. Uniqueness of the Bayes Optimal Threshold.

We are interested in characterizing mild conditions on the performance measure under which the ﬁxed point equation characterizing the Bayes optimal threshold has a unique solution, under which case P(A 3) = 0 (guaranteed by the η-continuity Assumption 2).

The performance measure restricted to classiﬁers that are threshold functions of the conditional probability, can be rewritten as a function of the conditional probability η and the threshold δ.

Deﬁnition 3.1. We deﬁne Vη(δ, P) := U(fη,δ, P) as the performance measure of any threshold classiﬁer fη,δ(x) = sign (η(x) δ). Its arguments are the threshold δ, and distribution P, while the subscript η notes its dependence on the conditional probability η.

We next introduce the deﬁnition of quasi-concavity, and the assumption of V being strictly quasi-concave.

Deﬁnition 3.2. A function f : X R is said to be quasiconcave if x, y X, such that f(x) f(y), it follows that f(x), y x 0. We further say that f is strictly quasiconcave if it is quasi-concave and its gradient only vanishes at the global optimum, i.e., f(y) < maxx X f(x) f(y) > 0.

Quasi-concave functions have super level sets are convex sets, and moreover by deﬁnition are unimodal i.e. have a unique maximal point.

Assumption 3. (Threshold-Quasi-Concavity) The threshold-classiﬁer performance measure Vη(δ, P) is strictly quasi-concave for δ [0, 1].

Assumption 3 seems abstract, but it entails that the performance measure is well-behaved as a function of the threshold. Moreover, it can be easily shown to hold for performance measures in practical use. We provide a proposition that shows that the assumption is satisﬁed for two important classes of performance measures: linear-fractional functions and concave functions.

Proposition 3.1. If Assumptions 1, 2 hold, and either: (a) G is twice continuously differentiable and concave, or (b) G is a linear fractional function G(C) = a T C

b T C with |b T C| > 0. Then Vη(δ, P) is strictly quasi-concave.

Theorem 3.2. Under Assumption 3, the ﬁxed-point equa-

δ = G(C(fδ), P)T (0, 1, 0, 1)T

G(C(fδ), P)T (1, 1, 1, 1)T , (4)

where fδ(x) = sign (η(x) δ), has a unique ﬁxed point δ (0, 1). Hence the threshold in Theorem 3.1 is uniquely deﬁned.

Theorems 3.1 and 3.2 have two key consequences: ﬁrst, we can use the representation to design plugin-estimators of the Bayes optimal classiﬁer; second, it facilitates the statistical analysis for rates of convergence. We will discuss each of these two consequences in the following sections.

4. Algorithmic Consequence: Estimation of the Threshold

Theorem 3.1 shows that for Karmic performance measures, the Bayes optimal classiﬁers has the thresholded form as in Eq. (3), and moreover under the threshold-quasi-concavity Assumption 3, this threshold is unique. An immediate algorithmic consequence of this is to focus on plug-in classiﬁers that separately estimate the conditional probability, and the threshold. We present this plugin-classiﬁer template in Algorithm 1. The template needs: (a) an estimator for conditional probability density η(x), and (b) an estimator for the threshold. For the convenience of analysis, we divide the set of samples into two independent subsets: the conditional probability estimator is estimated using one subset, and the threshold is estimated using the other. In the coming subsec-

Algorithm 1 Two-step Plug-in Classiﬁer for General Metrics

1: Input: Training sample {Xi, Yi}n i=1, utility measure U, conditional probability estimator ˆη, stepsize α. 2: Randomly split the training sample into two subsets {X(1) i , Y (1) i }n1 i=1 and {X(2) i , Y (2) i }n2 i=1;

3: Estimate ˆη on {X(1) i , Y (1) i }n1 i=1;

4: Estimate ˆδ with {X(2) i , Y (2) i };

5: Output: ˆf(x) = sign ˆη ˆδ .

tions we discuss how to estimate the conditional probability and the threshold respectively.

4.1. Estimation of Conditional Probability Function

The estimation of the conditional probability of the response plays a crucial role in the success of Algorithm 1, but we emphasize that it is not the focus of our paper. In particular, this is a well studied problem, and numerous methods have been proposed for both parametric and non-parametric model assumptions on the conditional probability function.

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

In this section we brieﬂy discuss some common estimators, and defer additional details to Section 6.

Parametric methods. In a classical paper, Ng and Jordan (2002) compares two models of classiﬁcation: one can either estimate P(Y ) and P(X|Y ) ﬁrst, then get the conditional probability by Bayes rule (generative model approach); or directly estimate P(Y |X) (discriminative model approach). The two approaches can also be related. In particular, if PθY (X|Y ) belongs to exponential family, we have

PθY (X|Y ) = h(x) exp ( θY , φ(X) A(θY )) ,

where φ(X) is the set of sufﬁcient statistics, θY is the vector of the true canonical parameters, and A(θ) is the logpartition function. Using Bayes rule, we then have:

P(Y = 1|X) = 1 1 + exp ( θ1 θ0, φ(X) + c )

where c = A(θ0) A(θ1). The conditional distribution can be seen to follow a logistic regression model, with the generative model sufﬁcient statistics as the features, and the difference of the generative model parameters serving as the parameters of the discriminative model. A natural class of estimators for either the generative or discriminative models is based on Maximum likelihood Estimation (MLE). In Section 6, we derive the rate of convergence for the special case where the generative distributions are Gaussians with same covariances for both classes.

Non-parametric methods. One can also estimate η(x) = P(Y = 1|X) non-parametrically, where a common model assumption is some form of smoothness on η(x). One popular class of smooth functions is the following.

Deﬁnition 4.1 (β-H older class). Let β > 0, denote β the maximal integer that is strictly less than β. For x X and any β -times continuously differentiable real-valued function η on X, we denote by ηx its Taylor polynomial of degree β at point x,

β-H older class is deﬁned as the functions that satisfy, for x, x X,

|ηx(x) ηx(x )| Cβ x x β.

In particular, when 0 β < 1, we have |η(x) η(x )| Cβ x x β where β > 0.

We can then estimate η(x) from this family of smooth functions via locally polynomial estimators (Audibert et al., 2007), or kernel (conditional) density estimators (Jiang, 2017) with a properly chosen bandwidth.

4.2. Estimation of the Threshold

When Vη is quasi-concave, a key consequence is that its gradient with respect to the threshold sufﬁces to provide ascent direction information. We leverage this consequence, and summarize a simple binary search algorithm based on the sign of V η(δ, P) in Algorithm 2.

Algorithm 2 Binary search for the optimal threshold

1: Input: Training sample {Xi, Yi}n i=1, utility measure U, conditional probability estimator ˆη, tolerance ϵ0. 2: δℓ= 0; δr = 1; 3: while |δℓ δr| ϵ0 do

4: Evaluate s = sign V ˆη(δ, Pn) ;

5: if s 0 then 6: δℓ= δℓ+δr

2 ; 7: else 8: δr = δℓ+δr

2 ; 9: end if 10: end while 11: Output: δℓ+δr

In the next section, we then analyze the rates of convergence for the excess generalization error of the plug-in classiﬁer learned from Algorithm 1, and with threshold estimated via Algorithm 2.

5. Statistical Analysis Consequence: Rates of Convergence

We next analyze the convergence rate of the excess utility. As we will show, the rates of convergence depend on three quantities: the noise level of the data distribution, the convergence rate of the conditional probability function, and the convergence rate of the threshold. We start by introducing some assumptions.

We assume that the estimator of the conditional probability of response satisﬁes the following condition. Assumption 4. Let Sn denote a sample set of size n, and ηSn denote the conditional probability estimator learnt from Sn. Then, for some absolute constants c1, c2 > 0, the conditional probability estimator satisﬁes the following condition:

sup Sn P(|ηSn(x) η(x)| ϵ) c1 exp( c2 an ϵ2) a.e.

The convergence rate also depends on the noise in the training labels, which is typically captured via the probability mass near the Bayes optimal threshold. Here we generalize the classical margin assumption (sometimes also called low noise assumption) of Audibert et al. (2007), developed for the accuracy metric, to the case where the optimal threshold is not a ﬁxed constant 1

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

Assumption 5. For some function C0(δ ) > 0 that depends on the threshold δ , there exists an α 0 such that

PX(0 < |η(X) δ | t) C0(δ ) tα.

The assumption characterizes the behavior of the regression function in the vicinity of the optimal threshold δ . The case α = 0 bounds the probability by a constant potentially larger than one, and is trivially satisﬁed. The other extreme case α = is most advantageous for classiﬁcation, since in this case the regression function η is bounded away from the threshold.

In cases where the threshold is not an absolute constant (such as 1/2), it has to be estimated from data. We make the following assumption on its convergence rate.

Assumption 6. Given a conditional probability estimate ˆη learned from an independent data source, the estimator ˆδn of the threshold, from a sample set of size n, satisﬁes the following condition, for some absolute constant c3 > 0:

P U(sign (ˆη δ )) U(sign ˆη ˆδ ) b 1 n n c3.

Note that Assumption 6 allows the rate bn to in turn depend on ˆη, or more speciﬁcally, EX|η(X) ˆη(X)|. Moreover, it does not necessarily require that ˆδ converge to δ , only that their corresponding utility function values be close.

Armed with these largely notational assumptions, we can now provide the rate for the overall data-splitting two-step plug-in classiﬁer described in Algorithm 1:

Theorem 5.1. Suppose Assumption 1 and 2 hold, and further that Assumptions 4 and 6 hold for some ˆη and ˆδ. Let U = U(sign (η δ ) , P) be the Bayes optimal utility. If we split the data as n1 = n2 = n

2 , then with probability greater than 1 n c4:

U U sign ˆη ˆδ , P c5 max n a 1+α

2 n , b 1 n o .

where c4, c5 > 0 are absolute constants.

5.1. Key Lemmas

We provide a detailed proof of the theorem in the Appendix, but provide brief vignettes via some key lemmas that also provide some additional insight into the statistical analysis. A key tool when analyzing traditional binary classiﬁcation is to turn the excess risk into an expectation of the absolute deviation of conditional probability from the threshold 1

2. We show in the following lemma that a similar result holds with general optimal threshold:

Lemma 5.1. Let Cn and C be the vectorized confusion matrices associated with fn = sign (ηn δ )

and f = sign (η δ ) respectively, where δ is the threshold for the Bayes optimal classiﬁer. Denote CG := G(C )T (1, 1, 1, 1), and CH := maxf 2G(C(f)) op, where op refers to the operator norm of a matrix. If for some constant c6, an

c6 CH CG min{δ ,1 δ } 2 , then

G(C ) G(Cn) 1

2CGE[|η δ |1(fn = f )],

G(C ) G(Cn) 3

2CGE[|η δ |1(fn = f )].

This lemma thus helps us control the excess utility via the error of the conditional probability estimator ˆη η. Armed with this result, and additionally using Assumption 4 on the convergence rate of the conditional probability estimator, we can then show that the excess utility converges at the

rate O(a 1+α

Lemma 5.2. Suppose that Assumptions 4 and 5 are satisﬁed, and that the Bayes optimal classiﬁer is f = sign (η δ ). Then there exists a constant c7 > 0 which depend on G and C(f ), such that U(sign (η δ )) U(sign (ηn δ ))

Lemma 5.2 describes the classiﬁcation error rate when the optimal threshold is known. Stitching this together with Assumption 6 on the convergence rate of the threshold estimator can then be shown to yield the statement of Theorem 5.1.

5.2. Risk Bound for the Plugin Classiﬁer from Algorithm 2

Prior work on threshold estimation for plug-in classiﬁers have ranged over brute-force search (Koyejo et al., 2014) with no rates of convergence, level-set based methods (Parambath et al., 2015) for the speciﬁc class of linear fractional metrics, and Frank-Wolfe based methods (Narasimhan et al., 2015b) for the speciﬁc class of concave performance metrics. However these estimators, in addition to focus on speciﬁc performance metrics, are only able to achieve a convergence rate of O(max{E ˆη(X) η(X) 1, 1/ n}). This entails that even if when the conditional probability estimator has a fast convergence rate, the ﬁnal convergence rate for these estimators will still be bounded by O(1/ n). In this section we show that our simple threshold search procedure in Algorithm 2 achieves a fast O(1/n) or O(1/an) rate of convergence by leveraging our analysis from Section 3.

Lemma 5.3. Assume that Assumptions 1, 2, 5 hold, and that the confusion-matrix function G corresponding to the performance measure U satisﬁes the same conditions as in Proposition 3.1. Let ˆδ denote the output of Algorithm 2 with sample size n and tolerance τ = log n

n , and ˆη denote

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

a conditional probability estimator satisfying Assumption 4 obtained on an independent sample set of size n. Denoting n = min{n, an}, we then have that the rate bn in Assumption 6 satisﬁes: bn = log n

An immediate corollary then gives the excess risk for the plug-in classiﬁer.

Corollary 5.1. Suppose Assumption 3 holds. If τ = log n

n , n1 = n2 = n

2 , then there exist constants c8, c9 > 0, such that with probability at least 1 min{n, an} c8,

U(f , P) U( ˆf, P) c9 max log n

6. Explicit Rates for Speciﬁc Conditional Probability Models

In this section, we analyze two special cases where we can achieve explicit rate of convergence for the conditional probability estimation. For the ﬁrst example, we consider the Gaussian generative model. We will show that the rate of convergence for the excess utility obtained in Theorem 5.1 is O( log n

n ) in this case. The second example is for nonparametric kernel estimators when the conditional probability function satisﬁes certain smoothness assumption.

6.1. Gaussian Generative Model

Consider two Gaussian distributions with the same variance, without loss of generality we assume the covariance matrix is identity Id for both classes. We deﬁne an asymmetric mixture of two Gaussians indexed by the centers and mixing weights.

Pµ,κ : P(Y = 1) = κ, P(Y = 0) = 1 κ,

X|Y = 1 N µ

2 , Id , X|Y = 0 N µ

2 , Id . (5)

As stated in Section 4, we can compute the conditional probability and show that it can be ﬁtted with a logistic regression model. Next we present results related to the key quantities in Theorem 5.1: an and α.

Lemma 6.1. Model deﬁned in Eq. (5) with maximum likelihood estimator satisﬁes Assumption 4 with an = n.

The following lemma speciﬁes the margin assumption parameter for the above model.

Lemma 6.2. The Gaussian generative model deﬁned as in Eq. (5), satisﬁes Assumption 5 with α = 1.

Combining this result with Theorem 5.1 gives us the following corollary.

Corollary 6.1. Assume Assumptions 1-5 hold, P is generated from Eq. (5). Let ˆf be the output of Algorithm 1 with

ˆη estimated by MLE of logistic regression. We have with probability tending to 1, U(f , P) U( ˆf, P) = O log n

For Gaussian generative models, fast rates of O( 1

n) are only known for 0-1 loss (Li et al., 2015). The logarithm factor can be further removed under 0-1 loss, or other cases when the threshold is known, as one can apply Lemma 5.2 with α = 1 and get exactly the same rate as in Li et al. (2015). Corollary 6.1 generalizes this result for a much broader class of utility functions, when the threshold is unknown and estimated from data.

6.2. β-H older Densities

When the conditional probability function belongs to the β-H older class as deﬁned in Deﬁnition 4.1, we have the following lemma on the convergence rates of ηn.

Lemma 6.3. For β-H older conditional probability functions, there exists estimators such that Assumption 4 holds with an = n

Examples of such estimators include locally polynomial estimators (Audibert et al., 2007), or kernel (conditional) density estimators (Jiang, 2017). Combined with Theorem 5.1 we have the following corollary.

Corollary 6.2. Assume Assumptions 1-5 hold and P be a distribution where P(Y = 1|X) belongs to β-H older class. With locally polynomial estimators (Audibert et al., 2007) or kernel (conditional) density estimators (Jiang, 2017), we have: U(f , P) U( ˆf) = O n (min{α,1}+1)β

The convergence rate obtained in Corollary 6.2 is faster than O( 1 n) if β > max{ d

2}. It is worth pointing out that the fast rate is obtained via a trade-off between the parameter α and β: to have a very smooth conditional probability function η, i.e., a large value of β, it cannot deviate from the critical level very abruptly, hence α has to be small.

7. Conclusion

We study Bayes optimal classiﬁcation for general performance metrics. We derive the form of the Bayes optimal classiﬁer, provide practical algorithms to estimate this Bayes optimal classiﬁer, and provide novel analysis of classiﬁcation error with respect to general performance metrics, and in particular show our estimators are not only consistent but have fast rates of convergence. We also provide corollaries of our general results for some special cases, such as when the inputs are drawn from a Gaussian mixture generative models, or when the conditional probability function lies in a H older space, explicitly proving fast rates under mild regularity conditions.

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

Acknowledgements

P.R. acknowledges the support of NSF via IIS-1149803, IIS-1664720, DMS-1264033, and PNC via the PNC Center for Financial Services Innovation.

Jean-Yves Audibert, Alexandre B Tsybakov, et al. Fast learning rates for plug-in classiﬁers. The Annals of statistics, 35(2):608 633, 2007.

R obert Busa-Fekete, Bal azs Sz or enyi, Krzysztof Dembczynski, and Eyke H ullermeier. Online f-measure optimization. In Advances in Neural Information Processing Systems, pages 595 603, 2015.

Sophia Daskalaki, Ioannis Kopanas, and Nikolaos Avouris. Evaluation of classiﬁers for an uneven class distribution problem. Applied artiﬁcial intelligence, 20(5):381 417, 2006.

Luc Devroye, L aszl o Gy orﬁ, and G abor Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.

Qiong Gu, Li Zhu, and Zhihua Cai. Evaluation measures of the classiﬁcation performance of imbalanced data sets. In International Symposium on Intelligence Computation and Applications, pages 461 471. Springer, 2009.

Martin Jansche. A maximum expected utility framework for binary sequence labeling. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 736 743, 2007.

Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International Conference on Machine Learning, pages 1694 1703, 2017.

Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, pages 377 384. ACM, 2005.

Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for nondecomposable loss functions. In Advances in Neural Information Processing Systems, pages 694 702, 2014.

Kenneth Kennedy, Brian Mac Namee, and Sarah Jane Delany. Learning without default: A study of one-class classiﬁcation and the low-default portfolio problem. In Artiﬁcial Intelligence and Cognitive Science, pages 174 187. Springer, 2009.

Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent binary classiﬁcation with generalized performance metrics. In Advances in Neural Information Processing Systems, pages 2744 2752, 2014.

Miroslav Kubat, Stan Matwin, et al. Addressing the curse of imbalanced training sets: one-sided selection. In ICML, volume 97, pages 179 186. Nashville, USA, 1997.

David D Lewis. Evaluating and optimizing autonomous text classiﬁcation systems. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 246 254. ACM, 1995.

Tianyang Li, Adarsh Prasad, and Pradeep K Ravikumar. Fast classiﬁcation rates for high-dimensional gaussian generative models. In Advances in Neural Information Processing Systems, pages 1054 1062, 2015.

Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6): 1808 1829, 1999.

Christopher D Manning, Prabhakar Raghavan, Hinrich Sch utze, et al. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008.

Aditya Menon, Harikrishna Narasimhan, Shivani Agarwal, and Sanjay Chawla. On the statistical consistency of algorithms for binary classiﬁcation under class imbalance. In International Conference on Machine Learning, pages 603 611, 2013.

Ye Nan, Kian Ming Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing f-measure: A tale of two approaches. ar Xiv preprint ar Xiv:1206.4625, 2012.

Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency of plug-in classiﬁers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pages 1493 1501, 2014.

Harikrishna Narasimhan, Purushottam Kar, and Prateek Jain. Optimizing non-decomposable performance measures: A tale of two classes. In 32nd International Conference on Machine Learning (ICML), 2015a.

Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consistent multiclass algorithms for complex performance measures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2398 2407, 2015b.

Binary Classiﬁcation with Karmic, Threshold-Quasi-Concave Metrics

Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classiﬁers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, pages 841 848, 2002.

Shameem A Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Theory of optimizing pseudolinear performance measures: Application to f-measure. ar Xiv preprint ar Xiv:1505.00199, 2015.

Wolfgang Polonik. Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The Annals of Statistics, pages 855 881, 1995.

Cornelis Joost Van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365 373, 1974.

Byron C Wallace, Kevin Small, Carla E Brodley, and Thomas A Trikalinos. Class imbalance, redux. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 754 763. IEEE, 2011.

William J Youden. Index for rating diagnostic tests. Cancer, 3(1):32 35, 1950.

Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, and Gavin Brown. Beyond fano s inequality: bounds on the optimal f-score, ber, and cost-sensitive risk and their implications. The Journal of Machine Learning Research, 14(1):1033 1090, 2013.