# consistency_analysis_for_binary_classification_revisited__35ce75e4.pdf

Consistency Analysis for Binary Classiﬁcation Revisited

Krzysztof Dembczy nski 1 Wojciech Kotłowski 1 Oluwasanmi Koyejo 2 Nagarajan Natarajan 3

Statistical learning theory is at an inﬂection point enabled by recent advances in understanding and optimizing a wide range of metrics. Of particular interest are non-decomposable metrics such as the F-measure and the Jaccard measure which cannot be represented as a simple average over examples. Non-decomposability is the primary source of difﬁculty in theoretical analysis, and interestingly has led to two distinct settings and notions of consistency. In this manuscript we analyze both settings, from statistical and algorithmic points of view, to explore the connections and to highlight differences between them for a wide range of metrics. The analysis complements previous results on this topic, clariﬁes common confusions around both settings, and provides guidance to the theory and practice of binary classiﬁcation with complex metrics.

1. Introduction

Real-world applications of binary classiﬁcation to complex decision problems have led to the design of a wide range of evaluation metrics (Choi & Cha, 2010). Prominent examples include area under the ROC curve (AUC) for imbalanced labels (Menon et al., 2013), F-measure for information retrieval (Lewis, 1995), and precision at the top (Kar et al., 2014; 2015; Jasinska et al., 2016). To this end, several algorithms have been proposed for optimizing many of these metrics, primarily focusing on large-scale learning, without a conscious emphasis on statistical consequences of choosing models and their asymptotic behavior (Kar et al., 2015; Joachims, 2005). Wide use of such complex metrics has also re-invigorated research into their theoretical properties, which can then serve as a guide to prac-

Authors listed in the alphabetical order 1Institute of Computing Science, Poznan University of Technology, Poland 2Department of Computer Science, University of Illinois at Urbana Champaign, USA 3Microsoft Research, India. Correspondence to: Wojciech Kotłowski <wkotlowski@cs.put.poznan.pl>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

tice (Koyejo et al., 2014a; Narasimhan et al., 2014a; Dembczy nski et al., 2012; Waegeman et al., 2014; Natarajan et al., 2016).

Complex evaluation metrics for binary classiﬁcation are best described as set metrics, or non-decomposable metrics as, in general, the evaluation for a set of predictions cannot be decomposed into the average of individual instance evaluations. This is in contrast to decomposable metrics such as accuracy which are deﬁned as the empirical average of the instance evaluations. This property is the primary source of difﬁculty in theoretical analysis, and interestingly has led to two distinct settings and notions of consistency. On one hand, Population Utility (PU) focuses on estimation so a consistent PU classiﬁer is one which correctly estimates the population optimal utility as the size of the training set (equiv. test set) increases. The PU approach has strongest roots in classical statistical analysis which often deals with asymptotically optimal estimation. On the other hand, Expected Test Utility (ETU) focuses on generalization. Thus, the consistent ETU classiﬁer is one which optimizes the expected prediction error over test sets of a pre-deﬁned size. The ETU approach has strongest roots in statistical machine learning which prizes generalization as the primary goal. Importantly, these distinctions are irrelevant when the metric is a linear function of the confusion matrix e.g. (weighted) accuracy and other linear metrics. To the best of our knowledge, this dichotomy was ﬁrst explicitly noted by Ye et al. (2012) in the context of F-measure.1 Like in Ye et al. (2012), our goal is not to adjudicate the correctness of either approach, but instead to explore deep connections, and highlight signiﬁcant differences between both approaches for a wide range of metrics.

Contributions: We present a variety of results comparing and contrasting the PU and ETU approaches for consistent classiﬁcation:

We show that for a wide range of metrics, PU and ETU are asymptotically equivalent with respect to the size of the test set, subject to a certain p-Lipschitzness

1Note that Ye et al. (2012) termed the two approaches Empirical Utility Maximization (EUM) and Decision Theoretic Approach (DTA), respectively. We have instead chosen the more descriptive names Population Utility (PU) and Expected Test Utility (ETU).

Consistency Analysis for Binary Classiﬁcation Revisited

condition which is satisﬁed by many metrics of interest. This further implies asymptotic equivalence of the Bayes optimal classiﬁers (Section 3.1). Similar results were previously only known for F-measure.

We provide lower bounds for the difference between PU and ETU metrics for ﬁnite test sets, and for certain metrics thereby highlighting the difference between PU and ETU consistent classiﬁers with small test sets (Section 3.2).

We analyze approximate ETU classiﬁcation using low order Taylor approximations, showing that the approximation can be computed with effectively linear complexity, yet achieves low error under standard assumptions (Section 4.1).

We consider the effects of model mis-speciﬁcation and ﬁnd that ETU may be more sensitive than PU, but this may be alleviated by properly calibrating the estimated probabilities (Section 4.2).

In addition, we present experimental results using simulated and real data to evaluate our theoretical claims (Section 5).

2. Preliminaries and Problem Setup

We consider the binary classiﬁcation problem, where the input is a feature vector x X, and the output is a label y {0, 1}. We assume the examples (x, y) are generated i.i.d. according to P(x, y). A classiﬁer is a mapping h: X {0, 1}. We let 1C denote the indicator function i.e. equal to one if C is satisﬁed, and zero otherwise.

Given a distribution P and a binary classiﬁer h, deﬁne:

TP(h) = P(h = 1, y = 1), TN(h) = P(h = 0, y = 0),

FP(h) = P(h = 1, y = 0), FN(h) = P(h = 0, y = 1),

which are entries of the so-called confusion matrix, namely true positives, true negatives, false positives and false negatives. In this paper, we are interested in optimizing performance metrics Φ(h, P) (we use explicit dependence on P because we will also consider the empirical version of Φ) that are functions of the above four quantities. However, since the entries of the confusion matrix are interdependent, it sufﬁces to only use their three independent combinations. Following Natarajan et al. (2016), we parametrize Φ(h, P) = Φ(u(h), v(h), p) by means of:

u(h) = TP(h), v(h) = P(h = 1), and p = P(y = 1).

As argued by Natarajan et al. (2016), any metric being a function of the confusion matrix can be parameterized in this way. Table 1 lists popular examples of such metrics

Metric Deﬁnition Φ(u, v, p)

Accuracy TP + TN 1 + 2u v p

AM TP/2 TP+FN + TN/2 TN+FP u+p(1 v p)

Fβ (1+β2)TP (1+β2)TP+β2FN+FP (1+β2)u

β2p+v Jaccard TP TP+FP+FN p+v 2u

TP TN (TP+FN)(TN+FP) u(1 v p+u)

AUC FP FN (TP+FN)(FP+TN) (v u)(p u)

Table 1. Examples of performance metrics.

with explicit parameterization Φ(u, v, p). Throughout the paper we assume Φ(u, v, p) is bounded from above and from below.2

2.1. Formal Deﬁnitions of PU and ETU

Deﬁnition 1 (Population Utility (PU)). Given a distribution P and classiﬁer h, the PU of h for a performance metric Φ is deﬁned as Φ(u(h), v(h), p). We let h PU denote any maximizer of the PU,

h PU argmax h Φ(u(h), v(h), p) .

In words, the PU is obtained by taking the value of metric Φ evaluated at the expected confusion matrix of h over P. Thus, one can think of the PU as evaluating the classiﬁer h on a single test set of inﬁnite size drawn i.i.d. from P.

In contrast, ETU evaluates the expected utility for a ﬁxedsize test set. Formally, given a sample S = {(xi, yi)}n i=1 of size n, generated i.i.d. from P, we let bu(h), bv(h), bp denote the corresponding empirical quantities:

i=1 h(xi)yi, bv(h) = 1

i=1 h(xi), bp = 1

and the empirical value of metric Φ is then Φ(bu(h), bv(h), bp).

Deﬁnition 2 (Expected Test Utility (ETU)). Let x = (x1, . . . , xn) Xn be an arbitrary sequence of inputs. Given a distribution P and a classiﬁer h, the ETU of h for a performance metric Φ conditioned on x is deﬁned as:3

Ey|x h Φ bu(h), bv(h), bp i ,

2In fact, for essentially all metrics used in practice it holds 0 Φ(u, v, p) 1. 3The conditional expectation y|x is deﬁned up to a zeromeasure set (over x), but this does not create any problems as we always consider x being sampled from the data distribution.

Consistency Analysis for Binary Classiﬁcation Revisited

where the expectation over y = (y1, . . . , yn) is with respect to the conditional distribution P(y|x) i.i.d. over the examples. We let h ETU(x) denote any maximizer of the ETU,

h ETU(x) argmax h Ey|x h Φ bu(h), bv(h), bp i .

One can think of ETU as evaluating the classiﬁer h on inﬁnitely many test sets of size n drawn i.i.d. from P. We will see (in Section 4) that the optimal predictions (in both PU and ETU approaches) can be accurately estimated using the conditional probabilities P(yi|xi). In practice, we ﬁrst obtain an estimator of the conditional probability and then compute the optimal predictions on test data based on their conditional probability estimates.

Remark 1. More generally, ETU optimizes the expected utility Ey,x h Φ bu(h), bv(h), bp i . However, clearly, it is sufﬁcient to analyze the predictions at any given x (Natarajan et al., 2016) as in Deﬁnition 2.

2.2. Well-behaved Performance Metrics

The two frameworks treat the metrics as utility measures (i.e., they are to be maximized). Further, it is reasonable to expect that Φ(h, P) is non-decreasing in true positive and true negative rates (and indeed, virtually all performance measures used in practice behave this way). As shown by Natarajan et al. (2016), such monotonicity in true positive and true negative rates implies another property, called TP monotonicity, which is better suited to the parameterization employed here.

Deﬁnition 3 (TP monotonicity). Φ(u, v, p) is said to be TP monotonic if for any v, p and u1 > u2, it holds that Φ(u1, v, p) > Φ(u2, v, p).

It is easy to verify that all measures in Table 1 are TP monotonic.

A contribution in this work is to develop a notion of regularity for metrics, that helps establish statistical connections between the two frameworks and their optimal classiﬁers. We call it p-Lipschitzness, deﬁned next.

Deﬁnition 4 (p-Lipschitzness). Φ(u, v, p) is said to be p-Lipschitz if:

|Φ(u, v, p) Φ(u , v , p )| Up|u u |+Vp|v v |+Pp|p p |,

for any feasible u, v, p, u , v , p . The Lipschitz constants Up, Vp, Pp are allowed to depend on p, in contrast to the standard Lipschitz functions.

The rationale behind p-Lipschitzness is that we want to control the change in value of the measure under small

changes in their arguments. This property turns out to be essential to show equivalence between ETU and PU approaches. On the other hand, if we simply used a standard deﬁnition of Lipschitz function (with global constants), it would not be satisﬁed by many interesting measures. Hence, we weaken the Lipschitz property by allowing the constant to vary as a function of p. One can also show that general linear-fractional performance metrics studied in (Koyejo et al., 2014a; Narasimhan et al., 2015; Kotłowski & Dembczy nski, 2016) satisfy p-Lipschitzness under mild conditions (Appendix A).

Proposition 1. All measures in Table 1 are p-Lipschitz.

Proof. We only give a proof for Fβ-measure here (See Appendix A for the rest). For ease, let us denote Fβ(u, v, p) by Fβ and Fβ(u , v , p ) by F β. Let u = u u , v = v v , p = p p . We have:

|Fβ F β| = (1 + β2)|u(β2p + v ) u (β2p + v)|

(β2p + v)(β2p + v )

= (1 + β2)| u(β2p + v ) u β2 p u v|

(β2p + v)(β2p + v )

β2p +v | p| + u

β2p +v | v| .

Since u min{p , v }, we have β2u

β2p +v 1, u β2p +v

1, and thus we can choose Up = Vp = Pp = 1+β2

As for an example of a metric which is not p-Lipschitz, consider the precision deﬁned as Φ(u, v, p) = u

v . Indeed, if v is close to zero, choosing v = 2v, u = u and p = p gives: Φ(u, v, p) Φ(u , v , p ) = u

which can be arbitrarily large for sufﬁciently small v, while the difference |v v | = v is small. As it turns out in Section 3.2, this pathological behavior of the precision metric is responsible for a large deviation between PU and ETU, which suggests that p-Lipschitzness is in some sense necessary to establish connections.

3. Equivalence of PU and ETU

Most of the existing literature on optimizing nondecomposable classiﬁcation metrics focus on one of the two approaches in isolation. In this section, we show that the two approaches are in fact asymptotically equivalent, for a range of well-behaved metrics. Informally, given a distribution P and a performance metric Φ, our ﬁrst result is that for sufﬁciently large n, the PU of the associated h ETU is arbitrarily close to that of h PU, and likewise, the ETU of h PU is arbitrarily close to that of h ETU. In contrast, we also

Consistency Analysis for Binary Classiﬁcation Revisited

show that the PU and ETU optimal classiﬁers may suffer differences for small samples.

3.1. Asymptotic Equivalence

The intuition behind the equivalence lies in the observation that the optimal classiﬁers under the two approaches exhibit a very simple, similar form, under mild assumptions on the distribution (Koyejo et al., 2014b; Narasimhan et al., 2014b; Natarajan et al., 2016). Let η(x) := P(y = 1|x) denote the conditional probability of positive class as a function of x. The following lemma shows that for any ﬁxed classiﬁer h that thresholds η(x), and sufﬁciently large sample size n, its performance measured with respect to PU and ETU are close, in particular, differ by a factor that decays as fast as O(1/ n). In fact, the result holds uniformly over all such binary classiﬁers.

Lemma 1. Let H = {h | h = 1η(x) τ, τ [0, 1]}, be the class of thresholded binary decision functions. Let Φ be a performance metric which is p-Lipschitz. Then, with probability at least 1 δ over a random sample S = {(xi, yi)}n i=1 of size n generated i.i.d. from P, it holds uniformly over all h H, Φ(u(h),v(h), p(h)) Ey|x Φ(bu(h), bv(h), bp(h))

2 log(n + 1)

δ 2n + Lp n,

where bu(h), bv(h), bp(h) are empirical quantities evaluated on S, and Lp = max{Up, Vp, Pp}.

Remark 2. Lemma 1 generalizes the result obtained by Ye et al. (2012) for Fβ-measure to arbitrary p-Lipschitz metrics. Furthermore, using more careful bounding technique, we are able to get a better dependence on the sample size n, essentially O(1/ n) (neglecting logarithmic terms). In fact, this dependence cannot be improved any further in general (See Appendix B).

The uniform convergence result in Lemma 1 enables the ﬁrst main result of this work. In particular, the convergence holds when the optimal classiﬁers with respect to ETU and PU are of the thresholded form, i.e. h PU H, and h ETU(x) H almost surely (with respect to random sample of inputs x), where H = {h | h = 1η(x) τ, τ [0, 1]} is the class of threshold functions on function η(x). Several recent results have shown that the optimal classiﬁer for many popular metrics (including all metrics in Table 1) indeed has the thresholded form (Narasimhan et al., 2014a; Lewis, 1995), under a mild condition related to continuity of the distribution of η(x) (See the proof of Theorem 1 in Appendix B.2 for details):

Assumption 1. The random variable η(x) has a density (with respect to the Lebesgue measure) on [0, 1].

We are now ready to state the result. Proofs omitted in the main text are supplied in the Appendix.

Theorem 1. Let Φ be a performance metric that is TP monotonic and p-Lipschitz, and P be a distribution satisfying Assumption 1. Consider the ETU optimal classiﬁer h ETU (Deﬁnition 2) and the PU optimal classiﬁer h PU (Definition 1). Then, for any given ϵ and δ, we can choose n large enough (in Deﬁnition 2 of ETU), such that, with probability at least 1 δ over the random choice of the sample of inputs x, we have: Φ u(h ETU(x)),v(h ETU(x)), p

Φ u(h PU), v(h PU), p ϵ.

Similarly, for large enough n, with probability 1 δ, Ey|x h Φ bu(h ETU(x)), bv(h ETU(x)), bp i

Ey|x h Φ bu(h PU), bv(h PU), bp i ϵ.

Remark 3. In essence, Theorem 1 suggests that, for large sample sizes, the optimal in the sense of one approach gives an accurate estimate (or a proxy) of the optimal in the sense of the other approach. Our characterization of p-Lipschitzness is key to showing the equivalence.

3.2. Finite Sample Regime

The aforementioned result is asymptotic; to elucidate the point, we now give an example where optimal classiﬁers corresponding to PU and ETU differ. It is important to be aware of such extremities, especially when one applies a learned model to test data of modest sample sizes. The way we argue a lower bound is by specifying a metric and a distribution, such that on a randomly obtained test set of modest size, say m, the gap in the empirical metric computed on the test data for the two optimal classiﬁers can be large. As one is typically primarily interested in the empirical metric on a given test set, focusing on the empirical metric ensures fairness and forbids favoring either deﬁnition.

Example. For some constant α > 0, consider the (adjusted) empirical precision metric deﬁned as:

ΦPrec(bu(h(x)), bv(h(x)), p) = bu(h(x)) bv(h(x)) + α.

Note that ΦPrec [0, 1 1+α]; Furthermore, it is p-Lipschitz, with Lipschitz constant Vp 1 α (see Deﬁnition 4). Thus, choosing very small values of α implies very high Lipschitz constant, and in turn the metric becomes less stable . To establish the desired lower bound, we choose a small 0 <

Consistency Analysis for Binary Classiﬁcation Revisited

α 1. Let {X1, X2, X3} denote a partition of the instance space X, i.e. 3 i=1Xi = X and Xi Xj = 0, for any pair (i, j). Consider the joint distribution P deﬁned as:

P(y = 1|x X1) = 1 , P(y = 1|x X3) = 0,

P(y = 1|x X2) = 1 ϵ = 1 α, (1)

P(X1) + P(X3) = ϵ2 , P(X2) = 1 ϵ2.

for some 1 ϵ2 > 0 and note that the distribution is deﬁned to be dependent on our choice of α. The last line in the above set of equations suggests that the distribution has a small region where labels are deterministically positive or negative, but overwhelmingly positive elsewhere. Theorem 2. Let x = {x1, x2, . . . , xn} denote a set of instances drawn i.i.d. from the distribution P. Let y = {y1, y2, . . . , yn} denote their labels drawn from the same distribution. With probability at least (1 ϵ2 ϵn),

ΦPrec(bu(h ETU(x)), bv(h ETU(x)), ˆp)

ΦPrec(bu(h PU(x)), bv(h PU(x)), ˆp) 1 n(1 + α) .

4. Algorithms: Optimization and Conditional Probability Estimation

Characterization of the optimal classiﬁer as a thresholding of the conditional probability yields simple and efﬁcient PU consistent estimators. The idea is to ﬁrst obtain an estimator for the conditional probability using training data, and then search for an optimal threshold on a separate validation set (Narasimhan et al., 2014b; Koyejo et al., 2014b). Threshold search can be efﬁciently implemented in linear time (assuming probabilities are pre-sorted). In contrast, although a similar thresholding characterization exists for ETU (Natarajan et al., 2016), evaluation and prediction require the computation of an expensive expectation (Deﬁnition 2). For general metrics, there is an O(n3) procedure to determine the optimal test set labeling (Jansche, 2007; Chai, 2005; Natarajan et al., 2016), and the procedure can be sped up to O(n2) in some special cases (Ye et al., 2012; Natarajan et al., 2016). Here, we consider an approximation to ETU that requires only O(n) computation, yet achieves error O(n 3/2) compared to exact optimization.

4.1. Approximation Algorithms

Recall that ETU seeks to ﬁnd the classiﬁer of the form:

h ETU(x) = argmax h Ey|x Φ(bu(h), bv(h), bp) .

Following (Lewis, 1995; Natarajan et al., 2016) we know that when Φ is TP monotonic, it sufﬁces to sort observations in decreasing order according to η(x) and assign positive labels to top k of them, for k = 0, . . . , n. Unfortunately, for each k, we need to calculate the expected utility

Algorithm 1 Approximate ETU Consistent Classiﬁer

1: Input: Φ and sorted estimates of ηi, i = 1, 2, . . . , n 2: Init s i = 0, i [n], bp = 1

n Pn i=1 yi, bu0 = 0 3: Set Φ0 = Φ(0, 0, bp) 4: for k = 1, 2, . . . , n do 5: Set buk = (k 1)buk 1+ηk

k , bvk = k

n 6: Set Φk = Φ(buk, bvk, bp) (via Lemmas 2 or 3) 7: end for 8: k arg maxk=0,...,n Φk. 9: return s s.t. s i 1 for i [k ].

measure, which is time consuming requiring O(n2) in general. Our goal here is to approximate this term, so that it can be computed in O(n) time, then the whole procedure can be implemented in amortized time O(n).

Fix a binary classiﬁer h: X {0, 1} and the input sample x = (x1, . . . , xn). Let bu(h), bv(h), bp denote the empirical quantities, as deﬁned in Section 3.1. Furthermore, we deﬁne semi-empirical quantities:

i=1 h(xi)η(xi), and ep = 1

(there is no need to deﬁne ev(h)). Note that eu(h) = Ey|x bu(h) , and ep = Ey|x [bp].

Zeroth-order approximation. Our ﬁrst approximation is based on Taylor-expanding the measure up to the second order:

Lemma 2. If Φ is twice-differentiable in (u, p) and all its second-order derivatives are bounded by constant A, then:

Ey|x Φ(bu(h), bv(h), bp) Φ(eu(h), bv(h), ep) A

We note that the ﬁrst order terms vanish in the Taylor approximation (proof in Appendix). This constitutes a simple, yet powerful method for approximating ETU utility. Algorithm 1 outlines the resulting algorithm. As shown, the classiﬁer can be computed in O(n) time overall, assuming the data is already sorted according to η(xi) (otherwise, the procedure is dominated by sorting time O(n log n)). We note that (Lewis, 1995) proposed a similar ﬁrst order approximation, albeit without any rigorous guarantee.

Second order approximation. Naturally, we can get a better approximation by Taylor-expanding the measure up to the third order.

Lemma 3. Assume Φ is three times differentiable in (u, p) and assume all its third-order derivatives are bounded by constant B. Let 2 uu, 2 up, 2 pp denote the second-order

Consistency Analysis for Binary Classiﬁcation Revisited

derivative terms evaluated at (eu, ep), and likewise deﬁne 2 up, 2 pp. We then have:

Ey|x Φ(bu(h), bv(h), bp) Φappr(h) B 3n3/2 ,

Φappr(h) = Φ(eu(h), bv(h), ep)

2( 2 uu + 2 2 up)su + 2 ppsp,

i=1 η(xi)(1 η(xi)),

i=1 h(xi)η(xi)(1 η(xi)).

Theorem 3 (Consistency). Given n instances x = (x1, x2, . . . , xn), sort them in decreasing order of η(xi). For 0 k n, let s(k) denote the vector with positions corresponding to top k of the sorted instances set to 1, and 0 otherwise. (a) Suppose ﬁrst order derivatives are bounded by A, let:

h a = arg max s(k) Φ(eu(s(k)), bv(s(k)), ep),

Φ(h ETU) Φ(eu(h a), bv(h a), ep) A

(b) Suppose second order derivatives are bounded by B, let: h b = arg max s(k) Φappr(s(k)),

where Φappr(h) is deﬁned in Lemma 3. We have:

Φ(h ETU) Φappr(h b) 2B 3n3/2 .

As before, the approximation can be computed in O(n) total time. We could also expand the function up to orders higher than the third order, and get better approximations (still with O(n) computation if the order of the expansion is independent of n) at the cost of an even more complicated approximation formula. In experiments, we ﬁnd that on real datasets with test data sets of size 100 or more, even the zeroth order approximation is highly accurate.

4.2. Conditional Probability Estimation and Model Misspeciﬁcation

So far, we assumed that we have access to the true class conditional density η(x) = P(y = 1|x) and the resulting

classiﬁer is a threshold function on η(x). In practice, one employs some probability estimation procedure and gets bη(x), which we call a model.4 Then, one uses bη(x) as if it were a true conditional probability η(x) to obtain PU or ETU classiﬁers. Note that since η(x) is unknown, and we only have access to bη(x), the best we can hope for is to choose the optimal threshold on bη(x) (for PU) or choose the optimal number of test set observations k to be classiﬁed as positive after sorting them according to bη(x) (for ETU). Next, we investigate these ﬁnite sample effects in practical PU and ETU procedures. For this analysis, we treat bη(x) as given and ﬁxed, make no other assumptions on how it was obtained. Let b H = {h | h = 1bη(x) τ, τ [0, 1]} denote the class of binary threshold functions on bη(x).

Consider PU ﬁrst, and let h be the PU-optimal classiﬁer from b H, i.e.:

h = argmax h b H Φ(u(h), v(h), p).

In practice, however, one does not have access to P, and thus u(h), v(h), p cannot be computed. Instead, given bη(x), one uses a validation sample S = {(xi, yi)}n i=1 to choose a threshold on bη(x) (and thus, a classiﬁer from b H), by directly optimizing the empirical version of the metric on S: bh = argmax h b H Φ(bu(h), bv(h), bp).

We would like to assess how close is bh to h . By following the proof of Lemma 1 (which never assumes the class H is based on thresholding η(x)), it is easy to show that with high probability,

Φ(u(bh), v(bh), p) Φ(u(h ), v(h ), p) O 1 n

Thus, if we have a sufﬁciently large validation sample at our disposal, we can set the threshold which maximizes the empirical version of the metric, and our performance is guaranteed to be e O(1/ n) close to the performance of the Φ-optimal classiﬁer from b H. In other words, PU does not require to know the true distribution in order to select the best classiﬁer in b H, only a sufﬁciently large validation sample is required.

In contrast, ETU procedure is inherently based on using bη(x) as a replacement for η(x) (which we do not know) to decide upon label assignments. Let x = (x1, . . . , xn) be the input sample of size n. Assume for simplicity the distribution of η(x) and bη(x) are continuous on [0, 1], so that for any i = j, η(xi) = η(xj) with probability one, and

4For instance, bη(x) could be obtained from logistic regression or neural network with soft-max function on the ﬁnal layer.

Consistency Analysis for Binary Classiﬁcation Revisited

similarly for bη. Then, given x and bη, the ETU procedure chooses the classiﬁer of the form:

bh = argmax h b H Ey bη(x) Φ(bu(h), bv(h), bp) .

Likewise, the optimal ETU classiﬁer in b H is given by:

h = argmax h b H Ey η(x) Φ(bu(h), bv(h), bp) ,

i.e. by deﬁnition, the optimal classiﬁer in the restricted class H involves the expectation with respect to the true η. Let us denote ΦETU = Ey η(x) Φ(bu(h), bv(h), bp) , so that h maximizes ΦETU. In the supplementary material, we show that under some mild assumptions on Φ:

Ex h ΦETU(bh) ΦETU(h ) i O 1 n

+ Pp|p pbη|

+ sup h b H Up|u(h) ubη(h)|,

where pbη = E bη(x) and ubη(h) = E h(x)bη(x) , are the quantities corresponding to p and u(h), which were calculated by replacing the conditional probability η with its estimate bη. Thus, while for the PU procedure, the difference between bh and h diminishes as n grows, it is not the case of ETU, as there are two bias terms |p pbη| and |u(h) ubη(h)| which do not depend on n. These terms correspond to using incorrect conditional probability bη while selecting the classiﬁer, and are present even if the sample size tends to inﬁnity. Thus, it seems crucial for the success of ETU procedure to have bη calibrated with respect to the true distribution.

A popular choice for class probability estimation is to use logistic regression. However, if the model is misspeciﬁed, which happens often in practice, the aforementioned discussion suggests that the desired ETU solution may not be achieved. Therefore, we need to learn the class probability function more carefully. Here, we consider two variants.

The ﬁrst is to use the Isotron algorithm. In case of the generalized linear model, i.e. P(y|x) = γ(w , x) for some unknown link function γ and model w , Kalai & Sastry (2009) proposed a simple and elegant algorithm (see Appendix E) that alternatively learns w and the link function γ (approximated by a piecewise linear function). It provably learns the model under certain assumptions on P. The model w and link function γ are learned using training data, and at prediction time, the link function and the scores of training data (i.e., x T i w) are used to calibrate the class probabilities η(x) of test instances.

We also consider using a recalibrated logistic model, i.e., we ﬁrst estimate the class probabilities via standard logistic regression, and recalibrate the probabilities by running one update of the γ function in Isotron algorithm (which

essentially solves a quadratic problem known as the Pool of Adjacent Violators). At test time, we use the learnt γ and the logistic model to estimate η(x) for test instances.

5. Experiments

We empirically evaluate the effectiveness and accuracy of ETU approximations introduced in Section 4.1, on synthetic as well as real datasets. We also show on several benchmark datasets that, by carefully calibrating the conditional probabilities in ETU, we can improve the classiﬁcation performance.

5.1. Convergence of Approximations

We consider F1 and Jaccard metrics from Table 1. We sample conditional probabilities ηi for n instances from the uniform distribution. The optimal predictions (see Deﬁnition 2) are obtained using Algorithm 1 of (Natarajan et al., 2016) (which is equivalent to searching over 2n possible label vectors). Then we compute the approximate optimal predictions using the ﬁrst and the second order approximations discussed in Section 4.1. For each metric, we measure the deviation between the true and the approximate optimal values with increasing sample size in Figure 1. We observe linear convergence for the ﬁrst order approximation and quadratic convergence for the second order approximation. This suggests that the bounds in Theorem 3 indeed can be improved for some metrics, if not in general.

n 0 20 40 60 80 100

|F(app) 1 - F1 *|

Second order approximation First order approximation

n 0 20 40 60 80 100

|Jacc(app) - Jacc*|

Second order approximation First order approximation

(b) Jaccard

Figure 1. Convergence of approximations demonstrated on synthetic data.

5.2. Approximations on Real Data

We report results on seven multiclass and multilabel benchmark datasets: (1) LETTERS: 16000 train, 4000 test instances, (2) SCENE: 1137 train, 1093 test (3) YEAST: 1500 train, 917 test (4) WEBPAGE: 6956 train, 27824 test (5) IMAGE: 1300 train, 1010 test (6) BREAST CANCER: 463 train, 220 test instances, (7) SPAMBASE: 3071 train, 1530 test instances.5 In case of multiclass datasets, we report results (using one-vs-all classiﬁers) averaged over classes (as

5See (Koyejo et al., 2014b; Ye et al., 2012) for details.

Consistency Analysis for Binary Classiﬁcation Revisited

DATASET Exact Approx (ﬁrst) Approx (second) Exact Approx (ﬁrst) Approx (second) F1 F1 F1 Jaccard Jaccard Jaccard LETTERS (26) 0.5666 0.5666 0.5666 0.4273 0.4272 0.4273 SCENE (6) 0.6916 0.6917 0.6916 0.5374 0.5376 0.5374 YEAST (14) 0.4493 0.4494 0.4493 0.3242 0.3242 0.3242 WEB PAGE 0.6336 0.6336 0.6336 0.4637 0.4637 0.4637 SPAMBASE 0.8448 0.8457 0.8448 0.7314 0.7326 0.7314 IMAGE 0.8542 0.8542 0.8542 0.7455 0.7455 0.7455 BREAST CANCER 0.9660 0.9660 0.9660 0.9342 0.9342 0.9342

Table 2. Comparison of ETU approximation methods: F1 and Jaccard metrics deﬁned in Table 1. Reported values correspond to performance evaluated on heldout data (higher values are better). For multiclass datasets (number of classes indicated in parenthesis), average performance over classes is reported. Evidently, the ETU approximations are accurate to at least 4 decimal digits in several cases.

DATASET Logistic Isotron Recalibrated PU Logistic Isotron Recalibrated PU F1 F1 Logistic: F1 F1 Jaccard Jaccard Logistic: Jaccard Jaccard LETTERS (26) 0.5666 0.5905 0.5722 0.5745 0.4273 0.4447 0.4304 0.4318 SCENE (6) 0.6916 0.6222 0.6809 0.4397 0.5374 0.4673 0.5362 0.4318 YEAST (14) 0.4493 0.4666 0.4629 0.4730 0.3242 0.3399 0.3414 0.3422 WEB PAGE 0.6336 0.7362 0.6756 0.6809 0.4637 0.5825 0.5037 0.5194 SPAMBASE 0.8448 0.7825 0.8905 0.8839 0.7314 0.5729 0.7837 0.8003 IMAGE 0.8542 0.8683 0.8569 0.8581 0.7455 0.7673 0.7704 0.7623 BREAST CANCER 0.9660 0.9669 0.9799 0.9766 0.9342 0.9359 0.9605 0.9481

Table 3. Modeling conditional probabilities in ETU: Logistic model vs calibrated model using Isotron: F1 and Jaccard metrics deﬁned in Table 1. Reported values correspond to performance evaluated on heldout data (higher values are better). For multiclass datasets (number of classes indicated in parenthesis), average performance over classes is reported.

in Natarajan et al. (2016)).

We compare the exact ETU optimal, computed using the algorithm of Natarajan et al. (2016), with the approximations. The results for F1 and Jaccard metrics are presented in Table 2. The results convincingly show that the approximations are highly accurate, and almost always indistinguishable from optimizing true metrics, on real datasets. Note that even the ﬁrst-order approximation (in fact, this is zeroth-order, as the ﬁrst order term is zero; see Section 4) achieves high accuracy, as the test set sizes are relatively large.

5.3. Model Misspeciﬁcation

We now study how class probability estimation (CPE) and model misspeciﬁcation affects the performances of PU and ETU approaches, on the seven benchmark datasets. We compare four methods: (a) ETU with logistic regression based CPE, (b) ETU with Isotron based CPE (discussed in Section 4.1), (c) ETU with recalibrated logistic regression based CPE (discussed in Section 4.1), and (d) PU using logistic regression based CPE followed by threshold tuning on validation set (Koyejo et al., 2014b). Additional comparisons to structured SVM (Joachims, 2005) and other classiﬁers are available in previously published work by others (Koyejo et al., 2014b; Natarajan et al., 2016), and are omitted here.

The results are presented in Table 3. We observe that the logistic model (column 1) is insufﬁcient for many of the datasets. The results improve in several cases using the estimated generalized linear model with Isotron (column 2). However, there is a confounding factor that the two algorithms are very different, and noticed improvement may not necessarily be due to better CPE. To isolate this, recalibrated logistic model results are presented in column 3. The results are in general much better than the standard logistic model, which suggests that it is indeed the case of model misspeciﬁcation in these datasets. Finally, we present the results with PU algorithm in column 4. We ﬁnd that the results closely match that of the recalibrated logistic model (except in the case of SCENE dataset); thus, correcting for model misspeciﬁcation helps demonstrate the theorized asymptotic equivalence of PU and ETU approaches in practice.

6. Conclusions and Future Work

We have presented new results which elucidate the relationship between the two notions of consistency for complex binary classiﬁcation metrics. Next, we plan to explore surrogates to further improve training efﬁciency nondecomposable metrics. We will also extend to more complex prediction problems such as multilabel classiﬁcation, where a similar dichotomy exists.

Consistency Analysis for Binary Classiﬁcation Revisited

Acknowledgments

W. Kotłowski has been supported by the Polish National Science Centre under Grant No. 2013/11/D/ST6/03050.

Chai, Kian Ming Adam. Expectation of F-measures: tractable exact computation and some empirical observations of its properties. In Proceedings of the 28th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 593 594. ACM, 2005.

Choi, Seung-Seok and Cha, Sung-Hyuk. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, pp. 43 48, 2010.

Dembczy nski, Krzysztof, Waegeman, Willem, Cheng, Weiwei, and H ullermeier, Eyke. On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning, 88(1-2):5 45, 2012.

Jansche, Martin. A maximum expected utility framework for binary sequence labeling. In Annual Meeting of the Association of Computational Linguistics, volume 45, pp. 736, 2007.

Jasinska, Kalina, Dembczynski, Krzysztof, Busa-Fekete, R obert, Pfannschmidt, Karlson, Klerx, Timo, and H ullermeier, Eyke. Extreme f-measure maximization using sparse probability estimates. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1435 1444, 2016.

Joachims, Thorsten. A support vector method for multivariate performance measures. In Proceedings of the 22nd Intl. Conf. on Machine Learning, pp. 377 384. ACM, 2005.

Kalai, Adam and Sastry, Ravi. The Isotron algorithm: High-dimensional isotonic regression. In Conference on Learning Theory (COLT), 2009.

Kar, Purushottam, Narasimhan, Harikrishna, and Jain, Prateek. Online and stochastic gradient methods for nondecomposable loss functions. In Advances in Neural Information Processing Systems, pp. 694 702, 2014.

Kar, Purushottam, Narasimhan, Harikrishna, and Jain, Prateek. Surrogate functions for maximizing precision at the top. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 189 198, 2015.

Kotłowski, Wojciech and Dembczy nski, Krzysztof. Surrogate regret bounds for generalized classiﬁcation performance metrics. Machine Learning Journal, DOI 10.1007/s10994-016-5591-7, 2016.

Koyejo, Oluwasanmi, Natarajan, Nagarajan, Ravikumar, Pradeep K., and Dhillon, Inderjit S. Consistent binary classiﬁcation with generalized performance metrics. In Neural Information Processing Systems (NIPS), 2014a.

Koyejo, Oluwasanmi O, Natarajan, Nagarajan, Ravikumar, Pradeep K, and Dhillon, Inderjit S. Consistent binary classiﬁcation with generalized performance metrics. In Advances in Neural Information Processing Systems, pp. 2744 2752, 2014b.

Lewis, David D. Evaluating and optimizing autonomous text classiﬁcation systems. In Proceedings of the 18th Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 246 254. ACM, 1995.

Menon, Aditya, Narasimhan, Harikrishna, Agarwal, Shivani, and Chawla, Sanjay. On the statistical consistency of algorithms for binary classiﬁcation under class imbalance. In Proceedings of the 30th Intl. Conf. on Machine Learning, pp. 603 611, 2013.

Mohri, Mehryar, Rostamizadeh, Afshin, and Talwalkar, Ameet. Foundations of Machine Learning. The MIT Press, 2012.

Narasimhan, Harikrishna, Vaish, Rohit, and Agarwal, Shivani. On the statistical consistency of plug-in classiﬁers for non-decomposable performance measures. In Neural Information Processing Systems (NIPS), 2014a.

Narasimhan, Harikrishna, Vaish, Rohit, and Agarwal, Shivani. On the statistical consistency of plug-in classiﬁers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pp. 1493 1501, 2014b.

Narasimhan, Harikrishna, Ramaswamy, Harish, Saha, Aadirupa, and Agarwal, Shivani. Consistent multiclass algorithms for complex performance measures. In Proceedings of the 32nd Intl. Conf. on Machine Learning, pp. 2398 2407, 2015.

Natarajan, Nagarajan, Koyejo, Oluwasanmi, Ravikumar, Pradeep, and Dhillon, Inderjit. Optimal classiﬁcation with multivariate losses. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1530 1538, 2016.

Waegeman, Willem, Dembczynski, Krzysztof, Jachnik, Arkadiusz, Cheng, Weiwei, and H ullermeier, Eyke. On the bayes-optimality of F-measure maximizers. Journal of Machine Learning Research, 15:3333 3388, 2014.

Ye, Nan, Chai, Kian Ming A, Lee, Wee Sun, and Chieu, Hai Leong. Optimizing F-measures: a tale of two approaches. In Proceedings of the Intl. Conf. on Machine Learning, 2012.