# estimating_weighted_areas_under_the_roc_curve__5c34d122.pdf Estimating weighted areas under the ROC curve Andreas Maurer Istituto Italiano di Tecnologia am@andreas-maurer.eu Massimiliano Pontil Istituto Italiano di Tecnologia & University College London massimiliano.pontil@iit.it Exponential bounds on the estimation error are given for the plug-in estimator of weighted areas under the ROC curve. The bounds hold for single score functions and uniformly over classes of functions, whose complexity can be controlled by Gaussian or Rademacher averages. The results justify learning algorithms which select score functions to maximize the empirical partial area under the curve (p AUC). They also illustrate the use of some recent advances in the theory of nonlinear empirical processes. 1 Introduction Using the area under the ROC curve (the AUC) to evaluate score functions has a long history in medical screening and bioinformatics ([8],[9]). In the last decades several algorithms have been developed to learn score functions which maximize the AUC ([6], [24]). In some cases, however, different regions of the false positive range may not be equally relevant to assess the quality of the score and should therefore be weighted differently, say with some nonconstant weight function W. In melanoma detection, for example, false negatives have disastrous consequences for a patient, while a certain number of false positives is tolerable. A good scoring function should then have very large values on the right hand side of the ROC plot. Such considerations have created interest in a partial area under the ROC curve (the p AUC), which only measures the area between two specified false positive rates, so that W is a step function ([26], [15], [18]). Several algorithms have been designed with the goal of maximizing the p AUC over classes of scoring functions ([20], [21], [19], [22]). Any measurement or optimization of the AUC or p AUC must rely on a finite number of observations, which raises the questions of estimation and generalization. These problems are well understood for the AUC ([2], [5], [23]), where W is constant, but for more general weight functions estimation may be difficult to impossible. In Proposition 1 below we will show that the bias of its plug-in estimator may be bounded away from zero even in the limit of an infinite sample. Our first contribution shows that these problems are absent if the weight function W is Lipschitz continuous. Theorem 2 shows that then the weighted AUC can be estimated at rate n 1/2 by its plug-in estimator, where n is the sample size. Theorem 3 shows that the Lipschitz-weighted AUC can be uniformly estimated over a class of score functions at rate n 1/2 if the Gaussian complexity of the class increases as n1/2, which is standard for most function classes in machine learning. This result also implies a statistical performance guarantee for algorithms maximizing the empirical p AUC as in [21] or [19]. Even if W is Lipschitz the weighted AUC remains a nonlinear statistical functional of challenging complexity. Our second contribution is Theorem 7, which provides a general method to control the estimation errors of any statistical functional, which satisfies certain Lipschitz conditions with respect to the total-variation, Wasserstein or Kolmogorov metrics. The verification of these conditions for the weighted AUC then provides the proofs of Theorem 2 and Theorem 3. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 1.1 Related work The p AUC estimator is a special case of an L-estimator (see [25]). Asymptotic results for L-estimators have been known for a long time. Helmers [10] gave Berry-Esseen type rates of normal approximation for smoothened L-statistics. Finite sample bounds for special distributions are shown in [3]. The asymptotics of different estimators for the p AUC are considered in [7]. More recent asymptotic results for the p AUC are given in [27] and [1]. [13] announces a uniform bound of order n 1/4 for left-partial areas. We do not know of any finite sample bounds comparable to Theorem 2 or uniform bounds of order n 1/2 comparable to Theorem 3 below. 1.2 Notation With 1S we denote the indicator function of a set S and with H := 1(0, ) the Heaviside function. P (X) is the set of non-negative measures and P1 (X) P (X) is the set of probability measures on a measurable space X respectively. For λ [0, 1] and µ, ν P (X) we write λ (µ, ν) for the convex combination λ (µ, ν) = λµ + (1 λ) ν. If x X then δx P1 (X) is the unit mass at x. If µ P1 (X) and X = (X1, ..., Xn) µn is an iid sample of size n the empirical measure is ˆµ (X) = 1 n Pn i=1 δXi P1 (X). If X is unambiguous we write ˆµ = ˆµ (X). YX is the set of functions f : X Y, for f YX and µ P (X) the push-forward f#µ P (Y) is defined f#µ (A) = µ f 1 (A) , A Y. For f RX f = supx X |f (x)| and if (X, d) is a metric space then f Lip = supx =y X |f (x) f (y)| /d (x, y). For µ P1 (X {0, 1}) (random labeled data) µ0 and µ1 P (X) are defined µ0 (A) = µ (A {0}) and µ1 (A) = µ (A {1}) for A X, and µ (0) = µ (X {0}) and µ (1) = µ (X {1}) denote the relative frequencies of the two labels. A summary of notation in tabular form is given in the supplement. 2 Weighted areas under the ROC-curve Underlying the concept of the ROC-curve is the joint random occurrence of scores X in some open, bounded interval I R and binary labels Y {0, 1}. We assume (X, Y ) µ for some law µ P1 (I {0, 1}), with corresponding un-normalized measures µ0, µ1 P (I) and scalar frequencies µ (0) and µ (1). Any threshold t I on a score x induces a second labeling H (x t) alternative to y. The second labeling is considered correct iff it coincides with y. Correspondingly the true positive rate and false positive rate are the functions tpr, fpr : I [0, 1] defined respectively as tpr (t) = µ1 (t, ) µ (1) and fpr (t) = µ0 (t, ) If we assume that µ0 has a positive density, then fpr has a unique inverse, and the ROC-curve can be defined as the function roc : [0, 1] [0, 1] roc (u) = tpr fpr 1 (u) . The ROC-curve gives the true positive rate corresponding to a specified false positive rate. Knowledge of the ROC-curve allows us to adjust the classification threshold according to respective cost estimates for false positives and false negatives This accounts for the great importance of ROC-curves in practice. Let W : [0, 1] R+ be some specified weight function and consider the quantity Z [0,1] roc (u) W (1 u) du. With the change of variables u = fpr (t) this is seen to be equal to f W,ℓ(µ) = 1 µ (1) µ (0) I2 ℓ(t t) W µ0 ( , t] dµ1 (t ) dµ0 (t) , (1) as long as ℓis the Heaviside function , ℓ= H (see supplement). Other choices of ℓ: R [0, 1] replacing or approximating H will play a role when we want to learn scoring functions to optimize properties of the ROC-curve, but ℓwill generally be assumed nondecreasing. The equation (1) makes sense even when fpr is not uniquely invertible, and it will serve as our definition of the statistical functional f W,ℓ: P1 (I {0, 1}) R+ in the sequel. Note that f W,ℓis monotonic in W and ℓ, and that its range lies in [0, W ]. An important special case is the "area under the curve" (AUC) obtained if W 1. f1,H (µ) = Z [0,1] roc (u) du = 1 µ (1) µ (0) I2 H (t t) dµ1 (t ) dµ0 (t) . f1,H (µ) is the probability that t > t if t and t are drawn independently from the positive and the negative conditional distributions. The AUC has been widely used to measure the quality of score functions and its statistical properties have been thoroughly analyzed from different perspectives ([6], [2], [5]). If the respective costs of false negatives and false positives are very different, different regions of the false positive range are not equally relevant to measure the quality of the score. This leads to the consideration of a weighted or partial AUC where W is nonconstant and assigns different weights to different false positive rates. In the simplest case W it is a step function, such as 1[a,b] with 0 a < b 1, but W may also be chosen to approximate the value of the ROC-curve itself at a given point. 2.1 Estimation The measure µ, and with it the ROC-curve itself, are mathematical idealizations, accessible only through observations. We assume that we have access to an iid sample (X, Y) = ((X1, Y1) , ..., (Xn, Yn)) µn of labeled data. This defines the empirical measure ˆµ = ˆµ (X, Y) P1 (I {0, 1}) and corresponding empirical variants ˆµ0,ˆµ1 as well as the empirical rates ˆµ (0) and ˆµ (1). We will study the estimation of f W,ℓ(µ) by the plug-in estimator f W,ℓ(ˆµ) = 1 ˆµ (1) ˆµ (0) I2 ℓ(t t) W ˆµ0 ( , t] dˆµ1 (t ) dˆµ0 (t) = 1 |S1| |S0| j S0 ℓ(Xi Xj) W |{k S0 : Xk Xj}| where S1 = {i : Yi = 1} and S0 = {i : Yi = 0}. The dependence of the weight function on the order statistic of ˆµ0/ˆµ (0) causes problems which are absent in the estimation of the AUC as in [2]. If W has a discontinuity then the value of the functional depends discontinuously on the underlying law in the weak topology, and random fluctuations can make estimation impossible, even if the factor 1/ (ˆµ (1) ˆµ (0)) is bounded. The supplement gives proof of the following proposition, which shows that for discontinuous W the bias of the plug-in estimator may be bounded away from zero even in the limit of infinite sample sizes. Proposition 1 For W = 1[1/2,1] there exists a bounded, open interval I and µ P1 (I {0, 1}) such that for every even n and X µn lim n EX µn [f W,H (ˆµ (X))] = 1/4 < 1/2 = f W,H (µ) . To resolve this problem one might assume some regularity of µ, but since we wish to estimate a property of an unknown distribution from observations, we prefer to assume that the weight function W is Lipschitz. We still need to exclude the case that one of the observed label frequencies ˆµ (0) and ˆµ (1) is too small relative to the sample size. For δ > 0 we define the event min {ˆµ (0) , ˆµ (1)} > We then have the following result. Theorem 2 Suppose W : R [0, ) satisfies W , W Lip < , that ℓ: I [0, 1] is nondecreasing and µ P1 ( I {0, 1}). Then for any δ > 0 we have with probability at least 1 δ in (X, Y) µn that Aδ implies |f W,ℓ(ˆµ (X, Y)) f W,ℓ(µ)| W Lip + 9 W min {ˆµ (0) , ˆµ (1)}2 1. For any event B the statement "with probability at least 1 δ it holds that Aδ implies B" means Pr {Aδ Bc} < δ, where Bc is the complement of B, so if we observe Aδ the bound holds with high probability. 2. Since f W,H is monotonic in W the bound implies one-sided bounds for discontinuous weight functions, such as the step functions defining the p AUC. If ˆW WLip W where ˆW and W are discontinuous and WLip is Lipschitz, then with high probability f W,H (µ) f ˆ W ,H (ˆµ) O (1/ n). Upper estimates and sandwich estimates may be constructed similarly. 3. As in the previous remark an approximation to the value of roc at some u [0, 1] can be estimated with plug-in estimators f ˆ Wn,H (ˆµ), if the weight functions ˆWn are Lipschitz and, if regarded as probability densities, converge weakly to the unit mass δ1 u. 2.2 Uniform bounds Now let X be some space of instances and let µ P1 (X {0, 1}) be a law for labeled instances. Suppose that H IX is a class of candidate scoring functions h : X I. For every h H define h : X {0, 1} I {0, 1} by h (x, y) = (h (x) , y). Given a sample (X, Y) µn, we would like to find h H so as to (approximately) maximize the value of f W,H h#µ , where h#µ is push-forward of µ under h. The strategy is to pick an h which (approximately) maximizes a regularized empirical surrogate f W ,ℓ h#ˆµ (X, Y) , where W and ℓare Lipschitz lower bounds of W and H respectively. The situation mirrors that of support vector machines, where one minimizes the hinge-loss as Lipschitz upper bound of the 0-1-loss. Since the chosen score function h depends on (X, Y) , it becomes a random variable, and a justification of the method is given by a high probability bound on sup h H f W ,ℓ h#ˆµ f W,H h#µ sup h H f W ,ℓ h#ˆµ f W ,ℓ h#µ , where the inequality follows from f W ,ℓ f W,H. We can bound the right hand side. Theorem 3 Suppose W : R [0, ) satisfies W , W Lip < , that ℓ: I [0, 1] is nondecreasing and ℓ Lip < , that H IX and that µ P1 (X {0, 1}). Let Aδ be as in (2). Then for any δ > 0 we have with probability at least 1 δ in (X, Y) µn that Aδ implies f W,ℓ h#µ f W,ℓ h#ˆµ 2π ℓ Lip W + W Lip ˆµ (0)2 ˆµ (1) n + W Lip + 10 W min {ˆµ (0) , ˆµ (1)}2 4 ln (16n/δ) where G (H) is the expected Gaussian complexity G (H) = EXEγ suph H Pn i=1 γih (Xi) , with independent standard normal variables γ1, ..., γn . 1. Again we must observe Aδ for the bound to hold with high probability. 2. The use of Rademacher and Gaussian complexities has become a standard in learning theory, and there is a large body of literature giving bounds ([4], [14], [12], [11]), which can be substituted above. These bounds apply to many different function classes, such as multi-layer neural networks or bounded sets of linear functionals in a reproducing-kernel-Hilbert space. Typically the bounds on the Gaussian complexity are of O p n ln (n) making the above bound O p ln (n) /n . 3. Existing algorithms of [21] or [19] optimize the p AUC for discontinuous weight functions ˆW and practitioners may wish guarantees for discontinuous weight functions W. The next corollary, which directly follows from the monotonicity of f W,ℓin W and ℓ, shows that we can give such guarantees ln (n) /n , whenever a Lipschitz weight function WLip can be jammed between ˆW and W. This can be regarded as a statistical guarantee for the algorithms of [21] and [19]. Corollary 4 Let ˆW, WLip, W : [0, 1] [0, ), ˆW WLip W and WLip Lip < . Then with probability at least 1 δ in (X, Y) µn we have that Aδ implies h H, f W,H h#µ f ˆ W ,ℓ h#ˆµ B (n, WLip, H, ˆµ, δ) , where B (n, WLip, H, ˆµ, δ) is the bound in Theorem 3. In this section we outline the proofs. Several technical details are given in the supplementary material. 3.1 Conditioning on the empirical label frequencies The appearance of the true label frequencies µ (0) and µ (1) in the various denominators in the definition (1) of f W,ℓis a nuisance. But if the weight function is Lipschitz, we can approximate the estimation difference for f W,ℓby the estimation difference of another functional g W,ℓ,c, which is independent of the label frequencies and defined for c > 0 as g W,ℓ,c (µ) := Z I ℓ(t t) dµ1 (t ) W µ0 ( , t] In situations, where W, ℓand c are unambiguously fixed, we will simply write g for g W,ℓ,c. Lemma 5 Let δ (0, 1) and Aδ as in (2). Then with probability at least 1 δ/2 it holds that Aδ implies both µ (0) ˆµ (0) /2 and |f W,ℓ(ˆµ) f W,ℓ(µ)| g W,ℓ,µ(0) (ˆµ) g W,ℓ,µ(0) (µ) ˆµ (0) ˆµ (1) + W Lip + 8 W min {ˆµ (0) , ˆµ (1)}2 The proof, a straightforward application of Hoeffdings inequality, is given in the supplement. If W is Lipschitz and ˆµ (0) and ˆµ (1) and n are reasonably large, the lemma allows us to bound the estimation error for f W,ℓin terms of the estimation error of the functional g W,ℓ,c, for which we have the following result. Proposition 6 (i) For δ > 0 with probability at least 1 δ |g W,ℓ,c (µ) g W,ℓ,c (ˆµ)| W (ii) Under the conditions of Theorem 3 we have with probability at least 1 δ in (X, Y) µn that Eg W,ℓ,c h#ˆµ g W,ℓ,c h#ˆµ (X, Y) 2π ℓ Lip 2 W + c 1 W Lip n G (H) + 2 W The proof of this proposition is not easy, but it easily implies Theorem 2 and Theorem 3. Proof. (Theorem 2) A union bound of the inequalities in (i) and Lemma 5 together with some algebraic simplifications. Proof. (Theorem 3) Set δ = n 1/2 in (3) and take the expectation inside the absolute value. Using 0 g W,ℓ,c W this gives for every µ (and every h#µ) the bias bound |g W,ℓ,c (µ) Eg W,ℓ,c (ˆµ)| W Combine this with (ii), set c = µ (0) and simplify to obtain g W,ℓ,µ(0) h#µ g W,ℓ,µ(0) h#ˆµ (X, Y) 2π ℓ Lip 2 W + µ (0) 1 W Lip n G (H) + 2 W 2 ln (8n/δ) Now assume Aδ and combine with Lemma 5 in a union bound, using also µ (0) ˆµ (0) /2, and simplify to obtain the conclusion. 3.2 Plug-in estimators for Lipschitz functionals To establish Proposition 6 we will give a general method to prove high probability bounds on the estimation and uniform estimation error for Lipschitz functionals on P1 (U) with U R. For any µ, ν P (R) we define the metrics d T V (µ, ν) = |µ ν| (R) d1 (µ, ν) = Z R |µ (( , t]) ν (( , t])| dt d (µ, ν) = sup t R |µ (( , t]) ν (( , t])| . If µ, ν P1 (R) then d T V is the total variation metric, d1 is the 1-Wasserstein metric and d is the Kolmogorov metric. All these metrics have the convexity property d (λ (µ, ν) , λ (µ , ν )) λd (µ, µ ) + (1 λ) d (ν, ν ) (4) for any λ [0, 1] and µ, µ , ν, ν P1 (R). Theorem 7 Let U R and f : P1 (U) R and let δ > 0. Consider the following conditions on f (a) µ, ν P1 (U), f (µ) f (ν) L d (µ, ν) . (b) µ, ν P1 (U), f (µ) f (ν) L1d1 (µ, ν) . (c) λ [0, 1] and µ, µ , ν, ν P1 (U) we have g (λ (µ, ν)) g (λ (µ, ν )) g (λ (µ , ν)) + g (λ (µ , ν )) (1 λ) λL2d1 (ν, ν ) d T V (µ, µ ) . (i) Suppose f satisfies (a). Let µ P1 (U) and ˆµ = ˆµ (X) with X µn. Then with probability at least 1 δ |f (µ) f (ˆµ)| L (ii) Suppose f satisfies (a),(b) and (c). Let H UX , µ P1 (X) and ˆµ = ˆµ (X) with X µn. Then with probability at least 1 δ sup h H f (h#ˆµ) Ef (h#ˆµ) 8π (L1 + L2) n G (H) + L The proof uses two nontrivial auxiliary results. For (i) we need the following version of the Dvoretzky Kiefer-Wolfowitz theorem as sharpened by Massart [16]. Theorem 8 If µ is a probability measure on the real line and ˆµ is the empirical measure for X µn then for t > 0 Pr {d (µ, ˆµ) > t} 2e 2nt2. For part (ii) we use a result about nonlinear empirical processes [17], for which we need some additional notation. For x X n and h UX we write h (x) = (h (x1) , ..., h (xn)) Un. For k {1, ..., n} and y, y U we define the partial difference operator for functions f : Un R by Dk y,y f (x) = f (x1, ..., xk 1, y, xk, ..., xn) f (x1, ..., xk 1, y , xk, ..., xn) . Theorem 9 (see [17]) Let X = (X1, ..., Xn) be a vector of independent random variables with values X and X iid to X. Let U R, H UX and f : Un R. Then for any δ (0, 1), with probability at least 1 δ, sup h H f (h (X)) E [f (h (X ))] 2π (2ML (f) + JL (f)) G (H) + M (f) p n ln (1/δ), where the three seminorms M, JL and ML are defined as M (f) = max k sup x Un,y,y U Dk y,y f (x) ML (f) = max k sup x Un,y,y U,y =y Dk y,y f (x) |y y | JL = n max k,l:k =l sup x Un,z,z ,y,y U,y =y Dk y,y Dl z,z f (x) |y y | . Proof. (of Theorem 7) (i) If (a) holds then by Theorem 8 Pr {|g (µ) g (ˆµ)| > t} Pr d (µ, ˆµ) > t L Set the right hand side to δ and solve t for the conclusion. (ii) Suppose (a),(b) and (c) hold. We will use Theorem 9 and bound the seminorms for f (ˆµ). Fix x Un and k {1, ..., n} and set ˆµk = 1 n 1 P i:i =k δxk. Dk y,y f (ˆµ) = f n 1 n d (δy, δy ) L where the first inequality follows from (a), the second from convexity d (see (4)) and the last from d (δy, δy ) 1. In the same way (b) gives Dk y,y f (ˆµ) L1 |y y | /n, since d1 (δy, δy ) = |y y |. This gives M (f) L /n and ML (f) L1/n. To bound JL (f) let l = k and write ˆµkl = 1 n 2 P i:i/ {k,l} δxk, ˆµk,z = n 2 n 1 ˆµkl + 1 n 1δz and ˆµk,z = n 2 n 1 ˆµkl + 1 n 1δz . Then Dk y,y Dl z,z g (x) n ˆµk,z + 1 n ˆµk,z + 1 n ˆµk,z + 1 n ˆµk,z + 1 n2 L12d T V (ˆµk,z, ˆµk,z ) d1 (y, y ) L2 n2 d T V (δz, δz ) d1 (y, y ) 2L2 n2 |y y | . The first inequality follows from (c), the second from convexity of d T V (see (4)) and the last from d T V (δz, δz ) 2. This gives JL (f) 2L2/n. Substitution in the conclusion of Theorem 9 completes the proof. 3.3 Proof of Proposition 6 To apply Theorem 7 we need to control the Lipschitz properties of g W,ℓ,c. We do so at first with respect to the unnormalized measures µ0 and µ1. Proposition 10 Fix W, ℓand c. The functional g = g W,ℓ,c satisfies µ, µ , ν, ν P1 (I {0, 1}) (a) g (ν) g (ν ) W (d (ν0, ν 0) + d (ν1, ν 1)) (b) if ℓis Lipschitz then g (ν) g (ν ) W ℓ Lip (d1 (ν0, ν 0) + d1 (ν1, ν 1)) (c) if W and ℓare Lipschitz then g (λ (µ, ν)) g (λ (µ, ν )) g (λ (µ , ν)) + g (λ (µ , ν )) (1 λ) λ ℓ Lip c W Lip + W (d1 (ν1, ν 1) + d1 (ν0, ν 0)) d T V (µ, µ ) . The proof is given in the supplement. We then eliminate the labels and replace the disconnected space I {0, 1} by a subset of the real line. Let a = sup I and b = sup I inf I. We map I {0} to ( b, 0) in ascending order of I, and we map I {1} to (0, b) in descending order of I. More formally we define the bijection τ : (x, y) I {0, 1} 7 (2y 1) (a x) ( b, 0) (0, b). The push-forward µ 7 τ#µ is then a bijection between P1 (I {0, 1}) and P1 (( b, 0) (0, b)) with inverse τ 1 # . Then for A I we have µ0 (A) = µ (A {0}) = τ#µ (A a) and µ1 (A) = τ#µ (a A) . The supplement gives proof of the following lemma. Lemma 11 For µ, ν P1 (I {0, 1}) we have (i) d T V (µ0, ν0) + d T V (µ1, ν1) = d T V (µ, ν) = d T V (τ#µ, τ#ν) (ii) d (µ0, ν0) + d (µ1, ν1) 2d (τ#µ, τ#ν) (iii) d1 (µ0, ν0) + d1 (µ1, ν1) 2d1 (τ#µ, τ#ν) Since g W,ℓ,c (µ) = g W,ℓ,c τ 1 # (τ#µ), we can shift perspective to the functional g W,ℓ,c τ 1 # on P1 (( b, 0) (0, b)). Proposition 10 and Lemma 11 show that the functional µ 7 g W,ℓ,c τ 1 # (µ) satisfies the Lipschitz condition (a) of Theorem 7 with L = 2 W d (µ, ν) and, if ℓis Lipschitz, also (b) with L1 = W ℓ Lip and (c) with L2 = 2 ℓ Lip c W Lip + W In Theorem 7 we then replace U by ( b, 0) (0, b), f by g W,ℓ,c τ 1 # , X by (X {0, 1}), H by the class of functions H = {(x, y) 7 τ (h (x) , y) : h H}, and we substitute the constants L , L1 and L2 by the values given above. Theorem 7 then gives us parts (i) and (ii) of Proposition 6, since by symmetry of the standard normal distribution G (H ) = EXEγ sup h H i=1 γi (2Yi 1) (a h (Xi)) = G (H) . 4 Conclusion In a nonparametric setting the estimation of partial areas under the ROC-curve by plug-in estimators was shown to be impossible. It is nevertheless possible to give error bounds for Lipschitz weight functions and uniform bounds which provide some justification for existing algorithms optimizing the p AUC. The method to control the uniform estimation errors for nonlinear statistical functionals in terms of Lipschitz conditions appears promising in a more general context. Acknowledgments and Disclosure of Funding This work was partially supported by a grant from SAP SE. Broader Impact A solid mathematical basis is beneficial to the development of practical statistical methods. We believe that the present work improves the understanding of ROC-curves and the optimization of score functions used in machine learning and medical diagnostics. [1] Gianfranco Adimari and Monica Chiogna. Jackknife empirical likelihood based confidence intervals for partial areas under roc curves. Statistica Sinica, pages 1457 1477, 2012. [2] Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, and Dan Roth. Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6(Apr):393 425, 2005. [3] EA Baklanov and IS Borisov. Probability inequalities and limit theorems for generalized l-statistics. Lithuanian Mathematical Journal, 43(2):125 140, 2003. [4] P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. [5] Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis, et al. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008. [6] Corinna Cortes and Mehryar Mohri. Auc optimization vs. error rate minimization. In Advances in neural information processing systems, pages 313 320, 2004. [7] Lori E Dodd and Margaret S Pepe. Partial auc estimation and regression. Biometrics, 59(3):614 623, 2003. [8] David J Goodenough, Kurt Rossmann, and Lee B Lusted. Radiographic applications of receiver operating characteristic (roc) curves. Radiology, 110(1):89 95, 1974. [9] James A Hanley and Barbara J Mc Neil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29 36, 1982. [10] Roelof Helmers et al. The order of the normal approximation for linear combinations of order statistics with smooth weight functions. The Annals of Probability, 5(6):940 953, 1977. [11] S. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865 1890, 2012. [12] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pages 793 800, 2009. [13] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for non-decomposable loss functions. In Advances in Neural Information Processing Systems, pages 694 702, 2014. [14] V. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of function learning. In J. Wellner E. Gine, D. Mason, editor, High Dimensional Probability II, pages 443 459. 2000. [15] Zhenqiu Liu and Terry Hyslop. Partial auc for differentiated gene detection. In 2010 IEEE International Conference on Bio Informatics and Bio Engineering, pages 310 311. IEEE, 2010. [16] Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, pages 1269 1283, 1990. [17] Andreas Maurer and Massimiliano Pontil. Uniform concentration and symmetrization for weak interactions. In Proceedings of the Thirty-Second Conference on Learning Theory, pages 2372 2387, 2019. [18] Donna Katzman Mc Clish. Analyzing a portion of the roc curve. Medical Decision Making, 9(3):190 195, 1989. [19] H. Narasimhan and S. Agarwal. Support vector algorithms for optimizing the partial area under the roc curve. Neural computation, 29(7):1919 1963, 2017. [20] Harikrishna Narasimhan and Shivani Agarwal. A structural svm based approach for optimizing partial auc. In International Conference on Machine Learning, pages 516 524, 2013. [21] Harikrishna Narasimhan and Shivani Agarwal. Svm pauc tight: a new support vector method for optimizing partial auc based on a tight convex upper bound. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 167 175. ACM, 2013. [22] Sakrapee Paisitkriangkrai, Chunhua Shen, and Anton Van Den Hengel. Efficient pedestrian detection by directly optimizing the partial area under the roc curve. In Proceedings of the IEEE International Conference on Computer Vision, pages 1057 1064, 2013. [23] Gengsheng Qin and Xiao-Hua Zhou. Empirical likelihood inference for the area under the roc curve. Biometrics, 62(2):613 622, 2006. [24] Alain Rakotomamonjy. Optimizing area under roc curve with svms. In ROCAI, pages 71 80, 2004. [25] Robert J Serfling. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009. [26] Stephen D Walter. The partial area under the summary roc curve. Statistics in medicine, 24(13):2025 2040, 2005. [27] Hanfang Yang, Kun Lu, and Yichuan Zhao. A nonparametric approach for partial areas under roc curves and ordinal dominance curves. Statistica Sinica, pages 357 371, 2017.