# taufpl_toleranceconstrained_learning_in_linear_time__6baecc22.pdf

τ-FPL: Tolerance-Constrained Learning in Linear Time

Ao Zhang,1 Nan Li,2 Jian Pu,1 Jun Wang,1 Junchi Yan,31 Hongyuan Zha14

1Shanghai Key Laboratory of Trustworthy Computing, MOE International Joint Lab of Trustworthy Software, School of Computer Science and Software Engineering, East China Normal University, Shanghai, China 2Institute of Data Science and Technologies, Alibaba Group, Hangzhou, China 3IBM Research China 4Georgia Institute of Technology, Atlante, USA az.aozhang@gmail.com, nanli.ln@alibaba-inc.com, {jianpu,jwang,zha}@sei.ecnu.edu.cn, yanesta@163.com

In many real-world applications, learning a classiﬁer with false-positive rate under a speciﬁed tolerance is appealing. Existing approaches either introduce prior knowledge dependent label cost or tune parameters based on traditional classiﬁers, which are of limitation in methodology since they do not directly incorporate the false-positive rate tolerance. In this paper, we propose a novel scoring-thresholding approach, τFalse Positive Learning (τ-FPL) to address this problem. We show that the scoring problem which takes the false-positive rate tolerance into accounts can be efﬁciently solved in linear time, also an out-of-bootstrap thresholding method can transform the learned ranking function into a low false-positive classiﬁer. Both theoretical analysis and experimental results show superior performance of the proposed τ-FPL over the existing approaches.

Introduction In real-world applications, such as spam ﬁltering (Drucker, Wu, and Vapnik 1999) and medical diagnosing (Huang, Liu, and Zhou 2010), the loss of misclassifying a positive instance and negative instance can be rather different. For instance, in medical diagnosing, misdiagnosing a patient as healthy is more dangerous than misclassifying healthy person as sick. Meanwhile, in reality, it is often infeasible to deﬁne an accurate cost for these two kinds of errors (Liu and Zhou 2010; Zhou and Zhou 2016). In such situations, it is often needed to keep the classiﬁer working under a small tolerance of false-positive rate (FPR) τ, i.e., only allow the classiﬁer to misclassify no larger than τ percent of negative instances. Traditional classiﬁers trained by maximizing classiﬁcation accuracy or AUC are not suitable due to mismatched goal. In the literature, classiﬁcation under constrained falsepositive rate is known as Neyman-Pearson (NP) Classiﬁcation problem (Scott and Nowak 2005; Lehmann and Romano 2006; Rigollet and Tong 2011), and existing approaches can be roughly grouped into several categories. One common approach is to use cost-sensitive learning, which assigns different costs for different classes, and representatives include cost-sensitive SVM (Osuna, Freund, and Girosi 1997; Davenport, Baraniuk, and Scott 2006; 2010),

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

cost-interval SVM (Liu and Zhou 2010) and cost-sensitive boosting (Masnadi-Shirazi and Vasconcelos 2007; 2011). Though effective and efﬁcient in handling different misclassiﬁcation costs, it is usually not easy to ﬁnd appropriate misclassiﬁcation cost for speciﬁc FPR tolerance. Another group of methods formulates this problem as a constrained optimization problem, which has the FPR tolerance as an explicit constraint (Mozer et al. 2002; Gasso et al. 2011; Mahdavi, Yang, and Jin 2013). These methods often need to ﬁnd the saddle point of Lagrange function, leading to timeconsuming alternate optimization. Moreover, a surrogate loss is often used to simplify the optimization problem, possibly making the tolerance constraint not satisﬁed in practice. The third line of research is scoring-thresholding methods, which train a scoring function ﬁrst, then ﬁnd a threshold to meet the target FPR tolerance (Drucker, Wu, and Vapnik 1999). In practice, the scoring function can be trained by either class conditional density estimation (Tong 2013) or bipartite ranking (Narasimhan and Agarwal 2013a). However, density estimation itself is another difﬁculty. Also most bipartite ranking methods have super-linear training complexity, which limits their scalability. Meanwhile, there are some methods paying special attention to the positive class. For example, asymmetric SVM (Wu et al. 2008) maximizes the margin between negative samples and the core of positive samples, one-class SVM (Ben-Hur et al. 2001) ﬁnds the smallest ball to enclose positive samples. However, they do not incorporate the FPR tolerance into the learning procedure either.

In this paper, we address the tolerance constrained learning problem by proposing τ-False Positive Learning (τFPL). Speciﬁcally, τ-FPL is a scoring-thresholding method. In the scoring stage, we explicitly learn a ranking function which optimizes the probability of ranking any positive instance above the centroid of the worst τ percent of negative instances. Whereafter, it is shown that, with the help of our newly proposed Euclidean projection algorithm, this ranking problem can be solved in linear time under the projected gradient framework. It is worth noting that the Euclidean projection problem is a generalization of a large family of projection problems, and our proposed linear-time algorithm based on bisection and divide-and-conquer is one to three orders faster than existing state-of-the-art methods. In the thresholding stage, we devise an out-of-bootstrap threshold-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

ing method to transform aforementioned ranking function into a low false-positive classiﬁer, which is more stable compared to existing thresholding method. Theoretical analysis and experimental results show that the proposed method achieves superior performance over existing approaches.

From Constrained Optimization to Ranking In this section, we show that the FPR tolerance problem can be transformed into a ranking problem, then give a convex ranking loss function. Let X = {x | x Rd : ||x|| 1} be the instance space, and S = S+ S be a set of training instances, where S+ = {x+ i X}m i=1 and S = {x j X}n j=1 contains m and n instances independently sampled from distributions P+ and P , respectively. Let 0 τ 1 be the maximum tolerance of false-positive rate. Consider the following optimization problem with false-positive rate constraint, which is known as Neyman-Pearson classiﬁcation criteria1:

min f,b Px+ P+(f(x+) < b) (1)

s.t. Px P (f(x ) > b) τ

where f : X R is a scoring function and b R a threshold. With ﬁnite training instances, the corresponding empirical risk minimization problem is

min f,b Lemp(f, b) = 1

i=1 I(f(x+ i ) < b) (2)

j=1 I(f(x j ) > b) τ

where I(u) is the indicator function. Although directly optimizing by ﬁnding a saddle point of Lagrangian function is available, it usually falls into a time consuming alternate optimization framework (Mozer et al. 2002; Gasso et al. 2011). Indeed, there have been considerable efforts on approximating this problem by introducing asymmetric costs for different type of error into classiﬁcation learning framework (Davenport, Baraniuk, and Scott 2010)

min f,b LC emp(f, b) = 1

i=1 I(f(x+ i ) < b) + C

j=1 I(f(x j ) > b)

where C 0 is a hyper-parameter that punishes the gain of false positive instance. Although reasonable, here we point out that, all methods under this framework indeed minimize a lower bound of problem (2). This can be veriﬁed by formulating (2) into unconstrained form L emp(f, b)

L emp(f, b)

max λ 0 1 m

i=1 I(f(x+ i ) < b) + λ

j=1 I(f(x j ) > b) τ

i=1 I(f(x+ i ) < b) + C

j=1 I(f(x j ) > b) τ

= LC emp(f, b) Cτ.

Thus, for a ﬁxed C, minimizing LC emp is equivalent to minimize a lower bound of Lemp. In other words, cost-sensitive learning methods are insecure in this setting.

1The case of f(x) = b is omit for simplicity .

On the other hand, very few work tries to minimize a more reasonable upper bound of (2) or its equivalent problems. It is also lack of clear formulation for such upper bound. We address these problems in detail below.

Proposition 1. Denote f(x [j]) the j-th largest value in set

{f(xi) | xi S }, and . the ceil function. Then the constrained optimization problem (2) share the same optimal solution f with following ranking problem

i=1 I f(x+ i ) f(x [ τn ]) < 0 . (3)

Proof. For a ﬁxed f, it is clear that the constraint in (2) is equivalent to b f(x [ τn ]). Since the objective in (2) is a non-increasing function of b, its minimum is achieved at b = f(x [ τn ]). From this, we can transform the original problem (2) into its equivalent form (3) by substituting.

Proposition 1 constructs the connection between constrained optimization (2) and ranking problem (3). Optimizing (3) is very difﬁcult due to the participation of operation [ ], which is non-convex when τn 2. Hence, we consider to optimize following upper bound of (3)

f(x+ i ) 1 τn

i=1 f(x [i]) < 0

which prefers scores on positive examples to exceed the mean of scores on the worst τ-proportion of negative examples. If τ = 0, it is equivalent to the original problem (3). In general cases, equality also could hold when both the scores of the worst τ-proportion negative examples are the same. The advantages of considering ranking problem (4) include: by replacing I( ) by its convex surrogate, it produces a tight convex upper bound of the original minimization problem, in contrast to cost-sensitive classiﬁcation which may only offer an unstable lower bound; its formulation leads to a linear time ranking algorithm, which remains the same training time complexity compared with cost-sensitive classiﬁcation, and outperforms most of the traditional ranking algorithms; It is also cost-free, and the generalization performance is not depending on additional hyper-parameters.

Tolerance-Constrained False Positive Learning

Based on previous discussion, our framework can be divided into two stages, namely scoring and thresholding. In scoring, a function f( ) is learnt to maximize the probability of giving higher a score to positive instances than the centroid of top τ percent of the negative instances. In thresholding, a suitable threshold b will be chosen, and the ﬁnal prediction of an instance x can be obtained by

y = sgn(f(x) b) . (5)

Tolerance-Constrained Ranking

In (4), we consider linear scoring function f(x) = w T x, where w Rd is the weight vector to be learned, and replace I(u < 0) by its convex surrogate function l(u) = [1 u]2 +. Here

Algorithm 1 τ-FPL Ranking Require: X+ Rm d, X Rn d, maximal FPR tolerance τ, regularization parameter R, stopping condition ϵ 1: Randomly initialize α0 and β0 2: Set counter: t 0 3: while t = 0 or |g(αt, βt) g(αt 1, βt 1)| > ϵ do 4: Compute gradient of g( ) at point (αt, βt) 5: Compute α t+1, β t+1 by gradient descent; 6: Project α t+1, β t+1 onto the feasible set Γk:

(αt+1, βt+1) ΠΓk(α t+1, β t+1)

7: Update counter: t t + 1; 8: end while 9: Return w (m R) 1(αT t X+ βT t X )T

[x]+ max{x, 0}. Kernel methods can be used for nonlinear ranking functions. As a result, the learning problem is

1 w T x+ i + 1

j=1 w T x [j]

where R > 0 is the regularization parameter, and k = τn . Directly minimizing (6) can be challenging due to the [ ] operator, we address it by developing its dual.

Theorem 1. Deﬁne X+ = [x+ 1 , ..., x+ m]T and X = [x 1 , ..., x m]T be the matrix containing positive and negative instances in their rows respectively, the dual problem of (6) can be written by

min (α,β Γk)g(α, β)= 1 2m R ||αT X+ βT X ||2 +

i=1 l ( αi) (7)

where α and β are dual variables, l ( ) is the convex conjugate of l( ), and the domain Γk is deﬁned as

α Rm + , β Rn + |

j=1 βj; βj 1

Let α and β be the optimal solution of (7), the optimal

w = (m R) 1(α T X+ β T X )T (8)

Proof. Due to space limit, the proof is put in appendix.

According to Theorem 1, learning scoring function f is equivalent to learning the dual variables α and β by solving problem (7). Its optimization naturally falls into the area of projected gradient method. The key steps are summarized in Algorithm 1. At each iteration, we ﬁrst update solution by the gradients of the objective function g(α, β), then project the dual solution onto feasible set Γk. In the sequel, we will show that this projection problem can be efﬁciently solved in linear time. In practice, since g( ) is smooth, we also leverage Nesterov s method to further accelerate the convergence of our algorithm. Nesterov s method (Nesterov 2003) achieves O(1/T 2) convergence rate for smooth objective function, where T is the number of iterations.

Linear Time Projection onto the Top-k Simplex One of our main technical results is a linear time projection algorithm onto Γk, even in the case of k n. For clear notations, we reformulate the projection problem as

min α 0,β 0 1 2 ||α α0||2 + 1

2 ||β β0||2 (9)

j=1 βj, βj 1

It should be noted that, many Euclidean projection problems studied in literature can be seen as a special case of this problem. If the term m i=1 αi is ﬁxed, or replaced by a constant upper bound C, we obtain a well studied case of continuous quadratic knapsack problem (CQKP)

min β ||β β0||2 s.t. n

i=1 βi C, 0 βi C1 ,

where C1 = C/k. Several efﬁcient methods based on medianselecting or variable ﬁxing techniques are available (Patriksson 2008). On the other hand, if k = 1, all upper bounded constraints are automatically satisﬁed and can be omitted. Such special case has been well studied, for example, in (Liu and Ye 2009) and (Li, Jin, and Zhou 2014), both of which achieve O(n) complexity. Unfortunately, none of those above methods can be directly applied to solving the generalized case (9), due to its property of unﬁxed upper-bound constraint on β when k > 1. To our knowledge, the only one attempt to address the problem of unﬁxed upper bound is (Lapin, Hein, and Schiele 2015). They solve a similar (but simpler) problem

min β ||β β0||2 s.t. 0 βj 1

based on sorting and exhaustive search and their method achieves a runtime complexity O(n log(n) + kn), which is super-linear and even quadratic when k and n are linearly dependent. By contrast, our proposed method can be applied to both of aforementioned special cases with slight changes and remains O(n) complexity. The notable characteristic of our method is the efﬁcient combination of bisection and divide-and-conquer: the former offers the guarantee of worst complexity, and the latter signiﬁcantly reduces the large constant factor of bisection method. We ﬁrst introduce the following theorem, which gives a detailed description of the solution for (9). Theorem 2. (α Rm, β Rn) is the optimal solution of (9) if and only if there exist dual variables C 0, λ , μ R satisfy the following system of linear constraints:

i=1[α0 i λ ]+ (10)

j=1 min{[β0 j μ ]+, C /k} (11)

0 = λ + μ + 1

j=1 β0 j μ C /k

and α i = [α0 i λ ]+, β j = min{[β0 j μ ]+, C /k}.

Proof. Due to space limit, the proof is put in appendix.

Based on Theorem 2, the projection problem can be solved by ﬁnding the value of three dual variables C, λ and μ

that satisfy the above linear system. Here we ﬁrst propose a baseline bisection method which guarantees the worst time complexity. Similar method has also been used in (Liu and Ye 2009). For brevity, we denote α0 [i] and β0 [i] the i-largest dimension in α0 and β0 respectively, and deﬁne function C(λ), μ(C), δ(C) and f(λ) as follows2:

i=1[α0 i λ]+ (13)

μ(C) = μ satisﬁes (11) (14)

δ(C) = μ(C) + C/k (15)

f(λ) = kλ + kμ(C(λ)) + n

j=1[β0 j δ(C(λ))]+ (16)

The main idea of leveraging bisection to solve the system in theorem 2 is to ﬁnd the root of f(λ) = 0. In order to make bisection works, we need three guarantees: f should be continuous; root of f should be efﬁciently bounded in a interval; the value of f at the two endpoints of this interval should have opposite signs. Fortunately, based on following three lemmas, both of these requirements can be ensured.

Lemma 1. (Zero case) (0m, 0n) is an optimal solution of (9) if and only if kα0 [1] + k j=1 β0 [j] 0.

Lemma 2. (Bounding λ ) If C > 0, λ ( β0 [1], α0 [1]).

Lemma 3. (Monotonicity and convexity)

1. C(λ) is convex, continuous and strictly decreasing in ( , α0 [1]); 2. μ(C) is continuous, monotonically decreasing in (0, + ); 3. δ(C) is continuous, strictly increasing in (0, + ); 4. f(λ) is continuous, strictly increasing in ( , α0 [1]).

Furthermore, we can deﬁne the inverse function of C(λ) as λ(C), and rewrite f(λ) as:

f(λ(C)) = kλ(C) + kμ(C) + n

j=1[β0 j δ(C)]+ (17)

it is a convex function of C, strictly decreasing in (0, + ).

Lemma 1 deal with the special case of C = 0. Lemma 2 and 3 jointly ensure that bisection works when C > 0; Lemma 2 bounds λ ; Lemma 3 shows that f is continuous, and since it is also strictly increasing, the value of f at two endpoints must has opposite sign. Baseline method We start from select current λ in range ( β0 [1], α0 [1]) [l, u]. Then compute corresponding C by (13) in O(m), and use current C to compute μ by (14). Computing μ can be completed in O(n) by a well-designed medianselecting algorithm (Kiwiel 2007). With current (i.e. updated) C, λ and μ in hand, we can evaluate the sign of f(λ) in O(n) and determine the new bound of λ. In addition, the special case of C = 0 can be checked using Lemma 1 in O(m+n) by a linear-time k-largest element selecting algorithm (Kiwiel 2005). Since the bound of λ is irrelevant to m and n, the number of iteration for ﬁnding λ is log( u l

ϵ ), where ϵ is the maximum tolerance of the error. Thus, the worst runtime of this algorithm is O(m + n). Furthermore, we also leverage

2Indeed, for some C, μ(C) is not one-valued and thus need more strict deﬁnition. Here we omit it for brevity, and leave details in supplement materials.

Algorithm 2 Linear-time Projection on Top-k simplex Require: α0 Rm, β0 Rn, maximal accuracy ϵ 1: Calculate initial uncertainly intervals for λ, C, δ and μ; 2: Initialize breakpoint caches for C(λ), μ(C), δ(C), f(λ):

Cache C {α0 i | i}, Cacheμ, Cacheδ, Cachef {β0 j | j}

3: Initialize partial sums of C(λ), μ(C), δ(C), f(λ) with zero; 4: Set t 0 and λ0 (α0 [1] β0 [1])/2; 5: while t = 0 or |λt λt 1| > ϵ do 6: Calculate Ct, μt, δt, ft(by leveraging corresponding caches and partial sums); 7: Prune caches and update partial sums; 8: Shrink intervals of λ, C, δ and μ based on sign of f(λt); 9: t t + 1; 10: Set λt as midpoint of current new interval 11: end while 12: Return λ λt, μ μt, C Ct

the convexity of f(λ(C)) and C(λ) to further improve this algorithm, please refer to (Liu and Ye 2009) for more details about related technologies.

Although bisection solves the projections in linear time, it may lead to a slow convergence rate. We further improve runtime complexity by reducing the constant factor log( u l

ϵ ). This technology beneﬁts from exploiting the monotonicity of both functions C(λ), μ(C), δ(C) and f(λ), which have been stated in Lemma 3.

Improved method by endpoints Divide & Conquer Lemma 3 reveals an important chain monotonicity between the dual variables, which can used to improve the performance of our baseline method. The key steps are summarized in Algorithm 2. Denote the value of a variable z in iteration t as zt. For instance, if λt > λt 1, from emma 3 we have Ct < Ct 1, μt > μt 1 and δt < δt 1. This implies that we can set uncertainty intervals for both λ, C, μ and δ. As the interval of λ shrinking, lengths of these four intervals can be reduced simultaneously. On the other hand, notice that C(λ) is indeed piecewise linear function (at most m + 1 segments), the computation of its value only contains a comparison between λt and all of the α0 is. By keeping a cache of α0 is and discard those elements which are out of the current bound of λ in advance, in each iteration we can reduce the expected comparison counts by half. A more complex but similar procedure can also be applied for computing μ(C), δ(C), and f(λ), because both of these functions are piecewise linear and the main cost is the comparison with O(m + n) endpoints. As a result, for approximately linear function (convexity not required) and evenly distributed breakpoints, if the ﬁrst iteration of bisection costs γ(m + n) time, the overall runtime of the projection algorithm will be γ(m + n) + γ(m + n)/2 + ... 2γ(m + n), which is much less than the original bisection algorithm whose runtime is log( u l

ϵ )γ(m + n).

Algorithm Training Validation τ-FPL O(d(m + n)/T 2) Linear Top Push O(d(m + n)/T 2) Linear CS-SVM O(d(m + n)/T ) Quadratic SVMp AUC tight O((m log m + n log n + d(m + n))/T ) Linear Bipartite O((d(m + n) + (m + n) log(m + n))/T ) Ranking O(dmn + mn log(mn)/

Table 1: Complexity comparison with SOTA approaches

Convergence and Computational Complexity Follows immediately from the convergence result of Nesterov s method, we have:

Theorem 3. Let αT and βT be the output from the τ-FPL algorithm after T iterations, then g(αT , βT ) min g(α, β)+ϵ, where T O(1/ ϵ).

Finally, the computational cost of each iteration is dominated by the gradient evaluation and the projection step. Since the complexity of projection step is O(m + n) and the cost of computing the gradient is O(d(m + n)), combining with Theorem 3 we have that: to ﬁnd an ϵ-suboptimal solution, the total computational complexity of τ-FPL is O(d(m + n)/ ϵ). Table 1 compares the computational complexity of τ-FPL with that of some state-of-the-art methods. The order of validation complexity corresponds to the number of hyper-parameters. From this, it is easy to see that τFPL is asymptotically more efﬁcient.

Out-of-Bootstrap Thresholding In the thresholding stage, the task is to identify the boundary between the positive instances and (1 τ) percent of the negative instances. Though thresholding on the training set is commonly used in (Joachims 1996; Davenport, Baraniuk, and Scott 2010; Scheirer et al. 2013), it may introduce overﬁtting. Hence, we propose an out-of-bootstrap method to ﬁnd a more accurate and stable threshold. At each time, we randomly split the training set into two sets S1 and S2, and then train on S1 as well as the select threshold on S2. The procedure can be running multiple rounds to make use of all the training data. Once the process is completed, we can obtain the ﬁnal threshold by averaging. On the other hand, the ﬁnal scoring function can be obtained by two ways: learn a scoring function using the full set of training data, or gather the weights learned in each previous round and average them. This method combines both the advantages of out-of-bootstrap and soft-thresholding techniques: accurate error estimating and reduced variance with little sacriﬁce on bias, thus ﬁts the setting of thresholding near the risk area.

Theoretical Guarantees Now we develop the theoretical guarantee for the scoring function, which bounds the probability of giving any positive instances higher score than 1 τ proportion of negative instances. To this end, we ﬁrst deﬁne h(x, f), the probability for any negative instance to be ranked above x using f, i.e. h(x, f) = Ex P [I(f(x) f(x ))], and then measure the quality of f by P(f, τ) = Px+ P+(h(x+, f) τ), which is

the probability of giving any positive instances lower score than τ percent of negative instances. The following theorem bounds P(f, τ) by the empirical loss L k. Theorem 4. Given training data S consisting of m independent instances from distribution P+ and n independent instances from distribution P , let f be the optimal solution to the problem (6). Assume m 12 and n s. We have, for any k n, with a probability at least 1 2e s,

P(f , τ) L k + O(

(s + log(m)/m)) (18)

where τ = O(

log m/n + k/n), and L k = 1 m m i=1 l(f (x+ i )

1 k k j=1 f (x [j])).

Proof. Due to space limit, the proof is put in appendix.

Theorem 4 implies that if L k is upper bounded by O(log(m)/m)), the probability of ranking any positive samples below τ percent of negative samples is also bounded by O(log(m)/m)). If m is approaching inﬁnity, P(f , τ) would be close to 0, which means in that case, we can almost ensure that by thresholding at a suitable point, the true-positive rate will get close to 1. Moreover, we observe that m and n play different roles in this bound. For instance, it is well known that the largest absolute value of Gaussian random instances grows in log(n). Thus we believe that the growth of n only slightly affects both the largest and the centroid of top-proportion scores of negatives samples. This leads to a conclusion that increasing n only slightly raise L k, but signiﬁcant reduce the margin between target τ and k/n. On the other hand, increasing m will reduce upper bound of P, thus increasing the chance of ﬁnding positive instances at the top. In sum, n and m control τ and P respectively.

Experiment Results Effectiveness of the Linear-time Projection We ﬁrst demonstrate the effectiveness of our projection algorithm. Following the settings of (Liu and Ye 2009), we randomly sample 1000 samples from the normal distribution N(0, 1) and solve the projection problem. The comparing method is ibis (Liu and Ye 2009), an improved bisection algorithm which also makes use of the convexity and monotonicity. All experiments are running on an Intel Core i5 Processor. As shown in Fig.1, thanks to the efﬁcient reduction of the constant factor, our method outperforms ibis by saving almost 75% of the running time in the limit case. We also solve the projection problem proposed in (Lapin, Hein, and Schiele 2015) by using a simpliﬁed version of our method, and compare it with the method presented in (Lapin, Hein, and Schiele 2015) (PTk C), whose complexity is O(n log(n) + kn). As one can observe from Fig.1(b), our method is linear in complexity regarding with n and does not suffer from the growth of k. In the limit case (both large k and n), it is even 3-order faster than the competitors.

Ranking Performance Next, we validate the ranking performance of our τ-FPL method, i.e. scoring and sorting test samples, and then evaluate the proportion of positive samples ranked above 1 τ

Dataset heart spambase real-sim w8a 120/150,d:13 1813/2788,d:57 22238/50071,d:20958 2933/62767,d:300 τ(%) 5 10 0.1 0.5 1 5 10 0.01 0.1 1 5 10 0.05 0.1 0.5 1 5 10 CS-SVM .526 .691 .109 .302 .487 .811 .920 .376 .748 .921 .972 .990 .501 .520 .649 .695 .828 .885 Top Push .541 .711 .112 .303 .484 .774 .845 .391 .747 .920 .968 .983 .508 .551 .627 .656 .761 .842 SVMp AUC tight .509 .728 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A τ-Rank .541 .740 .112 .305 .460 .842 .929 .391 .747 .920 .975 .991 .508 .551 .645 .710 .832 .894 2τ-Rank .547 .734 .112 .311 .477 .862 .936 .391 .747 .922 .978 .992 .508 .549 .675 .739 .841 .902

Table 2: Ranking performance by different values of the tolerance τ. The number of positive/negative instances and feature dimensions ( d ) is shown together with the name of each dataset. The best results are shown in bold. N/A s denote the experiments that require more than one week for training.

Dataset(+/-) τ(%) BS-SVM CS-LR CS-SVM CS-SVM-OOB τ-FPL 2τ-FPL heart 5 (.069, .675), .713 (.035, .394), .606 (.027, .327), .673 (.058, .553), .609 (.053, .582), .468 (.055, .584), .514 120/150,d:13 10 (.121, .774), .435 (.058, .615), .385 (.078, .666), .334 (.088, .682), .318 (.086, .686), .314 (.080, .679), .317 breast-cancer 1 (.015, .964), .559 (.007, .884), .116 (.006, .870), .130 (.014, .955), .451 (.013, .955), .324 (.011, .949), .192 239/444 5 (.063, .978), .276 (.013, .965), .035 (.017, .965), .034 (.046, .974), .026 (.041, .976), .025 (.045, .974), .026 d:10 10 (.113, .985), .142 (.035, .970), .030 (.044, .973), .027 (.095, .981), .020 (.098, .982), .018 (.094, .982), .018 spambase 0.5 (.008, .426), 1.220 (.007, .011), 1.362 (.002, .109), .891 (.005, .275), .790 (.005, .278), .722 (.004, .268), .732 1813/2788 1 (.013, .583), .748 (.007, .011), .989 (.004, .256), .744 (.009, .418), .582 (.008, .416), .584 (.008, .440), .560 d:57 5 (.054, .895), .192 (.007, .011), .989 (.020, .667), .333 (.047, .793), .207 (.041, .822), .178 (.046, .845), .155 10 (.103, .941), .087 (.007, .011), .989 (.051, .716), .284 (.090, .902), .099 (.087, .925), .075 (.090, .928), .072 real-sim 0.01 (.002, .813), 22.376 (.001, .207), 7.939 (.000, .209), .791 (.000, .268), .833 (.000, .270), .730 (.000, .270), .730 22238/50071 0.1 (.008, .919), 7.09 (.001, .207), .826 (.001, .700), .428 (.001, .584), .416 (.001, .585), .415 (.001, .585), .415 d:20958 0.5 (.023, .966), 3.680 (.001, .207), .794 (.001, .755), .245 (.003, .810), .190 (.003, .829), .174 (.003, .827), .181 1 (.036, .978), 2.570 (.001, .207), .794 (.007, .880), .121 (.007, .875), .125 (.007, .894), .115 (.006, .891), .109 5 (.094, .994), .878 (.078, .994), .575 (.029, .931), .139 (.039, .965), .035 (.041, .972), .028 (.044, .974), .028 10 (.133, 0.997), .336 (.078, .994), .007 (.069, .993), .007 (.099, .986), .019 (.092, .991), .009 (.094, .991), .009 w8a 0.05 (.001, .525), .966 (.000, .101), .900 (.000, .420), .580 (.000, .438), .562 (.000, .428), .572 (.000, .428), .572 1933/62767 0.1 (.001, .585), .710 (.000, .119), .881 (.000, .447), .553 (.001, .493), .507 (.001, .495), .505 (.001, .499), .501 d:123 0.5 (.006, .710), .437 (.000, .119), .881 (.002, .595), .405 (.003, .634), .366 (.003, .654), .347 (.003, .667), .333 1 (.011, .749), .341 (.014, .696), .715 (.006, .642), .358 (.006, .695), .305 (.006, .702), .298 (.007, .726), .274 5 (.048, .823), .177 (.014, .696), .305 (.013, .701), .299 (.046, .805), .195 (.033, .818), .182 (.036, .827), .173 10 (.049, .823), .177 (.014, .696), .305 (.013, .701), .299 (.053, .814), .186 (.042, .833), .167 (.038, .826), .174

Table 3: [(mean false positive rate, mean true positive rate), NP-score] on real-world datasets by different values of the tolerance τ. In the leftmost column, the number of positive/negative instances and feature dimensions ( d ) in each dataset. For each dataset, the best results are shown in bold.

(a) Run time against the method ibis (log-log).

(b) Run time against PTk C (log-log).

Figure 1: Running time against two peer projection methods.

proportion of negative samples. Considering ranking performance independently can avoid the practical problem of mismatching the constraint in (2) on testing set, and always offer us the optimal threshold. Speciﬁcally, we choose (3) as evaluation and validation criterion. Compared methods include cost-sensitive SVM (CS-SVM) (Osuna, Freund, and Girosi 1997), which has been shown a lower bound approximation of (3); Top Push

(Li, Jin, and Zhou 2014) ranking, which focus on optimizing the absolute top of the ranking list, also a special case of our model (τ = 0); SVMp AUC tight (Narasimhan and Agarwal 2013b), a more general method which designed for optimizing arbitrary partial-AUC. We test two version of our algorithms: τ-Rank and 2τ-Rank, which correspond to the different choice of τ in learning scheme. Intuitively, enlarge τ in training phase can be seen as a top-down approximation from upper bound to the original objective (2). On the other hand, the reason for choosing 2τ is that, roughly speaking, the average score of the top 2τ proportion of negative samples may close to the score of nτ -th negative sample. Settings. We evaluate the performance on publicly benchmark datasets with different domains and various sizes3. For small scale datasets( 10, 000 instances), 30 times stratiﬁed hold-out tests are carried out, with 2/3 data as train set and 1/3 data as test set. For large datasets, we instead run 10 rounds. In each round, hyper-parameters are chosen by 5fold cross validation from grid, and the search scope is extended if the optimal is at the boundary. Results. Table 2 reports the experimental results. We note that at most cases, our proposed method outperforms other

3https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary

peer methods. It conﬁrms the theoretical analysis that our methods can extract the capacity of the model better. For Top Push, it is highly-competitive in the case of extremely small τ, but gradually lose its advantage as τ increase. The algorithm of SVMp AUC tight is based on cutting-plane methods with exponential number of constraints, similar technologies are also used in many other ranking or structured prediction methods, e.g. Structured SVM (Tsochantaridis et al. 2005). The time complexity of this kind of methods is O((m + n) log(m + n)), and we found that even for thousands of training samples, it is hard to ﬁnish experiments in allowed time.

Overall Classiﬁcation Accuracy In this section we compare the performance of different models by jointly learning the scoring function and threshold in training phase, i.e. output a classiﬁer. To evaluate a classiﬁer under the maximum tolerance, we use Neyman Pearson score (NP-score) (Scott 2007). The NP-score is deﬁned by 1 τ max{fpr, τ} tpr where fpr and tpr are falsepositive rate and true-positive rate of the classiﬁer respectively, and τ is the maximum tolerance. This measure punishes classiﬁers whose false-positive rates exceed τ, and the punishment becomes higher as τ 0. Settings. We use the similar setting for classiﬁcation as for ranking experiments, i.e., for small scale datasets, 30 times stratiﬁed hold-out tests are carried out; for large datasets, we instead run 10 rounds. Comparison baselines include: Cost-Sensitive Logistic Regression (CS-LR) which choose a surrogate function that different from CS-SVM; Bias-Shifting Support Vector Machine (BS-SVM), which ﬁrst training a standard SVM and then tuning threshold to meet speciﬁed false-positive rate; cost-sensitive SVM (CS-SVM). For complete comparison, we also construct a CS-SVM by our out-of-bootstrap thresholding (CS-SVMOOB), to eliminate possible performance gains comes from different thresholding method, and focus on the training algorithm itself. For all of comparing methods, the hyperparameters are selected by 5-fold cross-validation with grid search, aims at minimizing the NP-score, and the search scope is extended when the optimal value is at the boundary. For our τ-FPL, in the ranking stage the regularization parameter R is selected to minimize (3) , and then the threshold is chosen to minimize NP-score. We test two variants of our algorithms: τ-FPL and 2τ-FPL, which corresponding different choice of τ in learning scheme. As mentioned previously, enlarge τ can be seen as a top-down approximation towards the original objective. Results. The NP-score results are given in Table 3. First, we note that both our methods can achieve the best performance in most of the tests, compared to various comparing methods. Moreover, it is clear that even using the same method to select the threshold, the performance of cost sensitive method is still limited. Another observation is that both of the three algorithms which using out-of-bootstrap thresholding can efﬁciently control the false positive rate under the constraint. Moreover, τ-FPLs are more stable than other algorithms, which we believe beneﬁts from the accurate splitting of the positive-negative instances and stable

105 106 Data size

Training time (s)

Figure 2: Training time of τ-FPL versus training data size for different τ (log-log).

thresholding techniques.

Scalability We study how τ-FPL scales to a different number of training examples by using the largest dataset real-sim. In order to simulate the limit situation, we construct six datasets with different data size, by up-sampling original dataset. The sampling ratio is {1, 2, 22, ...25}, thus results in six datasets with data size from 72309 to 2313888. We running τ-FPL ranking algorithm on these datasets with different τ and optimal R (chosen by cross-validation), and report corresponding training time. Up-sampling technology ensures that, for a ﬁxed τ, all the six datasets share the same optimal regularization parameter R. Thus the unique variable can be ﬁxed as data size. Figure 2 shows the log-log plot for the training time of τ-FPL versus the size of training data, where different lines correspond to different τ. It is clear that the training time of τ-FPL is indeed linear dependent in the number of training data. This is consistent with our theoretical analysis and also demonstrate the scalability of τ-FPL.

Conclusion In this paper, we focus on learning binary classiﬁer under the speciﬁed tolerance τ. To this end, we have proposed a novel ranking method which directly optimizes the probability of ranking positive samples above 1 τ percent of negative samples. The ranking optimization is then efﬁciently solved using projected gradient method with the proposed linear time projection. Moreover, an out-of-bootstrap thresholding is applied to transform the learned ranking model into a classiﬁer with a low false-positive rate. We demonstrate the superiority of our method using both theoretical analysis and extensive experiments on several benchmark datasets.

Acknowledgments The mainly work was done when the ﬁrst author was an intern at i DST of Alibaba. This work is supported by the National Natural Science Foundation of China (NSFC) (61702186, 61672236, 61602176, 61672231), the NSFCZhejiang Joint Fund for the Integration of Industrialization and Information (U1609220), the Key Program of Shanghai Science and Technology Commission (15JC1401700) and the Joint Research Grant Proposal for Overseas Chinese Scholars (61628203).

References Ben-Hur, A.; Horn, D.; Siegelmann, H. T.; and Vapnik, V. 2001. Support vector clustering. JMLR 2(Dec):125 137. Davenport, M. A.; Baraniuk, R. G.; and Scott, C. D. 2006. Controlling false alarms with support vector machines. In ICASSP, volume 5, V V. Davenport, M. A.; Baraniuk, R. G.; and Scott, C. D. 2010. Tuning support vector machines for minimax and neymanpearson classiﬁcation. TPAMI 32(10):1888 1898. Drucker, H.; Wu, D.; and Vapnik, V. N. 1999. Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10(5):1048 1054. Gasso, G.; Pappaioannou, A.; Spivak, M.; and Bottou, L. 2011. Batch and online learning algorithms for nonconvex neyman-pearson classiﬁcation. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3):28. Huang, H.; Liu, C.-C.; and Zhou, X. J. 2010. Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. PNAS 107(15):6823 6828. Joachims, T. 1996. A probabilistic analysis of the rocchio algorithm with tﬁdf for text categorization. Technical report, DTIC Document. Kiwiel, K. C. 2005. On ﬂoyd and rivest s select algorithm. Theoretical Computer Science 347(1-2):214 238. Kiwiel, K. C. 2007. On linear-time algorithms for the continuous quadratic knapsack problem. Journal of Optimization Theory and Applications 134(3):549 554. Lapin, M.; Hein, M.; and Schiele, B. 2015. Top-k multiclass SVM. In NIPS, 325 333. Lehmann, E. L., and Romano, J. P. 2006. Testing statistical hypotheses. Springer Science & Business Media. Li, N.; Jin, R.; and Zhou, Z.-H. 2014. Top rank optimization in linear time. In NIPS, 1502 1510. Liu, J., and Ye, J. 2009. Efﬁcient euclidean projections in linear time. In ICML, 657 664. Liu, X.-Y., and Zhou, Z.-H. 2010. Learning with cost intervals. In KDD, 403 412. Mahdavi, M.; Yang, T.; and Jin, R. 2013. Stochastic convex optimization with multiple objectives. In NIPS. 1115 1123. Masnadi-Shirazi, H., and Vasconcelos, N. 2007. Asymmetric boosting. In ICML, 609 619. Masnadi-Shirazi, H., and Vasconcelos, N. 2011. Costsensitive boosting. TPAMI 33(2):294 309. Mozer, M. C.; Dodier, R.; Colagrosso, M. D.; Guerra Salcedo, C.; and Wolniewicz, R. 2002. Prodding the roc curve: Constrained optimization of classiﬁer performance. In NIPS, 1409 1415. Narasimhan, H., and Agarwal, S. 2013a. On the relationship between binary classiﬁcation, bipartite ranking, and binary class probability estimation. In NIPS, 2913 2921. Narasimhan, H., and Agarwal, S. 2013b. A structural svm based approach for optimizing partial auc. In ICML. Nesterov, Y. 2003. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers.

Osuna, E.; Freund, R.; and Girosi, F. 1997. Support vector machines: Training and applications. Technical Report AIM-1602. Patriksson, M. 2008. A survey on the continuous nonlinear resource allocation problem. European Journal of Operational Research 185(1):1 46. Rigollet, P., and Tong, X. 2011. Neyman-pearson classiﬁcation, convexity and stochastic constraints. JMLR 12(Oct):2831 2855. Scheirer, W. J.; de Rezende Rocha, A.; Sapkota, A.; and Boult, T. E. 2013. Toward open set recognition. TPAMI 35(7):1757 1772. Scott, C., and Nowak, R. 2005. A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory (TIT) 51(11):3806 3819. Scott, C. 2007. Performance measures for neyman pearson classiﬁcation. IEEE Transactions on Information Theory (TIT) 53(8):2852 2863. Tong, X. 2013. A plug-in approach to neyman-pearson classiﬁcation. JMLR 14(Oct):3011 3040. Tsochantaridis, I.; Joachims, T.; Hofmann, T.; and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. JMLR 6(Sep):1453 1484. Wu, S.-H.; Lin, K.-P.; Chen, C.-M.; and Chen, M.-S. 2008. Asymmetric support vector machines: low false-positive learning under the user tolerance. In KDD, 749 757. Zhou, Y.-H., and Zhou, Z.-H. 2016. Large margin distribution learning with cost interval and unlabeled data. TKDE 28(7):1749 1763.