# top_rank_optimization_in_linear_time__a6fc1120.pdf

Top Rank Optimization in Linear Time

Nan Li1 Rong Jin2 Zhi-Hua Zhou1

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 {lin,zhouzh}@lamda.nju.edu.cn rongjin@cse.msu.edu

Bipartite ranking aims to learn a real-valued ranking function that orders positive instances before negative instances. Recent efforts of bipartite ranking are focused on optimizing ranking accuracy at the top of the ranked list. Most existing approaches are either to optimize task speciﬁc metrics or to extend the rank loss by emphasizing more on the error associated with the top ranked instances, leading to a high computational cost that is super-linear in the number of training instances. We propose a highly efﬁcient approach, titled Top Push, for optimizing accuracy at the top that has computational complexity linear in the number of training instances. We present a novel analysis that bounds the generalization error for the top ranked instances for the proposed approach. Empirical study shows that the proposed approach is highly competitive to the state-of-the-art approaches and is 10-100 times faster.

1 Introduction

Bipartite ranking aims to learn a real-valued ranking function that places positive instances above negative instances. It has attracted much attention because of its applications in several areas such as information retrieval and recommender systems [32, 25]. Many ranking methods have been developed for bipartite ranking, and most of them are essentially based on pairwise ranking. These algorithms reduce the ranking problem into a binary classiﬁcation problem by treating each positivenegative instance pair as a single object to be classiﬁed [16, 12, 5, 39, 38, 33, 1, 3]. Since the number of instance pairs can grow quadratically in the number of training instances, one limitation of these methods is their high computational costs, making them not scalable to large datasets.

Considering that for applications such as document retrieval and recommender systems, only the top ranked instances will be examined by users, there has been a growing interest in learning ranking functions that perform especially well at the top of the ranked list [7, 39, 38, 33, 1, 3, 27, 40]. Most of these approaches can be categorized into two groups. The ﬁrst group maximizes the ranking accuracy at the top of the ranked list by optimizing task speciﬁc metrics [17, 21, 23, 40], such as average precision (AP) [42], NDCG [39] and partial AUC [27, 28]. The main limitation of these methods is that they often result in non-convex optimization problems that are difﬁcult to solve efﬁciently. Structural SVM [37] addresses this issue by translating the non-convexity into an exponential number of constraints. It can still be computationally challenging because it usually requires to search for the most violated constraint at each iteration of optimization. In addition, these methods are statistically inconsistent [36, 21], leading to suboptimal solutions. The second group of methods are based on pairwise ranking. They design special convex loss functions that place more penalties on the ranking errors related to the top ranked instances [38, 33, 1]. Since these methods are based on pairwise ranking, their computational costs are usually proportional to the number of positive-negative instance pairs, making them unattractive for large datasets.

In this paper, we address the computational challenge of bipartite ranking by designing a ranking algorithm, named Top Push, that can efﬁciently optimize the ranking accuracy at the top. The key feature of the proposed Top Push algorithm is that its time complexity is only linear in the number of training instances. This is in contrast to most existing methods for bipartite ranking whose computational costs depend on the number of instance pairs. Moreover, we develop novel analysis for bipartite ranking. One deﬁciency of the existing theoretical studies [33, 1] on bipartite ranking is that they try to bound the probability for a positive instance to be ranked before any negative instance, leading to relatively pessimistic bounds. We overcome this limitation by bounding the probability of ranking a positive instance before most negative instances, and show that Top Push is effective in placing positive instances at the top of a ranked list. Extensive empirical study shows that Top Push is computationally more efﬁcient than most ranking algorithms, and yields comparable performance as the state-of-the-art approaches that maximize the ranking accuracy at the top.

The rest of this paper is organized as follows. Section 2 introduces the preliminaries of bipartite ranking, and addresses the difference between AUC optimization and maximizing accuracy at the top. Section 3 presents the proposed Top Push algorithm and its key theoretical properties. Section 4 summarizes the empirical study, and Section 5 concludes this work with future directions.

2 Bipartite Ranking: AUC vs. Accuracy at the Top

Let X = {x Rd : x 1} be the instance space. Let S = S+ S be a set of training instances, where S+ = {x+ i X}m i=1 and S = {x i X}n i=1 include m positive instances and n negative instances independently sampled from distributions P+ and P , respectively. The goal of bipartite ranking is to learn a ranking function f : X 7 R that is likely to place a positive instance before most negative ones. In the literature, bipartite ranking has found applications in many domains [32, 25], and its theoretical properties have been examined by several studies [2, 6, 20, 26].

AUC is a commonly used evaluation metric for bipartite ranking [15, 9]. By exploring its equivalence to Wilcoxon-Mann-Whitney statistic [15], many ranking algorithms have been developed to optimize AUC by minimizing the ranking loss deﬁned as

Lrank(f; S) = 1 mn

j=1 I f(x+ i ) f(x j ) , (1)

where I( ) is the indicator function. Other than a few special loss functions (e.g., exponential and logistic loss) [33, 20], most of these methods need to enumerate all the positive-negative instance pairs, making them unattractive for large datasets. Various methods have been developed to address this computational challenge [43, 13].

Recently, there is a growing interest on optimizing ranking accuracy at the top [7, 3]. Maximizing AUC is not suitable for this goal as indicated by the analysis in [7]. To address this challenge, we propose to maximize the number of positive instances that are ranked before the ﬁrst negative instance, which is known as positives at the top [33, 1, 3]. We can translate this objective into the minimization of the following loss

L(f; S) = 1

i=1 I f(x+ i ) max 1 j n f(x j ) . (2)

which computes the fraction of positive instances ranked below the top-ranked negative instance. By minimizing the loss in (2), we essentially push negative instances away from the top of the ranked list, leading to more positive ones placed at the top. We note that (2) is fundamentally different from AUC optimization as AUC does not focus on the ranking accuracy at the top. More discussion about the relationship between (1) and (2) can be found in the longer version of the paper [22].

To design practical learning algorithms, we replace the indicator function in (2) with its convex surrogate, leading to the following loss function

Lℓ(f; S) = 1

i=1 ℓ max 1 j n f(x j ) f(x+ i ) , (3)

where ℓ( ) is a convex loss function that is non-decreasing1 and differentiable. Examples of such loss functions include truncated quadratic loss ℓ(z) = [1 + z]2 +, exponential loss ℓ(z) = ez, or

1In this paper, we let ℓ(z) to be non-decreasing for the simplicity of formulating dual problem.

logistic loss ℓ(z) = log(1 + ez). In the discussion below, we restrict ourselves to the truncated quadratic loss, though most of our analysis applies to others.

It is easy to verify that the loss Lℓ(f; S) in (3) is equivalent to the loss used in Inﬁnite Push [1] (a special case of P-norm Push [33]), i.e.,

Lℓ (f; S) = max 1 j n 1 m

i=1 ℓ f(x j ) f(x+ i ) . (4)

The apparent advantage of employing Lℓ(f; S) instead of Lℓ (f; S) is that it only needs to evaluate on m positive-negative instance pairs, whereas the later needs to enumerate all the mn instance pairs. As a result, the number of dual variables induced by Lℓ(f; S) is n + m, linear in the number of training instances, which is signiﬁcantly smaller than mn, the number of dual variables induced by Lℓ (f; S) [1, 31]. It is this difference that makes the proposed algorithm achieve a computational complexity linear in the number of training instances and therefore be more efﬁciently than the existing algorithms for most state-of-the-art algorithms for bipartite ranking.

3 Top Push for Optimizing Top Accuracy

We ﬁrst present a learning algorithm to minimize the loss function in (3), and then the computational complexity and performance guarantee for the proposed algorithm.

3.1 Dual Formulation

We consider linear ranking function2, i.e., f(x) = w x, where w Rd is the weight vector to be learned. As a result, the learning problem is given by the following optimization problem

min w λ 2 w 2 + 1

i=1 ℓ max 1 j n w x j w x+ i , (5)

where λ > 0 is a regularization parameter. Directly minimizing the objective in (5) can be challenging because of the max operator in the loss function. We address this challenge by developing a dual formulation for (5). Speciﬁcally, given a convex and differentiable function ℓ(z), we can rewrite it in its convex conjugate form as ℓ(z) = maxα Ωαz ℓ (α) , where ℓ (α) is the convex conjugate of ℓ(z) and Ωis the domain of dual variable [4]. For example, the convex conjugate of truncated quadratic loss is ℓ (α) = α + α2/4 with Ω= R+. We note that dual form has been widely used to improve computational efﬁciency [35] and connect different styles of learning algorithms [19]. Here we exploit it to overcome the difﬁculty caused by max operator. The dual form of (5) is given in the following theorem, whose detailed proof can be found in the longer version [22].

Theorem 1. Deﬁne X+ = (x+ 1 , . . . , x+ m) and X = (x 1 , . . . , x n ) , the dual problem of (5) is

min (α,β) Ξ g(α, β) = 1 2λm α X+ β X 2 + Xm

i=1 ℓ (αi) (6)

where α and β are dual variables, and the domain Ξ is deﬁned as

Ξ = α Rm +, β Rn + : 1 mα = 1 n β .

Let α and β be the optimal solution to the dual problem (6). Then, the optimal solution w to the primal problem in (5) is given by

w = 1 λm a X+ β X . (7)

Remark The key feature of the dual problem in (6) is that the number of dual variables is m + n, leading to a linear time ranking algorithm. This is in contrast to the Inﬁnit Push algorithm in [1] that introduces mn dual variables and a higher computational cost. In addition, the objective function in (6) is smooth if the convex conjugate ℓ ( ) is smooth, which is true for many common loss functions (e.g., truncated quadratic loss and logistic loss). It is well known in the literature of optimization [4] that an O(1/T 2) convergence rate can be achieved if the objective function is smooth, where T is the number of iterations; this also helps in designing efﬁcient learning algorithm.

2Nonlinear function can be trained by kernel methods, and Nystr om method and random Fourier features can transform the kernelized problem into a linear one. See [41] for more discussions.

3.2 Linear Time Bipartite Ranking

According to Theorem 1, to learn a ranking function f(w), it is sufﬁcient to learn the dual variables α and β by solving the problem in (6). For this purpose, we adopt the accelerated gradient method due to its light computation per iteration, and refer the obtained algorithm as Top Push. Speciﬁcally, we choose the Nesterov s method [30, 29] that achieves an optimal convergence rate O(1/T 2) for smooth objective function. One of the key features of the Nesterov s method is that it maintains two sequences of solutions: {(αk, βk)} and {(sα k; sβ k)}, where the sequence of auxiliary solutions {(sα k; sβ k)} is introduced to exploit the smoothness of the objective to achieve a faster convergence rate. Algorithm 1 shows the key steps3 of the Nesterov s method for solving the problem in (6), where the gradients of the objective function g(α, β) can be efﬁciently computed as

αg(α, β) = X+ν /λm + ℓ (α) , βg(α, β) = X ν /λm . (8)

where ν = α X+ β X and ℓ ( ) is the derivative of ℓ ( ).

Algorithm 1 The Top Push Algorithm

Input: X+ Rm d, X Rn d, λ, ϵ Output: w

1: initialize α1 = α0 = 0m, β1 = β0 = 0n, and let t 1 = 0, t0 = 1, L0 = 1 m+n 2: repeat for k = 1, 2, . . . 3: compute sa k = αk + ωk(αk αk 1) and sβ k = βk + ωk(βk βk 1), where ωk = tk 2 1

tk 1 4: compute gα = αg(sα k , sβ k) and gβ = βg(sα k , sβ k) based on (8) 5: ﬁnd Lk > Lk 1 such that g(αk+1, βk+1) > g(sα k , sβ k) + ( gα 2 + gβ 2)/(2Lk), where [αk+1; βk+1] = πΞ([α k+1; β k+1]) with α k+1 = sα k 1 Lk gα and β k+1 = sβ k 1 Lk gβ

6: update tk = (1 + q

1 + 4t2 k 1)/2

7: until convergence (i.e., |g(αk+1, βk+1) g(αk, βk)| < ϵ) 8: return w = 1 λ m(α k X+ β k X )

It should be noted that, (6) is a constrained problem, and therefore, at each step of gradient mapping, we have to project the dual solution into the domain Ξ (i.e, [αk+1; βk+1] = πΞ([α k+1; β k+1]) in step 5) to keep them feasible. Below, we discuss how to solve this projection step efﬁciently.

Projection Step For clear notations, we expand the projection step into the problem

min α 0,β 0 1 2 α α0 2 + 1

2 β β0 2 s.t. 1 mα = 1 n β , (9)

where α0 and β0 are the solutions obtained in the last iteration. We note that similar projection problems have been studied in [34, 24] where they either have O((m + n) log(m + n)) time complexity [34] or only provide approximate solutions [24]. Instead, based on the following proposition, we provide a method which ﬁnd the exact solution to (9) in O(n+m) time. By using proof technique similar to that for Theorem 2 in [24], we can prove the following proposition:

Proposition 1. The optimal solution to the projection problem in (9) is given by

α = [α0 γ ]+ and β = [β0 + γ ]+ ,

where γ is the root of function ρ(γ) = Pm i=1[α0 i γ]+ Pn j=1[β0 j + γ]+ .

Based on Proposition 1, we provide a method which ﬁnd the exact solution to (9) in O(m+n) time. According to Proposition 1, the key to solving this problem is to ﬁnd the root of ρ(γ). Instead of approximating the solution via bisection as in [24], we develop a divide-and-conquer method to ﬁnd the exact solution of γ in O(m + n) time, where a similar approach has been used in [10]. The basic idea is to ﬁrst identify the smallest interval that contains the root based on a modiﬁcation of the randomized median ﬁnding algorithm [8], and then solve the root exactly based on the interval. The detailed projection procedure can be found in the longer version [22].

3The step size of the Nesterov s method depends on the smoothness of the objective function. In current work we adopt the Nemirovski s line search scheme [29] to compute the smoothness parameter, and the detailed algorithm can be found in [22].

Table 1: Comparison of computational complexities for ranking algorithms, where d is the number of dimensions, ϵ is the precision parameter, m and n are the number of positive and negative instances, respectively.

Algorithm Computational Complexity

SVMRank [18] O (m + n)d + (m + n) log(m + n) /ϵ

SVMMAP [42] O (m + n)d + (m + n) log(m + n) /ϵ

OWPC [38] O (m + n)d + (m + n) log(m + n) /ϵ

SVMp AUC [27, 28] O n log n + m log m + (m + n)d /ϵ

Inﬁnite Push [1] O mnd + mn log(mn) /ϵ2

L1SVIP [31] O mnd + mn log(mn) /ϵ

Top Push this paper O (m + n)d/ ϵ

3.3 Convergence and Computational Complexity

The theorem below states the convergence of the Top Push algorithm, which follows immediately from the convergence result for the Nesterov s method [29].

Theorem 2. Let αT and βT be the solution output from Top Push after T iterations, we have

g(αT , βT ) min (α,β) Ξ g(α, β) + ϵ

provided T O(1/ ϵ).

Finally, since the computational cost of each iteration is dominated by the gradient evaluation and the projection step, the time complexity of each iteration is O((m + n)d) since the complexity of projection step is O(m+n) and the cost of computing the gradient is O((m+n)d). Combining this result with Theorem 2, we have, to ﬁnd an ϵ-suboptimal solution, the total computational complexity of the Top Push algorithm is O((m + n)d/ ϵ), which is linear in the number of training instances.

Table 1 compares the computational complexity of Top Push with that of the state-of-the-art algorithms. It is easy to see that Top Push is asymptotically more efﬁcient than the state-of-the-art ranking algorithms4. For instances, it is much more efﬁcient than Inﬁnite Push and its sparse extension L1SVIP whose complexity depends on the number of positive-negative instance pairs; compared with SVMRank, SVMMAP and SVMp AUC that handle speciﬁc performance metrics via structural SVM, the linear dependence on the number of training instances makes our Top Push approach more appealing, especially for large datasets.

3.4 Theoretical Guarantee

We develop theoretical guarantee for the ranking performance of Top Push. In [33, 1], the authors have developed margin-based generalization bounds for the loss function Lℓ . One limitation with the analysis in [33, 1] is that they try to bound the probability for a positive instance to be ranked before any negative instance, leading to relatively pessimistic bounds5. Our analysis avoids this pitfall by considering the probability of ranking a positive instance before most negative instances.

To this end, we ﬁrst deﬁne hb(x, w), the probability for any negative instance to be ranked above x using ranking function f(x) = w x, as

hb(x, w) = Ex P I(w x w x ) .

Since we are interested in whether positive instances are ranked above most negative instances, we will measure the quality of f(x) = w x by the probability for any positive instance to be ranked below δ percent of negative instances, i.e.,

Pb(w, δ) = Prx+ P+ hb(x+ i , w) δ .

Clearly, if a ranking function achieves a high ranking accuracy at the top, it should have a large percentage of positive instances with ranking scores higher than most of the negative instances, leading to a small value for Pb(w, δ) with little δ. The following theorem bounds Pb(w, δ) for Top Push, and the detailed proof can be found in the longer version [22].

4In Table 1, we report the complexity of SVMp AUC tight in [28], which is more efﬁcient than SVMp AUC in [27]. In addition, SVMp AUC tight is used in experiments and we do not distinguish between them in this paper. 5For instance, for the bounds in [33], the failure probability can be as large as 1 if the parameter p is large.

Theorem 3. Given training data S consisting of m independent samples from P+ and n independent samples from P , let w be the optimal solution to the problem in (5). Assume m 12 and n t, we have, with a probability at least 1 2e t,

Pb(w , δ) Lℓ(w , S) + O p

(t + log m)/m

where δ = O( p

log m/n) and Lℓ(w , S) = 1

m Pm i=1 ℓ(max1 j n w x j w x+ i ).

Remark Theorem 3 implies that if the empirical loss Lℓ(w , S) O(log m/m), for most positive instance x+ (i.e., 1 O(log m/m)), the percentage of negative instances ranked above x+ is upper bounded by O( p

log m/n). We observe that m and n play different roles in the bound; that is, because the empirical loss compares the positive instances to the negative instance with the largest score, it usually grows signiﬁcantly slower with increasing n. For instance, the largest absolute value of Gaussian random samples grows in log n. Thus, we believe that the main effect of increasing n in our bound is to reduce δ (decrease at the rate of 1/ n), especially when n is large. Meanwhile, by increasing the number of positive instances m, we will reduce the bound for Pb(w, δ), and consequently increase the chance of ﬁnding positive instances at the top.

4 Experiments

4.1 Settings

To evaluate the performance of the Top Push algorithm, we conduct a set of experiments on realworld datasets. Table 2 (left column) summarizes the datasets used in our experiments. Some of them were used in previous studies [1, 31, 3], and others are larger datasets from different domains. We compare Top Push with state-of-the-art algorithms that focus on accuracy at the top, including SVMMAP [42], SVMp AUC [28] with α = 0 and β = 1/n, AATP [3] and Inﬁnite Push [1]. In addition, for completeness, several state-of-the-art classiﬁcation and ranking models are included in the comparison: logistic regression (LR) for binary classiﬁcation, cost-sensitive SVM (cs-SVM) that addresses imbalance class distribution by introducing a different misclassiﬁcation cost for each class, and SVMRank [18] for AUC optimization. We implement Top Push and Inﬁnite Push using MATLAB, implement AATP using CVX [14], and use LIBLINEAR [11] for LR and cs-SVM, and use the codes shared by the authors of the original works.

We measure the accuracy at the top by commonly used metrics6: (i) positives at the top (Pos@Top) [1, 31, 3], which is deﬁned as the fraction of positive instances ranked above the topranked negative, (ii) average precision (AP) and (iii) normalized DCG scores (NDCG). On each dataset, experiments are run for thirty trials. In each trial, the dataset is randomly divided into two subsets: 2/3 for training and 1/3 for test. For all algorithms, we set the precision parameter ϵ to 10 4, choose other parameters by 5-fold cross validation (based on the average value of Pos@Top) on training set, and perform the evaluation on test set. Finally, averaged results over thirty trails are reported. All experiments are run on a machine with two Intel Xeon E7 CPUs and 16GB memory.

4.2 Results

In table 2, we report the performance of the algorithms in comparison, where the statistics of testbeds are included in the ﬁrst column of the table. For better comparison between the performance of Top Push and baselines, pairwise t-tests at signiﬁcance level of 0.9 are performed and results are marks / in table 2 when Top Push is statistically signiﬁcantly better/worse.

When an evaluation task can not be completed in two weeks, it will be stopped automatically, and no result will be reported. As a consequence, we observe that results for some algorithms are missing in Table 2 for certain datasets, especially for large ones. We can see from Table 2 that Top Push, LR and cs-SVM succeed to ﬁnish the evaluation on all datasets (even the largest datasets url). In contrast, SVMRank, SVMRank and SVMp AUC fail to complete the training in time for several large datasets. Inﬁnite Push and AATP have the worst scalability: they are only able to ﬁnish the smallest dataset diabetes. We thus conclude that overall, Top Push scales well to large datasets.

6It is worth mentioning that we also measure the ranking performance by AUC, and the results can be found in [22]. In addition, more details of the experimental setting can be found there.

Table 2: Data statistics (left column) and experimental results. For each dataset, the number of positive and negative instances is below the data name as m/n, together with dimensionality d. For training time comparison, ( ) are marked if Top Push is at least 10 (100) times faster than the compared algorithm. For performance (mean std) comparison, ( ) is marked if Top Push performs signiﬁcantly better (worse) than the baseline based on pairwise t-test at 0.9 signiﬁcance level. On each dataset, if the evaluation of an algorithm can not be completed in two weeks, it will be stopped and its results will be missing from the table.

Data Algorithm Time (s) Pos@Top AP NDCG

diabetes Top Push 5.11 10 3 .123 .056 .872 .023 .976 .005 500/268 LR 2.30 10 2 .064 .075 .881 .022 .973 .008 d : 34 cs-SVM 7.70 10 2 .077 .088 .758 .166 .920 .078 SVMRank 6.11 10 2 .087 .082 .879 .022 .975 .006 SVMMAP 4.71 100 .077 .072 .879 .012 .969 .009 SVMp AUC 2.09 10 1 .053 .096 .668 .123 .884 .065 Inﬁnite Push 2.63 101 .119 .051 .877 .035 .978 .007 AATP 2.72 103 .127 .061 .881 .035 .979 .010

news20-forsale Top Push 2.16 100 .191 .088 .843 .018 .970 .005 999/18, 929 LR 4.14 100 .086 .067 .803 .020 .962 .005 d : 62, 061 cs-SVM 1.89 100 .114 .069 .766 .021 .955 .006 SVMRank 2.96 102 .149 .056 .850 .016 .972 .003 SVMMAP 8.42 102 .184 .092 .832 .022 .969 .007 SVMp AUC 3.25 102 .196 .087 .812 .019 .963 .005

nslkdd Top Push 7.64 101 .633 .088 .978 .001 .997 .001 71, 463/77, 054 LR 3.63 101 .220 .053 .981 .002 .998 .001 d : 121 cs-SVM 1.86 100 .556 .037 .980 .001 .998 .001 SVMp AUC 1.72 102 .634 .059 .956 .002 .996 .001

real-sim Top Push 1.34 101 .186 .049 .986 .001 .998 .001 22, 238/50, 071 LR 7.67 100 .100 .043 .989 .001 .999 .001 d : 20, 958 cs-SVM 4.84 100 .146 .031 .979 .001 .998 .001 SVMRank 1.83 103 .090 .045 .986 .000 .999 .001

spambase Top Push 1.51 10 1 .129 .077 .922 .006 .988 .001 1, 813/2, 788 LR 3.11 10 2 .071 .053 .920 .010 .987 .003 d : 57 cs-SVM 8.31 10 2 .069 .059 .907 .010 .980 .004 SVMRank 2.31 101 .069 .076 .931 .010 .990 .003 SVMMAP 1.92 102 .097 .069 .935 .014 .984 .005 SVMp AUC 1.73 100 .073 .058 .854 .024 .975 .007 Inﬁnite Push 1.78 103 .132 .087 .920 .005 .987 .002

url Top Push 5.11 103 .474 .046 .986 .001 .999 .001 792, 145/1, 603, 985 LR 8.98 103 .362 .113 .993 .001 .999 .001 d : 3, 231, 961 cs-SVM 3.78 103 .432 .069 .991 .002 .998 .001

w8a Top Push 7.35 100 .226 .053 .710 .019 .938 .005 1, 933/62, 767 LR 2.46 100 .107 .093 .450 .374 .775 .221 d : 300 cs-SVM 3.87 100 .118 .105 .447 .372 .774 .220 SVMp AUC 2.59 103 .207 .046 .673 .021 .929 .006

Performance Comparison In terms of evaluation metric Pos@Top, we ﬁnd that Top Push yields similar performance as Inﬁnite Push and AATP, and performs signiﬁcantly better than the other baselines including LR and cs-SVM, SVMRank, SVMRank and SVMp AUC. This is consistent with the design of Top Push that aims to maximize the accuracy at the top of the ranked list. Since the loss function optimized by Inﬁnite Push and AATP are similar as that for Top Push, it is not surprising that they yield similar performance. The key advantage of using the proposed algorithm versus Inﬁnite Push and AATP is that it is computationally more efﬁcient and scales well to large datasets. In terms of AP and NDCG, we observe that Top Push yield similar, if not better, performance as the state-of-the-art methods, such as SVMMAP and SVMp AUC, that are designed to optimize these metrics. Overall, we conclude that the proposed algorithm is effective in optimizing the ranking accuracy for the top ranked instances.

Training Efﬁciency To evaluate the computational efﬁciency, we set the parameters of different algorithms to be the values that are selected by cross-validation, and run these algorithms on full datasets that include both training and testing sets. Table 2 summarizes the training time of different algorithms. From the results, we can see that Top Push is faster than state-of-the-art ranking methods on most datasets. In fact, the training time of Top Push is similar to that of LR and cs-SVM

implemented by LIBLINEAR. Since the time complexity of learning a binary classiﬁcation model is usually linear in the number of training instances, this result implicitly suggests a linear time complexity for the proposed algorithm.

10 2 10 3 10 4 10 5

trainign time (s)

=100 =10 =1 =0.1 =0.01 (x)

Figure 1: Training time of Top Push versus training data size for different values of λ.

Scalability We study how Top Push scales to different number of training examples by using the largest dataset url. Figure 1 shows the log-log plot for the training time of Top Push vs. the size of training data, where different lines correspond to different values of λ. For the purpose of comparison, we also include a black dash-dot line that tries to ﬁt the training time by a linear function in the number of training instances (i.e., Θ(m + n)). From the plot, we can see that for different regularization parameter λ, the training time of Top Push increases even slower than the number of training data. This is consistent with our theoretical analysis given in Section 3.3.

5 Conclusion

In this paper, we focus on bipartite ranking algorithms that optimize accuracy at the top of the ranked list. To this end, we consider to maximize the number of positive instances that are ranked above any negative instances, and develop an efﬁcient algorithm, named as Top Push to solve related optimization problem. Compared with existing work on this topic, the proposed Top Push algorithm scales linearly in the number of training instances, which is in contrast to most existing algorithms for bipartite ranking whose time complexities dependents on the number of positive-negative instance pairs. Moreover, our theoretical analysis clearly shows that it will lead to a ranking function that places many positive instances the top of the ranked list. Empirical studies verify the theoretical claims: the Top Push algorithm is effective in maximizing the accuracy at the top and is signiﬁcantly more efﬁcient than the state-of-the-art algorithms for bipartite ranking. In the future, we plan to develop appropriate univariate loss, instead of pairwise ranking loss, for efﬁcient bipartite ranking that maximize accuracy at the top.

Acknowledgement This research was supported by the 973 Program (2014CB340501), NSFC (61333014), NSF (IIS-1251031), and ONR Award (N000141210431).

[1] S. Agarwal. The inﬁnite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In SDM, pages 839 850, 2011.

[2] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under the ROC curve. JMLR, 6:393 425, 2005.

[3] S. Boyd, C. Cortes, M. Mohri, and A. Radovanovic. Accuracy at the top. In NIPS, pages 962 970. 2012.

[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[5] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89 96, 2005.

[6] S. Cl emenc on, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-statistics. Annals of Statistics, 36(2):844 874, 2008.

[7] S. Cl emenc on and N. Vayatis. Ranking the best instances. JMLR, 8:2671 2699, 2007.

[8] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to algorithms. MIT Press, 2001.

[9] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. In NIPS, pages 313 320. 2004.

[10] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efﬁcient projections onto the ℓ1-ball for learning in high dimensions. In ICML, pages 272 279, 2008.

[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classiﬁcation. JMLR, 9:1871 1874, 2008.

[12] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An efﬁcient boosting algorithm for combining preferences. JMLR, 4:933 969, 2003.

[13] W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In ICML, pages 906 914, 2013.

[14] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http: //cvxr.com/cvx, March 2014.

[15] J. Hanley and B. Mc Neil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29 36, 1982.

[16] R. Herbrich, T. Graepel, and K. Obermayer. Large Margin Rank Boundaries for Ordinal Regression, chapter Advances in Large Margin Classiﬁers, pages 115 132. MIT Press, Cambridge, MA, 2000.

[17] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377 384, Bonn, Germany, 2005.

[18] T. Joachims. Training linear SVMs in linear time. In KDD, pages 217 226, 2006.

[19] T. Kanamori, A. Takeda, and T. Suzuki. Conjugate relation between loss functions and uncertainty sets in classiﬁcation problems. JMLR, 14:1461 1504, 2013.

[20] W. Kotlowski, K. Dembczynski, and E. H ullermeier. Bipartite ranking through minimization of univariate loss. In ICML, pages 1113 1120, 2011.

[21] Q.V. Le and A. Smola. Direct optimization of ranking measures. Co RR, abs/0704.3359, 2007.

[22] N. Li, R. Jin, and Z.-H. Zhou. Top rank optimization in linear time. Co RR, abs/1410.1462, 2014.

[23] N. Li, I. W. Tsang, and Z.-H. Zhou. Efﬁcient optimization of performance measures by classiﬁer adaptation. IEEE-PAMI, 35(6):1370 1382, 2013.

[24] J. Liu and J. Ye. Efﬁcient Euclidean projections in linear time. In ICML, pages 657 664, 2009.

[25] T.-Y. Liu. Learning to Rank for Information Retrieval. Springer, 2011.

[26] H. Narasimhan and S. Agarwal. On the relationship between binary classiﬁcation, bipartite ranking, and binary class probability estimation. In NIPS, pages 2913 2921. 2013.

[27] H. Narasimhan and S. Agarwal. A structural SVM based approach for optimizing partial AUC. In ICML, pages 516 524, 2013.

[28] H. Narasimhan and S. Agarwal. SVMtight p AUC: A new support vector method for optimizing partial AUC based on a tight convex upper bound. In KDD, pages 167 175, 2013.

[29] A. Nemirovski. Efﬁcient methods in convex programming. Lecture Notes, 1994.

[30] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2003.

[31] A. Rakotomamonjy. Sparse support vector inﬁnite push. In ICML, 2012.

[32] S. Rendle, L. Balby Marinho, A. Nanopoulos, and L. Schmidt-Thieme. Learning optimal ranking with tensor factorization for tag recommendation. In KDD, pages 727 736, 2009.

[33] C. Rudin and R. Schapire. Margin-based ranking and an equivalence between adaboost and rankboost. JMLR, 10:2193 2232, 2009.

[34] S. Shalev-Shwartz and Y. Singer. Efﬁcient learning of label ranking by soft projections onto polyhedra. JMLR, 7:1567 1599, 2006.

[35] S. Sun and J. Shawe-Taylor. Sparse semi-supervised learning using conjugate functions. JMLR, 11:2423 2455, 2010.

[36] A. Tewari and P. Bartlett. On the consistency of multiclass classiﬁcation methods. JMLR, 8:1007 1025, 2007.

[37] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 6:1453 1484, 2005.

[38] N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classiﬁcation. In ICML, pages 1057 1064, Montreal, Canada, 2009.

[39] H. Valizadegan, R. Jin, R. Zhang, and J. Mao. Learning to rank by optimizing NDCG measure. In NIPS, pages 1883 1891. 2009.

[40] M. Xu, Y.-F. Li, and Z.-H. Zhou. Multi-label learning with PRO loss. In AAAI, pages 998 1004, 2013.

[41] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystr om method vs random Fourier features: A theoretical and empirical comparison. In NIPS, pages 485 493. MIT Press, 2012.

[42] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. In SIGIR, pages 271 278, 2007.

[43] P. Zhao, S.C.H. Hoi, R. Jin, and T. Yang. Online AUC maximization. In ICML, pages 233 240, Bellevue, WA, 2011.