# fast_stochastic_auc_maximization_with_o1nconvergence_rate__b980c895.pdf

Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Mingrui Liu 1 Xiaoxuan Zhang 1 Zaiyi Chen 2 Xiaoyu Wang 3 Tianbao Yang 1

In this paper, we consider statistical learning with AUC (area under ROC curve) maximization in the classical stochastic setting where one random data drawn from an unknown distribution is revealed at each iteration for updating the model. Although consistent convex surrogate losses for AUC maximization have been proposed to make the problem tractable, it remains an challenging problem to design fast optimization algorithms in the classical stochastic setting since the convex surrogate loss depends on random pairs of examples from positive and negative classes. Building on a saddle point formulation for a consistent square loss, this paper proposes a novel stochastic algorithm to improve the standard O(1/ n) convergence rate to e O(1/n) convergence rate without strong convexity assumption or any favorable statistical assumptions (e.g., low noise), where n is the number of random samples. To the best of our knowledge, this is the ﬁrst stochastic algorithm for AUC maximization with a statistical convergence rate as fast as O(1/n) up to a logarithmic factor. Extensive experiments on eight large-scale benchmark data sets demonstrate the superior performance of the proposed algorithm comparing with existing stochastic or online algorithms for AUC maximization.

1. Introduction

Area under ROC curve (AUC) (Metz, 1978; Hanley & Mc Neil, 1982; 1983) is a commonly used metric for evaluating the performance of a classiﬁer. ROC (receiver operating characteristic) curve is the "true positive rate - false positive rate" curve, and AUC represents the probability that examples in positive class are scored higher than those in negative class (Hanley & Mc Neil,

1Department of Computer Science, The University of Iowa, IA 52242, USA 2University of Science and Technology of China 3Intellifusion. Correspondence to: Mingrui Liu <mingruiliu@uiowa.edu>, Tianbao Yang <tianbao-yang@uiowa.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

1982). Compared with misclassiﬁcation rate, AUC is more favorable in the applications with imbalanced datasets and in which the real-valued classiﬁer is used for ranking. However, the algorithms designed to minimize the misclassiﬁcation error rate may not lead to maximization of AUC (Cortes & Mohri, 2004).

In a batch learning setting where all training data is given beforehand, one may formulate AUC maximization as a convex empirical optimization problem based on the training data using a convex surrogate loss. The resulting optimization problem can be solved by employing existing algorithms (Herschtal & Raskutti, 2004; Cortes & Mohri, 2004; Ferri et al., 2011). However, such an approach has two limitations: (i) it is not applicable to many applications in which a random data is received in a sequential manner (e.g., online advertisement); (ii) little is known about the statistical convergence rate of the empirical maximizer to the true maximizer due to the i.i.d assumption of pairwise data does not hold. Therefore, a stochastic algorithm with a provable convergence rate in the classical stochastic setting for minimizing an expected convex surrogate loss for AUC maximization is desirable for addressing these two limitations. It is also useful for tackling large-scale data by one pass of training data.

Nevertheless, the pairwise nature in the deﬁnition of AUC makes it challenging to design algorithms suitable for the classical stochastic setting. To address this challenge, several online and stochastic algorithms have been developed based on convex surrogate loss (Zhao et al., 2011; Gao et al., 2013; Ying et al., 2016). An interesting observation in (Ying et al., 2016) is that the AUC maximization using a consistent square loss is equivalent to a stochastic minmax saddle point problem, for which a primal-dual style of stochastic gradient algorithm can be employed to yield an e O(1/ n) convergence rate for minimizing an expected square loss, where n is the number of samples. How to improve the convergence rate of stochastic optimization of AUC remains an open problem though.

Fast rate such as O(1/n) of stochastic algorithms has been established for expected convex risk minimization in literature (Hazan & Kale, 2011a; Srebro et al., 2010; Bach & Moulines, 2013). However, these studies either impose strong assumptions about the problem (e.g., strong convex-

Fast Stochastic AUC Maximization

ity assumption, low-noise assumption, etc.), and/or their analysis is limited to certain settings that is not applicable to AUC maximization. Therefore, a stochastic algorithm with a provable fast rate as O(1/n) without imposing strong assumptions should be considered as a signiﬁcant contribution for AUC maximization.

The proposed algorithm referred to as FSAUC is a Fast Stochastic algorithm for true AUC maximization. It is based on the min-max saddle point formulation as observed in (Ying et al., 2016). However, different from (Ying et al., 2016), we develop a novel multi-stage scheme for running primal-dual stochastic gradient method with adaptively changing parameters and initial solutions for both primal and dual variables. The adaptive multi-stage scheme not only leverages the primal solution from previous stages but also utilizes the empirical estimation of feature vectors of examples from both positive and negative classes for initializing the dual variables. The convergence analysis of the proposed algorithm hinges on a quadratic growth property of a reformulated AUC objective and a novel synthesis of the adaptive scheme and the convergence result of primal-dual stochastic gradient method. To summarize, the major contributions of this work are:

We propose a novel fast stochastic algorithm that can be run in the classical stochastic setting for maximizing AUC with a known number of total samples n. We establish an e O(1/n) 1 convergence rate for the proposed algorithm, where n is the total number of samples. This is the ﬁrst e O(1/n) convergence result for stochastic AUC optimization.

We evaluate the proposed algorithm on eight largescale benchmark datasets. The results show that our algorithm signiﬁcantly outperforms two state-of-theart stochastic/online AUC methods, namely SOLAM algorithm (Ying et al., 2016) and OPAUC algorithm (Gao et al., 2013).

2. Related Work

(Zhao et al., 2011) is probably the ﬁrst work on online maximization of AUC. In order not to store all received positive and negative examples, they proposed to use reservoir sampling to maintain a buffer of positive and negative examples. At each iteration, they run gradient update based on an online empirical version of AUC deﬁned on the buffered data to update the models. A regret bound was established with the cost function at each iteration t deﬁned based on the example received at the t-th iteration and all data in the buffer. Even with the optimal buffer size, their regret bound is worse than n. Gao et al. (2013) proposed another online algorithm without resorting to the reservoir sampling.

1The e O( ) notation hides logarithmic factors.

Their algorithm hinges on the observation that if a consistent square loss is used, the gradient of an online empirical version of AUC can be computed based on ﬁrst and second order statistics of received data. Hence, their algorithm needs to maintain a covariance matrix of received examples, which renders it not practical for high-dimensional data. Although a randomized version is proposed for maintaining a low rank version of the covariance matrix, it still has considerable computational overhead. In terms of guarantee, they established a regret bound in the order of

T for general case and a possible O(1) regret bound under the low-noise condition. However, these regret bounds do not directly imply a convergence rate for statistical AUC maximization. The reason is that the data that deﬁne each cost function are not independent. The stochastic algorithm proposed in a recent work (Ying et al., 2016) is the ﬁrst with a convergence guarantee for statistical AUC optimization and is also the ﬁrst one without storing any historical examples or their covariance matrix. As aforementioned, their algorithm is based on a novel min-max saddle point formulation of AUC maximization and has a convergence rate of e O(1/ n).

Fast rate of stochastic optimization such as O(1/n) has been studied for standard classiﬁcation and regression problems under some conditions. For example, Hazan & Kale (2011a) proposed a method with an O(1/n) convergence rate under a (weak) strong convexity assumption. Their algorithm requires knowing the strong convexity parameter. For statistical learning problems, Srebro et al. (2010) established an O(1/n) convergence rate for smooth loss functions under low-noise assumption (e.g., the optimal risk value is close to zero). However, their algorithm requires knowing a good estimation of the optimal risk value. Bach & Moulines (2013) established the ﬁrst O(1/n) convergence rate of stochastic algorithms for minimizing expected square loss for regression and expected logistic loss for classiﬁcation without the strong convexity assumption. However, all these algorithms and analysis are not applicable to the AUC maximization in the classical stochastic setting.

Finally, we note that the multi-stage scheme of the proposed algorithm is similar to that developed in (Hazan & Kale, 2011b; Ghadimi & Lan, 2013; Juditsky et al., 2014; Xu et al., 2017) for minimizing strongly or uniformly convex functions or problems with error bound conditions. However, the differences between the proposed work and these works are that (i) our development is based on primal dual stochastic method for solving a stochastic saddle point problem of AUC maximization with particular treatment of dual variables; while their algorithms are for solving stochastic convex minimization problems; (ii) we do not require strong convexity assumption with known parameters as in (Hazan & Kale, 2011b; Ghadimi & Lan, 2013); (iii)

Fast Stochastic AUC Maximization

we do not assume uniform strong convexity assumption as in (Juditsky et al., 2014). Instead, we prove a quadratic growth condition for the studied problem that is a special case of error bound condition studied in (Xu et al., 2017). However, the stochastic algorithms proposed in (Xu et al., 2017) require a target accuracy level and cannot be applied to one pass learning setting for large-scale data.

3. Algorithm and Main Result

3.1. Preliminaries and Notations

Let z = (x, y) P denote a random data following an unknown distribution P, where x X Rd represents the feature vector and y {1, 1} represents the label. Denote by Z = X {1, 1} and by p = Pr(y = 1) = Ey[I[y=1]], where I is an indicator function. We assume that supx X x 2 κ. Let x Rd+2 denote an augmented feature vector with the last two components being 0, and let B(x0, r) = {x : x x0 2 r} be an ℓ2-ball centered at x0 with a radius r.

Given a score function h : Rd R, the AUC at the population level (referred to as the true AUC in this paper) is deﬁned as:

AUC(h) = Pr(h(x) h(x )|y = 1, y = 1),

where z = (x, y) and z = (x , y) are a pair of random data. The AUC maximization problem is to ﬁnd an optimal score function in a hypothesis class such that AUC is maximized. Since the problem is non-convex, it is usually solved by using consistent convex surrogate loss. A common choice used by previous studies (Ying et al., 2016; Gao et al., 2013) is the square loss. In this paper, we consider learning a linear function h(x) = w x to maximize the AUC using a square loss, i.e.,

min w Rd L(w) Ez,z [(1 w (x x ))2|y = 1, y = 1].

(1) Since the loss function depends on a random pair of data making it difﬁcult to handle in the classical stochastic setting, a solution is to cast the above problem into an equivalent saddle point problem (Ying et al., 2016):

(a,b) R2 max α R{f(w, a, b, α) := Ez[F(w, a, b, α; z)]},

F(w, a, b, α; z) = (1 p)(w x a)2I[y=1] + p(w x b)2I[y= 1] p(1 p)α2

+ 2(1 + α)(pw x I[y= 1] (1 p)w x I[y=1]).

Assuming the optimal solution w sits in a bounded domain such that w 1 R, we can restrict the primal and

dual variables to constrained domains Ω1 = {(w, a, b) : w 1 R, |a| Rκ, |b| Rκ}, Ω2 = {α R : |α| 2Rκ}, i.e.,

min (w,a,b) Ω1 max α Ω2{f(w, a, b, α) := Ez[F(w, a, b, α; z)]}.

(2) Please note that adding the ℓ1-ball constraint on w does not restrict the performance of the learned model because (i) if R is large enough such that the optimal solution w to (1) satisﬁes w 1 R, then w is also the optimal solution to (2) 2; (ii) for a ﬁnite number of samples, the constraint can serve as regularization for improving the generalization performance. We use ℓ1-ball constraint because it allows us to show that the proposed stochastic optimization algorithm has an convergence rate of e O(1/n).

Next, we will present the proposed algorithm and its convergence results. We will also provide a convergence analysis on a core component of the proposed algorithm.

3.2. Algorithm and its Convergence

The proposed algorithm is presented in Algorithm 1, which is referred to as FSAUC. The parameteres R0, G, D0, β0 will be explained shortly. The algorithm divides the updates into m stages and each stage has n/m updates. Each stage of FSAUC runs a primal dual style stochastic gradient (PDSG) method outlined in Algorithm 2, which is similar to that in (Ying et al., 2016) except for three differences: (i) the step size is given as a constant for each call of PDSG; (ii) each update of the primal variable v = (w , a, b) and the dual variable α is projected into an intersection of their constrained domain and an ℓ2 ball centered at the initial solution v1, α1; (iii) upon receiving a random data zt, the variables b A , T , bp are updated according to Algorithm 3, where T represent the number of positive/negative examples received so far; b A represents the cumulative (augmented) feature vector for the positive class and negative class; and bp represents the estimated positive ratio based on the received examples. The parameters for each call of PDSG is adaptively changing, including the initial solutions v1, α1, the radius of ℓ2 balls for the primal variables and the dual variables, and the step size η.

The initial parameter R0 is a bound of Euclidean distance from bv0 to the optimal set of (2) in terms of v. The initial parameter D0 is for setting a constrained domain of the dual variable. The initial parameter β0 is for setting the initial step size. The parameter G is an upper bound of Euclidean norms of G(u, z), b Gt(u, zt), g(u) for any u Ω1 Ω2 and z, zt Z, which are deﬁned in (3). The function ˆFt at the line 5 and line 6 of the Algorithm 2 given

2This can be shown that if w 1 1, then the optimal solutions to a, b, α satisfy the prescribed bounds in light of their closed-form solutions depending on w .

Fast Stochastic AUC Maximization

Algorithm 1 FSAUC

1: Set m = 1

2 log2 2n log2 n 1, n0 = n/m , R0 =

1 + 2κ2R, G > max((1 + 4κ)κ(R + 1), 2κ(2R + 1 + 2Rκ), 2κ(4κR + 11R + 1)), β0 = (1 + 8κ2), D0 = 2

2κR0 2: Initialize bv0 = 0 Rd+2, bα0 = 0, 3: for k = 1, . . . , m do

4: Set ηk =

βk 1 3n0GRk 1 5: Call PDSG to obtain (bvk, bαk, βk, Rk, Dk) = PDSG(bvk 1, bαk 1, Rk 1, Dk 1, n0, ηk) 6: end for 7: return bvm

b Ft(v, α, zt) = (1 bp)(w xt a)2I[yt=1] + bp(w xt b)2I[yt= 1] bp(1 bp)α2

+ 2(1 + α)(bpw xt I[yt= 1] (1 bp)w x I[yt=1])

is an estimation of F(v, α, zt) using the current estimation bp of p. It is worth mentioning that the line 5 and line 6 of the Algorithm 2 need to execute a projection onto the intersection of two convex sets. We can use the alternating projection algorithm to generate a sequence which can linearly converges to the desired point. Actually it is easy to show that the two sets in our case are boundedly linearly regular (Deﬁnition 3.11 of (Bauschke & Borwein, 1993)) and the linear convergence is guaranteed by the Corollary 3.14 of (Bauschke & Borwein, 1993).

Finally, the convergence rate of FSAUC is presented in the following theorem.

Theorem 1. Given δ (0, 1), assume n is sufﬁciently large such that n > max(100, m 32 ln( 12

δ ) (min(p,1 p))2 ). Then with probability at least 1 δ,

max α Ω2 f(bvm, α) min v Ω1 max α Ω2 f(v, α) O ln( 1

where e O( ) suppresses logarithmic factor of log(n) and some constants of the problem independent of n.

Remark: In practice, we can maintain global variables b A+, b A , bp and use them for updating parameters and initializing dual variables. The local versions used in Algorithm 2 are just for simplicity of analysis.

The above theorem implies the convergence of AUC maximization in terms of the square loss.

Corollary 1. Under the same condition as in Theorem 1,

Algorithm 2 PDSG(v1, α1, r, D, T, η)

1: Initialize variables b A+ Rd+2, b A Rd+2, T+, T , bp R as zeros 2: for t = 1, . . . , T do 3: Receive a sample zt = (xt, yt) 4: Update b A , T , bp using the data zt 5: vt+1 = ΠΩ1 B(v1,r)(vt η v b Ft(vt, αt, zt))

6: αt+1 = ΠΩ2 B(α1,D)(αt + η α b Ft(vt, αt, zt)) 7: end for 8: Compute v T =

T and bα = ( b A

T+ ) v T 9: Let r = r/2 10: Update β, D according to Lemma 1 11: return ( v T , bα, β, r, D)

Algorithm 3 Update b A , T , bp given a data (xt, yt)

1: b A+ = b A+ + I[yt=1] xt 2: b A = b A + I[yt= 1] xt 3: T+ = T+ + I[yt=1] 4: T = T + I[yt= 1] 5: bp = T+/(T+ + T )

with probability at least 1 δ,

L(bwm) min w 1 R L(w) O ln( 1

4. Convergence Analysis

This section is devoted to convergence analysis. Due to limitation of space, we present detailed analysis for a key lemma. More details about the proof of main Theorem can be found in the supplement.

Before analysis, we ﬁrst give some notations that will be frequently used in this section.

v = (w , a, b) Rd+2

u = (v , α) Rd+3

G(u, z) = ( v F(v, α; z), αF(v, α; z)) b Gt(u, zt) = ( v b Ft(v, α; zt), α b Ft(v, α; zt))

g(u) = ( vf(v, α), αf(v, α))

Lemma 1. For each call of Algorithm 2, we update D, β according to

δ ) (1 + 2κ)R r

(min(bp, 1 bp)T q

Fast Stochastic AUC Maximization

β = (1 + 8κ2) + 32κ2(1 + 2κ)2 2 + q

min(bp, 1 bp) q

where bp is an estimation of p. Suppose v1 v 2 r, where v Ω1 is the optimal solution closest to v1, and T > max R2

r2 , 32 ln( 12

δ ) (min(p,1 p))2 , then

max α Ω2 f( v T , α) min v Ω1 max α Ω2 f(v, α)

where γ1 = (1 + 8κ2) + 32κ2(1+2κ)2 2+

min(p,1 p)/2 , γ2 =

min(p,1 p)/2

Remark: When n is sufﬁciently large as stated in the main theorem, then p

log(2/δ)/T < min(bp, 1 bp) by Hoeffding s inequality since bp is estimated using n0 examples. The above result implies that the proposed algorithm has at least an O(1/ n) convergence rate.

Based on the above lemma, Algorithm 1 uses a geometrically decreasing radius r in order to achieve a faster convergence. By further utilizing the following property we can prove the main theorem.

Lemma 2. f1(v) = max α Ω2 f(v, α) restricted on the set Ω1 satisﬁes a quadratic growth condition, i.e., for any v Ω1, there exists c > 0 such that

v v 2 c(f1(v) min v Ω1 f1(v))1/2,

where v is the optimal solution to minv Ω1 f1(v) that is closest to v.

Remark: It is notable that the proposed algorithm FSAUC does not require the value of c that is difﬁcult to compute.

4.1. Proof of Lemma 1

To prove this lemma, we need the following lemma, whose proof is included in the supplement.

Lemma 3. Let b A = b A+/T+ b A /T and and A = ((E(x|y = 1) E(x|y = 1)) , 0, 0) . After the k-th call of Algorithm 2 with k 1, with probability 1 δ

b A A 2 4κ 2 + q

where ξ min(bp, 1 bp) q

Proof of Lemma 1. By the setting of G, we have

max t,ut,zt

g(ut) 2, G(ut, zt) 2, b Gt(ut, zt) 2 G.

α ,T = arg maxα f( v T , α) = w T [E(x|y = 1) E(x|y = 1)]. Following standard analysis of primal-dual update (e.g., see inequality 17 in (Ying et al., 2016)), we have

max α Ω2 f( v T , α) min v Ω1 max α Ω2 f(v, α)

f( v T , α ,T ) f(v , αT )

v1 v 2 2 2ηT + α1 α ,T 2 2 2ηT + ηG2

+ PT t=1(ut u ) (g(ut) G(ut, zt))

+ PT t=1(ut u ) (G(ut, zt) ˆGt(ut, zt))

T = I + II + III + IV + V

where v is the optimal solution closest to v1 of (2), and the ﬁrst inequality follows from the fact that minv Ω1 maxα Ω2 f(v, α) f(v , αT ).

Now we bound the ﬁve terms respectively. It is obvious that I = v1 v 2 2 2ηT r2 0 2ηT .

Let b Ak 1, bpk 1, ξk 1 be counterparts of that in Lemma 3 estimated from data in (k 1)-stage. According to the setting of α in Algorithm 1, for the ﬁrst call of Algorithm 2 we have α1 = Av1 due to bv0 = 0, and for k-th call of Algorithm 2 with k > 1 we have α1 = b Ak 1v1. Then for the ﬁrst call of Algorithm 2, we have II = Av1 A v T 2 2 2ηT , and for other calls we have

II = b Ak 1v1 A v T 2 2 2ηT

2 b Ak 1(v1 v T ) 2 2 + 2 ( b Ak 1 A) v T 2 2 2ηT .

By Cauchy-Schwartz inequality, we have

max{ A(v1 v T ) 2 2, b Ak 1(v1 v T ) 2 2}

4κ2 v1 v T 2 2,

( b Ak 1 A) v T 2 2 b Ak 1 A 2 2 v T 2 2.

By combining the result in Lemma 3, we have with probability 1 δ/3

2ηT + Ik>1 32κ2 2 + q

δ ) 2 (1 + 2κ)2R2

Fast Stochastic AUC Maximization

Similarly, we can bound |α1 α ,T | by 2

2κr for the ﬁrst call, and

|α1 α ,T | 4

δ ) (1 + 2κ)r r

(min(bpk 1, 1 bpk 1)n0 q

for k-th call with k > 1. According to initial value of D and r for each call, we have |α1 α ,T | D.

Next, we can bound the last two terms similarly to (Ying et al., 2016). Deﬁne

u1 = u1 (Ω1 B(v1, r)) (Ω2 B(α1, D)), ut+1 = Π (Ω1 B(v1,r)) (Ω2 B(α1,D)) ( ut η(g(ut) G(ut, zt))),

then we have with probability at least 1 δ

t=1 η( ut u ) (g(ut) G(ut, zt))

u1 u 2 2 2 + 1

t=1 η2 g(ut) G(ut, zt) 2 2

v1 v 2 2 + α1 α ,T 2 2 2 + 1

Note that both ut and ut are measurable with respect to Ft = {z1, . . . , zt 1}, and {St : η(ut ut) (g(ut) G(ut)) : t = 1, . . . T} is a martingale difference sequence, and for any t, we have |η(ut ut) (g(ut) G(ut, zt))| 2ηG ut ut 2 2ηG(2r + 2D). Then by Azuma Hoeffding s inequality, we have with probability at least 1 δ

t=1 η(ut ut) (g(ut) G(ut, zt))

2ηG(2r + 2D)

Hence, with probability 1 2δ

3 , we have

(1 + 8κ2)r2

2ηT + 16κ2 2 + q

δ ) 2 (1 + 2κ)2R2

+ 2ηG2 + 4G(r + D) q

Next we bound V. According to Lemma 3 of (Ying et al., 2016),

sup t ( ut u1 2 + u1 u 2)

sup u Ω,z Z b Gt(u, z) G(u, z) 2

4(r + D) 2κ(4κR + 11R + 1) PT t=1

8(r + D)G q

where the last inequality holds since PT t=1 1

r2 , by union bound, with probability at least 1 δ we have

max α Ω2 f( v T , α) min v Ω1 max α Ω2 f(v, α)

ηT + 3ηG2 + ζ2r G q

where ζ1 = (1 + 8κ2) + I[k>1] 32κ2(1+2κ)2 2+

ζ2 = 16 1 + 2

2κ + I[k>1] 8κ 2+

ξk 1 = min (ˆpk 1, 1 ˆpk 1) q

Moreover, by choosing η = ζ1r

T , we have with probability at least 1 δ, we have

max α Ω2 f( v T , α) min v Ω1 max α Ω2 f(v, α)

By Hoeffding inequality,

(1 p)T (1 bpk 1)T

When T 32 ln( 12

δ ) (min(p,1 p))2 , we can bound ζ1 γ1 and ζ2

γ2 due to T 32 ln( 12

δ ) (min(p,1 p))2 , then the conclusion follows.

Fast Stochastic AUC Maximization

0.5 1 1.5 2 2.5 3 3.5 iteration

a9a dataset

FSAUC SOALM SOLAM_l1 OPAUC

1 2 3 4 5 6 7 8 9 10 iteration 10 4

ijcnn1 dataset

FSAUC SOALM SOLAM_l1 OPAUC

0.5 1 1.5 2 2.5 3 3.5 4 iteration 10 5

covtype_binary dataset

FSAUC SOALM SOLAM_l1 OPAUC

(c) covtype

0 1 2 3 4 5 6 7 8 iteration 10 6

HIGGS dataset

FSAUC SOALM SOLAM_l1 OPAUC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 iteration

w8a dataset

FSAUC SOALM SOLAM_l1 OPAUC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 iteration 10 4

real-sim dataset

FSAUC SOALM SOLAM_l1 OPAUC

(f) real-sim

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 iteration 10 5

rcv1_binary dataset

FSAUC SOALM SOLAM_l1 OPAUC

(g) rcv1 binary

2000 4000 6000 8000 10000 12000 14000 iteration

news20.binary dataset

FSAUC SOALM SOLAM_l1 OPAUC

(h) news20.binary

Figure 1. AUC-Iteration curves of FSAUC and the baselines

5. Experiments

In this section, we conduct experiments to compare our FSAUC algorithm with two state-of-the-art stochastic/online AUC optimization methods, namely SOLAM (Ying et al., 2016) and OPAUC (Gao et al., 2013). To make the comparison fair, we implement two variations of SOLAM by using two different norm constraints on w: SOLAM with ℓ2 norm constraint (the original one) referred to as SOLAM, and SOLAM with ℓ1 norm constraint referred to as SOLAM-l1.

We use eight large-scale benchmark datasets from libsvm website3, ranging from high-dimensional to lowdimensional, from balanced class distribution to imbalanced class distribution. The statistics of these datasets are summarized in Table 1. We randomly divide each dataset into three sets, respectively training, validation, and testing. For a9a and w8a datasets, we randomly split the given testing set into half validation and half testing. For the datasets that do not explicitly provide a testing set, we randomly split the entire data into 4:1:1 for training, validation, and testing. For the dataset ijcnn1 and rcv1 binary, despite that the test set is given, the size of the training set is relatively small. Thus we ﬁrst combine the training and the test sets and then follow the above procedure to split it. The involved parameters of each algorithm are tuned based on the validation data. FSAUC has two parameters R and G. R is decided within the range 10[ 1:1:5]. G af-

3http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/

Table 1. Statistics of Datasets Datasets #Training #Testing feat % of Pos a9a 32,561 8,141 123 24.08 ijcnn1 94,460 23,616 22 9.49 covtype binary 387,341 96,836 54 51.19 HIGGS 7,333,333 1,833,334 28 52.98 w8a 49,749 7,476 300 2.97 real-sim 48,206 12,052 20,958 30.68 rcv1 binary 465,094 116,274 47,236 52.41 news20.binary 13,330 3,333 1,355,191 50.29

fects the stepsize of each epoch (Algorithm 1 line 4-5). Since η1 = β0 3n0GR0 and ηk+1 = βk 2

βk 1 ηk, we equiv-

alently tune η1 2[ 10:1:10]. As for SOLAM, following the same strategy in the original paper (Ying et al., 2016), we tune R in 10[ 1:1:5] and the learning rate in 2[ 10:1:10]. OPAUC has two versions corresponding to different ways of maintaining the covariance matrix, the full version and an approximated version designed to deal with high dimensional data. On the ﬁve low-dimensional datasets (a9a, ijcnn1, covtype binary, HIGGS, w8a), we use the full version of OPAUC, since the dimension is low enough and thus it is unnecessary to use the low rank approximation method, which is less accurate and more complicated. For the three high-dimensional datasets (real-sim, rcv1 binary, news20.binary), we set the low rank τ = 50, the same as the original paper (Gao et al., 2013). There are two parameters shared by both versions of OPAUC, step size η and regularized parameter λ. Following the suggestion of the

Fast Stochastic AUC Maximization

0 1 2 3 4 5 6 time (second)

a9a dataset

FSAUC SOALM SOLAM_l1 OPAUC

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time (second)

ijcnn1 dataset

FSAUC SOALM SOLAM_l1 OPAUC

2 4 6 8 10 12 14 16 18 20 time (second)

covtype_binary dataset

FSAUC SOALM SOLAM_l1 OPAUC

(c) covtype

0 20 40 60 80 100 120 140 160 180 200 time (second)

HIGGS dataset

FSAUC SOALM SOLAM_l1 OPAUC

0 5 10 15 20 25 30 35 40 time (second)

w8a dataset

FSAUC SOALM SOLAM_l1 OPAUC

0 50 100 150 200 250 300 350 400 time (second)

real-sim dataset

FSAUC SOALM SOLAM_l1 OPAUC

(f) real-sim

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 time (second)

rcv1_binary dataset

FSAUC SOALM SOLAM_l1 OPAUC

(g) rcv1 binary

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 time (second)

news20.binary dataset

FSAUC SOALM SOLAM_l1 OPAUC

(h) news20.binary

Figure 2. AUC-Time curves of FSAUC and the baselines

Table 2. Averaged ﬁnal AUC on testing data

Datasets FSAUC SOLAM SOLAM l1 OPAUC a9a .900782 .000197 .899884 .000358 .898382 .000973 .900350 .000450 ijcnn1 .923230 .001310 .917147 .002526 .921682 .001249 .922163 .002296 covtype binary .824046 .000242 .823440 .000489 .819894 .001434 .822669 .000973 HIGGS .683727 .000093 .683064 .000280 .683526 .000086 .681697 .000401 w8a .962703 .001485 .945508 .011601 .949083 .004182 .952536 .002712 real-sim .991862 .000243 .990473 .000111 .979830 .000911 .988357 .000147 rcv1 binary .995448 .000052 .993601 .000044 .994767 .000071 .990634 .000039 news20.binary .975428 .000513 .956472 .000957 .937493 .006501 .967471 .000721

original paper (Gao et al., 2013), we tune η 2[ 12:1: 4]

and λ 2[ 10:1:0].

Each algorithm updates the model by one pass of training data and the models at different iterations are evaluated by AUC computed on the testing data to demonstrate the (testing) convergence speed of different algorithms. We report the results on testing sets averaged over 5 random runs over the training data. The convergence curves of considered algorithms are plotted in Figure 1 and Figure 2 in terms of both the number of iterations and training time. The ﬁnal AUC with the standard deviation over 5 runs is summarized in Table 2. From Figure 1, we can see that the proposed FSAUC converges faster than the two state-of-the-art methods, which is consistent with our theory. For the convergence in terms of running time shown in Figure 1, the proposed FSAUC also has the best performance.

6. Conclusion

In this paper, we have proposed a novel stochastic algorithm for AUC maximization in the classical stochastic setting where one random data is received at each iteration. We theoretically analyze the proposed algorithm and derive a fast convergence rate of e O(1/n), largely improved from the best result that the current state-of-the-art method can achieve - e O(1/ n). We have also veriﬁed the efﬁciency of our algorithm by experiments on multiple benchmark datasets, and the results show that our algorithm converges faster than two strong baseline algorithms in terms of both the number of iterations and running time. For future work, we will consider fast stochastic algorithms for AUC maximization based on different surrogate losses. One may also consider extending the proposed algorithm to learning to rank from pairwise data in which the label of each data have more than two possible values in the binary classiﬁcation.

Fast Stochastic AUC Maximization

Acknowledgments

We thank the anonymous reviewers for their helpful comments. M. Liu, X. Zhang, T. Yang are partially supported by National Science Foundation (IIS-1545995).

Bach, F. R. and Moulines, E. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems 26 (NIPS), pp. 773 781, 2013.

Bauschke, H. H. and Borwein, J. M. On the convergence of von neumann s alternating projection algorithm for two sets. Set-Valued Analysis, 1(2):185 212, 1993.

Cortes, C. and Mohri, M. Auc optimization vs. error rate minimization. In Advances in neural information processing systems, pp. 313 320, 2004.

Ferri, C., Hern andez-Orallo, J., and Flach, P. A. A coherent interpretation of auc as a measure of aggregated classiﬁcation performance. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 657 664, 2011.

Gao, W., Jin, R., Zhu, S., and Zhou, Z.-H. One-pass auc optimization. In International Conference on Machine Learning, pp. 906 914, 2013.

Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):20612089, 2013.

Hanley, J. A. and Mc Neil, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29 36, 1982.

Hanley, J. A. and Mc Neil, B. J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148(3):839 843, 1983.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. Journal of Machine Learning Research - Proceedings Track, 19:421 436, 2011a.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), pp. 421 436, 2011b.

Herschtal, A. and Raskutti, B. Optimising area under the roc curve using gradient descent. In Proceedings of the

twenty-ﬁrst international conference on Machine learning, pp. 49. ACM, 2004.

Juditsky, A., Nesterov, Y., et al. Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization. Stochastic Systems, 4(1):44 80, 2014.

Metz, C. E. Basic principles of roc analysis. In Seminars in nuclear medicine, volume 8, pp. 283 298. Elsevier, 1978.

Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems (NIPS), pp. 2199 2207, 2010.

Xu, Y., Lin, Q., and Yang, T. Stochastic convex optimization: Faster local growth implies faster global convergence. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 3821 3830, 2017.

Ying, Y., Wen, L., and Lyu, S. Stochastic online auc maximization. In Advances in Neural Information Processing Systems, pp. 451 459, 2016.

Zhao, P., Jin, R., Yang, T., and Hoi, S. C. Online auc maximization. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 233 240, 2011.