# stochastic_online_auc_maximization__a99dd317.pdf

Stochastic Online AUC Maximization

Yiming Ying , Longyin Wen , Siwei Lyu

Department of Mathematics and Statistics SUNY at Albany, Albany, NY, 12222, USA Department of Computer Science SUNY at Albany, Albany, NY, 12222, USA

Area under ROC (AUC) is a metric which is widely used for measuring the classiﬁcation performance for imbalanced data. It is of theoretical and practical interest to develop online learning algorithms that maximizes AUC for large-scale data. A speciﬁc challenge in developing online AUC maximization algorithm is that the learning objective function is usually deﬁned over a pair of training examples of opposite classes, and existing methods achieves on-line processing with higher space and time complexity. In this work, we propose a new stochastic online algorithm for AUC maximization. In particular, we show that AUC optimization can be equivalently formulated as a convex-concave saddle point problem. From this saddle representation, a stochastic online algorithm (SOLAM) is proposed which has time and space complexity of one datum. We establish theoretical convergence of SOLAM with high probability and demonstrate its effectiveness on standard benchmark datasets.

1 Introduction

Area Under the ROC Curve (AUC) [8] is a widely used metric for measuring classiﬁcation performance. Unlike misclassiﬁcation error that reﬂects a classiﬁer s ability to classify a single randomly chosen example, AUC concerns the overall performance of a functional family of classiﬁers and quantiﬁes their ability of correctly ranking any positive instance with regards to a randomly chosen negative instance. Most algorithms optimizing AUC for classiﬁcation [5, 9, 12, 17] are for batch learning, where we assume all training data are available.

On the other hand, online learning algorithms [1, 2, 3, 16, 19, 22], have been proven to be very efﬁcient to deal with large-scale datasets. However, most studies of online learning focus on the misclassiﬁcation error or its surrogate loss, in which the objective function depends on a sum of losses over individual examples. It is thus desirable to develop online learning algorithms to optimize the AUC metric. The main challenge for an online AUC algorithm is that the objective function of AUC maximization depends on a sum of pairwise losses between instances from different classes which is quadratic in the number of training examples. As such, directly deploying the existing online algorithms will require to store all training data received, making it not feasible for large-scale data analysis.

Several recent works [6, 11, 18, 20, 21] have studied a type of online AUC maximization method that updates the classiﬁer upon the arrival of each new training example. However, this type of algorithms need to access all previous examples at iteration t, and has O(td) space and per-iteration complexity where d is the dimension of the data. The scaling of per-iteration space and time complexity is an undesirable property for online applications that have to use ﬁxed resources. This problem is partially alleviated by the use of buffers of a ﬁxed size s in [11, 21], which reduces the per-iteration space and time complexity to O(sd). Although this change makes the per-iteration space and time complexity independent of the number of iterations, in practice, to reduce variance in learning performance, the

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

size of the buffer needs to be set sufﬁciently large. The work of [6] proposes an alternative method that requires to update and store the ﬁrst-order (mean) and second-order (covariance) statistics of the training data, and the space and per-iteration complexity becomes O(d2). Although this eliminates the needs to access all previous training examples, the per-iteration is now quadratic in data dimension, which makes this method inefﬁcient for high-dimensional data. To this end, the authors of [6] further proposed to approximate the covariance matrices with low-rank random Gaussian matrices. However, the approximation method is not a general solution to the original problem and its convergence was only established under the assumption that the effective numerical rank for the set of covariance matrices is small (i.e., they can be well approximated by low-rank matrices).

In this work, we present a new stochastic online AUC maximization (SOLAM) method associated for the ℓ2 loss function. In contrast to existing online AUC maximization methods, e.g. [6, 21], SOLAM does not need to store previously received training examples or the covariance matrices, while, at the same time, enjoys a comparable convergence rate, up to a logarithmic term, as in [6, 21]. To our best knowledge, this is the ﬁrst online learning algorithm for AUC optimization with linear space and per-iteration time complexities of O(d), which are the same as the online gradient descent algorithm [1, 2, 16, 22] for classiﬁcation. The key step of SOLAM is to reformulate the original problem as a stochastic saddle point problem [14]. This connection is the foundation of the SOLAM algorithm and its convergence analysis. When evaluating on several standard benchmark datasets, SOLAM achieves performances that are on par with state-of-the-art online AUC optimization methods with signiﬁcant improvement in running time.

The main contribution of our work can be summarized as follows:

We provide a new formulation of the AUC optimization problem as stochastic Saddle Point Problem (SPP). This formulation facilitates the development of online algorithms for AUC optimization.

Our algorithm SOLAM achieves a per-iteration space and time complexity that is linear in data dimensionality.

Our theoretical analysis provides guarantee of convergence, with high probability, of the proposed algorithm.

Let the input space X Rd and the output space Y = { 1, +1}. We assume the training data, z = {(xi, yi), i = 1, . . . , n} as i.i.d. sample drawn from an unknown distribution ρ on Z = X Y. The ROC curve is the plot of the true positive rate versus the false positive rate. The area under the ROC curve (AUC) for any scoring function f : X R is equivalent to the probability of a positive sample ranks higher than a negative sample (e.g. [4, 8]). It is deﬁned as

AUC(f) = Pr(f(x) f(x )|y = +1, y = 1), (1)

where (x, y) and (x , y ) are independent drawn from ρ. The target of AUC maximization is to ﬁnd the optimal decision function f:

arg max f AUC(f) = arg min f Pr(f(x) < f(x )|y = 1, y = 1)

= arg min f E h I[f(x ) f(x)>0] y = 1, y = 1 i , (2)

where I( ) is the indicator function that takes value 1 if the argument is true and 0 otherwise. Let p = Pr(y = 1). For any random variable ξ(z), recall that its conditional expectation is deﬁned by E[ξ(z)|y = 1] = 1

p RR ξ(z)Iy=1dρ(z). Since I( ) is not continuous, it is often replaced by its convex surrogates. Two common choices are the ℓ2 loss (1 (f(x) f(x )))2 or the hinge loss 1 (f(x) f(x ))

+. In this work, we use the ℓ2, as it has been shown to be statistically consistent with AUC while the hinge loss is not [6, 7]. We also restrict our interests to the family of linear functions, i.e., f(x) = w x. In summary, the AUC maximization can be formulated by

argmin w R E h (1 w (x x ))2|y = 1, y = 1 i

= argmin w R 1 p(1 p) RR

Z Z(1 w (x x ))2I[y=1]I[y = 1]dρ(z)dρ(z ). (3)

When ρ is a uniform distribution over training data z, we obtain the empirical minimization (ERM) problem for AUC optimization studied in [6, 21]1

j=1 (1 w (xi xj))2I[yi=1 yj= 1], (4)

where n+ and n denote the numbers of instances in the positive and negative classes, respectively.

2.1 Equivalent Representation as a (Stochastic) Saddle Point Problem (SPP)

The main result of this work is the equivalence of problem (3) to a stochastic Saddle Point Problem (SPP) (e.g., [14]). A stochastic SPP is generally in the form of

min u Ω1 max α Ω2

f(u, α) := E[F(u, α, ξ)] , (5)

where Ω1 Rd and Ω2 Rm are nonempty closed convex sets, ξ is a random vector with non-empty measurable set Ξ Rp, and F : Ω1 Ω2 Ξ R. Here E[F(u, α, ξ)] = R

Ξ F(u, α, ξ)d Pr(ξ), and function f(u, α) is convex in u Ω1 and concave in α Ω2. In general, u and α are referred to as the primal variable and the dual variable, respectively.

The following theorem shows that (3) is equivalent to a stochastic SPP (5). First, deﬁne F : Rd R3 Z R, for any w Rd, a, b, α R and z = (x, y) Z, by

F(w, a, b, α; z) = (1 p)(w x a)2I[y=1] + p(w x b)2I[y= 1] + 2(1 + α)(pw x I[y= 1] (1 p)w x I[y=1]) p(1 p)α2. (6)

Theorem 1. The AUC optimization (3) is equivalent to

min w R (a,b) R2 max α R

n f(w, a, b, α) := Z

Z F(w, a, b, α; z)dρ(z) o . (7)

Proof. It sufﬁces to prove the claim that the objective function of (3) equals to 1 + min(a,b) R2 maxα R R

Z F(w, a, b, α; z)dρ(z).

To this end, note that z = (x, y) and z = (x , y ) are samples independently drawn from ρ. Therefore, the objective function of (3) can be rewritten as

E (1 w (x x ))2|y = 1, y = 1 = 1 + E[(w x)2|y = 1] + E[(w x )2|y = 1]

2E[w x|y = 1] + 2E[w x |y = 1] 2 E[w x|y = 1] E[w x |y = 1]

= 1 + E[(w x)2|y = 1] E[w x|y = 1] 2 + E[(w x )2|y = 1] E[w x |y = 1] 2

2E[w x|y = 1] + 2E[w x |y = 1] + E[w x|y = 1] E[w x |y = 1] 2. (8)

Note that E[(w x)2|y = 1] E[w x|y = 1] 2 = 1 p R

Z(w x)2I[y=1]dρ(z) 1

Z w x I[y=1]dρ(z) 2 = mina R 1

Z(w x a)2I[y=1]dρ(z) = mina R E[(w x a)2|y = 1], where the minimization is achieved by

a = E[w x|y = 1]. (9)

Likewise, min b E[(w x b)2|y = 1] = E[(w x )2|y = 1] E[w x |y = 1] 2 where the

minimization is obtained by letting

b = E[w x |y = 1]. (10)

Moreover, observe that E[w x|y = 1] E[w x |y = 1] 2 = maxα 2α(E[w x |y = 1] E[w x|y = 1]) α2 , where the maximization is achieved with

α = E[w x |y = 1] E[w x|y = 1]. (11)

1The work [6, 21] studied the regularized ERM problem, i.e. minw Rd 1 n+n Pn i=1 Pn j=1(1 w (xi xj))2I[yi=1]I[yj= 1] + λ

2 w 2, which is equivalent to (3) with Ωbeing a bounded ball in Rd.

Putting all these equalities into (8) implies that

E h (1 w (x x ))2|y = 1, y = 1 i = 1 + min (a,b) R2 max α R

Z F(w, a, b; z)dρ(z)

This proves the claim and hence the theorem.

In addition, we can prove the following result.

Proposition 1. Function f(w, a, b, α) is convex in (w, a, b) Rd+2 and concave in α R.

The proof of this proposition can be found in the Supplementary Materials.

2.2 Stochastic Online Algorithm for AUC Maximization

The optimal solution to an SPP problem is called a saddle point. Stochastic ﬁrst-order methods are widely used to get such an optimal saddle point. The main idea of such algorithms (e.g. [13, 14] is to use an unbiased stochastic estimator of the true gradient to perform, at each iteration, gradient descent in the primal variable and gradient ascent in the dual variable.

Using the stochastic SPP formulation (7) for AUC optimization, we can develop stochastic online learning algorithms which only need to pass the data once. For notational simplicity, let vector v = (w , a, b) Rd+2, and for any w Rd, a, b, α R and z = (x, y) Z, we denote f(w, a, b, α) as f(v, α), and F(w, a, b, α, z) as F(v, α, z). The gradient of the objective function in the stochastic SPP problem (7) is given by a (d + 3)-dimensional column vector g(v, α) = ( vf(v, α), αf(v, α)) and its unbiased stochastic estimator is given, for any z Z, by G(v, α, z) = ( u F(v, α, z), αF(v, α, z)). One could directly deploy the stochastic ﬁrst-order method in [14] to the stochastic SPP formulation (7) for AUC optimization. However, from the deﬁnition of F in (6), this would require the knowledge of the unknown probability p = Pr(y = 1) a priori. To overcome this problem, for any v = (w , a, b) Rd+2, α R and z Z, let

ˆFt(v, α, z) = (1 ˆpt)(w x a)2I[y=1] + ˆpt(w x b)2I[y= 1] + 2(1 + α)(ˆptw x I[y= 1] (1 ˆpt)w x I[y=1]) ˆpt(1 ˆpt)α2. (12)

where ˆpt =

Pt i=1 I[yi=1]

t at iteration t. We propose, at iteration t, to use the stochastic estimator

ˆGt(v, α, z) = ( v ˆFt(v, α, z), α ˆFt(v, α, z)) (13)

to replace the unbiased, but practically inaccessible, stochastic estimator G(v, α, z). Assume κ = supx X x < , and recall that w R. For any optimal solution (w , a , b ) of the stochastic SPP (7) for AUC optimization, by (9), (10) and (11) we know that |a | = 1

Z w , x I[y=1]dρ(z)| Rκ, |b | = 1 1 p| R

Z w , x I[y = 1]dρ(z )| Rκ, and |α | = 1 1 p R

Z w , x I[y = 1]dρ(z ) 1 p R

Z w , x I[y=1]dρ(z) 2Rκ. Therefore, we can restrict (w, a, b) and α to the following bounded domains:

Ω1 = (w, a, b) Rd+2 : w R, |a| Rκ, |b| Rκ , Ω2 = α R : |α| 2Rκ . (14)

In this case, the projection steps (e.g. steps 4 and 5) in Table 1 can be easily computed. The pseudocode of the online AUC optimization algorithm is described in Table 1, to which we refer as SOLAM.

We now present the convergence results of the proposed algorithm for AUC optimization. Let u = (v, α) = (w, a, b, α). The quality of an approximation solution ( vt, αt) to the SPP problem (5) at iteration t is measured by the duality gap:

εf( vt, αt) = max α Ω2 f( vt, α) min v Ω1 f(v, αt). (15)

Stochastic Online AUC Maximization (SOLAM) 1. Choose step sizes {γt > 0 : t N} 2. Initialize t = 1, v1 Ω1, α1 Ω2 and let ˆp0 = 0, v0 = 0, α0 = 0 and γ0 = 0. 3. Receive a sample zt = (xt, yt) and compute ˆpt = (t 1)ˆpt 1+I[yt=1]

t 4. Update vt+1 = PΩ1(vt γt v ˆFt(vt, αt, zt)) 5. Update αt+1 = PΩ2(αt + γt α ˆFt(vt, αt, zt)) 6. Update γt = γt 1 + γt 7. Update vt = 1 γt ( γt 1 vt 1 + γtvt), and αt = 1 γt ( γt 1 αt 1 + γtαt) 8. Set t t + 1 Table 1: Pseudo code of the proposed algorithm. In steps 4 and 5, PΩ1( ) and PΩ2( ) denote the projection to the convex sets Ω1 and Ω2, respectively.

Theorem 2. Assume that samples {(x1, y1), (x2, y2), . . . , (x T , y T )} are i.i.d. drawn from a distribution ρ over X Y, let Ω1 and Ω2 be given by (14) and the step sizes given by {γt > 0 : t N}. For sequence {( vt, αt) : t [1, T]} generated by SOLAM (Table (1)), and any 0 < δ < 1, with probability 1 δ, the following holds

εf( v T , αT ) Cκ max(R2, 1)

j=1 γj 1h 1 +

j=1 γ2 j + T X

where Cκ is an absolute constant independent of R and T (see its explicit expression in the proof).

Denote f as the optimum of (7) which, by Theorem 1, is identical to the optimal value of AUC optimization (3). From Theorem 2, the following convergence rate is straightforward.

Corollary 1. Under the same assumptions as in Theorem 2, and γj = ζj 1

2 : j N with constant

ζ > 0, with probability 1 δ, it holds |f( v T , αT ) f | εf( u T ) = O ln T

While the above convergence rate is obtained by choosing decaying step sizes, one can establish a similar result when a constant step size is appropriately chosen.

The proof of Theorem 2 requires several lemmas. The ﬁrst is a standard result from convex online learning [16, 22]. We include its proof in the Supplementary Materials for completeness.

Lemma 1. For any T N, let {ξj : j [1, T]} be a sequence of vectors in Rm, and u1 Ωwhere Ωis a convex set. For any t [1, T] deﬁne ut+1 = PΩ( ut ξt). Then, for any u Ω, there holds PT t=1( ut u) ξt u1 u 2

2 PT t=1 ξt 2.

The second lemma is the Pinelis-Bernstein inequality for martingale difference sequence in a Hilbert space, which is from [15, Theorem 3.4]

Lemma 2. Let {Sk : k N} be a martingale difference sequence in a Hilbert space. Suppose that almost surely Sk B and PT k=1 E[ Sk 2|S1, . . . , Sk 1] σ2 T . Then, for any 0 < δ < 1, there

holds, with probability at least 1 δ, sup1 j T Pj k=1 Sk 2 B

3 + σT log 2

The third lemma indicates that the approximate stochastic estimator ˆGj(u, z) deﬁned by (13), is not far away from the unbiased one G(u, z). Its proof is given in the Supplementary materials.

Lemma 3. Let Ω1 and Ω2 be given by (14) and denote by Ω= Ω1 Ω2. For any t N, with

probability 1 δ, there holds sup u Ω,z Z ˆGt(u, z) G(u, z) 2κ(4κR + 11R + 1) ln (2

Proof of Theorem 2. By the convexity of f( , α) and concavity of of f(v, ), for any u = (v, α) Ω1 Ω2, we get f(vt, α) f(v, αt) = (f(vt, αt) f(v, αt)) + (f(vt, α) f(vt, αt)) (vt v) vf(vt, αt) (αt α) αf(vt, αt) = (ut u) g(ut). Hence, there holds

max α Ω2 f( v T , α) min v Ω1 f(v, αT ) (

t=1 γtf(vt, α) min v Ω1

t=1 γtf(v, αt)

t=1 γt) 1 max u Ω1 Ω2

t=1 γt(ut u) g(ut) (16)

Recall that Ω= Ω1 Ω2. The steps 4 and 5 in Algorithm SOLAMcan be rewritten as ut+1 = (vt+1, αt+1) = PΩ(ut γt ˆGt(ut, zt)). By applying Lemma 1 with ξt = γt ˆGt(ut, zt), we have, for any u Ω, that PT t=1 γt(ut u) ˆGt(ut, zt) u1 u 2

2 PT t=1 γ2 t ˆGt(ut, zt) 2, which yields that

t=1 γt(ut u) g(ut) sup u Ω

t=1 γ2 t ˆGt(ut, zt) 2

t=1 γt(ut u) (g(ut) ˆGt(ut, zt)) sup u Ω

t=1 γ2 t ˆGt(ut, zt) 2

t=1 γt(ut u) (g(ut) G(ut, zt)) + sup u Ω

t=1 γt(ut u) (G(ut, zt) ˆGt(ut, zt)) (17)

Now we estimate the terms on the right hand side of (17) as follows.

For the ﬁrst term, we have 1 2 sup u Ω u1 u 2 2 sup v Ω1,α Ω2 ( v 2 + |α|2) 2 sup u Ω u 2 2R2(1 + 6κ2). (18)

For the second term on the right hand side of (17), observe that supx X x κ and ut = (wt, at, bt, αt) Ω= (w, a, b, α) : w R, |a| κR, |b| κR, |α| 2κR . Combining this with the deﬁnition of ˆGt(ut, zt) given by (13), one can easily get ˆGt(ut, zt) w ˆFt(ut, zt) + | a ˆFt(ut, zt)| + | b ˆFt(ut, zt)| + | α ˆFt(ut, zt)| 2κ(2R + 1 + 2Rκ). Hence, there holds

t=1 γ2 t ˆGt(ut, zt) 2 2κ2(2R + 1 + 2Rκ)2 T X

t=1 γ2 t . (19)

The third term on the right hand side of (17) can be bounded by supu Ω PT t=1 γt(ut u) (g(ut) G(ut, zt)) supu Ω[PT t=1 γt( ut u) (g(ut) G(ut, zt))] + PT t=1 γt(ut ut) (g(ut) G(ut, zt)), where u1 = 0 Ωand ut+1 = PΩ( ut γt(g(ut) G(ut, zt))) for any t [1, T]. Applying Lemma 1 with ξt = γt(g(ut) G(ut, zt)) yields that

t=1 γt( ut u) (g(ut) G(ut, zt)) sup u Ω

t=1 γ2 t g(ut) G(ut, zt) 2

2R2(1 + 6κ2) + 4κ2(2R + 1 + 2Rκ)2 T X

t=1 γ2 t , (20)

where we used G(ut, zt) and g(ut) is uniformly bounded by 2κ(2R + 1 + 2Rκ). Notice that ut and ut are only dependent on {z1, z2, . . . , zt 1}, {St = γt(ut ut) (g(ut) G(ut, zt)) : t = 1, . . . , t} is a martingale difference sequence. Observe that E[ St 2|z1, . . . , zt 1] = γ2 t RR

Z((ut ut) (g(ut) G(ut, z)))2dρ(z) γ2 t supu Ω,z Z[ ut ut 2 g(ut) G(ut, zt) 2] γ2 t [2κR

1 + 6κ2(2R + 1 + 2Rκ)]2. Applying Lemma 2 with σ2 T = [2κR

1 + 6κ2(2R + 1 + 2Rκ)]2 PT t=1 γ2 t , B = sup T t=1 γt|(ut ut) (g(ut) G(ut, zt))| σT implies that, with probability 1 δ

2, there holds

t=1 γt(ut ut) (g(ut) G(ut, zt)) 16κR

1 + 6κ2(2R + 1 + 2Rκ)

t=1 γ2 t . (21)

Combining (20) with (21) implies, with probability 1 δ

t=1 γt(ut u) (g(ut) G(ut, zt)) R2(1 + 6κ2)

2 + 4κ2(2R + 1 + 2Rκ)2 T X

datasets inst feat datasets inst feat datasets inst feat datasets inst feat diabetes 768 8 fourclass 862 2 german 1,000 24 splice 3,175 60 usps 9,298 256 a9a 32,561 123 mnist 60,000 780 acoustic 78,823 50 ijcnn1 141,691 22 covtype 581,012 54 sector 9,619 55,197 news20 15,935 62,061 Table 2: Basic information about the benchmark datasets used in the experiments.

1 + 6κ2(2R + 1 + 2Rκ)

t=1 γ2 t 1/2. (22)

By Lemma 3, for any t [1, T] there holds, with probability 1 δ

2T , sup u Ω,z Z ˆGt(u, z) G(u, z)

2κ(2R(κ + 1) + 1)

δ )/t. Hence, the fourth term on the righthand side of (17) can estimated

as follows: with probability 1 δ

2, there holds

t=1 γt(ut u) (G(ut, zt) ˆGt(ut, zt)) 2 sup uΩ u T X

t=1 γt sup u Ω,z Z ˆGt(u, z) G(u, z)

8Rκ(4Rκ + 11R + 1) p

Putting the estimations (18), (19), (22), (23) and (17) back into (16) implies that

εf( u T ) Cκ max(R2, 1)

t=1 γt 1h 1 +

t=1 γ2 t + T X

where Cκ = 5

2(1 + 6κ2) + 6κ2(κ + 3)2 + 112

6κ2 + 1(2κ + 3).

4 Experiments

In this section, we report experimental evaluations of the SOLAM algorithm and comparing its performance with existing state-of-the-art learning algorithms for AUC optimization. SOLAM was implemented in MATLAB, and MATLAB code of the compared methods were obtained from the authors of corresponding papers. In the training phase, we use ﬁve-fold cross validation to determine the initial learning rate ζ [1 : 9 : 100] and the bound on w, R 10[ 1:1:5] by a grid search. Following the evaluation protocol of [6], the performance of SOLAM was evaluated by averaging results from ﬁve runs of ﬁve-fold cross validations.

Our experiments were performed based on 12 datasets that had been used in previous studies. For multi-class datasets, e.g., news20 and sector, we transform them into binary classiﬁcation problems by randomly partitioning the data into two groups, where each group includes the same number of classes. Information about these datasets is summarized in Table 2.

On these datasets, we evaluate and compare SOLAM with four online and two ofﬂine learning algorithms for AUC maximization, i.e. one-pass AUC maximization (OPAUC) [6], which uses the ℓ2 loss surrogate of the AUC objective function; online AUC maximization [21] that uses the hinge loss surrogate of the AUC objective function with two variants, one with sequential update (OAMseq) and the other using gradient update (OAMgra); online Uni-Exp [12] which uses the weighted univariate exponential loss; B-SVM-OR [10], which is a batch learning algorithm using the hinge loss surrogate of the AUC objective function; and B-LS-SVM, which is a batch learning algorithm using the ℓ2 loss surrogate of the AUC objective function.

Classiﬁcation performances on the testing dataset of all methods are given in Table 3. These results show that SOLAM achieves similar performances as other state-of-the-art online and ofﬂine methods based on AUC maximization. The performance of SOLAM is better than the ofﬂine methods on acoustic and covtype which could be due to the normalization of features used in our experiments for SOLAM. On the other hand, the main advantage of SOLAM is the running efﬁciency, as we pointed out in the Introduction, its per-iteration running time and space complexity is linear in data dimension and do not depend on the iteration number. In Figure 1, we show AUC vs. run time (seconds) for

Datasets SOLAM OPAUC OAMseq OAMgra online Uni-Exp B-SVM-OR B-LS-SVM diabetes .8253 .0314 .8309 .0350 .8264 .0367 .8262 .0338 .8215 .0309 .8326 .0328 .8325 .0329 fourclass .8226 .0240 .8310 .0251 .8306 .0247 .8295 .0251 .8281 .0305 .8305 .0311 .8309 .0309

german .7882 .0243 .7978 .0347 .7747 .0411 .7723 .0358 .7908 .0367 .7935 .0348 .7994 .0343

splice .9253 .0097 .9232 .0099 .8594 .0194 .8864 .0166 .8931 .0213 .9239 .0089 .9245 .0092 usps .9766 .0032 .9620 .0040 .9310 .0159 .9348 .0122 .9538 .0045 .9630 .0047 .9634 .0045 a9a .9001 .0042 .9002 .0047 .8420 .0174 .8571 .0173 .9005 .0024 .9009 .0036 .8982 .0028 mnist .9324 .0020 .9242 .0021 .8615 .0087 .8643 .0112 .7932 .0245 .9340 .0020 .9336 .0025 acoustic .8898 .0026 .8192 .0032 .7113 .0590 .7711 .0217 .8171 .0034 .8262 .0032 .8210 .0033

ijcnn1 .9215 .0045 .9269 .0021 .9209 .0079 .9100 .0092 .9264 .0035 .9337 .0024 .9320 .0037 covtype .9744 .0004 .8244 .0014 .7361 .0317 .7403 .0289 .8236 .0017 .8248 .0013 .8222 .0014

sector .9834 .0023 .9292 .0081 .9163 .0087 .9043 .0100 .9215 .0034 - - news20 .9467 .0039 .8871 .0083 .8543 .0099 .8346 .0094 .8880 .0047 - - Table 3: Comparison of the testing AUC values (mean std.) on the evaluated datasets. To accelerate the experiments, the performances of OPAUC, OAMseq, OAMgra, online Uni-Exp, B-SVM-OR and B-LS-SVM were taken from [6]

(c) sector Figure 1: AUC vs. time curves of SOLAM algorithm and three state-of-the-art AUC learning algorithms, i.e., OPAUC [6], OAMseq [21], and OAMgra [21]. The values in parentheses indicate the average running time (seconds) per pass for each algorithm.

SOLAM and three other state-of-the-art online learning algorithms,i.e., OPAUC [6], OAMseq [21], and OAMgra [21] over three datasets (a9a, ups, and sector), along with the per-iteration running time in the legend2. These results show that SOLAM in general reaches convergence faster in comparison of, while achieving competitive performance.

5 Conclusion

In this paper we showed that AUC maximization is equivalent to a stochastic saddle point problem, from which we proposed a novel online learning algorithm for AUC optimization. In contrast to the existing algorithms [6, 21], the main advantage of our algorithm is that it does not need to store all previous examples nor its second-order covariance matrix. Hence, it is a truly online learning algorithm with one-datum space and per-iteration complexities, which are the same as online gradient descent algorithms [22] for classiﬁcation.

There are several research directions for future work. Firstly, the convergence rate O(1/

T) for SOLAM only matches that of the black-box sub-gradient method. It would be interesting to derive fast convergence rate O(1/T) by exploring the special structure of the objective function F deﬁned by (6). Secondly, the convergence was established using the duality gap associated with the stochastic SPP formulation 7. It would be interesting to establish the strong convergence of the output w T of algorithm SOLAM to its optimal solution of the actual AUC optimization problem (3). Thirdly, the SPP formulation (1) holds for the least square loss. We do not know if the same formulation holds true for other loss functions such as the logistic regression or the hinge loss.

2Experiments were performed with running time reported based on a workstation with 12 nodes, each with an Intel Xeon E5-2620 2.0GHz CPU and 64GB RAM.

[1] F. R. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, 2011.

[2] L. Bottou and Y. Le Cun. Large scale online learning. In NIPS, 2003.

[3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. Information Theory, 50(9):2050 2057, 2004.

[4] S. Clemencon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008.

[5] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. In NIPS, 2003.

[6] W. Gao, R. Jin, S. Zhu, and Z. H. Zhou. One-pass AUC optimization. In ICML, 2013.

[7] W. Gao and Z.H. Zhou. On the consistency of AUC pairwise optimization. In International Joint Conference on Artiﬁcial Intelligence, 2015.

[8] J. A. Hanley and B. J. Mc Neil. The meaning and use of the area under of receiver operating characteristic (roc) curve. Radiology, 143(1):29 36, 1982.

[9] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.

[10] Thorsten Joachims. Training linear svms in linear time. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 217 226, 2006.

[11] P. Kar, B. K. Sriperumbudur, P. Jain, and H. Karnick. On the generalization ability of online learning algorithms for pairwise loss functions. In ICML, 2013.

[12] W. Kotlowski, K. Dembczynski, and E. Hüllermeier. Bipartite ranking through minimization of univariate loss. In ICML, 2011.

[13] G. Lan. An optimal method for stochastic composite optimization. Math Programming, 133(1-2):365 397, 2012.

[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

[15] I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, 22(4):1679 1706, 1994.

[16] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.

[17] A. Rakotomamonjy. Optimizing area under roc curve with svms. In 1st International Workshop on ROC Analysis in Artiﬁcial Intelligence, 2004.

[18] Y. Wang, R. Khardon, D. Pechyony, and R. Jones. Generalization bounds for online learning algorithms with pairwise loss functions. In COLT, 2012.

[19] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561 596, 2008.

[20] Y. Ying and D. X. Zhou. Online pairwise learning algorithms. Neural Computation, 28:743 777, 2016.

[21] P. Zhao, S. C. H. Hoi, R. Jin, and T. Yang. Online AUC maximization. In ICML, 2011.

[22] M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, 2003.