# stochastic_proximal_auc_maximization__0e5d0d3f.pdf

Journal of Machine Learning Research 22 (2021) 1-45 Submitted 5/19; Revised 12/20; Published 2/21

Stochastic Proximal AUC Maximization

Yunwen Lei y.lei@bham.ac.uk School of Computer Science University of Birmingham Birmingham B15 2TT, United Kingdom

Yiming Ying yying@albany.edu Department of Mathematics and Statistics State University of New York at Albany Albany, USA

Editor: Massimiliano Pontil

In this paper we consider the problem of maximizing the Area under the ROC curve (AUC) which is a widely used performance metric in imbalanced classiﬁcation and anomaly detection. Due to the pairwise nonlinearity of the objective function, classical SGD algorithms do not apply to the task of AUC maximization. We propose a novel stochastic proximal algorithm for AUC maximization which is scalable to large scale streaming data. Our algorithm can accommodate general penalty terms and is easy to implement with favorable O(d) space and per-iteration time complexities. We establish a high-probability convergence rate O(1/

T) for the general convex setting, and improve it to a fast convergence rate O(1/T) for the cases of strongly convex regularizers and no regularization term (without strong convexity). Our proof does not need the uniform boundedness assumption on the loss function or the iterates which is more ﬁdelity to the practice. Finally, we perform extensive experiments over various benchmark data sets from real-world application domains which show the superior performance of our algorithm over the existing AUC maximization algorithms. Keywords: AUC maximization, imbalanced classiﬁcation, stochastic gradient descent, proximal operator

1. Introduction

Area under the ROC curve (AUC) (Hanley and Mc Neil, 1982) measures the probability for a randomly drawn positive instance to have a higher decision value than a randomly sampled negative instance. It is a widely used metric for measuring the performance of machine learning algorithms in imbalanced classiﬁcation and anomaly detection (Bradley, 1997; Cortes and Mohri, 2004; Fawcett, 2006; Narasimhan and Agarwal, 2017; Maurer and Pontil, 2020). In particular, minimization of the rank loss in bipartite ranking is equivalent to maximizing the AUC criterion (Agarwal et al., 2005; G uvenir and Kurtcephe, 2013; Kotlowski et al., 2011). At the same time, we are experiencing the fundamental change of the sheer size of commonly generated datasets where streaming data is continuously arriving in a real time manner. Hence, it is of practical importance to develop eﬃcient optimization

. Corresponding author

c 2021 Yunwen Lei and Yiming Ying.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-418.html.

Lei and Ying

algorithms for maximizing the AUC score which is scalable to large-scale streaming datasets for real-time predictions. Stochastic (proximal) gradient descent (SGD), also known as stochastic approximation or incremental gradient, has become the workhorse in machine learning (Bach and Moulines, 2013; Bottou and Cun, 2004; Orabona, 2014; Rakhlin et al., 2012; Rosasco et al., 2014; Srebro and Tewari, 2010; Denevi et al., 2019). It can be regarded as online learning (Cesa-Bianchi and Lugosi, 2006; Hazan, 2016; Shalev-Shwartz, 2012; Orabona, 2019) in the stochastic setting where the individual data point is assumed to be drawn randomly from a (unknown) distribution. These algorithms are iterative and incremental in nature and process each new sample (input) with a computationally cheap update, making them amenable for streaming data analysis. The working mechanism behind classical SGD algorithms is to perform gradient descent using unbiased (random) samples of the true gradient. In the sense, the objective function is required to be linear in the sampling distribution. For example, in binary classiﬁcation, let ρ be a probability measure (sampling distribution) deﬁned on input/output space X Y with X Rd and Y = { 1}. The linearity with respect to the sampling distribution ρ in this case means that the objective function (true risk) is the expectation of a pointwise loss function ℓ: Rd X Y [0, ), i.e.

R(w) = E[ℓ(w, x, y)] = ZZ

X Y ℓ(w, x, y)dρ(x, y).

This linearity plays a pivotal role in studying the convergence of SGD and deriving many of its appealing properties. In contrast, the problem of AUC maximization involves the expectation of a pairwise loss function which depends on pairs of data points. Consequently, the objective function in AUC maximization is pairwise nonlinear with respect to the sampling distribution ρ. To be more precise, recall (Hanley and Mc Neil, 1982; Cl emen con et al., 2008) that the AUC score of a function hw(x) = w x is deﬁned by

AUC(w) = Pr{w x w x |y = +1, y = 1} = E I[w x w x ]|y = +1, y = 1 , (1.1)

where E[ ] is with respect to (x, y) and (x , y ) independently drawn from ρ. Since the indicator function I[ ] is discontinuous, one often resorts to a convex surrogate loss ℓ: R 7 R+ and some common choices are the square loss ℓ(a) = (1 a)2, the hinge loss ℓ(a) = max{0, 1 a} (Zhao et al., 2011) and the exponential loss ℓ(a) = exp( a) (Rudin and Schapire, 2009). In this paper, we consider the square loss since it is statistically consistent with AUC (Gao and Zhou, 2015) and its speciﬁc structure allows us to reformulate the pairwise learning problem as a pointwise learning problem (Ying et al., 2016; Liu et al., 2018; Natole et al., 2018). As a comparison, AUC maximization based on other loss functions requires to compare a pair of examples in updating models (Zhao et al., 2011; Kar et al., 2013; Wang et al., 2012b), which causes expensive space and per-iteration complexity. Then, we have

p(1 p) h 1 AUC(w) i f(w) :=p(1 p)E[(1 w (x x ))2|y = 1, y = 1]

=E (1 w (x x ))2I[y=1,y = 1] , (1.2)

Stochastic Proximal AUC Maximization

where p = Pr(y = 1). As in Ying et al. (2016); Natole et al. (2018); Liu et al. (2018), the idea of introducing the factor p(1 p) is to replace the conditional expectation in the AUC score with the expectation, which is more convenient to deal with. Now the regularization framework for maximizing the AUC score can be formulated as follows

n φ(w) := f(w) + Ω(w) o , (1.3)

where Ω: Rd 7 R+ is a convex regularizer. This pairwise nonlinearity in the sampling distribution makes the direct deployment of standard SGD infeasible.

1.1 Related Work

There are considerable eﬀorts on developing optimization algorithms for AUC maximization, which can roughly be divided into three categories. The ﬁrst category is batch learning algorithms for AUC maximization with focus on the empirical risk minimization (Cortes and Mohri, 2004) which use the training data at once. For instance, the early work (Joachims, 2005; Herschtal and Raskutti, 2004) proposed to use the cutting plane method and gradient descent algorithm, respectively. Zhang et al. (2012) developed an appealing algorithmic framework for optimizing the multivariate performance measures (Joachims, 2005) including the AUC score and precision-recall break-even point. The algorithms there used the smoothing techniques (Nesterov, 2007) and the Nesterov s accelerated gradient algorithm (Nesterov, 1983). Support Vector Algorithms were proposed to maximize the partial area under the ROC curve between any two false positive rates, which is interesting in several applications, e.g., ranking, biometric screening and medicine (Narasimhan and Agarwal, 2017). Such batch learning algorithms generally require O min 1

λε iterations to achieve an accuracy of ϵ, but have a high per-iteration cost of O(nd). Here, λ, n, and d are the regularization parameter, the number of samples, and the dimension of the data, respectively. Such algorithms train the model on the whole training data which are not suitable for analyzing massive streaming data that arrives continuously. The second category of work (Kar et al., 2013; Wang et al., 2012b; Ying and Zhou, 2016) extended the classical online gradient descent (OGD) (Zinkevich, 2003; Hazan, 2016; Shalev Shwartz, 2012) to the setting of pairwise learning and hence is applicable to the problem of AUC maximization. Regret bounds were established there which can be converted to generalization bounds in the stochastic setting as shown by Kar et al. (2013); Wang et al. (2012b). Such algorithms, however, need to compare the latest arriving data with previous data which require to store the historic data. This leads to expensive space and per-iteration complexities O(td) at the t-th iteration which is not feasible for streaming data. For the speciﬁc square loss, Gao et al. (2013) developed an one-pass AUC maximization method by updating the covariance matrices of the training data, which has O(d2) space and periteration time complexity which could be problematic for high-dimensional data. The third category of work (Ying et al., 2016; Liu et al., 2018; Natole et al., 2018; Liu et al., 2020) considered the expected risk and used primal-dual SGD algorithms. In particular, Ying et al. (2016); Natole et al. (2018) formulated AUC maximization (1.3) as

Lei and Ying

a saddle point problem as follows

min w,a,b R max α R Ez F(w, a, b, α; z) + Ω(w), (1.4)

where F(w, a, b, α; z) = p(1 p) + (1 p)(w x a)2I[y=1] + p(w x b)2I[y= 1] + 2(1 + α)w x p I[y= 1] (1 p)I[y=1] p(1 p)α2. Then, they proposed to perform SGD on both the primal variables w, a and b, and the dual variable α. This algorithm has per-iteration and space cost of O(d), making them amenable for streaming data analysis. It enjoys a moderate convergence rate O(1/

T). The most recent work by Liu et al. (2018) also used this saddle point formulation and developed a novel multi-stage scheme for running primal-dual stochastic gradient algorithms which enjoy a fast convergence of e O(1/T)1 for non-strongly-convex objective functions. Both algorithms in Ying et al. (2016); Liu et al. (2018) require a critical assumption of uniform boundedness for model parameters. i.e. w R which might be diﬃcult to adjust in practice. Natole et al. (2018) developed a stochastic proximal algorithm for AUC maximization with a convergence rate e O(1/T) for strongly convex objective function. The potential limitation of this method is that it assumes the conditional expectations E[x |y = 1] and E[x |y = 1] are known a priori which is hard to satisfy in practice. There are some other related work. For instance, Palaniappan and Bach (2016) developed an appealing stochastic primal-dual algorithm for saddle point problems with convergence rate of O( 1

T ) which, as a by-product, can be applied to AUC maximization with the square loss. However, their saddle point formulation focused on the empirical risk minimization and cannot be applied to the population risk in our case. In addition, the primal-dual algorithm there requires strong convexity on both the primal and dual variables, and the algorithm has per-iteration complexity O(n + d) where n is the total number of training samples and d is the dimension of the data. Our work falls in the regime where the aim is to minimize an expected-valued objective function which is nonlinear with respect to the sampling distribution. This research area is attracting more and more attention in optimization and machine learning with important applications to reinforcement learning and robust learning. For example, Wang et al. (2016, 2017) proposed a stochastic compositional gradient descent (SCGD) for solving the problem

min w ΩE[fv(E(gw(w)|v))], (1.5)

where Ωis a closed convex set of Rn, fv : Rm 7 R and gw : Rn 7 Rm are functions parametrized by the random variables w and v. However, it is not clear how to formulate the problem of AUC maximization as (1.5). In addition, the SCGD algorithms proposed in (Wang et al., 2017, 2016) require that both the gradients of fv and gw are bounded which is not the case for our setting since we use the square loss. As we show soon in the next section, we explore the intrinsic structure of AUC maximization to show our proposed algorithms are guaranteed to converge with high probability without boundedness assumptions. Moreover, it can achieve a fast convergence rate of e O( 1

T ) without strong convexity.

1. We use the notation e O to hide polynomial of logarithms.

Stochastic Proximal AUC Maximization

Algorithm storage/per-iteration bound type rate penalty OAM (Zhao et al., 2011) O(Bd) regret O(1/

T) ℓ2 OPAUC (Gao et al., 2013) O(d2) regret O(1/

T) ℓ2 SOLAM (Ying et al., 2016) O(d) w.h.p. O(1/

T) ℓ2constraint FSAUC (Liu et al., 2018) O(d) w.h.p. e O(1/T) ℓ1constraint SPAM (Natole et al., 2018) O(d) expectation e O(1/T) strongly convex SPAUC (this work) O(d) w.h.p. e O(1/

T) convex regularizer

SPAUC (this work) O(d) w.h.p. e O(1/T) strongly convex or no regularizer

Table 1: Comparison of diﬀerent AUC maximization methods. The notation B refers to the buﬀer size in Zhao et al. (2011). For the bound type, regret refers to regret bounds, expectation refers to convergence rates in expectation and w.h.p. refers to convergence rates with high probability. If the bound type is regret , we use the rate O(1/

T) to mean regret bounds O(

T) for a consistent comparison.

1.2 Main Contributions

In this paper, we propose novel SGD algorithms for AUC maximization which does not need the boundedness assumptions and can achieve a fast convergence rate without strong convexity. Our key idea is the new decomposition technique (see Proposition 1) which directly works with the objective function motivated by the saddle point formulation (Ying et al., 2016; Natole et al., 2018). From this new decomposition, we are able to design approximately unbiased estimators for the true gradient f(w). Our algorithms do not need to store the previous data points in contrast to the approaches in Wang et al. (2012b); Kar et al. (2013); Zhao et al. (2011) or accessing true conditional expectations as in Natole et al. (2018). A comparison of our algorithm with other methods is summarized in Table 1. From the side of technical novelty, we develop techniques to control the norm of iterates with high probability for stochastic learning based on biased estimators of f(w), and hence there is no boundedness assumptions on the iterates. Our major contributions can be summarized as follows.

We propose a novel stochastic proximal algorithm for AUC maximization which accommodates general convex regularizers with favorable O(d) space and per-iteration time complexities. Our algorithm is gradient-based and hence is simple and easy to implement which does not need the multi-stage design (Liu et al., 2018) and boundedness assumption on model parameters (Liu et al., 2018; Ying et al., 2016).

We establish a convergence rate e O(1/

T) with high probability for our algorithm with T iterations, and improve it to a fast convergence e O(1/T) for both cases of no regularization term (non-strong convexity) and strongly convex regularizers.

We perform a comprehensive empirical comparison against ﬁve state-of-the-art AUC maximization algorithms over sixteen benchmark data sets from real-world applica-

Lei and Ying

tion domains. Experimental results show that our algorithm can achieve superior performance with a consistent and signiﬁcant reduction in running time.

Organization of the paper. The remainder of this paper is organized as follows. We state the algorithm with motivation in Section 2. Theoretical and experimental results are presented in Section 3 and Section 4, respectively. We give some proofs in Section 5 and defer others to the Appendix. We conclude the paper in Section 6.

2. Proposed Algorithm

Our objective is to develop eﬃcient SGD-type algorithms for AUC maximization scalable to large-scale streaming data. In particular, we aim to design an (approximately) unbiased estimator for the true gradient f(w) with per-iteration cost O(d) to perform SGD-type algorithms. In particular, our new design is mainly motivated by the saddle point formulation in Ying et al. (2016); Natole et al. (2018). To illustrate the main idea, let

e F(w; z) = p(1 p) + (1 p) w x E[x |y = 1] 2I[y=1]

+ p w x E[x |y = 1] 2I[y= 1] + 2p(1 p)w E[x |y = 1] E[x |y = 1]

+ p(1 p) w E[x |y = 1] E[x |y = 1] 2. (2.1)

We will give an intuitive explanation of e F in Remark 2. It was shown in Ying et al. (2016); Natole et al. (2018) that the saddle point formulation (1.4) implies that f(w) = minw,a,b maxα E[F(w, a, b, α; z)]. In particular, for any ﬁxed w the optima a, b, α have a closed-form solution of a(w), b(w) and α(w) which are given by

a(w) = w E[x |y = 1], b(w) = w E[x |y = 1], α(w) = b(w) a(w). (2.2)

Indeed, let F1(w; z) = F(w, a(w), b(w), α(w); z) and then

F1(w; z) = (1 p) w x E[x |y = 1] 2I[y=1] + p w x E[x |y = 1] 2I[y= 1] + 2 1 + w E[x |y = 1] E[x |y = 1] w x p I[y= 1] (1 p)I[y=1]

+ p(1 p) p(1 p) w E[x |y = 1] E[x |y = 1] 2.

Note that w E x p I[y= 1] (1 p)I[y=1] = p(1 p)w E[x |y = 1] E[x |y = 1] . After organizing the terms, one can easily see that E[ e F(w; z)] = E[F1(w; z)] = f(w) for any w. Consequently, one can see that both e F(w; z) and F1(w, z) are unbiased estimators of f(w), i.e. E[ F1(w; z)] = E[ e F(w; z)] = f(w). The work of Natole et al. (2018) proposed to use F(w, a(w), b(w), α(w); z) as an unbiased gradient estimator and the convergence analysis was proved in expectation. It is easy to see that F1(w; z) is not convex, i.e. the Hessian of F1(w; z) is not positive-semi-deﬁnite (PSD). The non-convexity of F1(w; z) presents daunting diﬃculties to bound the iterates and deriving the convergence of the algorithm with high probability using concentration inequalities. In contrast, the new design of e F(w; z) is convex with respect to w which will enable us to prove convergence with high probability.

Stochastic Proximal AUC Maximization

In a nutshell, we have the following important proposition. Motivated by the saddlepoint formulation in Ying et al. (2016); Natole et al. (2018) as mentioned above, we also give an alternative but self-contained proof by writing the objective function as

(1 w (x x ))2 = [1 + α(w)] + [w x b(w)] [w x a(w)] 2

= 1 + w (E[ x| y = 1] E[ x| y = 1])

+ w (x E[ x| y = 1]) w (x E[ x| y = 1]) 2. (2.3)

Proposition 1 For any w, we have

E e F(w; z) = f(w) and E e F (w; z) = f(w), (2.4)

where we use the abbreviation e F (w; z) := e F(w;z)

w . Furthermore, for any z the function e F(w; z) is a convex function of w.

Proof As indicated by (2.3), we write (1 w (x x ))2 as three terms: 1 + w (E[ x| y = 1] E[ x| y = 1]) + w (x E[ x| y = 1]) w (x E[ x| y = 1]) 2

= 1 + w (E[ x| y = 1] E[ x| y = 1]) 2 + w (x E[ x| y = 1]) w (x E[ x| y = 1]) 2

+ 2 1 + w (E[ x| y = 1] E[ x| y = 1]) w (x E[ x| y = 1]) w (x E[ x| y = 1])

= I + II + III.

It suﬃces to estimate the above terms one by one. To this end, the ﬁrst term is deterministic, and hence

E[ I |y = 1, y = 1] = 2w E[ x| y = 1] E[ x| y = 1]

+ w E[ x| y = 1] E[ x| y = 1] 2 + 1. (2.5)

For the second term, noticing that E w (x E[ x| y = 1])w (x E[ x| y = 1]) |y = 1, y = 1 = E w (x E[ x| y = 1]) |y = 1 E w (x E[ x| y = 1]) |y = 1 = 0 as (x, y) and (x , y ) are independent, we have

E[ II |y = 1, y = 1]

= E w (x E[ x| y = 1]) w (x E[ x| y = 1]) 2|y = 1, y = 1

= E w (x E[ x| y = 1]) 2|y = 1, y = 1 + E w (x E[ x| y = 1]) 2|y = 1, y = 1

= E w (x E[ x| y = 1]) 2|y = 1 + E w (x E[ x| y = 1]) 2|y = 1

= 1 1 p E w (x E[ x| y = 1]) 2I[y = 1] + 1

p E w (x E[ x| y = 1]) 2I[y=1] . (2.6)

For the third term,

E[ III |y = 1, y = 1] = 2 1 + w (E[ x| y = 1] E[ x| y = 1])

E w (x E[ x| y = 1]) w (x E[ x| y = 1])|y = 1, y = 1 = 0. (2.7)

Lei and Ying

Combining equations (2.5),(2.6), and (2.7) together, we have

f(w) = p(1 p)E[(1 w (x x ))2|y = 1, y = 1]

= 2p(1 p)w E[ x| y = 1] E[ x| y = 1]

+ p(1 p) w E[ x| y = 1] E[ x| y = 1] 2 + p(1 p)

+ p E h w (x E[ x| y = 1]) 2I[y = 1] i + (1 p)E h w (x E[ x| y = 1]) 2I[y=1] i ,

which implies that f(w) = E[ e F(w; z)]. The fact of E e F (w; z) = f(w) follows directly from the Leibniz s integral rule that the derivative and the integral can be interchangeable as F is a quadratic function and the input x is from a bounded domain. For the last statement, notice that

2 e F(w; z) = 2(1 p) x E[ x| y = 1] x E[ x| y = 1] I[y=1] + 2p(x E[ x| y = 1])(x E[ x| y = 1]) I[y= 1]

+ 2p(1 p) E[ x| y = 1] E[ x| y = 1] E[ x| y = 1] E[ x| y = 1] .

It is clear that 2 e F(w; z) is positive semi-deﬁnite, and hence e F(w; z) is a convex function of w for any z. This completes the proof of the proposition.

Remark 2 We summarize the key idea in the proof. Let e E[ ] := E[ |y = 1, y = 1]. Then the proof essentially shows

e E 1 w (x x ) 2 = e E h 1 w e E[x x ] + w (e E[x] x) + w (x e E[x ]) 2i

= e E h 1 w e E[x x ] 2i + e E h w (e E[x] x) 2i + e E h w (e E[x ] x ) 2i . (2.8)

Note that e E h w (e E[x] x) 2i and e E h w (e E[x ] x ) 2i are conditional variance of w x in the positive class and negative class, respectively. Therefore, the AUC score can be considered as the summation of two conditional variances and a term depending on E[x|y = 1] E[x |y = 1]. This formulation motivates us to introduce e F in (2.1). Indeed, it is clear that

1 p(1 p) e F(w; z) = 1 w E[x |y = 1] E[x |y = 1] 2

p w x E[x |y = 1] 2I[y=1] + 1 1 p w x E[x |y = 1] 2I[y= 1].

Note that 1

p w x E[x |y = 1] 2I[y=1] is an unbiased estimator of the conditional variance

E[(w (x E[x|y = 1]))2|y = 1], and 1 1 p w x E[x |y = 1] 2I[y= 1] is an unbiased

estimator of the conditional variance E[(w (x E[x|y = 1]))2|y = 1]. Therefore, e F is derived from (2.8) by replacing the conditional variances with their unbiased estimators.

Stochastic Proximal AUC Maximization

Proposition 1 indicates to use e F (w; z) as an unbiased estimator for the gradient f(w). However, the function e F requires the unknown information p, E[x |y = 1] and E[x |y = 1], which is unknown in practice. We propose to replace them by their empirical counterpart deﬁned as follows

Pt 1 i=0 I[yi=1]

Pt 1 i=0 xi I[yi=1] Pt 1 i=0 I[yi=1] , vt =

Pt 1 i=0 xi I[yi= 1] Pt 1 i=0 I[yi= 1] , (2.9)

where we reserve an example (x0, y0) drawn independently from ρ. The resulting estimator for F at time t then becomes

ˆFt(w; z) = (1 pt) w x ut 2I[y=1] + pt w x vt 2I[y= 1]

+ 2pt(1 pt)w vt ut + pt(1 pt) w vt ut 2 + pt(1 pt).

It is easy to verify by computing its Hessian that ˆFt(w; z) is convex with respect to w. Its gradient can be directly computed as follows

ˆF t(w; z) := ˆFt(w; z)

w = 2(1 pt) x ut x ut w I[y=1] +2pt x vt x vt w I[y= 1]

+ 2pt(1 pt) vt ut + 2pt(1 pt) vt ut vt ut w. (2.10)

Note the stochastic gradient ˆF t(w; z) can be eﬃciently computed with an arithmetic cost O(d) and we do not need to store covariance matrices.

Algorithm 1: Stochastic Proximal AUC Maximization (SPAUC)

Input: {ηt}t, Ω, w1 and T. Output: an approximate solution of (1.3)

initialize: n+ 0, n 0, s+ 0, s 0

1 for t = 1, 2, . . . , T do

2 n+ n+ + I[yt=1]

3 n n + I[yt= 1]

4 s+ s+ + xt I[yt=1]

5 s s + xt I[yt= 1]

7 calculate the stochastic gradient ˆF t(w; zt) according to (2.10)

8 update wt+1 according to (2.11)

Algorithm: We propose to solve this regularization problem (1.3) by the following Stochastic proximal AUC maximization (SPAUC) algorithm with w1 = 0 and for any t 1,

wt+1 = arg min w Rd ηt w wt, ˆF t(wt; zt) + ηtΩ(w) + 1

2 w wt 2 2, (2.11)

where {ηt}t is a sequence of positive step sizes and zt is drawn independently from ρ at the t-th iteration. At the t-th iteration, SPAUC builds a temporary objective function

Lei and Ying

consisting of three components: a ﬁrst order approximation of f(w) based on the stochastic gradient ˆF t(w; z), a regularizer kept intact to preserve a composite structure and a term 1 2 w wt 2 2 to make sure the upcoming iterate wt+1 not far away from the current iterate. The pseudo-code of SPAUC is given in Algorithm 1. We comment that Algorithm 1 diﬀers from the standard stochastic proximal algorithm (Duchi and Singer, 2009; Rosasco et al., 2014; Lei and Tang, 2018) in that ˆF t is a biased estimator of f due to the use of pt, ut and vt. High-probability convergence rates of standard stochastic proximal algorithms without boundedness assumptions were recently established in Lei and Tang (2018). We extend the analysis in Lei and Tang (2018) by building novel techniques to handle the bias of ˆF t. Although we only consider the development of AUC maximization here, the developed techniques may apply to general stochastic proximal algorithms based on approximately unbiased gradient estimators.

3. Main Convergence Results

In this section, we present theoretical convergence rates with high probability for SPAUC. We consider two types of objective functions of the form (1.3): AUC maximization with a convex φ and AUC maximization with φ satisfying a quadratic functional growth. Let S = {w Rd : φ(w) = min w φ( w)} be the set of minimizers. For any w, we denote by w = arg min w S w w 2 the projection of w on to S , where for any p 1 and

w = (w1, . . . , wd) , we denote w p = Pd j=1 |wj|p 1

p . We always assume w 1 2 < .

3.1 General Convergence Rates

In this subsection, we present convergence rates for the general regularization framework for AUC maximization. To this aim, we need to impose a so-called self-bounding property on the regularizers, meaning the subgradients can be bounded in terms of function values. We denote by Ω (w) a subgradient of Ωat w and assume Ω(0)=0.

Assumption 1 (Self-bounding property) There exist constants A1, A2 0 such that the convex regularizer Ωsatisﬁes

Ω (w) 2 2 A1Ω(w) + A2, for all w Rd. (3.1)

Based on A1 and κ := max{1, supx X x 2}, we introduce a constant

C1 = max{A1, 16κ2}.

This self-bounding assumption above is very mild as many regularizers satisfy self-bounding property, including all smooth regularizers and all Lipschitz regularizers. For example, if Ω(w) = λ w 2 2, then (3.1) holds with A1 = 4λ and A2 = 0. If Ω(w) = λ w 1, then (3.1) holds with A1 = 0 and A2 = λ2. It is reasonable to assume a small regularization parameters in practice (e.g., λ 1), in which case we can take universal constants A1 and A2 for the above mentioned regularizers. Our theoretical analysis requires to estimate wt 2, which is achieved by the following lemma to be proved in Section 5.3. Essentially, it shows that wt 2 is bounded (ignoring logarithmic factors) if we consider step sizes satisfying (3.2). This result shows that the complexity of wt is well controlled even if the iterates are updated in an unbounded domain.

Stochastic Proximal AUC Maximization

Theorem 3 Let {wt}t be produced by (2.11) with ηt (2C1) 1 and ηt+1 ηt. We suppose Assumption 1 holds,

t=1 η2 t < . (3.2)

Then for any δ (0, 1), there exists a constant C2 independent of T (explicitly given in the proof and of the order of w 1 2 2) such that the following inequality holds with probability at least 1 δ max 1 t T w t w 1 2 2 C2 log(2T/δ). (3.3)

We are now ready to present convergence rates for SPAUC applied to general AUC objectives. In Theorem 4 we present general convergence rates in terms of step sizes satisfying (3.2), which are then instantiated in Corollary 5 by specifying step sizes. The convergence rate O(T 1

δ ) is optimal up to a logarithmic factor for stochastic algorithms applied to general convex optimization problems (Agarwal et al., 2009).

Theorem 4 Let the conditions of Theorem 3 hold. Then, for any δ (0, 1) there exists a constant C3 independent of T such that the following inequality holds with probability at least 1 δ

φ( w(1) T ) inf w φ(w) C3 T X

t=1 ηt 1 max n T X

t o log 3 2 T

where w(1) T = PT t=1 ηtwt/ PT t=1 ηt is a weighted average of the ﬁrst T iterates.

Corollary 5 Let {wt}t be produced by (2.11) and δ (0, 1). Suppose Assumption 1 holds and η1 (2C1) 1.

(1) If we choose ηt = η1t θ with θ > 1/2, then with probability at least 1 δ we have φ( w(1) T ) infw φ(w) = O T θ 1 log 3 2 T

(2) If we choose ηt = η1(t logβ(et)) 1

2 with β > 2, then with probability at least 1 δ we have φ( w(1) T ) infw φ(w) = O T 1

The proofs for Theorem 4 and Corollary 5 can be found in Section 5.4.

3.2 Fast Convergence Rates

In this subsection, we show that a faster convergence rate is possible for SPAUC if a quadratic functional growth condition is imposed to the objective function (Anitescu, 2000; Necoara et al., 2018).

Assumption 2 (Quadratic functional growth) We assume the existence of σφ > 0 such that φ(w) φ(w ) σφ w w 2 2, for all w Rd. (3.4)

Lei and Ying

The quadratic functional growth assumption (3.4) means that the objective function grows faster than the squared distance between any feasible point and the optimal set (Necoara et al., 2018). This condition is milder than assuming a strong convexity of φ (Necoara et al., 2018). Indeed, it holds if Ω(w) = λ w 2 2. It also holds if we consider no regularization, i.e., Ω(w) = 0 as shown in the next proposition. We give the proof for completeness.

Proposition 6 The function φ(w) = f(w) satisﬁes Assumption 2.

Proof Indeed, the objective function can be written as

f(w) = p(1 p) Aw 2 2 + w c + 1

with A Rd d being a symmetric matrix satisfying A2 = E[(x x )(x x ) |y = 1, y = 1] and c = 2E[x x |y = 1, y = 1]. Analyzing analogously to the proof of Theorem 9 in Necoara et al. (2018), one can show that S = w : Aw = g for some g Rd. By the deﬁnition of w we know that w w is orthogonal to the kernel of A2 and therefore λmin(A2) w w 2 2 Aw Aw 2 2, where λmin(A2) denotes the smallest nonzero eigenvalue of A2. Furthermore, we know

p 1(1 p) 1 f(w) f(w ) = Aw 2 2 Aw 2 2 + (w w ) c

= Aw Aw 2 2 + 2 Aw Aw Aw + (w w ) c = Aw Aw 2 2,

where the last identity is due to the optimality condition 2A Aw + c = 0. It then follows that f(w) f(w ) p(1 p)λmin(A2) w w 2 2. The proof is complete.

Under Assumption 2, we show with high probability that the suboptimality measured by both the parameter distance and function values decay with the rate e O(T 1), which is optimal up to a logarithmic factor (Agarwal et al., 2009). Let σΩ 0 be a constant satisfying

Ω(w) Ω( w) w w, Ω ( w) 2 1σΩ w w 2 2, w, w Rd.

Note σΩcan be zero and therefore our results apply to non-strongly-convex regularizers, e.g., Ω(w) = 0 for all w. Without loss of generality, we assume σf := σφ σΩ 0.

Theorem 7 Let δ (0, 1). Suppose Assumption 1 and Assumption 2 hold. Let {wt}t be the sequence produced by (2.11) with ηt = 2 σφt+2σf+σφt1 , where t1 32C1σ 1 φ log 2T

δ . Then, the following inequality holds with probability at least 1 δ for t = 1, . . . , T (T > 2)

wt w t 2 2 = e O(1/t) and φ( w(2) t ) inf w φ(w) = e O(1/t), (3.5)

where w(2) t is a weighted average of iterates deﬁned by

w(2) t = t X

k=1 (k + t1 + 1) 1 t X

k=1 (k + t1 + 1)wk.

Stochastic Proximal AUC Maximization

The proof of Theorem 7 is postponed to Section C. The following two corollaries follow directly from Theorem 7 by noting the quadratic functional growth property of the associated objective functions and the self-bounding property of the regularizers. We omit the proof here for brevity.

Corollary 8 Let δ (0, 1). Let {wt}t be produced by (2.11) with ηt = 2 σφt+2σf+σφt1 and

Ω(w) = λ w 2 2/2, where t1 32C1σ 1 φ log 2T

δ . Then, (3.5) holds with probability 1 δ.

Corollary 9 Let δ (0, 1). Let {wt}t be produced by (2.11) with ηt = 2 σφt+2σf+σφt1 and

Ω(w) = 0, where t1 32C1σ 1 φ log 2T

δ . Then, (3.5) holds with probability 1 δ.

Remark 10 Our excess risk bounds are established for the square loss. There are some existing studies on connecting the bounds measured by convex surrogate and the bounds measured by AUC scores (Agarwal, 2013; Gao and Zhou, 2015). However, it is not clear to us how to combine these results to derive excess risk bounds in terms of AUC. We mention some diﬃculties in this direction as follows. First, the loss function e F(w; z) in (2.1) is deﬁned on the model parameter w instead of the predicted output w x (note there is w E[x |y = 1] in e F), and therefore it is not clear to us how to deﬁne conditional risks as in Agarwal (2013); Gao and Zhou (2015). Furthermore, we only consider linear models and Proposition 1 only holds for linear models. It would be very interesting to study excess risk bounds for the AUC score.

4. Experiments

In this section, we present experimental results to show the eﬀectiveness of the proposed algorithm in achieving a satisfactory AUC with a fast convergence speed. We ﬁrst describe the baseline methods used in our experimental comparison as well as the associated parameter setting in Section 4.1. Datasets used in the experiments and detailed experimental results are presented in Section 4.2 and Section 4.3, respectively.

4.1 Baseline Methods

We compare SPAUC to several state-of-the-art online AUC maximization algorithms. The algorithms we consider include

the stochastic proximal AUC maximization (SPAUC) (2.11) with either no regularizers Ω(w) = 0, an ℓ1 regularizer Ω(w) = λ w 1 or an ℓ2 regularizer Ω(w) = λ w 2 2;

the stochastic proximal AUC maximization (SPAM) (Natole et al., 2018) with Ω(w) = λ w 2 2;

the stochastic online AUC maximization (SOLAM) (Ying et al., 2016) based on a saddle problem formulation;

the one-pass AUC maximization (OPAUC) (Gao et al., 2013) which uses the ﬁrst and second-order statistics of training data to compute gradients;

Lei and Ying

the online AUC maximization based on the hinge loss function (OAM gra) (Zhao et al., 2011);

the fast stochastic AUC maximization (FSAUC) (Liu et al., 2018) which applies a multi-stage stochastic optimization technique to a saddle problem formulation.

datasets # inst # feat datasets # inst # feat datasets # inst # feat datasets # inst # feat diabetes 768 8 ijcnn1 141691 22 german 1000 24 satimage 6435 36 acoustic 78823 50 covtype 581012 54 a9a 32561 123 connect 67557 126 usps 9298 256 w8a 49749 300 mnist 60000 780 gisette 7000 5000 real-sim 72309 20958 protein h 145751 74 malware 71709 122 webspam u 350000 254

Table 2: Description of the datasets used in the experiments.

The performance of these algorithms depends on some parameters, which, as described below, we tune with the ﬁve-fold cross-validation. For SPAUC, SPAM and SOLAM, we consider step sizes of the form ηt = 2/(µt + 1) and validate the parameter µ over the interval 10{ 7, 6.5,..., 2.5}. Both SPAM and SPAUC with the ℓ1/ℓ2 regularizer require another regularization parameter to tune, which is validated over the interval 10{ 5, 4,...,0}. SOLAM involves the constraint on w, i.e. w belonging to ℓ2-ball with radius R in Rd, for which we tune over the interval 10{ 1,0,...,5}. For OAM gra, we need to tune a parameter to weight the comparison between released examples and bulk, which is validated over the interval 10{ 3, 2.5,...,1.5}. As recommended in Zhao et al. (2011), we ﬁx the buﬀer size to 100. For OPAUC, we need to tune both the constant step size and the regularization parameter λ, which are validated over the interval 10{ 3.5, 3,...,1} and 10{ 5, 4,...,0}, respectively. The multi-stage scheme in FSAUC speciﬁes how the step size decreases along the implementation of the algorithm and leave the initial step size as a free parameter to tune, which we validate over the interval 10{ 2.5, 2,...,2}. Furthermore, each iteration of FSAUC requires a projection onto an ℓ1-ball of radius of R, which we tune over the interval 10{ 1,0,...,5}. It should be noticed that both SPAUC with no regularizers and OAM gra only have a single parameter to tune, while all other algorithms have two parameters to tune. To speed up the training process, if the algorithm has two parameters p1, p2 to tune, we ﬁrst construct all the possible pairs (p1, p2) by enumerating all possible candidate values of p1 and p2, out of which we randomly sample 15 pairs without replacement to tune. We repeat the experiments 20 times and report the average of experimental results.

4.2 Datasets

We perform our experiments on several real-world datasets. We consider two types of datasets: the UCI benchmark dataset and the dataset in the domain of anomaly detection. The task of anomaly detection is to identify rare items, events or observations which raise suspicions by diﬀering signiﬁcantly from the majority of the data. As such, this is suitable to test the performance of AUC maximization methods since the class there is intrinsically and highly imbalanced. We consider three datasets in the domain of anomaly detection: protein h, webspam u and malware. In particular, webspam u is a subset used in the Pascal Large Scale Learning Challenge (Wang et al., 2012a) to detect malicious web pages,

Stochastic Proximal AUC Maximization

protein h is a dataset in bioinformatics used to predict which proteins are homologous to a query sentence (non-homologous sequences are labeled as anomalies) (Caruana et al., 2004), and malware was collected in the Android Malware Genome Project used to detect mobile malware app (Jiang and Zhou, 2012). The remaining UCI datasets can be downloaded from the LIBSVM webpage (Chang and Lin, 2011). For each dataset, we use 80% of data for training and the remaining 20% for testing. We transform datasets with multiple class labels into datasets with binary class labels by grouping the ﬁrst half of class labels into positive labels, and grouping the remaining class labels into negative labels. We run each algorithm until 15 passes of the training data are reached, and report the AUC values on the test dataset. The information of the dataset is summarized in Table 2 where we list the UCI datasets according to the dimensionality while datasets for anomaly detection are listed at the end.

4.3 Experimental Results

In this section, we present the experimental results and discuss the comparisons of our algorithm against other ones. In Figure 1, we plot the AUC values of the constructed models on the test data versus execution time in seconds for SPAUC (without regularization), SPAM, SOLAM, OPAUC, OAM gra and FSAUC. It is observed that SPAUC attains a faster training speed than all baseline methods. In particular, the curve of SOLAM ﬂuctuates rapidly, especially in the early stage of the optimization, which is perhaps due to the requirement of updating both primal and dual variables. OAM gra behaves more robustly, which, however, requires a high computation burden due to the requirement in updating a buﬀer and comparing the current example and examples in the buﬀer per iteration. As one can see from Figure 1, SPAUC converges faster than FSAUC on most of the datasets. The underlying reason could be two-fold. Firstly, FSAUC requires a projection onto the intersection of an ℓ1-ball and ℓ2-ball which requires an alternating projection step. Secondly, FSAUC requires to update both primal and dual variables, which further increases the computational cost per iteration. OPAUC has a low training speed due to the requirement in handling a covariance matrix, which is especially unfavorable for high-dimensional datasets. For example, OPAUC has the slowest training speed on USPS for which the dimensionality is 256. We do not run OPAUC on datasets with dimensionality larger than 1000 due to the heavy dependency of its time complexity on the dimensionality. The implementation of SPAM requires an accurate information of p, E[x |y = 1] and E[x |y = 1], which we approximate with

Pn i=1 I[yi=1]

Pn i=1 xi I[yi=1] Pn i=1 I[yi=1] , ˆv =

Pn i=1 xi I[yi= 1] Pn i=1 I[yi= 1] . (4.1)

It is observed that the AUC curve for SPAM attains a sharp increase at the beginning of the curve and then moderately increases. The underlying reason is that we include the computational cost of calculating ˆp, ˆu and ˆv in the curve, which requires to go through the whole training set. In Table 3, we also report detailed AUCs as well as the execution time per pass, both in the form of mean standard deviation. We can see from Table 3 that SPAUC achieves accuracies comparable to the state-of-the-art methods over all datasets. SPAUC (without

Lei and Ying

(a) diabetes

(d) satimage

(e) acoustic

(f) covtype

(h) connect

(l) gisette

(m) real-sim

(n) protein h

(o) malware

(p) webspam u

Figure 1: AUC versus time curves (in seconds) for SPAUC (without regularization), SPAM, SOLAM and OPAUC, OAM gra and FSAUC.

regularization) and SPAM require comparable running time per iteration since both algorithms require no projections and no updates on the dual variables. An advantage of SPAUC with no regularization over SPAM is that SPAUC can deal with streaming data in a truly online fashion, while SPAM needs to know the conditional expectations in (4.1) and hence is not an online learning algorithm. Furthermore, the fast convergence of SPAM requires the objective function to be strongly convex (Natole et al., 2018), which introduces an additional regularization parameter to tune. Other baseline methods require longer per-

Stochastic Proximal AUC Maximization

datasets SPAUC SPAM SOLAM OPAUC OAM gra FSAUC

diabetes AUC 0.8266 0.0284 0.8246 0.0303 0.8264 0.0308 0.7926 0.0462 0.8247 0.0266 0.8293 0.0375 Time 0.0075 0.0013 0.0071 0.0002 0.0141 0.0015 0.0121 0.0011 0.0201 0.0014 0.0210 0.0011

ijcnn1 AUC 0.9361 0.0019 0.9358 0.0018 0.9362 0.0019 0.9127 0.0021 0.9331 0.0031 0.9361 0.0015 Time 1.2881 0.0076 1.3080 0.0500 2.4498 0.0682 2.3807 0.0934 3.5811 0.1253 3.8352 0.0709

german AUC 0.7938 0.0246 0.7943 0.0255 0.7879 0.0326 0.7932 0.0313 0.7890 0.0278 0.7933 0.0262 Time 0.0094 0.0011 0.0099 0.0015 0.0179 0.0010 0.0169 0.0007 0.0281 0.0026 0.0278 0.0015

satimage AUC 0.9772 0.0029 0.9769 0.0040 0.9765 0.0028 0.9300 0.0066 0.9760 0.0029 0.9770 0.0041 Time 0.0609 0.0038 0.0589 0.0010 0.1181 0.0102 0.1212 0.0024 0.1802 0.0114 0.1826 0.0123

acoustic AUC 0.8952 0.0026 0.8910 0.0028 0.8911 0.0032 0.8911 0.0026 0.8929 0.0028 0.8877 0.0076 Time 0.7281 0.0216 0.7304 0.0278 1.3972 0.0168 1.6672 0.0213 2.7367 0.2275 2.2063 0.1009

covtype AUC 0.8236 0.0009 0.8235 0.0009 0.8228 0.0013 0.8233 0.0009 0.8134 0.0036 0.8233 0.0007 Time 5.4320 0.2447 5.4988 0.0629 10.5169 0.3023 13.1290 0.6334 19.303 3.6072 16.064 0.2350

a9a AUC 0.9000 0.0033 0.9003 0.0042 0.9003 0.0033 0.8957 0.0028 0.8879 0.0043 0.9002 0.0031 Time 0.3123 0.0018 0.3120 0.0023 0.5862 0.0035 0.9417 0.0610 0.9686 0.1143 0.9273 0.0146

connect AUC 0.8786 0.0031 0.8783 0.0023 0.8783 0.0032 0.8716 0.0027 0.8583 0.0035 0.8776 0.0036 Time 0.6520 0.0082 0.6532 0.0053 1.2386 0.0142 1.9633 0.0864 2.0464 0.2030 2.0307 0.0376

usps AUC 0.9225 0.0048 0.9226 0.0046 0.9182 0.0065 0.8765 0.0105 0.9079 0.0069 0.9154 0.0050 Time 0.0947 0.0044 0.1026 0.0022 0.1821 0.0093 0.2851 0.0282 0.2638 0.0195 0.2949 0.0045

w8a AUC 0.9694 0.0035 0.9692 0.0040 0.9663 0.0041 0.9454 0.0057 0.9640 0.0044 0.9695 0.0036 Time 0.5414 0.0069 0.5401 0.0262 0.9725 0.0119 1.6080 0.1558 1.5541 0.1302 1.5782 0.0228

mnist AUC 0.9306 0.0020 0.9302 0.0017 0.9304 0.0027 0.8345 0.0086 0.8908 0.0047 0.9302 0.0015 Time 0.8272 0.0241 0.8409 0.0168 1.3983 0.0592 2.3366 0.2359 2.2333 0.2533 2.3376 0.0418

gisette AUC 0.9970 0.0011 0.9969 0.0011 0.9940 0.0014 - 0.9931 0.0017 0.9908 0.0024 Time 0.2899 0.0224 0.2846 0.0208 0.3291 0.0253 - 0.8719 0.0807 0.5778 0.0423

real-sim AUC 0.9955 0.0004 0.9959 0.0002 0.9936 0.0005 - 0.9842 0.0021 0.9934 0.0006 Time 8.6884 0.2815 9.5146 0.3872 8.9692 0.3466 - 25.505 0.7707 16.132 0.4540

protein h AUC 0.9858 0.0029 0.9806 0.0030 0.9807 0.0054 0.9825 0.0040 0.9895 0.0017 0.9793 0.0036

Time 1.1807 0.0293 1.1943 0.0296 2.2331 0.0853 4.4396 0.4447 3.5537 0.7592 3.5484 0.1478

malware AUC 0.9606 0.0122 0.9595 0.0126 0.9589 0.0143 0.9587 0.0129 0.9566 0.0152 0.9581 0.0114 Time 0.7291 0.0212 0.7296 0.0183 1.2822 0.0473 2.1483 0.2648 1.9903 0.1021 1.9604 0.0637

webspam u AUC 0.9673 0.0008 0.9664 0.0007 0.9668 0.0005 0.9659 0.0006 0.9670 0.0012 0.9671 0.0006

Time 3.7163 0.2121 3.3759 0.1412 6.0562 0.3276 12.3849 0.9478 9.9146 0.1668 9.7933 0.5240

Table 3: Comparison of the testing AUC values and running time per pass (mean std.).

pass running time due to the same reasons we mentioned above for explaining the AUC curve in Figure 1. It can be seen that OAM gra requires longer per-pass running time than OPAUC if the dimensionality is relatively small, while the reverse is the case for datasets with a relatively large dimensionality. This is consistent with the dependency of the time complexity on the dimensionality for these two methods, i.e., linear versus quadratic.

Lei and Ying

(a) diabetes

(d) satimage

(e) acoustic

(f) covtype

(h) connect

(l) gisette

(m) real-sim

(n) protein h

(o) malware

(p) webspam u

Figure 2: AUC versus time curves (in seconds) for SPAUC, SPAM, SOLAM and OPAUC for objective functions with ℓ2 regularizer Ω(w) = λ w 2 2, λ = 10 6.

To show that SPAUC also works well with regularization, we consider (2.11) with either Ω(w) = λ w 2 2 or Ω(w) = λ w 1 in our experiments. We compare our method with SPAM, SOLAM and OPAUC. For the comparison of ℓ2 regularization, we modify the original SOLAM in Ying et al. (2016) by replacing the ℓ2-constraint with an ℓ2-regularizer. For the comparison with ℓ1 regularization, we modify the original implementation of SPAM, SOLAM and OPAUC to handle the ℓ1 regularizer. Therefore, these methods optimize the same objective function. We ﬁx the regularization parameter and tune the step-size

Stochastic Proximal AUC Maximization

(a) diabetes

(d) satimage

(e) acoustic

(f) covtype

(h) connect

(l) gisette

(m) real-sim

(n) protein h

(o) malware

(p) webspam u

Figure 3: AUC versus time curves (in seconds) for SPAUC, SPAM, SOLAM and OPAUC for objective functions with ℓ1 regularizer Ω(w) = λ w 1, λ = 10 6.

parameter µ by 5-fold cross validation. In Figure 2, we plot the AUC values as a function of execution time (in seconds) for SPAUC, SPAM, SOLAM and OPAUC in the setting of ℓ2 regularization with λ = 10 6. In Figure 3, we plot the AUC values as a function of execution time (in seconds) for SPAUC, SPAM, SOLAM and OPAUC in the setting of ℓ1 regularization with λ = 10 6. It can be seen that SPAUC attains a faster convergence speed as compared to the baseline methods. The same phenomenon also occurs for other choice

Lei and Ying

of regularization parameters, e.g., λ = 10 2 and λ = 10 4. We omit these results to save space.

In this section, we present proofs for theoretical properties of SPAUC. We ﬁrst give a road-map to clarify the basic idea. An essential ingredient of our proof is to show the almost boundedness of iterates. To this end, we ﬁrst establish the self-bounding property of ˆFt (Lemma 11) and the one-step progress inequality (Lemma 12), based on which we establish a crude bound of wt (Corollary 13). Then, we give high-probability bounds on the bias of using ˆF t(wt; zit) as a gradient estimate (Lemma 14). We then use these results and a Bernstein-type inequality to tackle a martingale diﬀerence sequence in the one-step progress inequality, yielding a high-probability bound on the iterates (Theorem 3). This bound on {wt} is further used to prove the convergence rate for general objective functions and objective functions with a quadratic functional growth.

5.1 Preliminary Lemmas

We present here some preliminary lemmas. The following lemma shows that an approximation of p, E[x |y = 1] and E[x |y = 1] by (2.9) still preserves the convexity. It also establishes the self-bounding property of ˆFt(w; z).

Lemma 11 For any w and z, we have

ˆF t(w; z) 2 2 16κ2 ˆFt(w; z) and ˆFt(w; z) 0. (5.1)

Furthermore, for any z the function ˆFt(w; z) is a convex function of w.

Proof The inequality ˆFt(w; z) 0 follows directly from the Schwartz s inequality:

ˆFt(w; z) 2pt(1 pt)w vt ut + pt(1 pt) w vt ut 2 + pt(1 pt) 0.

For any w and w, we have

ˆF t(w; z) ˆF t( w; z) 2 2(1 pt) (x ut)(x ut) (w w) 2I[y=1] + 2pt (x vt)(x vt) (w w) 2I[y= 1]+

2pt(1 pt) vt ut vt ut (w w) 2 8κ2 w w 2,

where in the last inequality we have used the deﬁnition of κ. Therefore, it follows from the self-bounding property of non-negative smooth functions (Lemma 17) that ˆF t(w; z) 2 2 16κ2 ˆFt(w; z). This establishes (5.1). It is clear that the Hessian matrix of ˆFt(w; z) is

2(1 pt) x ut x ut I[y=1] +2pt x vt x vt I[y= 1] +2pt(1 pt) vt ut vt ut ,

which is a semi-positive deﬁnite matrix. Therefore, ˆFt( ; z) is a convex function for any z. The proof is complete.

Stochastic Proximal AUC Maximization

Our theoretical analysis roots its foundation on the following one-step progress inequality measuring how the iterate would change after a single iteration of (2.11). The proof is standard and is deferred to Appendix B.

Lemma 12 (One-step Progress Inequality) Let {wt}t be produced by (2.11). If Assumption 1 holds, then for any w Rd we have

wt+1 w 2 2 w wt 2 2 2ηt w wt, ˆF t(wt; zt) + 2ηt Ω(w) Ω(wt)

ηtσΩ w wt+1 2 2 + 2η2 t C1 ˆFt(wt; zt) + C1Ω(wt) + A2 . (5.2)

Based on Lemma 12, we can derive several useful inequalities collected in the following corollary. Eq. (5.3) provides a general bound on the norm of iterates in terms of step sizes. Eqs. (5.4) and (5.5) show how the accumulation of function values can be controlled by step sizes, which, according to Lemma 11 and Assumption 1, in turn give useful estimates on Pt k=1 η2 k ˆF k(wk, zk) 2 2+ Ω (wk) 2 2 and Pt k=1 ˆF k(wk, zk) 2 2+ Ω (wk) 2 2 required to handle in convergence analysis. Since similar ideas have been used in Lei and Tang (2018), we defer the proof of Corollary 13 to Appendix B.

Corollary 13 Let {wt}t be produced by (2.11). Suppose ηt (2C1) 1 and Assumption 1 holds. Let C4 := C 1 1 A2 + 2 1. Then

wt+1 2 2 C4

k=1 ηk. (5.3)

Furthermore, if ηt+1 ηt, then

k=1 η2 k ˆFk(wk; zk) + Ω(wk) C4

k=1 η2 k (5.4)

ˆFk(wk; zk) + Ω(wk) C4t + C4η 1 t

k=1 ηk. (5.5)

5.2 Approximation of Stochastic Gradients

The implementation of SPAUC requires to approximate the unbiased stochastic gradient e F (wt; zt) by replacing the involved p, E[x |y = 1], E[x |y = 1] with their empirical counterparts. The following lemma gives a quantitative measure on the accuracy of this approximation.

Lemma 14 Let δ (0, 1). For any t N, the following inequality holds with probability at least 1 δ

e F (wt; zt) ˆF t(wt; zt) 2 2κ2 2 + p

p I[yt=1]+ 8 1 p I[yt= 1] wt 2+3 .

Lei and Ying

Before proving Lemma 14, we need to introduce the following preliminary lemma. For a matrix A, we denote by A op the operator norm of A, i.e., A op = sup w 2=1 Aw 2. For any u, v Rd, there holds uv op u 2 v 2. (5.6)

Lemma 15 Let δ (0, 1). For any t N, with probability at least 1 δ the following inequalities hold simultaneously

|p pt| 2 + p

2 log(3/δ) /

E[x |y = 1] ut 2 2κ(2 + p

2 log(3/δ)) p

E[x |y = 1] vt 2 2κ(2 + p

2 log(3/δ)) (1 p)

(1 pt)(x ut)(x ut) (1 p) x E[x |y = 1] x E[x |y = 1] op

pt(x vt)(x vt) p x E[x |y = 1] x E[x |y = 1] op

p(1 p) E[x |y = 1] E[x |y = 1] pt(1 pt)(vt ut) 2 3κ 2 + p

2 log(3/δ) /

p(1 p) E[x |y = 1] E[x |y = 1] E[x |y = 1] E[x |y = 1]

pt(1 pt)(vt ut)(vt ut) op 24κ2 2 + p

Proof According to Lemma 18, with probability at least 1 δ the following three inequalities hold simultaneously

|p pt| 2 + p

E[x I[y =1]] 1

i=0 xi I[yi=1] 2 (2 + p

2 log(3/δ))κ

E[x I[y = 1]] 1

i=0 xi I[yi= 1] 2 (2 + p

2 log(3/δ))κ

Stochastic Proximal AUC Maximization

We now prove (5.8). According to (2.9), we know

E[x |y = 1] ut 2 = 1

p E[x |y = 1] ptut + ptut put 2

E[x I[y =1]] 1

i=0 xi I[yi=1] 2 + ut 2

Then we can apply (5.7) and (5.14) to derive (5.8) with probability at least 1 δ.

Eq. (5.9) can be proved in a similar manner and we omit the proof for brevity.

We now show (5.10). It is clear that

(1 pt)(x ut)(x ut) (1 p) x E[x |y = 1] x E[x |y = 1] = (1 pt) (1 p) (x ut)(x ut)

+ (1 p)(x ut)(x ut) (1 p)(x ut) x E[x |y = 1]

+ (1 p)(x ut) x E[x |y = 1] (1 p) x E[x |y = 1] x E[x |y = 1] ,

from which and (5.6) we derive

(1 pt)(x ut)(x ut) (1 p) x E[x |y = 1] x E[x |y = 1] op |p pt| x ut x ut op

+(1 p) (x ut) E[x |y = 1] ut op +(1 p) E[x |y = 1] ut x E[x |y = 1] op 4κ2|p pt| + 4κ(1 p) E[x |y = 1] ut 2.

This together with (5.7) and (5.8) shows (5.10) with probability at least 1 δ.

Eq. (5.11) can be proved in a similar manner and we omit the proof for brevity.

We now prove (5.12). It is clear

p(1 p) E[x |y = 1] E[x |y = 1] pt(1 pt)(vt ut) = p E[x I[y = 1]] (1 p)E[x I[y =1]]

i=0 xi I[yi= 1] + 1 pt

i=0 xi I[yi=1] ,

from which we derive p(1 p) E[x |y = 1] E[x |y = 1] pt(1 pt)(vt ut) 2

p E[x I[y = 1]] pt

i=0 xi I[yi= 1] 2 + (1 p)E[x I[y =1]] (1 pt)

i=0 xi I[yi=1] 2

2κ|p pt| + pt E[x I[y = 1]] 1

i=0 xi I[yi= 1] 2 + (1 pt) E[x I[y =1]] 1

i=0 xi I[yi=1] 2,

where we have used p E[x I[y = 1]] = (p pt)E[x I[y = 1]] + pt E[x I[y = 1]]. We can then apply (5.7), (5.14) and (5.15) to derive the bound (5.12) with probability 1 δ.

Lei and Ying

We now prove (5.13). It is clear

p(1 p) E[x |y = 1] E[x |y = 1] E[x |y = 1] E[x |y = 1] pt(1 pt)(vt ut)(vt ut)

= p(1 p) E[x |y = 1] E[x |y = 1] E[x |y = 1] E[x |y = 1] vt ut

+ p(1 p) E[x |y = 1] E[x |y = 1] vt ut vt ut

+ p(1 p) pt(1 pt) vt ut vt ut ,

from which and (5.6) it follows that p(1 p) E[x |y = 1] E[x |y = 1] E[x |y = 1] E[x |y = 1] pt(1 pt)(vt ut)(vt ut) op

4p(1 p)κ E[x |y = 1] E[x |y = 1] vt ut 2 + 4κ2|p pt||p + pt 1|.

Furthermore, there holds that

p(1 p) E[x |y = 1] E[x |y = 1] vt ut 2

p(1 p) E[x |y = 1] E[x|y=1] pt(1 pt)(vt ut) 2 + pt(1 pt) p(1 p) vt ut 2

p(1 p) E[x |y = 1] E[x |y = 1] pt(1 pt)(vt ut) 2 + 2κ|p pt||p + pt 1|.

Combining the above two inequalities and (5.7), (5.12) together then imply the stated inequality (5.13) with probability 1 δ. The proof is complete.

Proof of Lemma 14 It follows from (2.1) that

e F (w; z) = 2(1 p) x E[x |y = 1] x E[x |y = 1] w I[y=1]+

2p(x E[x |y = 1])(x E[x |y = 1]) w I[y= 1] + 2p(1 p) E[x |y = 1] E[x |y = 1]

+ 2p(1 p) E[x |y = 1] E[x |y = 1] E[x |y = 1] E[x |y = 1] w. (5.16)

This together with (2.10) shows that e F (wt; zt) ˆF t(wt; zt) 2 2 p(1 p) E[x |y = 1] E[x |y = 1] pt(1 pt)(vt ut) 2

+ 2 (1 p) x E[x |y = 1] x E[x |y = 1] (1 pt)(x ut)(x ut) op wt 2I[yt=1]

+ 2 p x E[x |y = 1] x E[x |y = 1] pt(x vt)(x vt) op wt 2I[yt= 1]

+ 2 p(1 p) E[x |y = 1] E[x|y=1] E[x |y = 1] E[x|y=1] pt(1 pt)(vt ut)(vt ut) op wt 2

:= I + II + III + IV.

We can apply (5.12), (5.10), (5.11), and (5.13) to develop upper bounds of I, II, III and IV with probability at least 1 δ, respectively. Also, note that the high probability is w.r.t.

Stochastic Proximal AUC Maximization

z1, . . . , zt 1, which is independent of zt. We can plug these high-probability bounds back to the above inequality, and get the stated bound. The proof is complete.

5.3 Boundedness of Iterates

In this subsection, we prove Lemma 13 on the almost boundedness of iterates. To this aim, we ﬁrst establish a recursive inequality showing how wt+1 w 1 2 2 can be controlled by wk w 1 2 2 for k = 1, . . . , t. Our basic idea is to control wt+1 w 1 2 2 by

k=1 ηk(φ(w 1) φ(wk)) +

k=1 ηk w 1 wk, ˆF k(wk; zk) e F (wk; zk) +

k=1 ξk , (5.17)

where {ξk}k is a martingale diﬀerence sequence deﬁned in (5.23). We apply Lemma 14 to control Pt k=1 ηk w 1 wk, ˆF k(wk; zk) e F (wk; zk) , and apply Part (b) of Lemma 19 to show with high probability that Pt k=1 ξk Pt k=1 ηk(φ(wk) φ(w 1))+ e C Pt k=1 η2 k wk w 1 2 2 for a constant e C > 0. The key observation is that the partial variance Pt k=1 ηk(φ(wk) φ(w 1)) can be cancelled out by the term Pt k=1 ηk(φ(w 1) φ(wk)) in (5.17).

Proposition 16 Let {wt}t be produced by (2.11) with ηt (2C1) 1 and ηt+1 ηt. We suppose Assumption 1 holds,

C5 = sup k ηk

j=1 ηj < , C6 = η1 sup z e F(w 1, z)+2p(1 p) 7κ2C4C5+η1 1+2κ w 1 2+2κ2 w 1 2 2 .

Then for any δ (0, 1) and ρ = min{1, (2C1) 1(η1 w 1 2 2 + C4C5) 1C6}, the following inequality holds with probability at least 1 δ simultaneously for all t = 1, . . . , T

wt+1 w 1 2 2 w 1 2 2 +

2Ck,δηk( wk w 1 2 2 + 1)

k + φ(w 1) C4C5

k=1 η2 k wk w 1 2 2

+ 2C6 log(2T/δ)

ρ + 2(C1C4 + A2)

where we introduce Cp = 8 max{p 1, (1 p) 1} + 24 and

Ck,δ = 2κ2 2 + p

2 log(12k2/δ) max Cp + 1, 4 1(Cp w 1 2 + 3)2 .

Proof Taking w = w 1 in (5.2) gives

wt+1 w 1 2 2 wt w 1 2 2 2ηt w 1 wt, ˆF t(wt, zt) + 2ηt(Ω(w 1) Ω(wt))

+ 2η2 t C1 ˆFt(wt; zt) + C1Ω(wt) + A2 .

Lei and Ying

Taking a summation of the above inequality gives (w1 = 0)

wt+1 w 1 2 2 w 1 2 2 =

wk+1 w 1 2 2 wk w 1 2 2 2

k=1 ηk w 1 wk, ˆF k(wk; zk)

k=1 ηk(Ω(w 1) Ω(wk)) + 2

k=1 η2 k C1 ˆFk(wk; zk) + C1Ω(wk) + A2

k=1 ηk w 1 wk, ˆF k(wk; zk) + 2

k=1 ηk(Ω(w 1) Ω(wk)) + 2(C1C4 + A2)

where the last inequality is due to (5.4). We consider the following decomposition

k=1 ηk w 1 wk, ˆF k(wk; zk) =

k=1 ηk w 1 wk, ˆF k(wk; zk) e F (wk; zk)

k=1 ηk w 1 wk, e F (wk; zk) f(wk) +

k=1 ηk w 1 wk, f(wk) . (5.19)

For any k N, by Lemma 14 the following inequality holds with probability at least 1 δ/(4k2)

e F (wk; zk) ˆF k(wk; zk) 2 2κ2 2 + p

2 log(12k2/δ)

Cp wk 2 + 3 ,

which together with union bounds and P k=1 k 2 2 gives the following inequality with probability 1 δ/2 simultaneously for all k = 1, . . . ,

e F (wk; zk) ˆF k(wk; zk) 2 2κ2 2 + p

2 log(12k2/δ)

Cp wk w 1 2 +Cp w 1 2 +3 .

It then follows that the following inequality holds with probability at least 1 δ/2 simultaneously for all t = 1, . . . ,

k=1 ηk w 1 wk, ˆF k(wk; zk) e F (w; zk)

k=1 ηk wk w 1 2 ˆF k(wk; zk) e F (wk; zk) 2

k=1 ηk 2 + p

2 log(12k2/δ) Cp wk w 1 2 2 + (Cp w 1 2 + 3) wk w 1 2

k=1 ηk Ck,δ( wk w 1 2 2 + 1)/

where in the last step we have used the Schwartz s inequality 3 + Cp w 1 2 wk w 1 2 wk w 1 2 2 + (3 + Cp w 1 2)2/4.

Stochastic Proximal AUC Maximization

It follows from the convexity of f that

k=1 ηk w 1 wk, f(wk)

k=1 ηk f(w 1) f(wk) . (5.22)

We now control the last second term of (5.19) with an application of a concentration inequality for a martingale diﬀerence sequence. Introduce a sequence of random variables

ξk := ηk w 1 wk, e F (wk; zk) f(wk) , k N. (5.23)

It follows from Proposition 1 that Ezk[ξk] = 0 and therefore {ξk}k is a martingale diﬀerence sequence. Analogous to Lemma 11, we can show

e F (wk; zk) 2 2 16κ2 e F(wk, zk). (5.24)

Since E[(ξ E[ξ])2] E[ξ2] for any real-valued random variable ξ, it then follows that

Ezk h w 1 wk, e F (wk; zk) f(wk) 2i Ezk h w 1 wk, e F (wk; zk) 2i

wk w 1 2 2Ezk h e F (wk; zk) 2 2

i wk w 1 2 2Ezk C1 e F(wk, zk) = C1f(wk) wk w 1 2 2,

where we have used the deﬁnition of C1 and Proposition 1. It then follows that

k=1 Ezk ξk Ezk[ξk] 2 =

k=1 η2 k Ezk h w 1 wk, e F (wk; zk) f(wk) 2i

k=1 η2 k wk w 1 2 2 C1φ(wk) C1φ(w 1) +

k=1 η2 k wk w 1 2 2C1φ(w 1).

By (5.3), C5 = supk ηk Pk 1 j=1 ηj < and f(w) φ(w), we know

η2 k wk w 1 2 2 2η2 k( wk 2 2+ w 1 2 2) 2ηk ηk w 1 2 2+C4ηk

j=1 ηj 2ηk η1 w 1 2 2+C4C5 .

Combining the above two inequalities together, we derive

k=1 Ezk ξk Ezk[ξk] 2

2C1 η1 w 1 2 2 + C4C5 t X

k=1 ηk φ(wk) φ(w 1)) + C1φ(w 1)

k=1 η2 k wk w 1 2 2. (5.25)

Lei and Ying

According to the convexity of e F established in Proposition 1, we know

ξk Ezk[ξk] = ηk w 1 wk, e F (wk; zk) + ηk wk w 1, f(wk)

ηk e F(w 1; zk) e F(wk; zk) + ηk wk 2 + w 1 2 4p(1 p)κ + 8p(1 p)κ2 wk 2

ηk e F(w 1; zk) + 4ηkp(1 p) κ wk 2 + κ w 1 2 + 2κ2 wk 2 2 + 2κ2 wk 2 w 1 2

ηk e F(w 1; zk) + 2ηkp(1 p) κ2 wk 2 2 + 1 + 2κ w 1 2 + 4κ2 wk 2 2 + 2κ2 wk 2 2 + 2κ2 w 1 2 2

ηk e F(w 1; zk) + 2p(1 p) 7κ2ηk C4

j=1 ηj + ηk 1 + 2κ w 1 2 + 2κ2 w 1 2 2 C6,

where we have used the following inequality in the second inequality ( f(w) = 2p(1 p)E (1 w (x x ))(x x )|y = 1, y = 1 )

f(w) 2 4p(1 p)κ + 8p(1 p)κ2 w 2, w Rd, (5.26)

(5.3) and C5 = supk ηk Pk 1 j=1 ηj < in the last inequality. The above bounds on magnitudes and variances of ξk together with Part (b) of Lemma 19 (see the Appendix) imply the following inequality with probability 1 δ/2

2C1(η1 w 1 2 2+C4C5)

k=1 ηk φ(wk) φ(w 1) +C1φ(w 1)

k=1 η2 k wk w 1 2 2

+ C6 log(2/δ)

k=1 ηk φ(wk) φ(w 1) + φ(w 1) 2C4C5

k=1 η2 k wk w 1 2 2 + C6 log(2/δ)

where we have used the inequality 2C1ρ(η1 w 1 2 2 + C4C5) C6. Plugging (5.21), (5.22) and (5.27) into (5.19) gives the following inequality with probability 1 δ

k=1 ηk w 1 wk, ˆF k(wk; zk)

k=1 Ck,δηk( wk w 1 2 2 + 1)/

k=1 ηk f(w 1) f(wk)

k=1 ηk φ(wk) φ(w 1) + φ(w 1) 2C4C5

k=1 η2 k wk w 1 2 2 + C6 log(2/δ)

This together with (5.18) shows the following inequality with probability 1 δ

wt+1 w 1 2 2 w 1 2 2 +

2Ck,δηk( wk w 1 2 2 + 1)

k + φ(w 1) C4C5

k=1 η2 k wk w 1 2 2

+ 2C6 log(2/δ)

ρ + 2(C1C4 + A2)

Note (5.20) holds simultaneously for all k = 1, . . . , . To derive the stated inequality for all t = 1, . . . , T, one needs to derive (5.27) simultaneously for all k = 1, . . . , T. This can be

Stochastic Proximal AUC Maximization

done by replacing log(2/δ) in (5.27) with log(2T/δ). The proof is complete.

According to the assumption P k=1 η2 k < and P k=1 ηk log k/

k < , Proposition 16 essentially implies that

max 1 k t wk w 1 2 2 1

2 max 1 k t wk w 1 2 2 + e C log 1

for a e C > 0, from which we can derive an almost boundedness of {wt}t. We will rigorously show this in the following proof. Proof of Theorem 3 Introduce the set

ΩT = (z1, . . . , z T ) : wt+1 w 1 2 2 w 1 2 2 +

2Ck,δηk( wk w 1 2 2 + 1)

φ(w 1) C4C5

k=1 η2 k wk w 1 2 2 + 2C6 log(2T/δ)

ρ + 2(C1C4 + A2)

k=1 η2 k for all t = 1, . . . , T ,

where ρ is deﬁned in Proposition 16. Proposition 16 shows that Pr(ΩT ) 1 δ. Since P t=1 ηt log t/

t < and P t=1 η2 t < , we can ﬁnd a t2 N such that

k < 1/4 and

k=t2+1 η2 k < C4C5 4φ(w 1). (5.28)

Conditioned on the event ΩT , we derive the following inequality for all t = 1, . . . , T

wt+1 w 1 2 2 w 1 2 2

2Ck,δηk wk w 1 2 2

k + max 1 t T w t w 1 2 2

k + φ(w 1) C4C5

k=1 η2 k wk w 1 2 2

+ φ(w 1) max1 t T w t w 1 2 2 C4C5

k=t2+1 η2 k +

k + 2C6 log(2T/δ)

ρ + 2(C1C4 + A2)

2Ck,δηk wk w 1 2 2

4 max 1 t T w t w 1 2 2 + φ(w 1) C4C5

k=1 η2 k wk w 1 2 2

4 max 1 t T w t w 1 2 2 +

k + 2C6 log(2T/δ)

ρ + 2(C1C4 + A2)

It then follows the following inequality under the event ΩT

max 1 t T w t w 1 2 2 w 1 2 2 +

2Ck,δηk wk w 1 2 2

2 max 1 t T w t w 1 2 2

+ φ(w 1) C4C5

k=1 η2 k wk w 1 2 2 +

k + 2C6 log(2T/δ)

ρ + 2(C1C4 + A2)

Lei and Ying

from which we derive the stated inequality with probability 1 δ (notice wk w 1 2 2 2( w 1 2 2 + C4 Pk 1 j=1 ηj))

C2 = 2 w 1 2 2 +

8C7ηk w 1 2 2 + C4 Pk 1 j=1 ηj

4φ(w 1) C4C5

k=1 η2 k w 1 2 2 + C4

ρ + 4(C1C4 + A2)

where we introduce (notice Ck,δ C7 p

C7 = 2κ2 2 + p

2 log 12 + 4 max n Cp + 1, 4 1(CP w 1 2 + 3)2o .

The proof is complete.

5.4 Proofs for General Convergence Rates

In this subsection, we prove Theorem 4 on the probabilistic convergence rates by taking a deduction analogous to the proof of Proposition 16. The diﬀerence is to apply Part (a) of Lemma 19 together with the bound of wt 2 established in Theorem 3 to control Pt k=1 ξk in (5.23). Proof of Theorem 4 According to Lemma 14 followed with union bounds, we know the existence of Ω(1) T with Pr(Ω(1) T ) 1 δ/3 such that the following inequality holds with probability 1 δ/3 simultaneously for all t = 1, . . . , T conditioned on Ω(1) T

e F (wt; zt) ˆF t(wt; zt) 2 2κ2 2 + p

2 log(9T/δ)

Cp wt w 1 2 + 3 + Cp w 1 2 .

It then follows the following inequality conditioned on Ω(1) T

t=1 ηt w 1 wt, ˆF t(wt; zt) e F (wt; zt) I[ wt w 1 2 2 C2 log(6T/δ)]

t=1 ηt w 1 wt 2 ˆF t(wt; zt) e F (wt; zt) 2I[ wt w 1 2 2 C2 log(6T/δ)] e CT,δ

where we introduce

e CT,δ = 2κ2p

2 log(9T/δ) Cp p

C2 + 3 + Cp w 1 2 log(6T/δ).

Introduce a sequence of random variables

ξ t = ηt w 1 wt, e F (wt; zt) f(wt) I[ wt w 1 2 2 C2 log(6T/δ)], t = 1, . . . , T.

Stochastic Proximal AUC Maximization

According to Schwartz s inequality, we derive

|ξ t| ηt h w 1 wt 2 2 + 4 1 e F (wt; zt) f(wt) 2 2 i I[ wt w 1 2 2 C2 log 6T

ηt h wt w 1 2 2 + 2 1 e F (wt; zt) 2 2 + 2 1 f(wt) 2 2 i I[ wt w 1 2 2 C2 log 6T

According to (5.16) and (5.26), it is clear that

max f(w) 2, e F (w; z) 2 8κ2 w 2 + κ (5.30)

8κ2 w w 1 2 + 8κ2 w 1 2 + κ.

Therefore, there holds

|ξ t| C8ηt log(6T/δ), where C8 = C2 + 2(8κ2 w 1 2 + κ)2 + 128κ4C2.

It is clear that {ξ t} is a martingale diﬀerence sequence and therefore we can apply Part (a) of Lemma 19 in the Appendix A to show the existence of Ω(2) T with Pr(Ω(2) T ) 1 δ/3 such that the following inequality holds conditioned on Ω(2) T

t=1 η2 t log 3

Theorem 3 implies the existence of Ω(3) T with Pr(Ω(3) T ) 1 δ/3 such that max1 t T w t w 1 2 2 C2 log(6T/δ). According to (5.19), (5.22), (5.29) and (5.31), it is clear that the following inequality holds under the event Ω(1) T Ω(2) T Ω(3) T (note ξ t = ηt w 1 wt, e F (wt; zt) f(wt) in this case)

t=1 ηt w 1 wt, ˆF t(wt; zt) e CT,δ

t+C8 log 6T

t=1 η2 t log 3

t=1 ηt f(w 1) f(wt) .

Plugging the above inequality back into (5.18) and noting Pr Ω(1) T Ω(2) T Ω(3) T 1 δ, we derive the following inequality with probability at least 1 δ

w T+1 w 1 2 2 w 1 2 2 2

t=1 ηt φ(w 1) φ(wt) + 2(C1C4 + A2)

t + 2C8 log 6T

t=1 η2 t log 3

This combined with the convexity of φ establishes the stated inequality with probability 1 δ. The proof is complete.

Proof of Corollary 5 We ﬁrst prove Part (a). It is clear that the step sizes satisfy (3.2) and therefore Theorem 4 holds. Part (a) then follows from the standard inequality

Lei and Ying

PT t=1 t θ (1 θ) 1(T 1 θ 1), θ (0, 1). We now turn to Part (b). It is clear that P t=1 ηt log 1 2 t/

t η1 P t=1 log 1 β

2 (et)/t < and P t=1 η2 t < . Part (b) then follows

from the inequality PT t=1 t logβ(et) 1

2 (e T). The proof is complete.

5.5 Discussion of the Proof

We follow the arguments in Lei and Tang (2018) in our proofs. In this subsection, we give details on the similarity and diﬀerence between our proofs and those in Lei and Tang (2018). Both arguments ﬁrst build a crude estimate wt 2 2 = O Pt k=1 ηk and then reﬁne it to an almost boundedness of iterates with high probability by martingale analysis. As in Lei and Tang (2018), we use the self-bounding property of loss functions to control a weighted summation of loss functions (5.4), which removes bounded gradient assumptions imposed in the literature. As in Lei and Tang (2018), we use the Bernstein inequality to control the martingale diﬀerence sequence {ξk} deﬁned in (5.23), and show that the conditional variance of this martingale can be oﬀset by some negative terms in the one-step progress inequality.

A key diﬀerence is that we use approximately biased stochastic gradients in our algorithm, while the discussions (Lei and Tang, 2018) consider stochastic optimization with unbiased gradient estimates. Speciﬁcally, the approximate unbiased stochastic gradient is caused by replacing p = P(y = 1), E[x|y = 1] and E[x|y = 1] by its empirical counterparts at the present time t. To overcome this hindrance, we build high-probability bounds on e F (wt; zt) ˆF t(wt; zt) 2 (Lemma 14) by developing concentration inequalities for approximating conditional expectations and variances by their empirical counterparts (Lemma 15). We show that this approximate unbias does not aﬀect the almost boundedness of iterates provided that the step size sequence satisﬁes an additional assumption P t=1 ηt log t/t < . Another notable diﬀerence is that we show that the fast convergence rates can be derived for objective functions satisfying the quadratic functional growth, while the discussions in Lei and Tang (2018) require a stronger assumption on the strong convexity of objective functions.

As compared to the technical analysis, our main novelty is the development of a new stochastic optimization algorithm for AUC maximization. Previous reformulation of AUC maximization as a pointwise problem either introduces additional dual variables (Ying et al., 2016; Liu et al., 2018) or uses a non-convex estimator of stochastic gradients (Natole et al., 2018). The former formulation requires to introduce an explicit boundedness constraint on w (Ying et al., 2016; Liu et al., 2018), while the latter one requires the information of p, conditional expectation and is only guaranteed convergence rates in expectation (Natole et al., 2018). We develop a novel reformulation of AUC maximization as a pointwise problem which leads to a convex estimator of stochastic gradient. This key convexity is critical to handle the case when w Rd, as well as in both the algorithm design and the highprobability theoretical analysis.

Stochastic Proximal AUC Maximization

6. Conclusion

In this paper, we presented a new stochastic gradient descent method for AUC maximization which can accommodate general penalty terms. Our algorithm can update the model parameter upon receiving individual data with favorable O(d) space and per-iteration time complexity, making it amenable for streaming data analysis. We established a highprobability convergence rate e O(1/

T) for the general convex setting, and a fast convergence e O(1/T) for the cases of strongly convex regularizers and no regularization term (without strong convexity). There are several directions for future work. Firstly, we focused on the square loss and it remains unclear to us on how to develop similar algorithms for general loss functions. Secondly, it would be very interesting to develop stochastic optimization algorithms for AUC maximization under nonlinear models. There are two possible approaches for developing nonlinear models for AUC maximization including the kernel trick and and deep neural networks. For the approach using the kernel trick, one could use the techniques of random feature (Rahimi and Recht, 2008) for RBF kernels and then apply the linear model in this paper. One can easily prove a similar saddle point formulation even for non-convex deep neural network, and develop stochastic primal-dual stochastic gradient decent algorithms (Nemirovski et al., 2009) for deep AUC maximization models. However, it is not clear on how to establish theoretical guarantees for the convergence of such algorithms as the objective function is generally non-convex.

Acknowledgments

The authors would like to thank the anonymous reviewers and the editor for their constructive comments and suggestions. This work is supported by National Science Foundation (NSF) grants IIS-1816227 and IIS-2008532.

Appendix A. Lemmas

In this section we provide some useful lemmas. Lemma 17 shows a self-bounding property for smooth and non-negative functions (Nesterov, 2013).

Lemma 17 If h : Rd R is non-negative and β-smooth, i.e., h(w) h( w) 2 β w w 2, then h(w) 2 2 2βh(w) for all w Rd.

Our discussion is also based on some concentration inequalities. Lemma 18 is the Hoeﬀding s inequality for vector-valued random variables (Boucheron et al., 2013).

Lemma 18 (Hoeﬀding s inequality) Let Z1, . . . , Zn be a sequence of i.i.d. random variables taking values in Rd with Zi 2 B for every i. Then, for any 0 < δ < 1, with probability 1 δ we have

Zi E[Zi] 2 B n

2 log 1/δ i .

Lei and Ying

Part (a) of Lemma 19 is the Azuma-Hoeﬀding inequality for martingales with bounded increments (Hoeﬀding, 1963), and part (b) is a conditional Bernstein inequality using the conditional variance to quantify better the concentration behavior of martingales (Zhang, 2005).

Lemma 19 Let z1, . . . , zn be a sequence of random variables such that zk may depend on the previous random variables z1, . . . , zk 1 for all k = 1, . . . , n. Consider a sequence of functionals ξk(z1, . . . , zk), k = 1, . . . , n. Let σ2 n = Pn k=1 Ezk ξk Ezk[ξk] 2 be the conditional variance and δ (0, 1).

(a) Assume that |ξk Ezk[ξk]| bk for each k. With probability at least 1 δ we have

k=1 Ezk[ξk] 2

k=1 b2 k log 1

(b) Assume that ξk Ezk[ξk] b for each k and ρ (0, 1]. With probability at least 1 δ we have n X

k=1 Ezk[ξk] ρσ2 n b + b log 1

δ ρ . (A.2)

Appendix B. Proof of Results in Section 5.1

Proof of Lemma 12 According to the ﬁrst-order optimality condition in (2.11), we get

ηt ˆF t(wt; zt) + ηtΩ (wt+1) + wt+1 wt = 0, (B.1)

from which we derive

wt+1 w 2 2 = wt+1 w, wt+1 wt + wt w

= ηt wt+1 w, ˆF t(wt; zt) + ηt w wt+1, Ω (wt+1) + wt+1 w, wt w . (B.2)

It follows from the deﬁnition of σΩthat

w wt+1, Ω (wt+1) Ω(w) Ω(wt+1) 2 1σΩ w wt+1 2 2 = Ω(w) Ω(wt) + Ω(wt) Ω(wt+1) 2 1σΩ w wt+1 2 2 Ω(w) Ω(wt) + wt wt+1, Ω (wt) 2 1σΩ w wt+1 2 2 + wt wt+1 2 2 . (B.3)

It can be directly checked that

wt+1 w, wt w = 1

w wt 2 2 + w wt+1 2 2 wt wt+1 2 .

Plugging the above identity and (B.3) back into (B.2), we derive

wt+1 w 2 2 ηt w wt+wt wt+1, ˆF t(wt; zt) +ηtΩ(w) ηtΩ(wt)+ηt wt wt+1, Ω (wt)

2 1ηtσΩ w wt+1 2 2 + 1

w wt 2 2 + w wt+1 2 2 wt wt+1 2 . (B.4)

Stochastic Proximal AUC Maximization

According to the Schwartz s inequality, we know

ηt wt wt+1, ˆF t(wt; zt) +ηt wt wt+1, Ω (wt) 1

2 wt wt+1 2 2+η2 t ˆF t(wt; zt) 2 2+η2 t Ω (wt) 2 2.

Plugging the above inequality back into (B.4) gives

wt+1 w 2 2 w wt 2 2 2ηt w wt, ˆF t(wt; zt) + 2ηt Ω(w) Ω(wt)

ηtσΩ w wt+1 2 2 + 2η2 t ˆF t(wt; zt) 2 2 + 2η2 t Ω (wt) 2 2. (B.5)

The stated bound then follows from Lemma 11, Assumption 1 and the deﬁnition of C1. The proof is complete.

Proof of Corollary 13 Eq. (5.2) together with the convexity of ˆFt established in Lemma 11 implies

wt+1 w 2 2 wt w 2 2 2ηt ˆFt(w; zt) ˆFt(wt; zt) + 2ηt Ω(w) Ω(wt)

ηtσΩ w wt+1 2 2 + 2η2 t C1 ˆFt(wt; zt) + C1Ω(wt) + A2 . (B.6)

Taking w = 0 in (B.6) and using ˆFt(0; zt) = pt(1 pt), Ω(0) = 0, we get

wt+1 2 2 wt 2 2 2ηt ˆFt(0; zt) ˆFt(wt; zt) + 2ηt Ω(0) Ω(wt) + 2η2 t C1 ˆFt(wt; zt) + C1Ω(wt) + A2

2ηt(C1ηt 1) ˆFt(wt; zt) + Ω(wt) + ηt/2 + 2η2 t A2 (B.7)

ηt/2 + C 1 1 A2ηt,

where the last inequality follows from ˆFt(wt; zt) + Ω(wt) 0 due to Lemma 11 and the assumption 0 ηt (2C1) 1. Taking a summation of the above inequality then shows

wt+1 2 2 C 1 1 A2 + 2 1 t X

This establishes (5.3). Plugging the assumption ηt (2C1) 1 into (B.7) gives

ηt ˆFt(wt; zt) + Ω(wt) wt 2 2 wt+1 2 2 + ηt/2 + C 1 1 A2ηt.

Multiplying both sides by ηt, we derive

η2 t ˆFt(wt; zt) + Ω(wt) ηt wt 2 2 ηt wt+1 2 2 + η2 t /2 + η2 t C 1 1 A2 ηt wt 2 2 ηt+1 wt+1 2 2 + η2 t /2 + η2 t C 1 1 A2,

where we have used the assumption ηt+1 ηt. Taking a summation of the above inequality further yields t X

k=1 η2 k ˆFk(wk; zk) + Ω(wk) C 1 1 A2 + 2 1 t X

Lei and Ying

We now turn to (5.5). Plugging the assumption ηt (2C1) 1 into (B.7) and multiplying both sides by η 1 t , we derive

ˆF (wt; zt) + Ω(wt) η 1 t wt 2 2 wt+1 2 2 + 2 1 + C 1 1 A2.

Taking a summation of the above inequality implies

ˆFk(wk; zk) + Ω(wk) t C4 +

k=1 η 1 k wk 2 2 wk+1 2 2

k=2 wk 2 2(η 1 k η 1 k 1) + η 1 1 w1 2 2

t C4 + max 1 k t w k 2 2

k=2 (η 1 k η 1 k 1)

t C4 + C4η 1 t

where the last inequality is due to (5.3). The proof is complete.

Appendix C. Proofs for Fast Convergence Rates

In this subsection, we prove Theorem 7 on convergence rates for φ with a quadratic functional growth. To this aim, we need to introduce some lemmas. The following lemma provides probabilistic bounds for approximating e F (wk; zk) with ˆF k(wk; zk) for {wk} produced by (2.11) with speciﬁc step sizes.

Lemma 20 Suppose Assumption 1 holds. Let {wt}t be the sequence produced by (2.11) with ηt = 2 σφt+2σf+σφt1 , where t1 4C1σ 1 φ . Then, for any k T the following inequality holds with probability 1 δ e F (wk; zk) ˆF k(wk; zk) 2 Cδ p

where Cδ := 2κ2 2 + p

2 log(3/δ) 32 q

2C4σ 1 φ + 3 .

Proof Since t1 4C1σ 1 φ we know ηt (2C1) 1 and therefore Corollary 13 holds. It follows from the deﬁnition of ηt that

k=1 ηk 2σ 1 φ

k=1 (k + t1) 1 2σ 1 φ log(et). (C.2)

This together with (5.3) shows

wt 2 2 2C4σ 1 φ log(et). (C.3)

Stochastic Proximal AUC Maximization

For all k = 1, . . . , T, we can then apply Lemma 14 to derive the following inequality with probability 1 δ

e F (wk; zk) ˆF k(wk; zk) 2 2κ2 2 + p

2 log(3/δ) 32 q

2C4σ 1 φ + 3 p

The proof is complete with the introduction of Cδ.

The following lemma plays a fundamental role in our analysis. It shows that both wt w t 2 2 and a weighted summation of φ(wk) φ(w k) can be controlled by a summation of martingale diﬀerence sequences. It is established by taking a weighted summation of the one-step progress inequality (5.2).

Lemma 21 Suppose Assumption 1 and Assumption 2 hold. Let {wt}t be the sequence produced by (2.11) with ηt = 2 σφt+2σf+σφt1 with t1 4C1σ 1 φ . Let δ (0, 1) and C9 = 16(C1C4+A2). Then the following inequality holds with probability 1 δ for all t = 1, 2, . . . , T

Pt k=1(k + t1 + 1)(φ(wk) φ(w k)) (t + t1 + 1)(t + t1 + 2)σφ + wt+1 w t+1 2 2 (t1 + 1)(t1 + 2) w1 w 1 2 2 (t + t1 + 1)(t + t1 + 2)

+ 4 Pt k=1(k + t1 + 1)ξk (t + t1 + 1)(t + t1 + 2)σφ + 2 log2(e T)(2C2 δ/T + C9)

(t + t1 + 2)σ2 φ . (C.4)

Proof It follows from (5.2) that

wk+1 w 2 2 w wk 2 2 2ηk w wk, ˆF k(wk; zk) e F (wk; zk) +

+ 2ηk w wk, e F (wk; zk) f(wk) + 2ηk w wk, f(wk) + 2ηk Ω(w) Ω(wk)

ηkσΩ w wk+1 2 2 + 2η2 k C1 ˆFk(wk; zk) + C1Ω(wk) + A2 .

Taking w = w k in the above inequality and introducing the sequence of random variables {ξk}k as ξk = w k wk, e F (wk; zk) f(wk) , k = 1, 2, . . . , (C.5)

(1+ηkσΩ) wk+1 w k 2 2 wk w k 2 2+2 1ηkσφ w k wk 2 2+2ηkσ 1 φ ˆF k(wk, zk) e F (wk; zk) 2 2 +2ηkξk+2 1ηk[φ(w k) φ(wk)] 3ηkσφ wk w k 2 2/2+2η2 k C1 ˆFk(wk; zk)+C1Ω(wk)+A2 ,

where we have used Schwartz s inequality

2 w k wk, ˆF k(wk; zk) e F (wk; zk) σφ

2 w k wk 2 2 + 2

ˆF k(wk; zk) e F (wk; zk) 2 2

and the following inequality due to Assumption 2

2 w k wk, f(wk) + 2 Ω(w k) Ω(wk) 1

φ(w k) φ(wk)

φ(w k) φ(wk) 3

2σφ w k wk 2 2.

Lei and Ying

It then follows from wk+1 w k+1 2 wk+1 w k 2 that

ηk(φ(wk) φ(w k)) 2(1 + ηkσΩ) + wk+1 w k+1 2 1 ηkσφ

1 + ηkσΩ wk w k 2 2+

2ηk ˆF k(wk; zk) e F (wk; zk) 2 2 (1 + ηkσΩ)σφ + 2ηkξk 1 + ηkσΩ + 2η2 k C1 ˆFk(wk; zk) + C1Ω(wk) + A2

1 + ηkσΩ . (C.6)

According to the step size choice ηk = 2 σφk+2σf+σφt1 and σφ = σf + σΩwe know

1 σφηk 1 + σΩηk 1 σfηk

1 + σΩηk = k + t1 k + t1 + 2 and ηk 1 + σΩηk = 2 σφ(k + t1 + 2).

According to Lemma 20, we derive the following inequality with probability at least 1 δ simultaneously for all k = 1, . . . , T e F (wk; zk) ˆF k(wk; zk) 2 Cδ/T p

Plugging the above two inequalities back into (C.6), we get the following inequality with probability 1 δ for all k = 1, . . . , T

φ(wk) φ(w k) σφ(k + t1 + 2) + wk+1 w k+1 2 2 (k + t1) wk w k 2 2 k + t1 + 2 +

4C2 δ/T log(e T)

σ2 φk(k + t1 + 2) + 4ξk σφ(k + t1 + 2) + 4ηk C1 ˆFk(wk; zk) + C1Ω(wk) + A2

σφ(k + t1 + 2) .

Multiplying both sides with (k + t1 + 2)(k + t1 + 1) implies the following inequality with probability 1 δ for all k = 1, . . . , T

(k + t1 + 1)(φ(wk) φ(w k)) σφ + (k + t1 + 1)(k + t1 + 2) wk+1 w k+1 2 2

(k + t1)(k + t1 + 1) wk w k 2 2 + 4C2 δ/T log(e T)(k + t1 + 1)

σ2 φk + 4(k + t1 + 1)ξk

+ 4ηk(k + t1 + 1) C1 ˆFk(wk; zk) + C1Ω(wk) + A2

Taking a summation of the above inequality from k = 1 to t shows the following inequality with probability 1 δ for all t = 1, . . . , T

k=1 (k + t1 + 1)(φ(wk) φ(w k)) + (t + t1 + 1)(t + t1 + 2) wt+1 w t+1 2 2

(t1 + 1)(t1 + 2) w1 w 1 2 2 + 4C2 δ/T log(e T)

k=1 (k + t1 + 1)ξk + 16σ 2 φ

C1 ˆFk(wk; zk) + C1Ω(wk) + A2 , (C.7)

Stochastic Proximal AUC Maximization

where we have used ηk 4/((k + t1 + 1)σφ). Since t1 4C1σ 1 φ we know ηt (2C1) 1 and therefore Corollary 13 holds. According to (C.2) and η 1 t 2 1σφ(t + t1 + 2), we know

k=1 ηk η 1 t 2σ 1 φ log(et) 2 1σφ(t + t1 + 2) = (t + t1 + 2) log(et).

This together with (5.5) implies that

C1 ˆFk(wk; zk) + C1Ω(wk) + A2

(C1C4 + A2)t + C1C4 t X

k=1 ηk η 1 t (C1C4 + A2) log(e T)(2t + t1 + 2).

Plugging the above inequality into (C.7) and using Pt k=1 k 1 log(e T) give the following inequality with probability 1 δ

k=1 (k + t1 + 1)(φ(wk) φ(w k)) + (t + t1 + 1)(t + t1 + 2) wt+1 w t+1 2 2

(t1 + 1)(t1 + 2) w1 w 1 2 2 + 4C2 δ/T log(e T)

t + (t1 + 1) log(e T)

k=1 (k + t1 + 1)ξk + C9σ 2 φ log(e T)(2t + t1 + 2).

We can get the stated bound by dividing both sides by (t + t1 + 1)(t + t1 + 2) and noting that

4C2 δ/T log(e T) t+(t1+1) log(e T) +C9 log(e T)(2t+t1+2) 2(t+t1+1) log2(e T) 2C2 δ/T +C9 .

The proof is complete.

To tackle the martingale diﬀerence sequence {ξk}k in (C.4), we need to control the magnitudes and variances which are established in the following lemma.

Lemma 22 Let Assumption 1 and Assumption 2 hold. Let {wt}t be the sequence produced by (2.11) with ηt = 2 σφt+2σf+σφt1 , where t1 4C1σ 1 φ . Let {ξk}t k=1 be deﬁned by (C.5). Then for all k T we have

|ξk| C10 log(e T) and Ezk ξk Ezk[ξk] 2 C1φ(wk) w k wk 2 2,

where C10 = 34κ2C4σ 1 φ + 2κ w 1 2 + (8κ w 1 2 + 1)2.

Lei and Ying

Proof It follows from the inequality wk w k 2 wk w 1 2 and (5.30) that

w k wk, e F (wk; zk) f(wk) w 1 wk 2 e F (wk; zk) 2 + f(wk) 2

2 w 1 2 + wk 2 8κ2 wk 2 + κ

= 16κ2 wk 2 2 + 2κ w 1 2 + 2 wk 2(8κ2 w 1 2 + κ)

17κ2 wk 2 2 + 2κ w 1 2 + (8κ w 1 2 + 1)2

34κ2C4σ 1 φ log(ek) + 2κ w 1 2 + (8κ w 1 2 + 1)2 C10 log(e T),

where we have used (C.3). It is clear from Proposition 1 that Ezk[ξk] = 0 and therefore it follows from E[(ξ E[ξ])2] E[ξ2] for any real-valued random variables ξ that

Ezk ξk Ezk[ξk] 2 = Ezk[ξ2 k] Ezk w k wk, e F (wk; zk) 2

w k wk 2 2Ezk[ e F (wk; zk) 2 2] w k wk 2 2C1f(wk)

C1φ(wk) wk w k 2 2.

where we have used Ezk[ e F (wk; zk) 2 2] C1Ezk[ e F(wk; zk)] = C1f(wk) which can be shown analogously to the proof of Lemma 11. The proof is complete.

We are now ready to prove Theorem 7. Our key idea is to apply Part (b) of Lemma 19 in the Appendix to show that Pt k=1(k + t1 + 1)ξk can be controlled by Pt k=1 φ(wk) φ(w k) (k + t1 + 1), which can be oﬀset by the ﬁrst term of (C.4). Then we can apply the induction strategy to derive the stated bound. Proof of Theorem 7 Since t1 32C1σ 1 φ log 2T

δ and T 2, we know t1 4C1σ 1 φ and therefore Lemmas 20, 21, 22 hold. According to Lemma 21, there exists a set Ω(1) T = {(z1, . . . , z T )} with Pr(Ω(1) T ) 1 δ/2 such that for all (z1, . . . , z T ) Ω(1) T we have

Pt k=1(k + t1 + 1)(φ(wk) φ(w k)) (t + t1 + 1)(t + t1 + 2)σφ + wt+1 w t+1 2 2 (t1 + 1)(t1 + 2) w 1 2 2 (t + t1 + 1)(t + t1 + 2)+

+ 4 Pt k=1(k + t1 + 1)ξk (t + t1 + 1)(t + t1 + 2)σφ + 2 log2(e T)(2C2 δ/(2T) + C9)

(t + t1 + 2)σ2 φ . (C.8)

According to Lemma 22, we know the following inequalities for k = 1, . . . , t

|(k + t1 + 1)ξk| C10(t + t1 + 1) log(e T)

Ezk (k + t1 + 1)ξk Ezk[(k + t1 + 1)ξk] 2 (k + t1 + 1)2C1φ(wk) w k wk 2 2.

Let ρ (0, 1] to be ﬁxed later. It then follows from Part (b) of Lemma 19 the following inequality with probability 1 δ/(2T)

k=1 (k+t1+1)ξk C1ρ Pt k=1 φ(wk)(k + t1 + 1)2 w k wk 2 2 C10(t + t1 + 1) log(e T) +C10(t + t1 + 1) log(e T) log 2T

Stochastic Proximal AUC Maximization

By the union bounds of probability, we know the existence of Ω(2) T = {(z1, . . . , z T )} with probability Pr(Ω(2) T ) 1 δ/2 such that (C.9) holds under the event Ω(2) T simultaneously for all t = 1, . . . , T. In the remainder of the proof, we always assume that Ω(1) T Ω(2) T holds (with probability 1 δ), and show by induction that w t+1 w t+1 2 2 CT,δ/( t + t1 + 2)

for all t = 0, 1, . . . , T 1 conditioned on Ω(1) T Ω(2) T , where we introduce

CT,δ = max n 2(t1 + 1) w 1 2 2 + 3t1φ(w )

2σφ + 4 log2(e T)(2C2 δ/(2T) + C9)

σ2 φ , C10t1 log(e T)

and ρ = C10t1 log(e T)

4C1CT,δ . It is clear that ρ 1. The case with t = 0 is clear from the deﬁnition

of CT,δ. We now show wt+1 w t+1 2 2 CT,δ/(t + t1 + 2) under the induction assumption

w t+1 w t+1 2 2 CT,δ/( t + t1 + 2) (C.10)

for t = 0, 1, . . . , t 1. Plugging the induction assumption (C.10) into (C.9) gives (φ(w k) is the same for all k)

k=1 (k + t1 + 1)ξk C1ρCT,δ Pt k=1 φ(wk)(k + t1 + 1) C10(t + t1 + 1) log(e T) + C10(t + t1 + 1) log(e T) log 2T

t1 Pt k=1 φ(wk) φ(w k) (k+t1+1) 4(t+t1+1) + t1φ(w k) Pt k=1(k+t1+1) 4(t+t1+1) + 4C1(t+t1+1)CT,δ log 2T

t1 Pt k=1 φ(wk) φ(w k) (k + t1 + 1) 4(t + t1 + 1) + 3t1φ(w k)(t + t1 + 1)

16 + 4C1(t + t1 + 1)CT,δ log 2T

where the second inequality is due to the deﬁnition of ρ and the last inequality is due to Pt k=1(k + t1 + 1) 3(t+t1+1)2

4 . Plugging the above inequality back into (C.8) yields the following inequality

1 t1 t + t1 + 1

Pt k=1(k + t1 + 1)(φ(wk) φ(w k)) (t + t1 + 1)(t + t1 + 2)σφ + wt+1 w t+1 2 2

(t1 + 1) w 1 2 2 t + t1 + 2 + 3t1φ(w ) 4σφ(t + t1 + 2) + 16C1CT,δ log 2T

δ t1(t + t1 + 2)σφ + 2 log2(e T)(2C2 δ/2T + C9)

(t + t1 + 2)σ2 φ

(t1 + 1) w 1 2 2 t + t1 + 2 + 3t1φ(w ) 4σφ(t + t1 + 2) + CT,δ 2(t + t1 + 2) + 2 log2(e T)(2C2 δ/2T + C9)

(t + t1 + 2)σ2 φ , (C.11)

where the last inequality is due to t1 32C1σ 1 φ log 2T

δ . By the deﬁnition of CT,δ, it is clear

that the right-hand side of (C.11) is less than or equal to CT,δ t+t1+2. Therefore, we ﬁnish the induction process and show (C.10) for t = t.

Lei and Ying

We now prove the second inequality of (3.5). It follows from the convexity of φ and (C.11) that

φ( w(2) t ) φ(w 1) t X

k=1 (k + t1 + 1) 1 t X

k=1 (k + t1 + 1) φ(wk) φ(w )

2σφ(t + t1 + 1)2

t(t + 1)(t + 2t1 + 3)

(t1 + 1) w 1 2 2 + 3t1φ(w )

2 + 2 log2(e T)(2C2 δ/2T + C9)

The second inequality of (3.5) then follows. The proof is complete.

A. Agarwal, M. Wainwright, P. Bartlett, and P. Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1 9, 2009.

S. Agarwal. Surrogate regret bounds for the area under the roc curve via strongly proper losses. In Conference on Learning Theory, pages 338 353, 2013.

S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6(Apr):393 425, 2005.

M. Anitescu. Degenerate nonlinear programming with a quadratic growth condition. SIAM Journal on Optimization, 10(4):1116 1135, 2000.

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in neural information processing systems, pages 773 781, 2013.

L. Bottou and Y. Cun. Large scale online learning. In Advances in neural information processing systems, 2004.

S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145 1159, 1997.

R. Caruana, T. Joachims, and L. Backstrom. Kdd-cup 2004: results and analysis. ACM SIGKDD Explorations Newsletter, 6(2):95 108, 2004.

N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011.

Stochastic Proximal AUC Maximization

S. Cl emen con, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of ustatistics. Annals of Statistics, pages 844 874, 2008.

C. Cortes and M. Mohri. Auc optimization vs. error rate minimization. In Advances in neural information processing systems, pages 313 320, 2004.

G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566 1575, 2019.

J. Duchi and Y. Singer. Eﬃcient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899 2934, 2009.

T. Fawcett. An introduction to ROC analysis. Pattern recognition letters, 27(8):861 874, 2006.

W. Gao and Z.-H. Zhou. On the consistency of AUC pairwise optimization. In International Joint Conferences on Artiﬁcial Intelligence, pages 939 945, 2015.

W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In International Conference on Machine Learning, pages 906 914, 2013.

H. G uvenir and M. Kurtcephe. Ranking instances by maximizing the area under ROC curve. IEEE Transactions on Knowledge and Data Engineering, 25(10):2356 2366, 2013.

J. Hanley and B. Mc Neil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29 36, 1982.

E. Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016.

A. Herschtal and B. Raskutti. Optimising area under the ROC curve using gradient descent. In International Conference on Machine Learning, page 49. ACM, 2004.

W. Hoeﬀding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

X. Jiang and Y. Zhou. Dissecting android malware: Characterization and evolution. In IEEE Symposium on Security and Privacy, pages 95 109. IEEE, 2012.

T. Joachims. A support vector method for multivariate performance measures. In International Conference on Machine Learning, pages 377 384. ACM, 2005.

P. Kar, B. Sriperumbudur, P. Jain, and H. Karnick. On the generalization ability of online learning algorithms for pairwise loss functions. In International Conference on Machine Learning, pages 441 449, 2013.

W. Kotlowski, K. Dembczynski, and E. Huellermeier. Bipartite ranking through minimization of univariate loss. In International Conference on Machine Learning, pages 1113 1120, 2011.

Lei and Ying

Y. Lei and K. Tang. Stochastic composite mirror descent: Optimal bounds with high probabilities. In Advance in Neural Information Processing Systems, pages 1524 1534, 2018.

M. Liu, X. Zhang, Z. Chen, X. Wang, and T. Yang. Fast stochastic AUC maximization with o(1/n)-convergence rate. In International Conference on Machine Learning, pages 3195 3203, 2018.

M. Liu, Z. Yuan, Y. Ying, and T. Yang. Stochastic auc maximization with deep neural networks. In International Conference on Learning Representations (ICLR), 2020.

A. Maurer and M. Pontil. Estimating weighted areas under the ROC curve. Advances in Neural Information Processing Systems, 33, 2020.

H. Narasimhan and S. Agarwal. Support vector algorithms for optimizing the partial area under the roc curve. Neural Computation, 29(7):1919 1963, 2017.

M. Natole, Y. Ying, and S. Lyu. Stochastic proximal algorithms for AUC maximization. In International Conference on Machine Learning, pages 3707 3716, 2018.

I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of ﬁrst order methods for non-strongly convex optimization. Mathematical Programming, pages 1 39, 2018.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372 376, 1983.

Y. Nesterov. Smoothing technique and its applications in semideﬁnite optimization. Mathematical Programming, 110(2):245 259, 2007.

Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 1116 1124, 2014.

F. Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pages 1416 1424, 2016.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2008.

Stochastic Proximal AUC Maximization

A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning, pages 449 456, 2012.

L. Rosasco, S. Villa, and B. V u. Convergence of stochastic proximal gradient algorithm. ar Xiv preprint ar Xiv:1403.5074, 2014.

C. Rudin and R. Schapire. Margin-based ranking and an equivalence between adaboost and rankboost. Journal of Machine Learning Research, 10:2193 2232, 2009.

S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012.

N. Srebro and A. Tewari. Stochastic optimization for machine learning. ICML Tutorial, 2010.

D. Wang, D. Irani, and C. Pu. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Collaborative Computing: Networking, Applications and Worksharing, pages 40 49. IEEE, 2012a.

M. Wang, J. Liu, and E. Fang. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems, pages 1714 1722, 2016.

M. Wang, E. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161 (1-2):419 449, 2017.

Y. Wang, R. Khardon, D. Pechyony, and R. Jones. Generalization bounds for online learning algorithms with pairwise loss functions. In Conference on Learning Theory, volume 23, pages 13 1, 2012b.

Y. Ying and D.-X. Zhou. Online pairwise learning algorithms. Neural computation, 28(4): 743 777, 2016.

Y. Ying, L. Wen, and S. Lyu. Stochastic online AUC maximization. In Advances in Neural Information Processing Systems, pages 451 459, 2016.

T. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Conference on Learning Theory, pages 173 187, 2005.

X. Zhang, A. Saha, and S. V. N. Vishwanathan. Smoothing multivariate performance measures. Journal of Machine Learning Research, 13:3623 3680, 2012.

P. Zhao, S. Hoi, R. Jin, and T. Yang. Online AUC maximization. In International Conference on Machine Learning, pages 233 240, 2011.

M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning, pages 928 936, 2003.