# pairwise_learning_with_adaptive_online_gradient_descent__b645340b.pdf

Published in Transactions on Machine Learning Research (10/2023)

Pairwise Learning with Adaptive Online Gradient Descent

Tao Sun suntao.saltfish@outlook.com College of Computer National University of Defense Technology Changsha, Hunan, China

Qingsong Wang wang.8973@osu.edu Department of Mathematics Scientific Computing and Imaging (SCI) Institute University of Utah Salt Lake City, Utah, USA

Yunwen Lei leiyw@hku.hk Department of Mathematics The University of Hong Kong Pokfulam, Hong Kong

Dongsheng Li dsli@nudt.edu.cn College of Computer National University of Defense Technology Changsha, Hunan, China

Bao Wang wangbaonj@gmail.com Department of Mathematics Scientific Computing and Imaging (SCI) Institute University of Utah Salt Lake City, Utah, USA

Reviewed on Open Review: https: // openreview. net/ forum? id= rq1Sa HQg2k

In this paper, we propose an adaptive online gradient descent method with momentum for pairwise learning, in which the step size is determined by historical information. Due to the structure of pairwise learning, the sample pairs are dependent on the parameters, causing difficulties in the convergence analysis. To this end, we develop novel techniques for the convergence analysis of the proposed algorithm. We show that the proposed algorithm can output the desired solution in strongly convex, convex, and nonconvex cases. Furthermore, we present theoretical explanations for why our proposed algorithm can accelerate previous workhorses for online pairwise learning. All assumptions used in the theoretical analysis are mild and common, making our results applicable to various pairwise learning problems. To demonstrate the efficiency of our algorithm, we compare the proposed adaptive method with the non-adaptive counterpart on the benchmark online AUC maximization problem.

1 Introduction

Let K Rd be a closed convex set (can be the full space Rd) representing the parameter space. Given a statistical sample space Ξ with probability distribution P; let F( ; ξ, ξ ) : Rd R be a closed function

The first and second authors contributed equally to this paper. Dongsheng Li is the corresponding author. Dongsheng Li and Tao Sun are supported in part by the National Science Foundation of China (62025208), Hunan Provincial Natural Science Foundation of China (2022JJ10065), Young Elite Scientists Sponsorship Program by CAST (2022QNRC001), and Continuous Support of PDL (WDZC20235250101).

Published in Transactions on Machine Learning Research (10/2023)

associated with two samples ξ, ξ Ξ. This paper considers the following pairwise learning problem

n f(x) := E(ξ,ξ ) P PF(x; ξ, ξ ) o , (1)

where the function F(x; ξ, ξ ) can be either convex or nonconvex in x. The pairwise learning model (1) describes various classical machine learning tasks, including the metric learning (Weinberger & Saul, 2009; Kulis et al., 2013; Xing et al., 2002; Ying & Li, 2012), ranking (Rejchel, 2012; Agarwal & Niyogi, 2009), two-stage multiple kernel learning (Kumar et al., 2012), neural link prediction (Wang et al., 2021), the minimum error entropy principle (Hu et al., 2013), and can be adapted to the area under ROC curve (AUC) maximization (Zhao et al., 2011; Gao et al., 2013; Ying et al., 2016; Liu et al., 2018) (see Remark 2 in Section 3 for more details).

There are two major kinds of workhorses for the model (1), i.e., offline and online. The offline one is similar to the empirical risk minimization (ERM): given an i.i.d. sample set ξ1, . . . , ξn , we solve

min x K 1 n(n 1)

i,j [n],i =j F(x; ξi, ξj),

where [n] := n 1, 2, . . . , n o . The major difference between the offline pairwise learning model and the ERM lies in the efficiency of the samples and whether the objective functions are independent of each other. A n-samples training set outputs a finite-sum minimization with O(n) sub-functions in ERM, while the same training set results in O(n2) in the offline pairwise learning. Furthermore, the objective functions in the ERM are independent of each other, which is broken for the offline pairwise learning1. One possible modification to circumvent the dependent objective functions is to form the objective function with two new independent samples in each iteration. This method is suggested in (Peel et al., 2010, Section 4.2) and can be regarded as the algorithm proposed by Zhao et al. (2011) with buffer size one. However, as shown in (Zhao et al., 2011), this two-dependent-points version of online learning does not fully utilize the sampling sequence and results in inferior performance compared with algorithms that utilize some historical samples.

The online pairwise learning assumes the i.i.d. samples (ξk)k [n] are continuously received by the model. In the kth iteration, the online style algorithm proceeds to sample new data ξk from P and reuses the previous samples (ξji)1 i s with {ji}1 i s [k 1] to get the mini-batch stochastic gradient gk = 1 s Ps i=1 F(xk; ξk, ξji) (Wang et al., 2012; Zhao et al., 2011; Ying & Zhou, 2016). Thus, the online method needs to employ an O(s) memory to store the previously sampled data and O(s) computations to calculate the gradient. Nevertheless, the mini-batch version suffers two drawbacks: 1) It has been proved

that its excess generalization bound can be as large as O 1 s + 1 n (Wang et al., 2012), and we need to use

a large s to improve generalization of the online method. 2) A large s causes tremendous computational and memory costs that are unacceptable for online settings. To this end, a simple yet efficient online gradient descent (OGD) is proposed (presented as Algorithm 1), in which the stochastic gradient gk is set to be F xk; ξk, ξk 1 (Yang et al., 2021b). An interesting finding is that the OGD can achieve O 1 n excess generalization bound. The favorable memory and computation costs make the OGD applicable to broader online settings.

In this paper, we focus on developing provably convergent online algorithms with adaptive stepsizes for pairwise learning. Considering the efficiency of the sampling method of OGD (Yang et al., 2021b), our algorithm inherits such kind of sampling. Moreover, our algorithm employs momentum.

1.1 The Adaptive Online Gradient Descent for Pairwise Learning

The OGD performs an SGD-style (stochastic gradient descent) iteration but with biased stochastic gradients since ξk 1 is related to xk. Motivated by the remarkable success of the adaptive variants of SGD for machine learning (Duchi et al., 2011; Mc Mahan & Streeter, 2010; Tieleman & Hinton, 2012; Kingma & Ba, 2015; Reddi et al., 2018; Ward et al., 2019), we propose the adaptive variant of OGD (AOGD) for pairwise learning,

1For example, F(x; ξ1, ξ2) and F(x; ξ1, ξ3) are not independent because they have a shared data ξ1.

Published in Transactions on Machine Learning Research (10/2023)

Algorithm 1 Online Gradient Descent (OGD) for Pairwise Learning (Yang et al., 2021b) Parameters: η > 0. Initialization: x0 = 0, ξ1 P for k = 1, 2, 3, . . . step 1: receive ξk and calculate gk = F(xk; ξk, ξk 1) step 2: xk+1 = Proj K(xk ηgk) End for

Algorithm 2 Adaptive Online Gradient Descent (AOGD) for Pairwise Learning Parameters: η > 0, 0 θ < 1. Initialization: x0 = m0 = 0, ξ1 P for k = 1, 2, 3, . . . step 1: receive ξk and calculate gk = F(xk; ξk, ξk 1) step 2: mk = θmk 1 + (1 θ)gk

step 3: vk = vk 1 + gk 2

step 4: xk+1 = Proj K(xk ηmk/(vk) 1 2 ) End for

presented as Algorithm 2. AOGD directly uses the historical sum of the moment rather than the weighted average form used in Adam (Kingma & Ba, 2015), and AOGD can be rewritten in the weighted average form: let ˆvk := Pk i=1 gi 2/k = vk/k, then steps 3 and 4 of AOGD can be reformulated as follows:

xk+1 = Proj K

In reformulation (2), weights of the moment are 1/k

k 1 and stepsizes are η/

k 1, satisfying the sufficient conditions for the convergence of Adam-type algorithms (Zou et al., 2019; Chen et al., 2019).

Compared with OGD, AOGD employs the momentum and adaptive stepsize generated by the historical information. Thus unlike OGD, AOGD needs a memory of history, but without much computational overhead since AOGD only computes gradient once in each iteration. Another difference between OGD and AOGD lies in the hyper-parameter η: in OGD, η shall be set as small as the desired error ϵ; while in AOGD, η can be set as a constant that is independent of ϵ.

1.2 Comparison with A Closely Related Work

In (Ding et al., 2015), by adopting the Ada Grad-style adaptive gradient update, Duchi et al. (2011) have proposed an adaptive method for online AUC maximization, which is a kind of pairwise learning. Although both (Ding et al., 2015) and our paper consider the adaptive online method for pairwise learning, there are four major differences between (Ding et al., 2015) and our paper, summarized below.

1) We follow the sampling method used by the OGD algorithm in (Yang et al., 2021b). 2) We consider more general cases and provide the corresponding theoretical guarantees, including the more general model (pairwise learning rather than only AUC maximization), and convergence for more general settings, including the nonconvex case, and more general schemes e.g., the use of momentum. 3) We get rid of using the regret bound because it does not directly tell us whether the algorithm converges to the desired minimizer. Another important reason for not using the regret bound for the analysis is that the regret bound has difficulties in covering the nonconvex cases. Based on the above reasons, a non-regret analysis is necessary. 4) We develop new analysis techniques to get the non-regret convergence analysis. Notice that the regret bound is not affected by the stochasticity of the data, and thus the analysis in (Ding et al., 2015) does not need to consider how to deal with the biased stochastic gradients.

Published in Transactions on Machine Learning Research (10/2023)

bounding E[f(xk+1) f(xk)] Analysis

properties and techniques

Convergence Results intermediate variables

properties and novel techniques

properties of intermediate variables

Convergence Results

Adaptive SGD or OGD AOGD

Figure 1: Contrasting the analysis of AOGD against adaptive SGD/OGD. Our analysis of AOGD studies the properties of some intermediate variables. To this end, we develop novel techniques. In the last step, we use the properties of the intermediate variables to derive the convergence guarantee for AOGD.

1.3 Challenges in the Analysis and Difference from Existing Analysis

The primary source of challenges in theoretical analysis comes from the fact that xk is dependent on ξk 1, which immediately breaks the unbiased expectation of the stochastic gradient F(xk; ξk, ξk 1), i.e., E F(xk; ξk, ξk 1) = f(xk). As such, we cannot directly follow the techniques from adaptive SGD (Duchi et al., 2011; Reddi et al., 2018; Chen et al., 2019; Ward et al., 2019; Zou et al., 2019; Li & Orabona, 2019). In paper (Yang et al., 2021b), the authors consider the following decomposition

η F(xk; ξk, ξk 1) = η F(xk 1; ξk, ξk 1) + η F(xk; ξk, ξk 1) η F(xk 1; ξk, ξk 1).

Notice that xk 1 is independent of ξk, ξk 1, and E η F(xk 1; ξk, ξk 1) = η f(xk 1). The Lipschitz property of the gradient then gives us η F(xk; ξk, ξk 1) η F(xk 1; ξk, ξk 1) = O(η xk xk 1 ),

and the proof is similar to the delayed SGD.

However, our proof cannot directly follow the technique above because AOGD involves two extra recipes, i.e., momentum and adaptive stepsize. When momentum exists, we need to deal with both gk and mk

rather than only gk, and we have to modify the techniques from (Yang et al., 2021b) to analyze the effects of momentum 2. Another difficulty is the use of the adaptive stepsize variable η/(vk) 1 2 . Furthermore, vk is dependent of the pair (ξk, ξk 1), making the analysis more challenging.

In contrast to the existing analysis, our approach is not directly establishing the Lyapunov descent starting from bounding E[f(xk+1) f(xk)]. Instead, we recruit some intermediate variables. Taking the nonconvex case as an example, we first establish a Lyapunov-like descent property as follows

E f(xk), mk/(vk) 1 2 θE f(xk 1), mk 1/(vk 1) 1 2 + φ(xk, xk 1, ξk, ξk 1, ξk 2),

where φ(xk, xk 1, ξk, ξk 1, ξk 2) is a function of variables xk, xk 1, ξk, ξk 1, and ξk 2. We stress that the descent is not Lyapunov because θ = 1 and φ(xk, xk 1, ξk, ξk 1, ξk 2) is not always negative. Then, we build the correspondence between the mathematical convergence measurement and E f(xk), mk/(vk) 1 2 . A big picture of the difference in analyzing adaptive SGD/OGD and AOGD is presented in Figure 1.

1.4 Contributions

Our major contributions are threefold, which are summarized below.

2Indeed, in the conclusion part of (Yang et al., 2021b), the authors have listed the momentum variant as future work.

Published in Transactions on Machine Learning Research (10/2023)

We propose an adaptive online gradient descent algorithm for pairwise learning with a simple sampling strategy. The proposed algorithm uses adaptive stepsize and momentum, requiring only a small overhead in memory and computational costs. We present the convergence results for the proposed algorithm under different settings, including strongly convex, general convex, and nonconvex cases. The use of adaptive stepsize and momentum requires non-trivial techniques for the convergence analysis. We also provide theoretical explanations for why our proposed algorithm can accelerate OGD. We verify the efficiency of the proposed AOGD on the benchmark online AUC maximization task, showing that AOGD outperforms OGD.

1.5 Notation

Throughout this paper, we use boldface letters to denote vectors, e.g., x, y Rd. The jth coordinate of the vector x is denoted by xj. The L2 norm of the vector x is denoted by x . We denote E[ ] as the expectation with respect to the underlying probability space. We denote the minimum value of the function f over K as min K f, and denote Proj K(x) as the projection of x onto the set K. For two positive sequences (ak, bk)k 0, ak = O(bk) means that there exists C > 0 such that ak Cbk. The notation ak = Θ(bk) indicates that ak = O(bk) and bk = O(ak). We use ak = O(bk) and ak = Θ(bk) to hide the logarithmic factor but still with the same order. We use ak Θ(bk) to present the relation ak Cbk with C > 0.

1.6 Organization

We organize this paper as follows: In Section 2, we present assumptions and theoretical convergence results for AOGD under general convex, strongly convex, and nonconvex settings. We numerically verify the efficiency of AOGD and compare it to the benchmark OGD in Section 3. More related works are discussed in Section 4, followed by concluding remarks. The detailed proofs are provided in the supplementary materials.

2 Convergence Analysis

2.1 Assumptions

We first collect several necessary assumptions for the convergence analysis of AOGD:

Assumption 1: The function F( ; ξ, ξ ) is differentiable, and its gradient is L-Lipschitz, i.e.,

F(x; ξ, ξ ) F(y; ξ, ξ ) L x y , x, y K, ξ, ξ Ξ. (3)

Assumption 2: The gradient of F(x; ξ, ξ ) is uniformly bounded, i.e., F(x; ξ, ξ ) B for some constant B > 0, x K, and ξ, ξ Ξ.

The Lipschitz smooth gradient assumption is widely used in the (non)convex optimization and pairwise learning communities. While Assumption 2 is frequently used in the adaptive SGD community, see, e.g., (Duchi et al., 2011; Reddi et al., 2018; Chen et al., 2019; Ward et al., 2019; Zou et al., 2019; Li & Orabona, 2019) 3. Note that when K is bounded Duchi et al. (2011); Reddi et al. (2018) have assumed the boundedness of the constrained set Assumption 2 directly holds for the continuity of the gradient 4, but not vice versa. Moreover, Assumption 2 indicates the following estimate of f(x):

f(x) = Eξ,ξ P P F(x; ξ, ξ ) Eξ,ξ P P F(x; ξ, ξ ) B.

Using mathematical induction, we can see that the momentum mk also enjoys the uniform bound according to Assumption 2. Furthermore, we stress that we do not need to assume the variance is bounded, which is indicated by Assumption 2 since we have

E F(x; ξ, ξ ) f(x) 2 E F(x; ξ, ξ ) 2 B2.

3The uniform assumption presented in (Li & Orabona, 2019) enjoys another presentation, i.e., (Assumption (H2)) presented as |f(x) f(y)| G x y . Note that when f is differentiable, it is equivalent to supx f(x) G. With bounded assumption for F(x; ξ) f(x) in (Li & Orabona, 2019), which is indeed the uniform bound assumption. 4Any continuous function is uniformly bounded over a closed bounded subset of Rd.

Published in Transactions on Machine Learning Research (10/2023)

Assumptions 1 and 2 will be used in the analysis of AOGD for all different scenarios in the subsequent analysis.

2.2 General Convex Cases

In this subsection, we present the convergence result of AOGD for the general convex case, i.e., F(x; ξ, ξ ) is convex with respect to x and any fixed ξ, ξ . Note that the convexity of F(x; ξ, ξ ) indicates the convexity of f(x) in (1), but not vice versa.

Theorem 1 (General Convexity) Let Assumption 1 hold, and g0 2 δ > 0 for some constant δ, and F(x; ξ, ξ ) be convex. Assume {xk}k 1 is generated by AOGD for pairwise learning, x := arg minx K f(x), and K is additionally bounded, i.e., maxx,y K x y D. Then we have

where c1 := E

2(1 θ)η + 3B2

δ , and c2(K) :=

2(1 θ) + (ηθ + 1)

Assumption 2 is implicitly used in Theorem 1 because we have assumed the boundedness of the constrained set K. The bounded constrained set is indeed stronger than the uniform bounded gradient assumption. We leave how to relax the bounded set assumption as future work. From Theorem 1, we can see that the convergence rate is dependent on E[

v K]. The boundedness of the stochastic gradients indicates that

K), which means the worst convergence rate of AOGD is e O 1

. We notice that the

convergence rate e O 1

coincides with the rate of OGD for pairwise learning in the general convex case

(Yang et al., 2021b). However, in some cases E[

v K] can decay faster than O(

K), based on which we can establish an accelerated rate of AOGD.

Proposition 1 Assume the conditions of Theorem 1 hold, and assume E[

v K] = O(Kα) with 0 < α 1

2. Then we have

If α < 1 2, the convergence rate of AOGD is thus better than e O 1

. Proposition 1 then provides a

theoretical interpretation of why AOGD is possible to be faster than OGD (Yang et al., 2021b).

Remark 1 Note that the condition α = 1/2 directly holds due to the boundedness of the stochastic gradients. The fast decaying condition α < 1/2 has been commonly used as a standard explanation for why adaptive stochastic optimization algorithms often outperform non-adaptive schemes, as shown in several studies (Reddi et al., 2018; Chen et al., 2018; 2019; Liu et al., 2019). As far as our current knowledge goes, the superiority of adaptive SGD over non-adaptive approaches is not well-explained, apart from the assumption that α < 1/2. This assumption is commonly used in previous works because many training tasks involve sparse stochastic gradients. However, it is important to note that this assumption is not a logical consequence of sparse gradients resulting in α < 1/2. Instead, the assumption is more of a hypothetical analysis. In summary, while the assumption α < 1/2 is widely employed in the literature, we do not have a definitive explanation for the effectiveness of adaptive SGD beyond this assumption. Further research is required to gain a comprehensive understanding of the benefits of adaptive SGD compared to non-adaptive methods.

2.3 Strongly Convex Cases

In this subsection, we consider the case that the function F(x; ξ, ξ ) is ν-strongly convex for some constant ν > 0, i.e., F(x; ξ, ξ ) F(y; ξ, ξ ) F(y; ξ, ξ ), x y ν

2 x y 2. In particular, if ν = 0, F(x; ξ, ξ ) then reduces to general convex.

Published in Transactions on Machine Learning Research (10/2023)

Stepsize rule for the strongly convex case. Before developing a convergence guarantee of AOGD for strongly convex cases, we need to select the appropriate stepsize rule for AOGD. We first need to explain the stepsize rule of AOGD for the general convex case, i.e., η/

vk for some constant η > 0, is inappropriate for

solving strongly convex problems. The boundedness of gradient directly gives us 1

. However, such a stepsize rule causes the accumulation of stochastic noise; it will not improve the convergence rate of AOGD under strong convexity compared to the rate of AOGD under general convexity. A proper stepsize choice for the strongly convex case is Θ 1 k (Bach & Moulines, 2013). Indeed, in papers (Duchi et al., 2011;

Sun et al., 2020), the authors use the 1

vk stepsize rule for Ada Grad when the problem is strongly convex,

and we follow this stepsize rule for the strongly convex online pairwise learning. Note that 1

vk Θ 1 k , which coincides with the stepsize used for SGD when the underlying problem is strongly convex. In summary, in the strongly convex case, we replace step 4 of Algorithm 2 with step 4 , which is given below

step 4 : xk+1 = Proj K

Next, we present the convergence rate of AOGD for pairwise learning with the strong convexity assumption.

Theorem 2 Let Assumption 1 hold, and g0 2 δ > 0, and F(x; ξ, ξ ) be strongly convex. Assume {xk}k 1 is generated by the AOGD for pairwise learning with θ = 0 using stepsize rule step 4 , and x = arg minx K f(x). By setting η = B

2ν , then we have

E x K x 2 = O

In the strongly convex case, Assumption 2 is equivalent to the boundedness assumption of the constrained set K. This is because the function f(x) is also ν-strongly convex 5, yielding f(x) = f(x) f(x ) ν x x with x being the global minimizer of f. When Assumption 2 holds, in Subsection 2.1 we have shown that f(x) B over K. Thus, we get x x x + x B

ν + x when x K. Notice that x is fixed, and the set K is then uniformly bounded.

Theorem 2 above shows that AOGD can achieve a faster convergence rate under the strong convexity assumption than that for general convex cases. Theorem 2 also shows that in the strongly convex case, AOGD achieves an almost optimal convergence rate of SGD under strong convexity, specifically e O 1 K (the

optimal convergence rate is O 1 K according to (Rakhlin et al., 2012)). The result in Theorem 2 does not

generalize to the general convex case since we set η = B

2ν , which is infinity when ν = 0. For technical reasons, we set θ = 0 in Theorem 2, i.e., we only consider the momentum-free case. We will consider how to build the convergence rate e O 1 K for AOGD with momentum in our future work.

2.4 Nonconvex Cases

In this part, we consider the case when F(x; ξ, ξ ) is nonconvex. The assumptions for the nonconvex case are much milder than the convex and strongly convex cases. We can even get rid of using the projection operator Proj K( ) for AOGD. The convergence result of AOGD for the nonconvex case is presented as follows.

Theorem 3 Let Assumptions 1, 2 hold, and let {xk}k 1 be generated by AOGD, and g0 2 δ for some constant δ > 0. Suppose

vk C kα for two constants C > 0 and 0 < α 1

2, and K is the full space. Then, we have

n E f(xk) 2o c3 + c4(K)

5Taking expectation of both sides of the inequality F(x; ξ, ξ ) F(y; ξ, ξ ) F(y; ξ, ξ ), x y ν

2 x y 2 gives us the strong convexity of f.

Published in Transactions on Machine Learning Research (10/2023)

where c3 := 2(7 6θ)B2C

δ + 4Cf(x1)

η , and c4(K) := (1+θ)η

δ + L2/δ + 5/2 ln

From Theorem 3, we can see that the convergence rate of AOGD is e O 1 K1 α , and α = 1

2 is the worst case

due to the boundedness of the stochastic gradient. When α < 1

2, we get a faster convergence of AOGD compared with OGD or SGD in the general nonconvex case.

The conditions of Theorem 3 can also be satisfied by the general convex case without projection. Therefore, (8) also holds when F(x; ξ, ξ ) is convex. However, (8) is weaker than (4) since the convergence rate in (8) is not established with respect to the function values.

3 Numerical Results

Table 1: Statistics of the dataset used for contrasting the performance of AOGD and OGD, where n is the number of samples in each dataset, and d is the number of features of each instance in a given dataset. All datasets come from the LIBSVM website (Chang & Lin, 2011), and they are used in (Yang et al., 2021b).

diabtes german ijcnn1 letter mnist usps n 768 1,000 49,990 15,000 60,000 7,291 d 8 24 22 161 780 256

In this section, we numerically validate our theoretical findings for both convex and nonconvex cases. To this end, we compare our proposed AOGD against the baseline OGD proposed in (Yang et al., 2021b) 6

for pairwise learning in terms of generalization and rate of convergence with respect to the number of iteration. We also include the results of Ada OAM in (Ding et al., 2015), an adaptive online algorithm for AUC maximization. However, we note that the algorithm in (Ding et al., 2015) is not designed for general pairwise learning. Following (Yang et al., 2021b), we consider six benchmark datasets, summarized in Table 1. Also, following the data split strategy used in (Yang et al., 2021b), for the dataset with multiple classes, we convert the first half of classes to the positive class and the second half of classes to the negative class.

For each dataset, we use 80% of the data for training and the remaining 20% for testing. All the reported results are based on 25 runs with random shuffling. The generalization performance is reported using the average AUC score and standard deviation on the test data. To determine proper hyperparameters for OGD, AOGD, and Ada OAM, we conduct 5-fold cross-validation on the training sets: 1) for OGD, we select stepsizes ηt = η 10[ 5:5]7 and the parameter space K is set to be the L2-ball centered at the origin with radius R 10[ 3,3]; 2) for AOGD, we let θ = 0.9 and we select stepsizes ηt = η 10[ 5:5] and the parameter space K is also the L2-ball centered at the origin with radius R 10[ 3,3]; 3) for Ada OAM, we select stepsizes ηt = η 10[ 5:5] and the parameter space K is set to be the L2-ball centered at the origin with radius R 10[ 3,3].

Convex case: We run experiments on AUC maximization using the following convex loss function

f(w; (x, y), (x , y )) = ℓ(w (x x ))I[y=1 y = 1] + ℓ(w (x x))I[y= 1 y =1],

I[y=1 y = 1] =

( 1 if y = 1 and y = 1, 0 otherwise.

and similarly for I[y= 1 y =1]. Also, ℓis a surrogate loss function, e.g., the hinge loss

ℓ(t) = (1 t)+ =

( 0 if 1 t < 0, 1 t otherwise.

6In (Yang et al., 2021b), the authors have shown that OGD can remarkably outperform existing algorithms, including OLP (Kar et al., 2013), OAMgra (Zhao et al., 2011), SGDpair (Lei et al., 2020), and SPAUC (Lei & Ying, 2021). 7[ 5 : 5] stands for integers in the interval [ 5, 5].

Published in Transactions on Machine Learning Research (10/2023)

Remark 2 The loss function above follows the setup in (Yang et al., 2021b) which is designed for the AUC maximization problem with data stream (x, y) sampled from a distribution that contains both positive and negative samples by utilizing the indicator function I[y=1 y = 1] and I[y= 1 y =1]. We want to mention that the second term is not presented in the main text of (Yang et al., 2021b) but implicitly utilized their code 8 on which our implementation is based. In particular, in both OGD and AOGD, the loss is activated or non-zero whenever the consecutive sample pairs (x, y) and (x , y ) are of opposite labels, i.e., y = 1 and y = 1 or y = 1 and y = 1.

We now provide further clarification on the pairwise model and AUC maximization: In model (1), the distribution for ξ and ξ is assumed to be the same. This is in contrast to the early AUC maximization formulation, as seen in (Zhao et al., 2011), which does not have this property. In the early formulation, the positive and negative classes are treated differently, with potentially different distributions. However, we can introduce indicator functions and show that the AUC maximization can be represented as a special form of (1) as above.

A natural question that arises is why we would choose to reformulate the AUC maximization problem into the model (1). At first glance, there does not seem to be any apparent advantage to such a reformulation, as the algorithm only works when a negative sample is encountered along with a positive sample. In the worst-case scenario, if all the negative labels are encountered before the positive labels, the algorithm would almost output no predictions.

We provide a necessary explanation here:

a) The worst-case scenario mentioned above is highly unlikely to occur because we assume that the data is independently and identically distributed from the underlying distribution.

b) In the online setting, the labels are unknown in advance. While the work of Zhao et al. (2011) discusses the online scenario, they employ an update buffer data pre-processing step to divide the data and subsequently run the algorithm offline. Therefore, we cannot directly apply the formulation described in (Zhao et al., 2011). To address this, it is more flexible to utilize the formulation presented in (Yang et al., 2021b). This formulation is better suited for online learning, as it does not require an offline processing step like the one used in (Zhao et al., 2011).

c) Existing results from (Yang et al., 2021b) demonstrate that the formulation (1) outperforms previous methods, even when a negative sample is encountered along with a positive sample. This is because previous methods often have high gradient complexity at each iteration, while the online complexity is very small in the proposed formulation.

In summary, the reformulation (1) allows for more flexibility in the optimization process and has the potential to yield better results in practical scenarios.

Figure 2 plots the AUC scores of AOGD, OGD, and Ada OAM against the number of iterations (in log scale) 9

Table 2: Average AUC scores standard deviation with convex loss function across the six benchmark datasets listed in Table 1. The best results are highlighted in boldface.

Algorithm diabetes german ijcnn1 letter mnist usps AOGD .831 .027 .795 .026 .934 .002 .814 .006 .931 .002 .925 .004 OGD .831 .030 .793 .021 .934 .002 .810 .007 .932 .001 .926 .006 Ada OAM .829 .027 .792 .028 .934 .003 .814 .009 .932 .002 .924 .005

Table 2 summarizes the generalization performance between AOGD and OGD. The results for AOGD and Ada OAM are obtained from our above experiment and the results for OGD are adapted from (Yang et al.,

8https://github.com/zhenhuan-yang/simple-pairwise 9Number of iterations N is given by 2Nlog/2+4 where Nlog is the log-scaled number of iterations. on the six benchmark datasets listed in Table 1. The numerical results on the six benchmark datasets show that AOGD converges faster than OGD and Ada OAM in general, confirming our theoretical results.

Published in Transactions on Machine Learning Research (10/2023)

0 5 10 15 20 Iterations (log scale)

AOGD OGD Ada OAM

0 5 10 15 20 Iterations (log scale)

AOGD OGD Ada OAM

0 5 10 15 20 25 Iterations (log scale)

AOGD OGD Ada OAM

0 10 20 30 Iterations (log scale)

AOGD OGD Ada OAM

0 10 20 30 Iterations (log scale)

AOGD OGD Ada OAM

0 5 10 15 20 25 Iterations (log scale)

AOGD OGD Ada OAM

Figure 2: The AUC score of AOGD and OGD against the number of iterations (in log scale) for AUC maximization with convex loss (hinge loss). It is evident that AOGD converges faster than OGD and Ada OAM in general, confirming our established theoretical results for AOGD.

2021b). Overall, AOGD generalizes as well as OGD and Ada OAM as the peak performance of these three methods are comparable. Establishing the generalization of OAGD is an interesting future direction.

Non-convex case: We follow the setting in Appendix G of (Yang et al., 2021b) on AUC maximization using the logistic link function logit(t) = (1 + exp( t)) 1 and then the square loss function ℓ(t) = (1 t)2. Hence, the loss function for the AUC maximization problem is given by the non-convex function

f(w; (x, y), (x , y )) = 1 logit w (x x ) 2 I[y=1 y = 1] + 1 logit w (x x) 2 I[y= 1 y =1].

We plot in Figure 3 the AUC scores of AOGD, OGD, and Ada OAM against the number of iterations on the six benchmark datasets listed in Table 1. The numerical results on the six benchmark datasets show that AOGD converges faster than OGD and Ada OAM in general, confirming our theoretical results.

Published in Transactions on Machine Learning Research (10/2023)

Table 3: Average AUC scores standard deviation with nonconvex loss function across the six benchmark datasets listed in Table 1. The best results are highlighted in boldface.

Algorithm diabetes german ijcnn1 letter mnist usps AOGD .834 .022 .797 .023 .935 .002 .815 .005 .932 .002 .928 .004 OGD .829 .033 .794 .022 .934 .002 .815 .003 .931 .002 .926 .005 Ada OAM .831 .025 .789 .021 .931 .002 .815 .006 .932 .003 .926 .007

0 5 10 15 20 Iterations (log scale)

AOGD OGD Ada OAM

0 5 10 15 20 Iterations (log scale)

AOGD OGD Ada OAM

0 10 20 30 Iterations (log scale)

AOGD OGD Ada OAM

0 10 20 30 Iterations (log scale)

AOGD OGD Ada OAM

0 10 20 30 Iterations (log scale)

AOGD OGD Ada OAM

0 5 10 15 20 25 Iterations (log scale)

AOGD OGD Ada OAM

Figure 3: The AUC score of AOGD, OGD, and Ada OAM against the number of iterations (in log scale) with non-convex loss function. In this case, AOGD converges faster than OGD and Ada OAM in general, confirming our established theoretical results for AOGD. In particular, the slower convergence of Ada OAM is more prominent in the non-convex case.

4 More Related Works

Because offline methods employ the ERM-like policy, the core problem of most offline methods is establishing the generalization bound of the finite-sum model with statistical learning theory or algorithmic stability

Published in Transactions on Machine Learning Research (10/2023)

(Agarwal & Niyogi, 2009; Jin et al., 2009; Wang et al., 2019; Gao & Zhou, 2013; Lei et al., 2020). It is worth mentioning that the difficulty in the generalization analysis for pairwise learning lies in that the objective functions fail to be i.i.d. with each other, which breaks the fundamental assumption in statistical learning theory and the algorithmic stability communities.

The online methods for pairwise learning assume the model accesses a data stream of i.i.d. samples, including online AUC maximization algorithms (Zhao et al., 2011; Ying et al., 2016; Liu et al., 2018; Natole et al., 2018; Lei & Ying, 2021; Guo et al., 2020), online metric learning algorithms (Shalev-Shwartz et al., 2004; Davis et al., 2007; Jain et al., 2008; Jin et al., 2009), online learning to rank (Rejchel, 2012; Schuth et al., 2013; Zoghi et al., 2017; Li et al., 2019), neural link prediction (Wang et al., 2021), etc. The online AUC maximization is proposed by Zhao et al. (2011) with theoretical guarantees. In the paper (Ying et al., 2016), a stochastic online AUC maximization algorithm is proposed from the perspective of a saddle representation. The main advantage of the algorithm in (Ying et al., 2016) is to avoid storing all previous examples and second-order covariance matrices. Leveraging saddle representation, Liu et al. (2018) propose a faster online AUC maximization algorithm with provably improved statistical convergence rates. The stochastic proximal algorithms for AUC maximization with non-differentiable regularization are proposed and studied in (Natole et al., 2018). To make the algorithm scalable to large-scale streaming data, Lei & Ying (2021) propose a new stochastic proximal algorithm. In the paper (Guo et al., 2020), the authors consider the distributed setting and propose a communication-efficient stochastic AUC maximization with deep neural networks. Shalev-Shwartz et al. (2004) propose an online algorithm for supervised learning of pseudo-metrics. In (Davis et al., 2007), the authors present an information-theoretic approach for online metric learning. In (Jain et al., 2008), leveraging the Log Det regularization, the authors propose a fast online metric learning for the similarity search. The generalization bound of regularized distance metric learning is established in (Jin et al., 2009). Rejchel (2012) consider ranking estimators that minimize the convex empirical risks and prove their generalization bounds. A framework of online learning to rank is proposed by Schuth et al. (2013). In the paper (Zoghi et al., 2017), the authors investigate online learning to rank in stochastic click models. Paper (Li et al., 2019) introduces a new model for online ranking with features. The differentially private pairwise learning has been recently studied, and representative works include (Huai et al., 2020; Yang et al., 2021a; Xue et al., 2021; Yang et al., 2021b).

5 Conclusions

In this paper, we propose adaptive online gradient descent algorithms to solve pairwise learning problems and establish their theoretical performance bounds in strongly convex, convex, and nonconvex settings. Our theoretical results explain why the convergence speed of adaptive online gradient descent can outperform the one without adaptive stepsize for pairwise learning. We also provide numerical experiments to demonstrate the efficiency of the proposed algorithm.

Limitation and future work. There are two major limitations in our analysis: 1) we assume the set K is bounded in establishing Theorem 1, and 2) the convergence rate in Theorem 2 is analyzed for the adaptive online gradient descent without momentum. We leave how to overcome the above two limitations as future work. There are numerous other avenues for future work, including 1) Can we establish the lower bound of the convergence rates for the adaptive online gradient descent applied to pairwise learning? 2) Can we extend the online adaptive gradient descent to the proximal settings to solve nonsmooth pairwise learning problems?

Published in Transactions on Machine Learning Research (10/2023)

Broader Impact Statement

Shivani Agarwal and Partha Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10(2), 2009.

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in neural information processing systems, 26, 2013.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http://www.csie.ntu. edu.tw/~cjlin/libsvm.

Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1x-x309tm.

Zaiyi Chen, Zhuoning Yuan, Jinfeng Yi, Bowen Zhou, Enhong Chen, and Tianbao Yang. Universal stagewise learning for non-convex problems with convergence on averaged solutions. In International Conference on Learning Representations, 2018.

Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, pp. 209 216, 2007.

Yi Ding, Peilin Zhao, Steven Hoi, and Yew-Soon Ong. An adaptive gradient method for online auc maximization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

Wei Gao and Zhi-Hua Zhou. Uniform convergence, stability and learnability for ranking problems. In Twenty-Third International Joint Conference on Artificial Intelligence. Citeseer, 2013.

Wei Gao, Rong Jin, Shenghuo Zhu, and Zhi-Hua Zhou. One-pass AUC optimization. In International conference on machine learning, pp. 906 914. PMLR, 2013.

Zhishuai Guo, Mingrui Liu, Zhuoning Yuan, Li Shen, Wei Liu, and Tianbao Yang. Communication-efficient distributed stochastic AUC maximization with deep neural networks. In International Conference on Machine Learning, pp. 3864 3874. PMLR, 2020.

Ting Hu, Jun Fan, Qiang Wu, and Ding-Xuan Zhou. Learning theory approach to minimum error entropy criterion. Journal of Machine Learning Research, 14(2), 2013.

Mengdi Huai, Di Wang, Chenglin Miao, Jinhui Xu, and Aidong Zhang. Pairwise learning with differential privacy guarantees. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 694 701, 2020.

Prateek Jain, Brian Kulis, Inderjit Dhillon, and Kristen Grauman. Online metric learning and fast similarity search. Advances in neural information processing systems, 21, 2008.

Rong Jin, Shijun Wang, and Yang Zhou. Regularized distance metric learning: Theory and algorithm. Advances in neural information processing systems, 22, 2009.

Purushottam Kar, Bharath Sriperumbudur, Prateek Jain, and Harish Karnick. On the generalization ability of online learning algorithms for pairwise loss functions. In Sanjoy Dasgupta and David Mc Allester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 441 449, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/kar13.html.

Published in Transactions on Machine Learning Research (10/2023)

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Brian Kulis et al. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287 364, 2013.

Abhishek Kumar, Alexandru Niculescu-Mizil, Koray Kavukcoglu, and Hal Daumé. A binary classification framework for two-stage multiple kernel learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1331 1338, 2012.

Yunwen Lei and Yiming Ying. Stochastic proximal AUC maximization. Journal of Machine Learning Research, 22(61):1 45, 2021.

Yunwen Lei, Antoine Ledent, and Marius Kloft. Sharper generalization bounds for pairwise learning. Advances in Neural Information Processing Systems, 33:21236 21246, 2020.

Shuai Li, Tor Lattimore, and Csaba Szepesvári. Online learning to rank with features. In International Conference on Machine Learning, pp. 3856 3865. PMLR, 2019.

Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983 992, 2019.

Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic AUC maximization with O(1/n)-convergence rate. In International Conference on Machine Learning, pp. 3189 3197. PMLR, 2018.

Mingrui Liu, Youssef Mroueh, Jerret Ross, Wei Zhang, Xiaodong Cui, Payel Das, and Tianbao Yang. Towards better understanding of adaptive gradient algorithms in generative adversarial nets. In International Conference on Learning Representations, 2019.

H Brendan Mc Mahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. COLT 2010, pp. 244, 2010.

Michael Natole, Yiming Ying, and Siwei Lyu. Stochastic proximal algorithms for AUC maximization. In International Conference on Machine Learning, pp. 3710 3719. PMLR, 2018.

Thomas Peel, Sandrine Anthoine, and Liva Ralaivola. Empirical bernstein inequalities for u-statistics. Advances in Neural Information Processing Systems, 23, 2010.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1571 1578, 2012.

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ry Qu7f-RZ.

Wojciech Rejchel. On ranking and generalization bounds. Journal of Machine Learning Research, 13(5), 2012.

Anne Schuth, Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. Lerot: An online learning to rank framework. In Proceedings of the 2013 workshop on Living labs for information retrieval evaluation, pp. 23 26, 2013.

Shai Shalev-Shwartz, Yoram Singer, and Andrew Y Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-first international conference on Machine learning, pp. 94, 2004.

Tao Sun, Linbo Qiao, Qing Liao, and Dongsheng Li. Novel convergence results of adaptive stochastic gradient descents. IEEE Transactions on Image Processing, 30:1044 1056, 2020.

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-RMSProp, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.

Published in Transactions on Machine Learning Research (10/2023)

Boyu Wang, Hejia Zhang, Peng Liu, Zebang Shen, and Joelle Pineau. Multitask metric learning: Theory and algorithm. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3362 3371. PMLR, 2019.

Yuyang Wang, Roni Khardon, Dmitry Pechyony, and Rosie Jones. Generalization bounds for online learning algorithms with pairwise loss functions. In Conference on Learning Theory, pp. 13 1. JMLR Workshop and Conference Proceedings, 2012.

Zhitao Wang, Yong Zhou, Litao Hong, Yuanhang Zou, and Hanjing Su. Pairwise learning for neural link prediction. ar Xiv preprint ar Xiv:2112.02936, 2021.

Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pp. 6677 6686. PMLR, 2019.

Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research, 10(2), 2009.

Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems, 15, 2002.

Zhiyu Xue, Shaoyang Yang, Mengdi Huai, and Di Wang. Differentially private pairwise learning revisited. IJCAI, 2021.

Zhenhuan Yang, Yunwen Lei, Siwei Lyu, and Yiming Ying. Stability and differential privacy of stochastic gradient descent for pairwise learning with non-smooth loss. In International Conference on Artificial Intelligence and Statistics, pp. 2026 2034. PMLR, 2021a.

Zhenhuan Yang, Yunwen Lei, Puyu Wang, Tianbao Yang, and Yiming Ying. Simple stochastic and online gradient descent algorithms for pairwise learning. Advances in Neural Information Processing Systems, 34, 2021b.

Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13(1):1 26, 2012.

Yiming Ying and Ding-Xuan Zhou. Online pairwise learning algorithms. Neural computation, 28(4):743 777, 2016.

Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online AUC maximization. Advances in neural information processing systems, 29, 2016.

Peilin Zhao, Steven CH Hoi, Rong Jin, and Tianbo Yang. Online AUC maximization. In Proceedings of the 28th International Conference on Machine Learning ICML, pp. 233 240, 2011.

Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. Online learning to rank in stochastic click models. In International Conference on Machine Learning, pp. 4199 4208. PMLR, 2017.

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of Adam and RMSProp. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11127 11135, 2019.

Published in Transactions on Machine Learning Research (10/2023)

A Proofs of Results in the Convex Scenario

A.1 Technical Lemmas

Given any x arg K min f, in the (strongly) convex setting, we introduce the following notation

vk 1 i + L2E xk xk 1 2

δ + B2E h 1

Ak := E( mk 2/(vk) 1 2 ), Bk := E x xk, mk , Ck := (ηθ + 1 θ

2 )Ak 1 + (1 θ)ϕk

We have the following lemmas.

Lemma 1 [Lemma 9 in the appendix, (Li & Orabona, 2019)] Let a1, a2, . . . , a K be non-negative, and h be a non-increasing function. Then we have

k=1 akh(a0 +

i=1 ai) Z PK

Lemma 2 Assume {xk}k 1 is generated by Algorithm 2, then we have

k=1 E[ gk 2/(vk) 1 2 ] E(vk) 1 2 .

Lemma 3 Assume {xk}k 1 is generated by Algorithm 2 and g0

δ > 0, then we have

k=1 E xk xk 1 2

k=1 E[ gk 2/vk] E ln vk

Lemma 4 Assume {xk}k 1 is generated by Algorithm 2, then we have

E F(xk 1; ξk, ξk 1) 2 2L2E xk xk 1 2 + 2E gk 2.

Lemma 5 Assume {xk}k 1 is generated by Algorithm 2, and Assumptions 1 and 2 hold, then the following result holds

Bk θBk 1 + Ck + (1 θ)E h f(x ) f(xk) i .

A.2 Proof of Theorem 1

Given k Z+, with Lemma 5 and mathematical induction, we have the following inequality

i=1 θk 1 i Ci +

i=1 (1 θ)θk 1 i E(f(x ) f(xi)).

Notice that m1 = 0 and B1 = 0, we get

i=1 θk 1 i Ci +

i=1 (1 θ)θk 1 i E(f(x ) f(xi)). (10)

Published in Transactions on Machine Learning Research (10/2023)

Summing the inequality (10) from k = 1 to K gives us

i=1 θk 1 i Ci +

i=1 (1 θ)θk 1 i E(f(x ) f(xi))

PK k=1 Ck 1 θ + (1 θ)

k=1 E(f(x ) f(xk)).

Therefore, we have

k=1 E(f(xk) f(x ))

Bk (1 θ) + PK k=1 Ck (1 θ)2 . (11)

The scheme of the algorithm indicates that

xk+1 = Proj K(xk ηmk/(vk) 1 2 ).

We then get

xk+1 x 2 = Proj K(xk ηmk/(vk) 1 2 ) x 2

= Proj K(xk ηmk/(vk) 1 2 ) Proj K(x ) 2

xk ηmk/(vk) 1 2 x 2.

Multiplying both sides with (vk) 1 2 , we are then led to

(vk) 1 2 xk+1 x 2 (vk) 1 2 x xk 2 + 2η mk, x xk + η2 mk 2/(vk) 1 2 ,

which is equivalent to

(vk+1) 1 2 xk+1 x 2 (vk) 1 2 x xk 2 + 2η mk, x xk

+ η2 mk 2/(vk) 1 2 + ((vk+1) 1 2 (vk) 1 2 ) xk+1 x 2.

Taking the total expectation of both sides of the above equation gives us

2ηBk E(vk) 1 2 xk x 2 E(vk+1) 1 2 xk+1 x 2

+ η2Ak + (E(vk+1) 1 2 E(vk) 1 2 )D2. (12)

Summing (12) from k = 1 to K gives us

k=1 ( Bk)/(1 θ) E

2η(1 θ) + η 2(1 θ)

k=1 Ak + D2

Together with (11), we then have

k=1 E(f(xk) f(x )) E

k=1 Ak + D2

v K+1 + PK k=1 Ck (1 θ)2 .

We turn to bound the right-hand side of (13) and get the following bound

k=1 Ak η 2(1 θ)E(v K+1) 1 2 . (14)

Published in Transactions on Machine Learning Research (10/2023)

On the other hand, we can get PK k=1 Ck (1 θ)2 ηθ (1 θ)2 E(v K+1) 1 2 + 2L (1 θ)2 E ln v K+1

δ + 1 (1 θ)2

ηθ (1 θ)2 E(v K+1) 1 2 + 2L (1 θ)2 E ln v K+1

δE ln v K+1

δ + E(v K+1) 1 2

Substituting the bounds (14) and (15) into (13), we are then led to

k=1 E(f(xk) f(x )) i

2η(1 θ) + [η + 2D2

2(1 θ) + ηθ + 1

+ [ 4L (1 θ)2 + 2L2

Notice that ln( ) and 1

are both convex when is positive, using Jensen s inequality gives us

Therefore, (16) can be presented as

f PK k=1 xk

2(1 θ)η + 3B2

c2(K) := [η + 2D2

2(1 θ) + ηθ + 1

v K+1 + [ 4L (1 θ)2 + 2L2

B Proofs of Results in the Strongly Convex Scenario

B.1 Proof of Proposition 2

The boundedness of the gradient and the strong convexity indicate K is bounded, whose radius is assumed to be D > 0. Notice that the operator Proj K( ) is contractive, as θ = 0, we get

E xk+1 x 2 = E Proj K(xk η

= E Proj K(xk η

vk) Proj K(x ) 2

= E xk x 2 2 η

k E( F(xk; ξk, ξk 1), xk x

k E mk 2/vk

E h (1 2ην/

kvk)E xk x 2i + η2

k E([F(x ; ξk, ξk 1) F(xk; ξk, ξk 1)]/

Published in Transactions on Machine Learning Research (10/2023)

where we used the strong convexity of F(x; ξk, ξk 1). Now, we turn to bound

E([F(x ; ξk, ξk 1) F(xk; ξk, ξk 1)]/

With direct computations, we have the decomposition

E([F(x ; ξk, ξk 1) F(xk; ξk, ξk 1)]/

= E([F(x ; ξk, ξk 1) F(xk 1; ξk, ξk 1)]/

+ E([F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1)]/

k E [F(x ; ξk, ξk 1) F(xk; ξk, ξk 1)] (1/

consequently, E([F(x ; ξk, ξk 1) F(xk; ξk, ξk 1)]/

E( mk 1, F(xk 1; ξk, ξk 1)

vk 2 ) + DBη

k 1E F(xk 1; ξk, ξk 1) 2

δ E( mk 1 2

k 1vk 1 ) + DBη

where we used E([F(x ; ξk, ξk 1) F(xk 1; ξk, ξk 1)]/

vk 2) = [f(x ) f(xk 1)]/

vk 2 0. Letting η = B 2ν and βk := η

k k 1E mk 1

k k 1E F (xk 1;ξk,ξk 1) 2

vk 2 + 2DBη2

k )ak + βk.

With direct computations, we have

Notice that

kβk = O E mk 1

vk 1 + E F(xk 1; ξk, ξk 1) 2

vk 2 + E(1/

vk 1) + E mk 2

Published in Transactions on Machine Learning Research (10/2023)

we just need to compute P k=1 E F (xk 1;ξk,ξk 1) 2

vk 2 . By using Lemma 4, we have

E F(xk 1; ξk, ξk 1) 2/vk 2 2L2E xk xk 1 2/δ + 2E gk 2/vk 2.

With the fact

vk 2 E gk 2

vk + B2(E1/vk 2 E1/vk),

we then get X

k=1 E F(xk 1; ξk, ξk 1) 2

vk 2 = O(ln v K).

With Lemma 3, we can get

k=2 kβk = O(ln v K) = O(ln K) a K+1 = ln K

C Proofs of Results in Nonconvex Scenario

C.1 Additional Technical Lemmas

In the nonconvex case, we denote the following items to simplify the presentations of the technical lemmas

ˆAk := E h mk 2/vki ,

ˆBk := E f(xk), mk/(vk) 1 2 ,

ˆCk := θη ˆAk + 2(1 θ)E h gk 2/vki + 6(1 θ)B2E[1/(vk 2) 1 2 1/(vk) 1 2 ]

+(1 θ)(2L2/

δ + L2/δ + 1/2)E xk xk 1 2.

Lemma 6 Assume {xk}k 1 is generated by Algorithm 2, then we have

k=1 E gk 2/vk E ln vk

Lemma 7 Assume {xk}k 1 is generated by Algorithm 2 and the functions are nonconvex. Let K be the full space and Assumptions 1 and 2 hold, then the following result holds

ˆBk + (1 θ)

2 E f(xk) 2/(vk) 1 2 θ ˆBk 1 + ˆCk.

C.2 Proof of Theorem 3

According to Lemma 7, we have

k=1 E f(xk) 2/(vk) 1 2 ˆBK + (θ 1)

k=1 ˆCk + B2

The Lipschitz property of the gradients gives

Ef(xk+1) Ef(xk) E f(xk), xk+1 xk + LE xk+1 xk 2

2 = η ˆBk + Lη2

2 ˆAk. (19)

Published in Transactions on Machine Learning Research (10/2023)

Combining with (19), we get the following estimate

k=1 ˆAk + 2f(x1)

On the other hand, with Lemma 3, we have the following bound

k=1 ˆCk 2θη

k=1 ˆAk + 6B2

δ + L2/δ + 5/2)E ln(v K

Using [Lemma 2, (Li & Orabona, 2019)] and Lemma 6, we have

k=1 ˆAk E ln(v K

Substituting (22), (21) and (20) into (18), then we get

k=1 E f(xk) 2/(vk) 1 2

δ + L2/δ + 5/2 E ln(v K

δ ) + (7 6θ)B2

Notice that 1

(vk) 1 2 1 Ckα , then we have

k=1 E [ f(xk)]2/(vk) 1 2 1 (

1 kαC ) min 1 k K{E f(xk) 2}.

We then complete the proof by using the fact that 0 < 1 1 α 2 when 0 < α 1/2. Thus, we can get the following estimate

min 1 k K{E f(xk) 2} c3 + c4(K)

c3 := 2(7 6θ)B2C

δ + 4Cf(x1)

c4(K) := (1 + θ)η

δ + L2/δ + 5/2 ln(CK

D Proofs of the Technical Lemmas

D.1 Proof of Lemma 2

With the fact that mk = (1 θ) Pk j=1 θk jgj when k 1, we have

[mk]2/(vk) 1 2 =

i=1 |mk i /(vk) 1 4 |2

i=1 (1 θ)2|

j=1 θk jgj i /(vk) 1 4 |2

i=1 (1 θ)2(

j=1 θk j(vk) 1 2 )

j=1 θk j(gj i )2/(vk)

i=1 (1 θ)2 (vk) 1 2 1 θ

j=1 θk j(gj i )2/(vk)

j=1 θk j gj 2/(vk) 1 2 b) (1 θ)

j=1 θk j gj 2/(vj) 1 2

Published in Transactions on Machine Learning Research (10/2023)

where a) uses the Cauchy s inequality (Pk j=1 ajbj)2 (Pk j=1 a2 j) (Pk j=1 b2 j) with aj = θ k j

2 (vk) 1 4 and

2 gj i /(vk) 1 2 ; b) is due to vj vk when j k. Thus, we are led to

j=1 θk j gj 2/(vj) 1 2 =

k=j θk j gj 2/(vj) 1 2

k=j θk j gj 2/(vj) 1 2 1 1 θ

j=1 gj 2/(vj) 1 2 .

Thus, we can get

k=1 E[ gk 2/(vk) 1 2 ].

Using Lemma 1 with ak = gk 2 and h( ) = 1 , we get

k=1 gk 2/(vk) 1 2

where we used the fact that v0 0. The proof is then completed.

D.2 Proof of Lemma 3

Similar to the proof of Lemma 2, we have

xk+1 xk 2 = Proj K(xk ηmk) xk 2 = Proj K(xk ηmk) Proj K(xk) 2

mk/(vk) 1 2 2 =

i=1 |mk i /(vk) 1 2 |2

i=1 (1 θ)2|

j=1 θk jgj i /(vk) 1 2 |2

i=1 (1 θ)2(

j=1 θk j (gj i )2

i=1 (1 θ)2 1 1 θ

j=1 θk j(gj i )2/(vk)

j=1 θk j gj 2/(vk) b)= (1 θ)

j=1 θk j gj 2/vj

where a) uses the fact that (Pk j=1 ajbj)2 Pk j=1 a2 j Pk j=1 b2 j with aj = θ k j

2 and bj = θ k j

2 gj i /(vk) 1 2 , and b) is due to vj vk when j k. Thus, we can get

k=1 xk xk 1 2

k=1 gk 2/vk.

Using Lemma 1 with ak = gk 2 and h( ) = 1

, we are then led to

k=1 gk 2/vk ln vk

The proof is then completed.

Published in Transactions on Machine Learning Research (10/2023)

D.3 Proof of Lemma 4

Direct computations give us

E gk 2 = E F(xk; ξk, ξk 1) 2

= E F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1) + F(xk 1; ξk, ξk 1) 2

= E F(xk 1; ξk, ξk 1) 2 + E F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1) 2

+ 2E F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1), F(xk 1; ξk, ξk 1)

2E F(xk 1; ξk, ξk 1) 2 E F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1) 2

2E F(xk 1; ξk, ξk 1) 2 L2E xk xk 1 2,

where a) uses the inequality 2E F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1), F(xk 1; ξk, ξk 1) 1

2E F(xk 1; ξk, ξk 1) 2 2E F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1) 2. Thus we can get

E F(xk 1; ξk, ξk 1) 2 2L2E xk xk 1 2 + 2E gk 2.

D.4 Proof of Lemma 5

The convexity of fik(x) with respect to x and gk = F(xk; ξk, ξk 1) gives us

E x xk, gk E[F(x ; ξk, ξk 1) F(xk; ξk, ξk 1)]

= E[F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1) + F(x ; ξk, ξk 1) F(xk 1; ξk, ξk 1)]. (24)

The Lipchitz gradient continuity of F(x; ξk, ξk 1) with respect to x indicates

E[F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1)]

2 E xk xk 1 2 + E xk 1 xk, F(xk; ξk, ξk 1)

2 E xk xk 1 2 + E xk 1 xk, F(xk 1; ξk, ξk 1)

+ E xk 1 xk, F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1)

2 E xk xk 1 2 + E xk 1 xk, F(xk 1; ξk, ξk 1) ,

where we used the Cauchy s inequality E xk 1 xk, F(xk; ξk, ξk 1) F(xk 1; ξk, ξk 1) L xk 1 xk 2. Because ξk, ξk 1 are independent of xk 1,

E(F(x ; ξk, ξk 1) F(xk 1; ξk, ξk 1)) = f(x ) f(xk 1). (26)

Turning back to (24), we get

E x xk, gk E[F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1)] + f(x ) f(xk 1). (27)

Notice that xk 1 is independent of (ξk, ξk 1),

E xk 1 xk, F(xk 1; ξk, ξk 1)

= E xk 1 xk, f(xk 1)

+ E xk 1 xk, F(xk 1; ξk, ξk 1) E F(xk 1; ξk, ξk 1)

E xk 1 xk, f(xk 1) + 1

2E F(xk 1; ξk, ξk 1) E F(xk 1; ξk, ξk 1) 2

Published in Transactions on Machine Learning Research (10/2023)

where we used |E X, Y | E X 2 + E Y 2 with X = 4

vk 1(xk 1 xk), and Y = [ F(xk 1; ξk, ξk 1) E F(xk 1; ξk, ξk 1)]/ 4

vk 1. Now, we turn to the upper bound of 1

2E F (xk 1;ξk,ξk 1) E F (xk 1;ξk,ξk 1) 2

1 2E F(xk 1; ξk, ξk 1) E F(xk 1; ξk, ξk 1) 2

2E F(xk 1; ξk, ξk 1) E F(xk 1; ξk, ξk 1) 2

2E h ( F(xk 1; ξk, ξk 1) E F(xk 1; ξk, ξk 1) 2) ( 1

2 E F(xk 1; ξk, ξk 1) 2

vk 2 + 2B2E( 1

a) L2E xk xk 1 2 + E gk 2

vk 2 + 2B2E( 1

where ϕk := E gk 2

vk + 2B2E( 1

vk 1 ) + L2E xk xk 1 2

vk ), and a) depends on Lemma 4. Thus, we have

E xk 1 xk, F(xk 1; ξk, ξk 1)

E xk 1 xk, f(xk 1) + 1

vk 1 + ϕk. (30)

Once with the Lipchitz property,

f(xk 1), xk 1 xk f(xk 1) f(xk) + L

2 xk xk 1 2. (31)

Combing (31) and (30), we then get

E xk 1 xk, F(xk 1; ξk, ξk 1) f(xk 1) f(xk)

2 xk xk 1 2 + 1

vk 1 + ϕk. (32)

Substituting (32) into (25),

E[F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1)] 2LE xk xk 1 2

+ f(xk 1) f(xk) + 1

vk 1 + ϕk. (33)

Substituting (33) into (27), we then get

E x xk, gk 2LE xk xk 1 2 + f(x ) f(xk) + 1

vk 1 + ϕk. (34)

According to our algorithm and we denote Λ := E( x xk, gk ), then we have

E x xk, mk = E x xk, θmk 1 + (1 θ)gk

= (1 θ) Λ + θE x xk, mk 1

= (1 θ) Λ + θE x xk 1, mk 1 + θE xk xk 1, mk 1

b) (1 θ) Λ + θE x xk 1, mk 1 + ηθE mk 1 2/(vk 1) 1 2 ,

where b) depends on that xk xk 1, mk 1 xk xk 1 mk 1 = Proj K(xk 1 ηmk) Proj K(xk 1) mk 1 mk 1 2/(vk 1) 1 2 . Then, we get

Bk (1 θ)E x xk, gk + θBk 1 + ηθAk 1. (35)

Substituting (34) into (35), we then proved the desired result.

Published in Transactions on Machine Learning Research (10/2023)

D.5 Proof of Lemma 6

This proof is identical to the proof of Lemma 3 and will not be reproduced.

D.6 Proof of Lemma 7

Notice that E F(xk 1; ξk, ξk 1) = f(xk 1), we then get

E f(xk)/(vk)1/2, gk

= E f(xk)/(vk)1/2, F(xk; ξk, ξk 1)

= E f(xk) 2/(vk)1/2 + E f(xk)/(vk) 1 2 , f(xk) f(xk 1)

+ E f(xk)/(vk) 1 2 , f(xk 1) F(xk 1; ξk, ξk 1) | {z } :=( )

(vk) 1 2 , F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1) .

The Cauchy s inequality together with the smooth assumption gives us

E f(xk)/(vk) 1 2 , f(xk) f(xk 1)

|E f(xk)/(vk) 1 4 , [ f(xk 1) f(xk)]/(vk) 1 4 |

4E f(xk) 2/(vk) 1 2 + E f(xk 1) f(xk) 2/(vk) 1 2

4E f(xk) 2/(vk) 1 2 + L2E xk 1 xk 2/(vk) 1 2 .

Similarly, the Lipschitz property of the stochastic gradient yields

E f(xk)/(vk) 1 2 , F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1)

(vk) 1 4 , [ F(xk 1; ξk, ξk 1) F(xk; ξk, ξk 1)]

4E f(xk) 2/(vk) 1 2 + L2E xk 1 xk 2/(vk) 1 2 .

Substituting (38) and (37) into (36), we are then led to

E f(xk)/(vk) 1 2 , gk E f(xk) 2/(vk) 1 2 + ( )

2E( f(xk) 2/(vk) 1 2 ) + 2L2E( xk 1 xk 2/(vk) 1 2 ). (39)

Now, we turn to bound ( ):

( ) = E f(xk 1)/(vk 1) 1 2 , f(xk 1) F(xk 1; ξk, ξk 1) | {z } =0

+ E D f(xk)/(vk) 1 2 f(xk 1)/(vk) 1 2

+ f(xk 1)/(vk) 1 2 f(xk 1)/(vk 1) 1 2 , f(xk 1) F(xk 1; ξk, ξk 1) E

E [ f(xk) f(xk 1)]/(vk) 1 2 , f(xk 1) F(xk 1; ξk, ξk 1)

+ 2B2E[1/(vk 2) 1 2 1/(vk) 1 2 ]

E xk xk 1 2

2 + 2B2E[1/(vk 2) 1 2 1/(vk) 1 2 ]

2E f(xk 1) F(xk 1; ξk, ξk 1) 2/vk.

Published in Transactions on Machine Learning Research (10/2023)

Furthermore, we have

1 2E f(xk 1) F(xk 1; ξk, ξk 1) 2/vk

2E f(xk 1) F(xk 1; ξk, ξk 1) 2/vk 2

2E f(xk 1) F(xk 1; ξk, ξk 1) 2(1/vk 1/vk 2)

Lemma 4 L2E xk xk 1 2/vk 2 + 2E gk 2/vk 2 + 2B2E(1/vk 1/vk 2)

δ E xk xk 1 2 + 2E gk 2/vk + 4B2E(1/vk 1/vk 2).

Thus, (39) can also be bounded as

E f(xk)/(vk) 1 2 , gk 1

2E f(xk) 2/(vk) 1 2

+ 6B2E[1/(vk 2) 1 2 1/(vk) 1 2 ] + 2E gk 2/vk

δ + L2/δ + 1/2)E xk xk 1 2.

We also use a shorthand notation Λ := E( f(xk)/(vk) 1 2 , gk ) and then

E f(xk), mk/(vk) 1 2

= E f(xk)/(vk) 1 2 , θmk 1 + (1 θ)gk

= (1 θ) Λ + θE f(xk)/(vk) 1 2 , mk 1

a) (1 θ) Λ + θE f(xk 1)/(vk 1) 1 2 , mk 1 + θηE mk 1 2/vk 1. (41)

where a) uses the Cauchy-Schwarz inequality and the Lipschitz property of f and vk 1 vk. Substituting (41) into (40), we then proved the result.

E Ablation study of the parameter θ

In this section, we investigate the influence of the parameter θ on the convergence rate of AOGD. The experimental setup adheres to the one detailed in Section 3, focusing on the hinge loss applied to the ijcnn1 dataset.

We fix the parameters η and R at values 1.0 and 10.0, respectively, while varying θ over the set {0.1, 0.3, 0.5, 0.7, 0.9}. The rationale behind these chosen values for η and R stems from optimization results observed at θ = 0.9, aligning with the configurations presented in Table 2 and Fig. 2.

As depicted in Figure 4, the convergence rate of AOGD across different θ values is displayed. Notably, the convergence rate appears robust to the specific selection of θ with slightly better performance observed at an intermediate θ value, specifically θ = 0.7.

Published in Transactions on Machine Learning Research (10/2023)

0 10 20 30 Iterations (log scale)

θ = 0.1 θ = 0.3 θ = 0.5 θ = 0.7 θ = 0.9

Figure 4: The effect of θ on the convergence rate of AOGD with hinge loss on ijcnn1 dataset. The result suggests AOGD achieves robust performance across different θ values.