# nonsmooth_weaklyconvex_finitesum_coupled_compositional_optimization__f7f6d7ec.pdf Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization Quanqi Hu Department of Computer Science Texas A&M University College Station, TX 77843 quanqi-hu@tamu.edu Dixian Zhu Department of Genetics Stanford University Stanford, CA 94305 dixian-zhu@stanford.edu Tianbao Yang Department of Computer Science Texas A&M University College Station, TX 77843 tianbao-yang@tamu.edu This paper investigates new families of compositional optimization problems, called non-smooth weakly-convex finite-sum coupled compositional optimization (NSWC FCCO). There has been a growing interest in FCCO due to its wide-ranging applications in machine learning and AI, as well as its ability to address the shortcomings of stochastic algorithms based on empirical risk minimization. However, current research on FCCO presumes that both the inner and outer functions are smooth, limiting their potential to tackle a more diverse set of problems. Our research expands on this area by examining non-smooth weakly-convex FCCO, where the outer function is weakly convex and non-decreasing, and the inner function is weakly-convex. We analyze a single-loop algorithm and establish its complexity for finding an ϵ-stationary point of the Moreau envelop of the objective function. Additionally, we also extend the algorithm to solving novel non-smooth weakly-convex tri-level finite-sum coupled compositional optimization problems, which feature a nested arrangement of three functions. Lastly, we explore the applications of our algorithms in deep learning for two-way partial AUC maximization and multi-instance two-way partial AUC maximization, using empirical studies to showcase the effectiveness of the proposed algorithms. 1 Introduction In this paper, we consider two classes of non-convex compositional optimization problems. The first class is formulated as following: min w Rd F(w) := 1 i S fi(Eξ Di[gi(w; ξ)]), (1) where S denotes a finite set of n items and Di denotes a distribution that could depend on i. The second class is given by: min w Rd F(w) := 1 j S2 gi(Eξ Di,j[hi,j(w; ξ)]) , (2) where S1 denotes a finite set of n1 items and S2 denotes a finite set of n2 items and Dij denotes a distribution that could depend on (i, j). For simplicity of discussion, we denote by gi(w) = 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Eξ Di[gi(w; ξ)] : Rd Rd1 and by hi,j(w) = Eξ Di,j[hi,j(w; ξ)] : Rd Rd2. For both classes of problems, we focus our attention on non-convex F with non-smooth non-convex functions fi and gi, which, to the best of our knowledge, has not been studied in any prior works. The first problem (1) with smooth functions fi and gi has been explored in previous works [26, 15, 21, 33], which is known as finite-sum coupled compositional optimization (FCCO). It is subtly different from standard stochastic compositional optimization (SCO) [27] and conditional stochastic optimization (CSO) [13]. FCCO has been successfully applied to optimizing a wide range of X-risks [33] with convergence guarantee, including smooth surrogate losses of areas under the curves [20] and ranking measures [21], listwise losses [21], and contrastive losses [36]. The second problem (2) is a novel class and is referred to as tri-level finite-sum coupled compositional optimization (TCCO). Both problems differ from traditional two-level or multi-level compositional optimization due to the coupling of variables i, ξ in (1) or the coupling of variables i, j, ξ in (2) at the inner most level. One limitation of prior works about non-convex FCCO is that their convergence analysis heavily rely on the smoothness conditions of fi and gi [26, 15]. This raises a concern about whether existing techniques can be leveraged for solving non-smooth non-convex FCCO problems with non-asymptotic convergence guarantee. Non-smooth non-convex FCCO and TCCO problems have important applications in ML and AI, e.g., group distributionally robust optimization [4] and two-way partial AUC maximization for deep learning [44]. We defer discussions and formulations of these problems to Section 5. The difficulty for solving smooth FCCO lies at high costs of computing a stochastic gradient gi(w) fi(gi(w)) for a randomly sampled i and the overall gradient F(w). To approximate the stochastic gradient, a variance-reduced estimator of gi(wt) denoted by ui,t is usually maintained and updated for sampled data in the mini-batch i Bt. As a result, the stochastic gradient can be approximated by gi(wt; ξt) fi(ui,t), where ξt Di is a random sample. The overall gradient can be estimated by averaging the stochastic gradient estimator over the mini-batch or using variance-reduction techniques. A key insight of the convergence analysis for smooth FCCO is to bound the following error using the L-smoothness of fi, which reduces to bounding the error of ui,t for estimating gi(wt): gi(wt; ξt) fi(ui,t) gi(wt; ξt) fi(gi(wt)) 2 gi(wt; ξt) 2L ui,t gi(wt) 2. A central question to be addressed in this paper is Can these gradient estimators be used in stochastic optimization for solving non-smooth non-convex FCCO with provable convergence guarantee"? To address this question we focus our attention on a specific class of FCCO/TCCO called non-smooth weakly-convex (NSWC) FCCO/TCCO. This approach aligns with many established works on NSWC optimization [6 9]. Nevertheless, NSWC FCCO/TCCO is more complex than a standard weakly-convex optimization problem because an unbiased stochastic subgradient is not readily accessible. In addition, the convergence measure in terms of the gradient norm of smooth non-convex objectives is not applicable to weakly convex optimization, which will complicate the analysis involving the biased stochastic gradient estimator gi(wt; ξt) fi(ut i) 1. Contributions. A major contribution of this paper is to present novel convergence analysis of singleloop stochastic algorithms for solving NSWC FCCO/TCCO problems, respectively. In particular, For non-smooth FCCO, we analyze the following single-loop updates: wt+1 = wt η 1 i Bt gi(wt; ξt) fi(ui,t), (3) where Bt is a random mini-batch of B items, and ui,t is an appropriate variance-reduced estimator of gi(wt) that is updated only for i Bt at the t-th iteration. To overcome the non-smoothness, we adopt the tool of Moreau envelop of the objective as in previous works [6, 7]. The key difference of our convergence analysis from previous ones for smooth FCCO is that we bound the inner product Ei gi(w) fi(ui,t), bwt wt , where bwt is the solution of the proximal mapping of the objective at wt. To this end, specific conditions of fi, gi are imposed, i.e., fi is weakly convex and non-decreasing and gi(w) is weakly convex, under which we establish an iteration complexity of T = O(ϵ 6) for finding an ϵ-stationary point of the Moreau envelope of F( ). For non-smooth TCCO, we analyze the following single-loop updates: wt+1 = wt η 1 j Bt 2 hi,j(wt; ξt) gi(vi,j,t) fi(ui,t), (4) 1We use to denote gradient of a differentiable function and to denote a subgradient of a non-smooth function. Table 1: Comparison with prior works for solving (1) and (2). In the monotonicity column, notation means the given function is required to be non-decreasing. If not specified, the given function is only required to be monotone. Method Objective Smoothness Weak Convexity Monotonicity Complexity SOX [26] (1) fi, gi none none O(ϵ 4) MSVR [15] (1) fi, gi none none O(ϵ 3) SONX (Ours) (1) none fi, gi fi O(ϵ 6) SONT (Ours) (2) none fi, gi, hi,j fi , gi O(ϵ 6) SONT (Ours) (2) hi,j fi, gi fi , gi O(ϵ 6) where B1 t and B2 t are random mini-batches of B1 and B2 items, respectively, and ui,t is an appropriate variance-reduced estimator of 1 n2 P j S2 gi(hij(wt)) that is updated only for i B1 t , and vi,j,t is an appropriate variance-reduced estimator of hi,j(wt) that is updated only for i B1 t , j B2 t . To prove the convergence, we impose conditions of fi, gi, hi,j, i.e., fi is weakly convex and non-decreasing and gi( ) is weakly convex and non-decreasing (or monotonic), hij is weakly convex (or smooth), and establish an iteration complexity of T = O ϵ 6 for finding an ϵ-stationary point of the Moreau envelope of F( ). We extend the above algorithms to solving (multi-instance) two-way partial AUC maximization for deep learning, and conduct extensive experiments to verify the effectiveness of the both algorithms. 2 Related work Smooth SCO. There are many studies about two-level smooth SCO [27, 38, 10, 19, 3, 28] and multi-level smooth SCO [32, 32, 1, 39]. The complexities of finding an ϵ-stationary point for twolevel smooth SCO have been improved from O(ϵ 5) [27] to O(ϵ 3) [19], and that for multi-level smooth SCO have been improved from a level-dependent complexity of O(ϵ (7+K)/2) [32] to a level-independent complexity of O(ϵ 3) [32], where K is the number of levels. The improvements mostly come from using advanced variance reduction techniques for estimating each level function or its Jacobian and for estimating the overall gradient. Two stochastic algorithms have been developed in [13] for CSO but suffer a limitation of requiring large batch sizes. Smooth FCCO. FCCO was first introduced in [20] for optimizing average precision. Its algorithm and convergence analysis was improved in [26] and [15]. The former work [26] proposed an algorithm named SOX by using moving average (MA) to estimate the inner function values and the overall gradient. In the smooth non-convex setting, SOX is proved to achieve an iteration complexity of O(ϵ 4). The latter work [15] proposed a novel multi-block-single-probe variance reduced (MSVR) estimator for estimating the inner function values, which helps achieve a lower iteration complexity O(ϵ 3). Recently, [11] proposed an extrapolation based estimator for the inner function, which yields a method with a complexity that matches MSVR when n ϵ2/3. These techniques have been employed for optimizing various X-risks, including contrastive losses [36], ranking measures and listwise losses [21], and other objectives [26, 15]. However, all of these prior works assume the smoothness of fi and gi. Hence, their analysis is not applicable to NSWC FCCO problems. Our novel analysis of a simple algorithm for NSWC FCCO problems yields an iteration complexity of O(ϵ 6) for using the MSVR estimators of the inner functions. The comparison with [26, 15] is shown in Table 1. Non-smooth Weakly Convex Optimization. Analysis of weakly convex optimization with unbiased stochastic subgradients was pioneered by [6, 7]. Optimization of compositional functions that are weakly convex have been tackled in earlier works [8, 9], where the inner function is deterministic or does not involve coupling between two random variables. A closely related work to our NSWC FCCO is weakly-convex concave minimax optimization [22]. Assuming fi is convex, (1) can be written as: minw maxπ Rn 1 n P i S πi, gi(w) f i (πi), where f i ( ) is the convex conjugate of fi. It can be solved using existing methods [22, 31, 41, 43, 17] but with several limitations: (i) the algorithms in [22, 31, 41, 43] have a comparable complexity of O(1/ϵ6) but have unnecessary double loops which require setting the number of iterations for the inner loop; (ii) the algorithm in [17] is single loop but has a worse complexity of O(1/ϵ8); (iii) these existing algorithms and analysis does not account for complexity of updating all coordinates of π, which could be prohibitive in many applications; iv) these approaches are not applicable to NSWC FCCO/TCCO with weakly convex fi. In fact, the double loop algorithm has been leveraged and extended to solving the two-way partial AUC maximization problem, a special case of NSWC FCCO [44], by sampling and updating a batch of coordinates of π at each iteration. However, it is less practical thus not implemented and its analysis did not explicitly show the convergence rate dependency on n+, n and the block batch size. A special case of NSWC SCO problem was considered in [46], which is given by minx X f(x, g(x)), with f(x, u) = Eζ[u + κ max(0, g(x; ζ) u)], g(x) = Eξ[g(x; ξ)]. They proposed two methods, SCS for smooth g(x) and SCS with SPIDER for non-smooth g(x). For both proposed methods, they proved a sample complexity of O(1/ϵ6) for achieving an ϵ-stationary point of the objective s Moreau envelope 2. We would like to remark that the above problem with a non-smooth g(x) is a special case of NSWC FCCO with only a convex outer function, one block and no coupled structure. Nevertheless, their algorithm for non-smooth g( ) suffers a limitation of requiring a large batch size in the order of O(1/ϵ2) for achieving the same convergence. Finally, we would like to mention that non-smooth convex or strongly convex SCO problems have been considered in [27, 42, 26], which, however, are out of scope of the present work. 3 Preliminaries Let be the Euclidean norm of a vector and spectral norm of a matrix. We use ΠC[ ] to denote the Euclidean projection onto {v Rm : v C}. For vectors, inequality notations including , , >, < are used to denote element-wise inequality. For an expectation function f( ) = Eξ[f( ; ξ)], let f( ; B) = 1 |B| P ξ B f( ; ξ) be its stochastic unbiased estimator evaluated on a sample batch B. A stochastic unbiased estimator is said to have bounded variance σ2 if Eξ[ f( ) f( ; ξ) 2] σ2. The Jacobian matrix of function f : Rm1 Rm2 is in dimension Rm1 m2. We recall the definition of general subgradient and subdifferential following [6, 24]. Definition 3.1 (subgradient and subdifferential). Consider a function f : Rn R { } and a point with f(x) finite. A vector v Rn is a general subgradient of f at x, if f(y) f(x) + v, y x + o( y x ), as y x. The subdifferential f(x) is the set of subgradients of f at point x. For simplicity, we abuse the notation and also use f(x) to denote one subgradient from the corresponding subgradient set when no confusion could be caused. We use f(x; B) to represent a stochastic unbiased estimator of the subgradient f(x) that is evaluated on a sample batch B. A function is called C1-smooth if it is continuously differentiable. A function f = (f1, . . . , fm2) : Rm1 Rm2 is called monotone if i {1, . . . , m2}, fi : Rm1 R is monotone with respect to each element of the input. Note that if a Lipschitz continuous function f : O Rm2 is assumed to be non-increasing (resp. non-decreasing), where the domain O Rm1 is open, then all subgradients of f are element-wise non-positive (resp. non-negative). We refer the details to Appendix D.1. A function f is C-Lipschitz continuous if f(x) f(y) C x y . A differentiable function f is L-smooth if f(x) f(y) L x y . A function f : Rd R { } is ρ-weakly-convex if the function f( ) + ρ 2 2 is convex. A vector-valued function f : Rd {R { }}m is called ρ-weakly-convex if it is ρ-weakly-convex for each output. It is difficult sometimes impossible to find an ϵ-stationary point of a non-smooth weakly-convex function F, i.e., dist(0, F(w)) ϵ. For example, an ϵ-stationary point of function f(x) = |x| does not exist for 0 ϵ < 1 unless it is the optimal solution. To tackle this issue, [6] proposed to use the stationarity of the problem s Moreau envelope as the convergence metric, which has become a standard metric for solving weakly-convex problems [7, 22, 31, 41, 43, 17]. Given a weakly-convex function φ : Rm R, its Moreau envelope and proximal map with λ > 0 are constructed as φλ(x) := min y {φ(y) + 1 2λ y x 2}, proxλφ(x) := arg min y {φ(y) + 1 The Moreau envelope is an implicit smoothing of the original problem. Thus it attains a continuous differentiation. As a formal statement, the following lemma follows from standard results [6, 18]. Lemma 3.2. Given a ρ-weakly-convex function φ and λ < ρ 1, the envelope φλ is C1-smooth with gradient given by φλ(x) = λ 1(x proxλφ(x)). 2It is notable that we use a slightly different definition of ϵ-stationary point with Fρ(w) 2 ϵ2. Algorithm 1 Stochastic Optimization algorithm for Non-smooth FCCO (SONX) 1: Initialization: w0, {ui,0 : i S}. 2: for t = 0, . . . , T 1 do 3: Draw sample batches Bt 1 S, and Bt 2,i Di for each i Bt 1. 4: ui,t+1 = (1 τ)ui,t + τgi(wt; Bt 2,i) + γ(gi(wt; Bt 2,i) gi(wt 1; Bt 2,i)), i Bt 1 ui,t, i Bt 1 5: Compute Gt = 1 B1 P i Bt 1 gi(wt; Bt 2,i) fi(ui,t) 6: Update wt+1 = wt ηGt 7: end for Moreover, for any point x Rm, the proximal point ˆx := proxλφ(x) satisfies [6] ˆx x = λ φλ(x) , φ(ˆx) φ(x), dist(0, φ(ˆx)) φλ(x) . Thus if φλ(x) ϵ, we can say x is close to a point ˆx that is ϵ-stationary, which is called nearly ϵ-stationary solution of φ(x). 4 Algorithms and Convergence 4.1 Non-Smooth Weakly-Convex FCCO In this section, we assume the following conditions hold for the FCCO problem (1). Assumption 4.1. For all i S, we assume that fi is ρf-weakly-convex, Cf-Lipschitz continuous and non-decreasing; gi( ) is ρg-weakly-convex and gi( ; ξ) is Cg-Lipschitz continuous; Stochastic gradient estimators gi(w; ξ) and gi(w; ξ) have bounded variance σ2. Proposition 4.2. Under Assumption 4.1, F(w) in (1) is ρF weakly convex with ρF = d1ρg Cf + ρf C2 g. One challenge in solving FCCO is the lack of access to unbiased estimation of the subgradients 1 n P i S gi(w) fi(gi(w)) due to the expectation form of gi(w) inside a non-linear function fi. A common solution in existing works for solving smooth FCCO is to maintain function value estimators {ui : i S} for {gi(w) : i S}, and approximate the true gradient by a stochastic version 1 B1 P i B1 gi(w; B2) fi(ui) [26, 15], where B1, B2 are sampled mini-batches. Simply using a mini-batch estimator of gi inside fi does not ensure convergence if mini-batch size is small. Inspired by existing algorithms of smooth FCCO, a simple method for solving non-smooth FCCO is presented in Algorithm 1 referred to as SONX. A key step is the step 4, which uses the multiblock-single-probe variance reduced (MSVR) estimator proposed in [15] to update {ui : i S} in a block-wise manner. It is an advanced variance reduced update strategy for multi-block variable inspired by STORM [5]. In the update of MSVR estimator, for each sampled i Bt 1, ui,t is updated following a STORM-like rule with a specialized parameter γ = n B1 B1(1 τ) + (1 τ) for the error correction term. For the unsampled i Bt 1, no update for ui,t is needed. When γ = 0, the estimator becomes the moving average estimator analyzed in [26] for smooth FCCO, which is also analyzed in the Appendix. With the function values of {gi(wt) : i S} well-estimated, the gradient can be approximated by Gt in step 5. Next, we directly update wt by subgradient descent using the stochastic gradient estimator Gt. Note that unlike existing works on smooth FCCO that often maintain a moving average estimator [26] or a STORM estimator [15] for the overall gradient to attain better rates, this is not possible in the non-smooth case as those variance reduction techniques for the overall gradient critically rely on the Lipschitz continuity of F, i.e., the smoothness of F. 4.2 Non-Smooth Weakly-Convex TCCO In this section, we consider non-smooth TCCO problem and aim to extend Algorithm 1 to solve it. First of all, for convergence analysis and to ensure the weak convexity of F(w) in (2), we make the following assumptions. Assumption 4.3. For all (i, j) S1 S2, we assume that Algorithm 2 Stochastic Optimization algorithm for Non-smooth TCCO (SONT) 1: Initialization: w0, {ui,0 : i S1}, vi,j,0 = hi,j(w0; B0 3,i,j) for all (i, j) S1 S2. 2: for t = 0, . . . , T 1 do 3: Sample batches Bt 1 S1, Bt 2 S2, and Bt 3,i,j Di,j for i Bt 1 and j Bt 2. 4: vi,j,t+1 = Π Ch[(1 τ1)vi,j,t + τ1hi,j(wt; Bt 3,i,j) + γ1(hi,j(wt; Bt 3,i,j) hi,j(wt 1; Bt 3,i,j))], (i, j) Bt 1 Bt 2 vi,j,t, (i, j) Bt 1 Bt 2 5: ui,t+1 = ( (1 τ2)ui,t + 1 B2 P j Bt 2[τ2gi(vi,j,t) + γ2(gi(vi,j,t) gi(vi,j,t 1)], i Bt 1 ui,t, i Bt 1 6: Gt = 1 B1 P i Bt 2 hi,j(wt; Bt 3,i,j) gi(vi,j,t) fi(ui,t) i 7: Update wt+1 = wt ηGt 8: end for fi is Cf-Lipschitz continuous, ρf-weakly-convex and non-decreasing; gi is ρg-weakly-convex and Cg-Lipschitz continuous. hi,j( ; ξ) is Ch-Lipschitz continuous. Either gi is non-decreasing, hi,j is Lh-weakly-convex or gi is monotone, hi,j is Lh-smooth. Stochastic estimators hi,j(w, ξ) and hi,j(w, ξ) have bounded variance σ2, and hi,j(w) Ch. Ei gi(v) 1 n2 P j S2 gi(v) 2 σ2 for any v. The weak convexity of F(w) in (2) is guaranteed by the following Proposition. Proposition 4.4. Under Assumption 4.3, F(w) in (2) is ρF -weakly-convex with ρF = d1( d2Lh Cg + ρg C2 h)Cf + ρf C2 g C2 h. We extend SONX to Algorithm 2 for (2), which is referred to as SONT. For dealing with the extra layer of compositional problem, we maintain another multi-block variable to track the extra layer of function value estimation. To understand this, we first write down the true subgradient: j S2 hi,j(w) gi(hi,j(w)) fi j S2 gi(hi,j(w)) . To approximate this subgradient, we need the estimations of 1 n2 P j S2 gi(hi,j(w)) and hi,j(w), which can be tracked by using MSVR estimators denoted by {ui,t : i S1} and {vi,j,t : (i, j) S1 S2}, respectively. As a result, a stochastic estimation of F(wt) is computed in step 6 of Algorithm 2, and the model parameter is updated similarly as before. 4.3 Convergence Analysis In this section, we present the proof sketch of the convergence guarantee for Algorithm 1. The analysis for Algorithm 2 follows in a similar manner. The detailed proofs can be found in Appendix A (please refer to the supplement). Before starting the proof, we define a constant M 2 C2 f C2 g so that under Assumption 4.1 we have Et[ Gt 2] M 2. Then we start by giving the error bound of the MSVR estimator in Algorithm 1. The following norm bound of the estimation error follows from the squared-norm error bound in Lemma 1 from [15], whose proof is given in Appendix D.3. Lemma 4.5. Consider the update for {ui,t : i S} in Algorithm 1. Assume gi is Cg-Lipshitz for all i S. With γ = n B1 B1(1 τ) + (1 τ), τ 1 i S ui,t+1 gi(wt+1) (1 B1τ i S ui,0 gi(w0) + 2τ 1/2σ B1/2 2 + 4n Cg Mη For simplicity, denote by ˆwt := prox F/ ρ(wt). Then using the definition of Moreau envelope and the update rule of wt, we can obtain a bound for the change in the Moreau envelope, Et[F1/ ρ(wt+1)] F1/ ρ(wt) + ρη ˆwt wt, Et[Gt] + η2 ρM 2 where Et[Gt] = 1 i S1 gi(wt) fi(ui,t) is the subgradient approximation based on the MSVR estimator ui,t of the inner function value. This is a standard result in weakly-convex optimization [6]. To bound the inner product ˆwt wt, Et[Gt] on the right-hand-side of (5), we apply the assumptions that fi is weakly-convex, Lipschitz continuous and non-decreasing, and gi is weakly-convex. Its upper bound is given as follows. ( ˆwt wt) Et[Gt] F( ˆwt) F(wt) + 1 i S [fi(gi(wt)) f(ui,t) f(ui,t) (gi(wt) ui,t) + ρf gi(wt) ui,t 2 + (ρg Cf 2 + ρf C2 g) ˆwt wt 2]. (6) Due to the ρF -weak convexity of F(w), we have ( ρ ρF )-strong convexity of w 7 F(w)+ ρ 2 wt w 2. Then it follows F( ˆwt) F(wt) ( ρF 2 ρ) wt ˆwt 2. Combining this with inequalities (5), (6), and setting ρ sufficiently large we have Et[F1/ ρ(wt+1)] F1/ ρ(wt) + η2 ρM 2 + fi(gi(wt)) f(ui,t) fi(ui,t) (gi(wt) ui,t) + ρf gi(wt) ui,t 2]. (7) Recall Lemma 3.2, we have wt ˆwt 2 = 1 ρ2 F1/ ρ(wt) 2. Moreover, the last three terms on the R.H.S of inequality (7) can be bounded using the Lipschitz continuity of fi and the error bound given in Lemma 4.5. Then we can conclude the complexity of SONX with the following theorem. Theorem 4.6. Under Assumption 4.1 with γ = n B1 B1(1 τ) + (1 τ), τ = O(B2ϵ4) 1 2, η = O( B1B1/2 2 ϵ4 n ), and ρ = ρF + ρg Cf + 2ρf C2 g, Algorithm 1 converges to an ϵ-stationary point of the Moreau envelope F1/ ρ in T = O( n B1B1/2 2 ϵ 6) iterations. Remark. Similar to the complexity for smooth FCCO problems [26, 15], Theorem 4.6 guarantees that SONX for NSWC FCCO has a parallel speed-up in terms of the batch size B1 and linear dependency on n. The dependency of the complexity on the batch size B2 is due to the use of MSVR estimator, which matches the results in [15]. If the MSVR estimator in SONX is replaced by moving average estimator, the complexity becomes O( n B1B2 ϵ 8) (cf. Appendix B). Following a similar proof strategy, the convergence guarantee of Algorithm 2 is given below. Theorem 4.7. (Informal) Under Assumption 4.3, with appropriate values of γ1, γ2, τ1, τ2, η and a proper constant ρ, Algorithm 2 converges to an ϵ-stationary point of the Moreau envelope F1/ ρ in T = O max 1 B1/2 3 , n1/4 1 B1/4 1 n1/4 2 , n1/2 1 B1/2 1 n1/2 2 n1n2 B1B2 ϵ 6 iterations. Remark. In the worst case, the complexity has a worse dependency on n1/B1, i.e., O(n3/2 1 /B3/2 1 ). This is caused by the two layers of block-sampling update for {ui,t, i S1} and {vi,j,t : (i, j) S1 S2}. When n1 = B1 = 1 and B3 n2, the complexity of SONT becomes similar as SONX, which is understandable as the inner two levels in TCCO is the same as FCCO. 5 Applications NSWC FCCO finds important applications in group distributionally robust optimization (group DRO) and two-way partial AUC (TPAUC) maximization. Consider N groups with different distributions. Each group k has an averaged loss Lk(w) = 1 nk Pnk i=1 ℓ(fw(xk i ), yk i ), where w is the the model parameter and (xk i , yk i ) is a data point. It has been shown in previous study [23] that the group DRO problem can be formulated into min w min s F(w, s) = 1 k=1 [Lk(w) s]+ + s. This formulation can be mapped into non-smooth weakly-convex FCCO under certain assumptions. Due to space limitation, we defer the comprehensive discussion of group DRO to Appendix E. The rest of this section focuses on TPAUC maximization. Let X denote an input example and hw(X) denote a prediction of a parameterized deep net on data X. Denote by S+ the set of n+ positive examples and by S the set of n negative examples. TPAUC measures the area under ROC curve where the true positive rate (TPR) is higher than α and the false positive rate (FPR) is lower than an upper bound β. A surrogate loss for optimizing TPAUC with FPR β, TPR α is given by [34]: Xi S +[1,k1] Xj S [1,k2] ℓ(hw(Xj) hw(Xi)), (8) where ℓ( ) is a convex, monotonically non-decreasing surrogate loss of the indicator function I(hw(Xj) hw(Xi)), S +[1, k1] is the set of positive examples with k1 = n+α smallest scores, and S [1, k2] is the set of negative examples with k2 = n β largest scores. To tackle the challenge of selecting examples from S +[1, k1] and S [1, k2], the above problem is cast into the following [44]: min w,s ,s 1 n+ Xi S+ fi(ψi(w, si), s ), (9) where fi(g, s ) = s + (g s )+ α , ψi(w, si) = 1 Xj S si + (ℓ(hw(Xj) hw(Xi)) si)+ where s = (s1, . . . , sn+). We will consider two scenarios, namely regular learning scenario where Xi Rd0 is an instance, and multi-instance learning (MIL) scenario where Xi = {x1 i , . . . , xmi i Rd0} contains multiple instances (e.g., one patient has hundreds of high-resolution CT images). A challenge in MIL is that the number of instances mi for each data might be large such that it is difficult to load all instances into the memory for mini-batch training. It becomes more nuanced especially because MIL involves a pooling operation that aggregates the predicted information of individual instances into a single prediction, which can be usually written as a compositional function with the inner function being an average over instances from X. For simplicity of exposition, below we consider the mean pooling hw(X) = 1 |X| P x X e(we; x) wc, where e(we, x) is the encoded feature representation of instance x with a parameter we, and wc is the parameter of the classifier. We will map the regular learning problem as NSWC FCCO and the MIL problem as NSWC TCCO. The problem (9) is slightly more complicated than (1) or (2) due to the presence of s , s. In order to understand the applicability of our analysis and results to (9), we ignore s , s for a moment. In the regular learning setting when hw(X) = e(we, X) wc can be directly computed, we can map the problem into NSWC FCCO, where fi(g, s ) is non-smooth, convex, and non-decreasing in terms of g, and gi(w, si) = ψi(w, si) is non-smooth, and is proved to be weakly when ℓ( ) is convex and hw(X) is smooth in terms of w. In the MIL setting with mean pooling, we can map the problem into NSWC TCCO by defining hi(w) = 1 |Xi| P x Xi e(we; x) wc, hij(w) = hj(w) hi(w) and gi(hi,j(w), si) = si + (ℓ(hi,j(w)) si)+ β , and fi(gi, s ) = s + (gi s )+ α , where fi is non-smooth, convex, and non-decreasing in terms of gi, and gi(hij(w), si) is non-smooth, convex, monotonic in terms of hij(w) when ℓ( ) is convex and monotonically non-decreasing, and gi(hij(w), si) is weakly convex in terms of w when hij(w) is smooth and Lipchitz continuous in terms of w. Hence, the problem (9) satisfies the conditions in Assumption 4.1 for the regular learning setting and that in Assumption 4.3 for the MIL with mean pooling under mild regularity conditions of the neural network. We present full details in Appendix C.1 for interested readers. To compute the gradient estimator w.r.t w, ui,t will be maintained for tracking gi(w, si) in the regular setting or 1 n P Xj S gi(hi,j(w), si) in the MIL setting, vi,t will be maintained for tracking hi(w) in the MIL setting, which are updated similar to that in SONX and SONT. One difference from SONT is that vi,j,t is decoupled into vi,t and vj,t due to that hi,j can be decoupled. In terms of the extra variable s , s, the objective function is convex w.r.t both s and s, which allows us to simply update s by SGD using the stochastic gradient estimator 1 B1 P i Bt 1 s fi(ui,t, s t) and we update si by SGD using the stochastic gradient estimator h 1 B2 P j Bt 2 sigi(vj,t vi,t, si,t) i ufi(ui,t, s t). Detailed updates are presented in Algorithm 5 and Algorithm 6 in Appendix C.2. We can extend the convergence analysis of SONX and SONT to the two learning settings of TPAUC maximization, which is included in Appendix C.4. Finally, it is worth mentioning that we can also extend the results to other pooling operations, including smoothed max pooling and attention-based pooling [45]. Due to limit of space, we include discussions in Appendix C.3 as well. 6 Experimental Results We justify the effectiveness of the proposed SONX and SONT algorithms for TPAUC Maximization in the regular learning setting and MIL setting [14, 45]. Table 2: Testing TPAUC on molecule datasets (top) and on MIL datasets (bottom). The two numbers in parentheses of the second line refers to the lower bound of TPR and the upper bound of FPR for evaluating TPAUC. The two numbers of each method refers to the mean TPAUC and its std. moltox21 (t0) molmuv (t1) molpcba (t0) Method (0.6, 0.4) (0.5, 0.5) (0.6, 0.4) (0.5, 0.5) (0.6, 0.4) (0.5, 0.5) CE 0.067 (0.001) 0.208 (0.001) 0.161 (0.034) 0.469 (0.018) 0.095 (0.001) 0.264 (0.001) AUC-SH 0.064 (0.008) 0.217 (0.014) 0.260 (0.130) 0.444 (0.128) 0.140 (0.003) 0.312 (0.003) AUC-M 0.066 (0.009) 0.209 (0.01) 0.114 (0.079) 0.433 (0.053) 0.142 (0.009) 0.313 (0.003) MB 0.067 (0.015) 0.215 (0.023) 0.173 (0.153) 0.426 (0.118) 0.095 (0.002) 0.262 (0.003) AW-poly 0.064 (0.01) 0.206 (0.025) 0.172 (0.144) 0.393 (0.123) 0.110 (0.001) 0.281 (0.002) SOTA-s 0.068 (0.018) 0.23 (0.021) 0.327 (0.164) 0.526 (0.122) 0.143 (0.001) 0.314 (0.002) SONX 0.07 (0.035) 0.252 (0.025) 0.347 (0.175) 0.575 (0.122) 0.158 (0.006) 0.335 (0.006) MUSK2 Fox Method (0.5, 0.5) (0.3, 0.7) (0.1, 0.9) (0.5, 0.5) (0.3, 0.7) (0.1, 0.9) AUC-M (att) 0.675 (0.1) 0.783 (0.067) 0.867 (0.036) 0.032 (0.03) 0.253 (0.098) 0.444 (0.118) MIDAM (smx) 0.525 (0.2) 0.667 (0.149) 0.8 (0.097) 0.048 (0.059) 0.265 (0.119) 0.449 (0.113) MIDAM (att) 0.6 (0.215) 0.717 (0.135) 0.819 (0.092) 0.016 (0.032) 0.249 (0.125) 0.509 (0.065) SOTAs (att) 0.6 (0.267) 0.683 (0.178) 0.819 (0.097) 0.024 (0.032) 0.278 (0.059) 0.477 (0.046) SONT (att) 0.7 (0.1) 0.8 (0.067) 0.867 (0.036) 0.12 (0.131) 0.343 (0.176) 0.578 (0.119) Colon Lung Method (0.5, 0.5) (0.3, 0.7) (0.1, 0.9) (0.5, 0.5) (0.3, 0.7) (0.1, 0.9) AUC-M (att) 0.576 (0.1) 0.739 (0.061) 0.803 (0.038) 0.32 (0.181) 0.609 (0.113) 0.744 (0.082) MIDAM (smx) 0.646 (0.083) 0.787 (0.04) 0.863 (0.026) 0.43 (0.195) 0.68 (0.128) 0.824 (0.055) MIDAM (att) 0.548 (0.253) 0.738 (0.149) 0.826 (0.102) 0.544 (0.261) 0.716 (0.189) 0.815 (0.129) SOTAs (att) 0.772 (0.124) 0.862 (0.073) 0.911 (0.045) 0.539 (0.153) 0.745 (0.077) 0.841 (0.049) SONT (att) 0.8 (0.166) 0.875 (0.099) 0.916 (0.065) 0.639 (0.137) 0.779 (0.041) 0.865 (0.028) Baselines. For regular TPAUC maximization, we compare SONX with the following competitive methods: 1) Cross Entropy (CE) loss minimization; 2) AUC maximization with squared hinge loss (AUC-SH); 3) AUC maximization with min-max margin loss (AUC-M) [37]; 4) Mini-Batch based heuristic loss (MB) [16]; 5) Adhoc-Weighting based method with polynomial function (AWpoly) [35]; 5) a single-loop algorithm (SOTAs) for optimizing a smooth surrogate for TPAUC [44]. For MIL TPAUC maximization, we consider the following baselines: 1) AUC-M with attention-based pooling (AUC-M [att]); 2) SOTAs with attention-based pooling, which is a natural combination between advanced TPAUC optimization and MIL pooling technique; 3) the recently proposed provable multi-instance deep AUC maximization methods with stochastic smoothed-max pooling and attention-based pooling (MIDAM [smx] and MIDAM [att]) [45]. The first two baselines use naive mini-batch pooling for computing the loss function in AUC-M and SOTAs. We implement SONT for MIL TPAUC maximization with attention-based pooling, which is referred to as SONT (att). Datasets. For regular TPAUC maximization, we use three molecule datasets as in [44], namely moltox21 (the No.0 target), molmuv (the No.1 target) and molpcba (the No.0 target) [29]. For MIL TPAUC maximization, we use four MIL datasets, including two tabular datasets MUSK2 and Fox, and two medical image datasets Colon and Lung. MUSK2 and Fox are two tabular datasets that have been widely adopted for MIL benchmark study [14]. Colon and Lung are two histopathology (medical image) datasets that have large image size (512 512) but local interests for classification [2]. For Colon dataset, the adenocarcinoma is regarded as positive label and benign is negative; for Lung dataset, we treat adenocarcinoma as positive and squamous cell carcinoma as negative 3. For both of the histopathology datasets, we uniformly randomly sample 100 positive and 1000 negative data for experiments. For all MIL datasets, we uniformly randomly split 10% as the testing and the remaining as the training and validation. The statistics for all used datasets are summarized in Table 3and Table 4 in Appendix F. Experiment Settings. For regular TPAUC maximization, we use the same setting as in [44]. The adopted backbone Graph Nueral Network (GNN) model is Graph Isomorphism Network (GIN), which has 5 mean-pooling layers with 64 number of hidden units and dropout rate 0.5 [30]. We utilize the sigmoid function for the final output layer to generate the prediction score, and set the surrogate loss ℓ( ) as squared hinge loss with a margin parameter. We follow the setups for model training and tuning exactly the same as the prior work [44]. Essentially, the model is trained by 60 epochs and the learning rate is decreased by 10-fold after every 20 epochs. The model is initialized as a pretrained model from CE loss on the training datasets. We fix the learning rate of SONX as 1e-2 and moving average parameter τ as 0.9; tune the parameter γ in {0, 1e-1,1e-2,1e-3}, the parameter α, β in {0.1,0.3,0.5} and fix the margin parameter of the surrogate loss ℓas 1.0, which cost the same 3Data available: https://www.kaggle.com/datasets/biplobdey/lung-and-colon-cancer 0 10 20 30 40 50 60 Epochs = 1e 2 = 1e 3 (a) molmuv (t1) 0 10 20 30 40 50 60 Epochs = 1e 2 = 1e 3 (b) molpcba (t0) 0 20 40 60 80 100 Epochs = 1e 2 = 1e 3 0 20 40 60 80 100 Epochs = 1e 2 = 1e 3 Figure 1: Training Curves of SONX (left two) and SONT (right two) for TPAUC maximization with different γ. The y-axis is the TPAUC (0.5, 0.5). tuning effort as the other baselines. The weight decay is set as the same value (2e-4) with the other baselines. For baselines, we directly use the results reported in [44] since we use the same setting. For MIL TPAUC maximization, we train a simple Feed Forward Neural Network (FFNN) with one hidden layer (the number of neurons equals to data dimension) for the two tabular datasets and Res Net20 for the two medical image datasets. Sigmoid transformation is adopted for the output layer to generate prediction score. The training epoch number is fixed as 100 epochs for all methods; the bag batch size is fixed as 16 (resp. 8) and the number of sampled instances per bag is fixed as 4 (resp. 128) for tabular (resp. medical image) datasets; the learning rate is tuned in {1e-2, 1e-3, 1e-4} and decreased by 10 folds at the end of 50-th and 75-th epoch for all baselines. For SONT (att), we set moving average parameter τ1 = τ2 as 0.9; tune the parameter γ1 = γ2 = γ in {0, 1e-1,1e-2,1e-3} and fix the margin parameter of the surrogate loss ℓas 0.5, and the parameter α, β in {0.1,0.5,0.9}. Similar parameters in baselines are set the same or tuned similarly. For all experiments, we utilize 5-fold-cross-validation to evaluate the testing performance based on the best validation performance with possible early stopping choice. Results. The testing results for the regular and MIL TPUAC maximization with different TPAUC measures are summarized in the Table 2. From Table 2, we observe that our method SONX achieves the best performance for regular TPAUC maximization. It is better than the state-of-the-art method SOTAs for TPAUC maximization. We attribute the better performance of SONX to the fact that the objective of SONX is an exact estimator of TPAUC while the smoothed objective of SOTAs is an inexact estimator of TPAUC. We also observe that SONT (att) achieves the best performance in all cases, which is not surprising since it is the only one that directly optimizes the TPAUC surrogate. In contrast, other baselines either optimizes a different objective (MIDAM) or does not ensure convergence due to the use of mini-batch pooling (AUC-M, SOTAs). Ablation Study. We conduct ablation studies to demonstrate the effect of the error correction term on the training convergence by varying the γ value for SONX and SONT, where γ1 = γ2 = γ is set as the same value in SONT. The training convergence results are presented in Figure 1. We can see that an appropriate value of γ > 0 can yield a faster convergence than γ = 0, which verifies the faster convergence of using MSVR estimators than using moving average estimators. However, we do observe a gap between theory and practice, as setting a large value of γ > 1 as in the theory might not yield convergence. This phenomenon is also observed in [12]. We conjecture that the gap could be fixed by considering convex objectives [40], which is left as future work. 7 Conclusions In this paper, we have considered non-smooth weakly-convex two-level and tri-level finite-sum coupled compositional optimization problems. We presented novel convergence analysis of two stochastic algorithms and established their complexity. Applications in deep learning for two-way partial AUC maximization was considered and great performance of proposed algorithms were demonstrated through experiments on multiple datasets. A future work is to prove the convergence of both algorithms for convex objectives. Acknowledgements We thank anonymous reviewers for constructive comments. Q. Hu, D. Zhu and T. Yang were partially supported by NSF Career Award 2246753, NSF Grant 2246757, 2246756 and 2306572. [1] Krishnakumar Balasubramanian, Saeed Ghadimi, and Anthony Nguyen. Stochastic multilevel composition optimization algorithms with level-independent convergence rates. SIAM Journal on Optimization, 32(2):519 544, 2022. [2] Andrew A Borkowski, Marilyn M Bui, L Brannon Thomas, Catherine P Wilson, Lauren A De Land, and Stephen M Mastorides. Lung and colon cancer histopathological image dataset (lc25000). ar Xiv preprint ar Xiv:1912.12142, 2019. [3] Tianyi Chen, Yuejiao Sun, and Wotao Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Transactions on Signal Processing, 69:4937 4948, 2021. [4] Sebastian Curi, Kfir Y. Levy, Stefanie Jegelka, and Andreas Krause. Adaptive sampling for stochastic risk-averse learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1036 1047. Curran Associates, Inc., 2020. [5] Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [6] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions, 2018. [7] Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908 1930, 2019. [8] Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program., 178(1-2):503 558, 2019. [9] John C. Duchi and Feng Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229 3259, 2018. [10] S. Ghadimi, Andrzej Ruszczy nski, and Mengdi Wang. A single timescale stochastic approximation method for nested stochastic optimization. SIAM J. Optim., 30:960 979, 2020. [11] Lie He and Shiva Prasad Kasiviswanathan. Debiasing conditional stochastic optimization, 2023. [12] Quanqi Hu, Zi-Hao Qiu, Zhishuai Guo, Lijun Zhang, and Tianbao Yang. Blockwise stochastic variance-reduced methods with parallel speedup for multi-block bilevel optimization. In Proceedings of the 39th International Conference on Machine Learning, 2023. [13] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. Advances in Neural Information Processing Systems, 33, 2020. [14] Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2127 2136. PMLR, 10 15 Jul 2018. [15] Wei Jiang, Gang Li, Yibo Wang, Lijun Zhang, and Tianbao Yang. Multi-block-single-probe variance reduced estimator for coupled compositional optimization, 2022. [16] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for non-decomposable loss functions. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS 14, page 694 702, Cambridge, MA, USA, 2014. MIT Press. [17] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6083 6093. PMLR, 13 18 Jul 2020. [18] J.J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93:273 299, 1965. [19] Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An online method for a class of distributionally robust optimization with non-convex objectives. Advances in Neural Information Processing Systems, 34, 2021. [20] Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang. Stochastic optimization of area under precision-recall curve for deep learning with provable convergence. In Advances in neural information processing systems, volume abs/2104.08736, 2021. [21] Zi-Hao Qiu, Quanqi Hu, Yongjian Zhong, Lijun Zhang, and Tianbao Yang. Large-scale stochastic optimization of NDCG surrogates for deep learning with provable convergence. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 18122 18152. PMLR, 17 23 Jul 2022. [22] Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Weakly-convex concave min max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, 37(3):1087 1121, 2022. [23] R. T. Rockafellar and S. Uryasev. Optimization of conditional valueat-risk. Journal of Risk, 2(3), 1999. [24] R.T. Rockafellar, M. Wets, and R.J.B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2009. [25] Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization, 2020. [26] Bokun Wang and Tianbao Yang. Finite-sum coupled compositional stochastic optimization: Theory and applications. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23292 23317. PMLR, 17 23 Jul 2022. [27] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419 449, 2017. [28] Mengdi Wang, Ji Liu, and Ethan X. Fang. Accelerating stochastic composition optimization, 2016. [29] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Molecule Net: a benchmark for molecular machine learning. Chemical science, 9(2):513 530, 2018. [30] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In 7th International Conference on Learning Representations, 2019. [31] Yan Yan, Yi Xu, Qihang Lin, Wei Liu, and Tianbao Yang. Optimal epoch stochastic gradient descent ascent methods for min-max optimization. In Advances in Neural Information Processing Systems 33 (Neur IPS), 2020. [32] Shuoguang Yang, Mengdi Wang, and Ethan X. Fang. Multi-level stochastic gradient methods for nested composition optimization, 2018. [33] Tianbao Yang. Algorithmic foundation of deep x-risk optimization. Co RR, abs/2206.00439, 2022. [34] Tianbao Yang and Yiming Ying. AUC maximization in the era of big data and AI: A survey. ACM Comput. Surv., 55(8):172:1 172:37, 2023. [35] Zhiyong Yang, Qianqian Xu, Shilong Bao, Yuan He, Xiaochun Cao, and Qingming Huang. When all we need is a piece of the pie: A generic framework for optimizing two-way partial auc. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11820 11829. PMLR, 18 24 Jul 2021. [36] Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25760 25782. PMLR, 17 23 Jul 2022. [37] Zhuoning Yuan, Yan Yan, Milan Sonka, and Tianbao Yang. Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification, 2021. [38] Junyu Zhang and Lin Xiao. A stochastic composite gradient method with incremental variance reduction. In Advances in Neural Information Processing Systems, pages 9075 9085, 2019. [39] Junyu Zhang and Lin Xiao. Multilevel composite stochastic optimization via nested variance reduction. SIAM J. Optim., 31(2):1131 1157, 2021. [40] Xuan Zhang, Necdet Serhat Aybat, and Mert Gurbuzbalaban. Robust accelerated primal-dual methods for computing saddle points. 2021. [41] Xuan Zhang, Necdet Serhat Aybat, and Mert Gurbuzbalaban. Sapd+: An accelerated stochastic method for nonconvex-concave minimax problems, 2023. [42] Zhe Zhang and Guanghui Lan. Optimal algorithms for convex nested stochastic composite optimization, 2022. [43] Renbo Zhao. A primal-dual smoothing framework for max-structured non-convex optimization, 2022. [44] Dixian Zhu, Gang Li, Bokun Wang, Xiaodong Wu, and Tianbao Yang. When AUC meets DRO: Optimizing partial AUC for deep learning with non-convex convergence guarantee. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27548 27573. PMLR, 17 23 Jul 2022. [45] Dixian Zhu, Bokun Wang, Zhi Chen, Yaxing Wang, Milan Sonka, Xiaodong Wu, and Tianbao Yang. Provable multi-instance deep auc maximization with stochastic pooling. In International Conference on Machine Learning. PMLR, 2023. [46] Landi Zhu, Mert Gürbüzbalaban, and Andrzej Ruszczy nski. Distributionally robust learning with weakly convex losses: Convergence rates and finite-sample guarantees, 2023. A Proofs of Theorem 4.6 and Theorem 4.7 In this section, we provide the detailed proofs for Theorem 4.6 and Theorem 4.7. We first give a basic property for weakly-convex functions. Proposition A.1 (Proposition 2.1 in [7]). Suppose function g : Rd R { } is lowersemicontinuous. Then g is ρ-weakly-convex if and only if g(y) g(x) + v, y x ρ 2 y x 2 (10) holds for all vectors v g(x) and x, y Rd. A.1 Proof of Theorem 4.6 Note that the proof of Lemma 4.5 also implies the following squared-norm error bound, i S ui,t+1 gi(wt+1) 2 (1 B1τ i S ui,0 gi(w0) 2 + 4τσ2 B2 + 16n2C2 g M 2η2 Proof of Theorem 4.6. Define ˆwt := prox F/ ρ(wt). For a given i S, we have fi(gi( ˆwt)) fi(ui,t) (a) fi(ui,t) (gi( ˆwt) ui,t) ρf 2 gi( ˆwt) ui,t 2 fi(ui,t) (gi( ˆwt) ui,t) ρf gi( ˆwt) gi(wt) 2 ρf gi(wt) ui,t 2 fi(ui,t) (gi( ˆwt) ui,t) ρf C2 g ˆwt wt 2 ρf gi(wt) ui,t 2 (b) fi(ui,t) gi(wt) ui,t + gi(wt) ( ˆwt wt) ρg ρf C2 g ˆwt wt 2 ρf gi(wt) ui,t 2 (c) fi(ui,t) (gi(wt) ui,t) + fi(ui,t) gi(wt) ( ˆwt wt) (ρg Cf 2 + ρf C2 g) ˆwt wt 2 ρf gi(wt) ui,t 2 where (a) follows from the ρf-weak-convexity of fi, (b) follows from that fi( ) is non-decreasing and the weak convexity of gi, (c) is due to 0 fi(ui,t) Cf. Then it follows 1 n i S fi(ui,t) gi(wt) ( ˆwt wt) fi(gi( ˆwt)) fi(ui,t) fi(ui,t) (gi(wt) ui,t) + (ρg Cf 2 + ρf C2 g) ˆwt wt 2 + ρf gi(wt) ui,t 2 Now we consider the change in the Moreau envelope: Et[F1/ ρ(wt+1)] = Et h min w F( w) + ρ 2 w wt+1 2i Et h F( ˆwt) + ρ 2 ˆwt wt+1 2i = F( ˆwt) + Et 2 ˆwt (wt ηGt) 2 F( ˆwt) + ρ 2 ˆwt wt 2 + ρEt[η ˆwt wt, Gt ] + η2 ρM 2 = F1/ ρ(wt) + ρη ˆwt wt, Et[Gt] + η2 ρM 2 i S gi(wt) fi(ui,t), and the second inequality uses the bound of E[ Gt 2], which follows from the Lipschitz continuity and bounded variance assumptions and is denoted by M. Combining inequality 29 and 30 yields Et[F1/ ρ(wt+1)] F1/ ρ(wt) + η2 ρM 2 fi(gi( ˆwt)) fi(ui,t) fi(ui,t) (gi(wt) ui,t) + (ρg Cf 2 + ρf C2 g) ˆwt wt 2 + ρf gi(wt) ui,t 2 = F1/ ρ(wt) + η2 ρM 2 Fi( ˆwt) Fi(wt) + fi(gi(wt)) fi(ui,t) fi(ui,t) (gi(wt) ui,t) + (ρg Cf 2 + ρf C2 g) ˆwt wt 2 + ρf gi(wt) ui,t 2 Due to the ρF -weak convexity of Fi(w), we have ( ρ ρF )-strong convexity of w 7 Fi(w) + ρ 2 wt w 2. Then it follows Fi( ˆwt) Fi(wt) = Fi( ˆwt) + ρ 2 wt ˆwt 2 Fi(wt) + ρ 2 wt wt 2 ρ 2 ρ) wt ˆwt 2 (14) Plugging inequality 32 into inequality 31 yields Et[F1/ ρ(wt+1)] E[F1/ ρ(wt)] + η2 ρM 2 2 ρ) wt ˆwt 2 + fi(gi(wt)) fi(ui,t) fi(ui,t) (gi(wt) ui,t) 2 + ρf C2 g) ˆwt wt 2 + ρf gi(wt) ui,t 2 (15) Set ρ = ρF + ρg Cf + 2ρf C2 g. We have Et[F1/ ρ(wt+1)] F1/ ρ(wt) + η2 ρM 2 2 wt ˆwt 2 + fi(gi(wt)) fi(ui,t) fi(ui,t) (gi(wt) ui,t) + ρf gi(wt) ui,t 2 (a) F1/ ρ(wt) + η2 ρM 2 2 F1/ ρ(wt) 2 + ρη fi(gi(wt)) fi(ui,t) fi(ui,t) (gi(wt) ui,t) + ρf gi(wt) ui,t 2 where inequality (a) follows from Lemma 3.2. Using the Lipschitz continuity of fi, we have Et[F1/ ρ(wt+1)] F1/ ρ(wt) + η2 ρM 2 2 F1/ ρ(wt) 2 + ρη i S 2Cf gi(wt) ui,t i S ρf gi(wt) ui,t 2 By Lemma 4.5, the error bound of the MSVR update gives i S ui,t gi(wt) (1 µ)t 1 i S ui,0 gi(w0) + R1, i S ui,t gi(wt) 2 (1 µ)t 1 i S ui,0 gi(w0) 2 + R2, 2n , R1 = 2τ 1/2σ B1/2 2 + 4n Cg Mη B1τ 1/2 , R2 = 4τσ2 B2 + 16n2C2 g M 2η2 E[F1/ ρ(wt+1)] F1/ ρ(wt) + η2 ρM 2 2E[ F1/ ρ(wt) 2] i S gi(w0) ui,0 + R1 i S gi(w0) ui,0 2 + R2 Taking summation from t = 0 to T 1 yields E[F1/ ρ(w T )] F1/ ρ(w0) + η2 ρM 2T t=0 E[ F1/ ρ(wt) 2] t=0 (1 µ)t 1 i S gi(w0) ui,0 + R1T i S gi(w0) ui,0 2 + R2T (a) F1/ ρ(w0) + η2 ρM 2T t=0 E[ F1/ ρ(wt) 2] i S gi(w0) ui,0 + 2Cf ρηR1T + ρf ρη i S gi(w0) ui,0 2 + 2ρf ρηR2T, (17) where (a) uses PT 1 t=0 (1 µ)t 1 Lower bounding the left-hand-side by minw F(w), we obtain t=0 E[ F1/ ρ(wt) 2] F1/ ρ(w0) min w F(w) + η2 ρM 2T i S gi(w0) ui,0 + 2Cf ρηR1T i S gi(w0) ui,0 2 + ρf ρηR2T ηT + η ρM 2 + 4Cf ρ i S gi(w0) ui,0 + 4Cf ρR1 + 2ρf ρ i S gi(w0) ui,0 2 + 2ρf ρR2 µ) + C(η + R1 + R2) where we assume F1/ ρ(w0, s0, s 0) minw,s,s F(w, s, s ) and C = max{8 , 12 ρM 2, 16Cf ρ i S gi(w0) ui,0 , 8ρf ρ i S gi(w0) ui,0 2, 16Cf ρ, 8ρf ρ}. t=0 E[ F1/ ρ(wt) 2] B1τ ) + C(η + 2τ 1/2σ B1/2 2 + 4n Cg Mη B1τ 1/2 + 4τσ2 B2 + 16n2C2 g M 2η2 η + n B1τ ) + (η + τ 1/2σ B1/2 2 + nη B1τ 1/2 + τσ2 τ = O(B2ϵ4), η = O B1B1/2 2 ϵ4 To reach an ϵ-stationary point, we need B1B1/2 2 ϵ6 A.2 Proof of Theorem 4.7 A formal statement in given below. Theorem A.2. Under Assumption 4.3, with γ1 = n1n2 B1B2 B1B2(1 τ1) + (1 τ1), γ2 = n1 B1 B1(1 τ2) + (1 τ2), τ1 = O min{B3, B1/2 1 n1/2 2 n1/2 1 }ϵ4 1 2, τ2 = O(B2ϵ4) 1 2, η = O min B1/2 3 , B1/4 1 n1/4 2 n1/4 1 , B1/2 1 n1/2 2 n1/2 1 n1n2 ϵ4 , and ρ = ρF + 4ρf C2 g + 2ρg Cf C2 h + Cf Cg Lh, Algorithm 2 converges to an ϵ-stationary point of the Moreau envelope F1/ ρ in T = O max 1 B1/2 3 , n1/4 1 B1/4 1 n1/4 2 , n1/2 1 B1/2 1 n1/2 2 n1n2 B1B2 ϵ 6 iterations. We first define constant M 2 max{ 3C2 f C2 gσ2 B3 + 3C2 f C2 g C2 h B2 + 3C2 f C2 g C2 h B1 , C2 h+σ2} so that Et[ Gt 2] M 2 and vi,j,t 2 M 2 for all i S1, j S2 and t. Then to prove Theorem A.2, we need the following Lemmas. Lemma A.3. Consider MSVR update for v. Assume hi,j(w; ξ) is Ch-Lipshitz for all (i, j) S1 S2 , and E[ Gt 2] M 2. With γ1 = n1n2 B1B2 B1B2(1 τ1) + (1 τ1), and τ1 1 j S2 vi,j,t+1 hi,j(wt+1) 2n1n2 )t+1 1 j S2 vi,j,0 hi,j(w0) + 2τ 1/2 1 σ B1/2 3 + 4n1n2Ch Mη B1B2τ 1/2 1 j S2 vi,j,t+1 hi,j(wt+1) 2 2n1n2 )2(t+1) 1 j S2 vi,j,0 hi,j(w0) 2 + 4τ1σ2 B3 + 16n2 1n2 2C2 h M 2η2 Lemma A.4. Consider MSVR update for u. Assume gi( ) is Cg-Lipshitz for all i S1. With γ2 = n+ B1 B1(1 τ2) + (1 τ2) and τ2 1 i S1 ui,t+1 1 j S2 gi(vi,j,t+1) i S1 ui,0 1 j S2 gi(vi,j,0) + 2τ 1/2 2 σ B1/2 2 + C2n1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ 1/2 2 + C2n3/2 1 n1/2 2 η B3/2 1 B1/2 2 τ 1/2 2 where C2 is a constant defined in the proof. Proof of Theorem A.2. Consider the change in the Moreau envelope: Et[F1/ ρ(wt+1)] = Et h min w F( w) + ρ 2 w wt+1 2i Et h F( ˆwt) + ρ 2 ˆwt wt+1 2i = F( ˆwt) + Et 2 ˆwt (wt ηGt) 2 F( ˆwt) + ρ 2 ˆwt wt 2 + ρEt[η ˆwt wt, Gt ] + η2 ρM 2 = F1/ ρ(wt) + ρEt[η ˆwt wt, Gt ] + η2 ρM 2 j=1 hi,j(wt) gi(vi,j,t) fi (ui,t) , and the second inequality uses the bound of E[ Gt 2], which follows from the Lipschitz continuity and bounded variance assumptions and is denoted by M. Define ˆwt := prox F/ ρ(wt). For a given i {1, . . . , m}, we have 1 n1 j S2 gi(hi,j( ˆwt))) 1 i S1 fi(ui,t) i S1 fi(ui,t) ( 1 j S2 gi(hi,j( ˆwt)) ui,t) 1 j S2 gi(hi,j( ˆwt)) ui,t 2 i S1 fi(ui,t) ( 1 j S2 gi(hi,j( ˆwt)) ui,t) j S2 gi(hi,j( ˆwt)) 1 j S2 gi(vi,j,t) 2 1 j S2 gi(vi,j,t) ui,t 2 i S1 fi(ui,t) ( 1 j S2 gi(hi,j( ˆwt)) ui,t) 1 j S2 ρf C2 g hi,j( ˆwt) vi,j,t 2 j S2 gi(vi,j,t) ui,t 2 i S1 fi(ui,t) 1 j S2 gi(vi,j,t) ui,t + 1 j S2 gi(vi,j,t) (hi,j( ˆwt) vi,j,t) 2 hi,j( ˆwt) vi,j,t 2 1 j S2 2ρf C2 g hi,j(wt) vi,j,t 2 2ρf C2 g ˆwt wt 2 1 j S2 gi(vi,j,t) ui,t 2 i S1 fi(ui,t) 1 j S2 gi(vi,j,t) ui,t j S2 fi(ui,t) gi(vi,j,t) (hi,j( ˆwt) vi,j,t) | {z } A1 2 hi,j( ˆwt) vi,j,t 2 1 j S2 2ρf C2 g hi,j(wt) vi,j,t 2 2ρf C2 g ˆwt wt 2 1 j S2 gi(vi,j,t) ui,t 2 i S1 fi(ui,t) 1 j S2 gi(vi,j,t) ui,t j S2 fi(ui,t) gi(vi,j,t) (hi,j( ˆwt) vi,j,t) | {z } A1 j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 (2ρf C2 g + ρg Cf C2 h) ˆwt wt 2 1 j S2 gi(vi,j,t) ui,t 2 (19) where (a) follows from the convexity of fi, (b) uses the assumption that fi( ) is non-decreasing and gi is weak convex, (c) is due to 0 fi(ui,t) Cf. The Lh-smoothness assumption of hi,j(w) (or weakly-convexity of hi,j(w), then only the second inequality holds) for all i, w implies hi,j( ˆwt) hi,j(wt) + hi,j(wt) ( ˆwt wt) + Lh 2 ˆwt wt 2, hi,j( ˆwt) hi,j(wt) + hi,j(wt) ( ˆwt wt) Lh 2 ˆwt wt 2. (20) We first assume that gi( ) is non-increasing. Since fi(ui,t) 0 and gi(vi,j,t) 0, we bound A1 as following A1 = fi(ui,t) gi(vi,j,t) (hi,j( ˆwt) vi,j,t) (a) fi(ui,t) gi(vi,j,t) (hi,j(wt) vi,j,t) + fi(ui,t) gi(vi,j,t) hi,j(wt) ( ˆwt wt) + fi(ui,t) gi(vi,j,t) Lh (b) Cf Cg hi,j(wt) vi,j,t + fi(ui,t) gi(vi,j,t) hi,j(wt) ( ˆwt wt) (21) where inequality (a) follows from the first inequality in (20), (b) follows from the Lipschitz continuity and monotone assumptions on fi, gi, hi,j. On the other hand, if we assume gi( ) is non-decreasing, we may use the second inequality in (20) and obtain the same result as (21). Now plugging the new formulation of A1 back to inequality 19 yields 1 n1 j S2 gi(hi,j( ˆwt))) 1 i S1 fi(ui,t) i S1 fi(ui,t) 1 j S2 gi(vi,j,t) ui,t j S2 Cf Cg hi,j(wt) vi,j,t j S2 fi(ui,t) gi(vi,j,t) hi,j(wt) ( ˆwt wt) Cf Cg Lh j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 (2ρf C2 g + ρg Cf C2 h) ˆwt wt 2 1 j S2 gi(vi,j,t) ui,t j S2 gi(vi,j,t) ui,t j S2 Cf Cg hi,j(wt) vi,j,t + Et[Gt], ˆwt wt 1 j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 (2ρf C2 g + ρg Cf C2 h + Cf Cg Lh 2 ) ˆwt wt 2 1 j S2 gi(vi,j,t) ui,t It follows Et[Gt], ˆwt wt j S2 gi(hi,j( ˆwt))) 1 i S1 fi(ui,t) + 1 j S2 gi(vi,j,t) ui,t j S2 Cf Cg hi,j(wt) vi,j,t + 1 j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 + (2ρf C2 g + ρg Cf C2 h + Cf Cg Lh 2 ) ˆwt wt 2 + 1 j S2 gi(vi,j,t) ui,t Combining inequality 22 and 18 yields Et[F1/ ρ(wt+1)] F1/ ρ(wt) + η2 ρM 2 j S2 gi(hi,j( ˆwt))) fi(ui,t) j S2 gi(vi,j,t) ui,t j S2 Cf Cg hi,j(wt) vi,j,t j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 + (2ρf C2 g + ρg Cf C2 h + Cf Cg Lh 2 ) ˆwt wt 2 + ρf j S2 gi(vi,j,t) ui,t F1/ ρ(wt) + η2 ρM 2 Fi( ˆwt) Fi(wt) + Fi(wt) fi( 1 j S2 gi(vi,j,t)) j S2 gi(vi,j,t)) fi(ui,t) + Cf j S2 gi(vi,j,t) ui,t j S2 Cf Cg hi,j(wt) vi,j,t + 1 j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 + (2ρf C2 g + ρg Cf C2 h + Cf Cg Lh 2 ) ˆwt wt 2 + ρf j S2 gi(vi,j,t) ui,t (a) F1/ ρ(wt) + η2 ρM 2 Fi( ˆwt) Fi(wt) + 2Cf j S2 gi(vi,j,t) ui,t j S2 2Cf Cg hi,j(wt) vi,j,t + 1 j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 + (2ρf C2 g + ρg Cf C2 h + Cf Cg Lh 2 ) ˆwt wt 2 + ρf j S2 gi(vi,j,t) ui,t where (a) follows from the Lipschitz continuity of fi, gi, hi,j. Due to the ρF -weak convexity of Fi(w), we have ( ρ ρF )-strong convexity of w 7 Fi(w) + ρ 2 wt w 2. Then it follows Fi( ˆwt) Fi(wt) = Fi( ˆwt) + ρ 2 wt ˆwt 2 Fi(wt) + ρ 2 wt wt 2 ρ 2 ρ) wt ˆwt 2 Plugging inequality 23 back into A.2, we obtain Et[F1/ ρ(wt+1)] F1/ ρ(wt) + η2 ρM 2 2 ρ) wt ˆwt 2 + 2Cf j S2 gi(vi,j,t) ui,t j S2 2Cf Cg hi,j(wt) vi,j,t + 1 j S2 (2ρf C2 g + ρg Cf) hi,j(wt) vi,j,t 2 + (2ρf C2 g + ρg Cf C2 h + Cf Cg Lh 2 ) ˆwt wt 2 + ρf j S2 gi(vi,j,t) ui,t (a) F1/ ρ(wt) + η2 ρM 2 2 wt ˆwt 2 + C1 j S2 gi(vi,j,t) ui,t j S2 C1 hi,j(wt) vi,j,t + 1 j S2 C1 hi,j(wt) vi,j,t 2 j S2 gi(vi,j,t) ui,t (b) = F1/ ρ(wt) + η2 ρM 2 2 F1/ ρ(wt) 2 + C1 ρη 1 j S2 gi(vi,j,t) ui,t j S2 hi,j(wt) vi,j,t + C1 ρη 1 j S2 hi,j(wt) vi,j,t 2 j S2 gi(vi,j,t) ui,t where in inequality (a) we use ρ = ρF + 4ρf C2 g + 2ρg Cf C2 h + Cf Cg Lh and C1 = max{2Cf Cg, 2Cf, (2ρf C2 g + ρg Cf), ρf}, and equality (b) uses Lemma 3.2. With general error bounds 1 n1 j S2 E[ hi,j(wt) vi,j,t ] (1 µ1)t 1 j S2 hi,j(w0) vi,j,0 + R1, j S2 E[ hi,j(wt) vi,j,t 2] (1 µ1)t 1 j S2 hi,j(w0) vi,j,0 2 + R2, i S1 E 1 n2 j S2 gi(vi,j,t) ui,t j S2 gi(vi,j,0) ui,0 i S1 E 1 n2 j S2 gi(vi,j,t) ui,t 2 (1 µ2)t 1 j S2 gi(vi,j,0) ui,0 we have E[F1/ ρ(wt+1)] E[F1/ ρ(wt)] + η2 ρM 2 2E[ F1/ ρ(wt) 2] + C1 ρη(1 µmin)t 1 j S2 gi(vi,j,0) ui,0 j S2 gi(vi,j,0) ui,0 j S2 hi,j(w0) vi,j,0 j S2 hi,j(w0) vi,j,0 2 + C1 ρη(R1 + R2 + R3 + R4), where µmin = min{µ1, µ2}. Taking summation from t = 0 to T 1 yields E[F1/ ρ(w T )] E[F1/ ρ(w0)] + η2 ρM 2T t=0 E[ F1/ ρ(wt) 2] + C1 ρη t=0 (1 µmin)t 0 + TC1 ρη(R1 + R2 + R3 + R4) E[F1/ ρ(w0)] + η2 ρM 2T t=0 E[ F1/ ρ(wt) 2] + C1 ρη 0 µmin + TC1 ρη(R1 + R2 + R3 + R4) where we use PT 1 t=0 (1 µmin)t 1 µmin and define constant 0 such that 1 j S2 gi(vi,j,0) ui,0 j S2 gi(vi,j,0) ui,0 j S2 hi,j(w0) vi,j,0 + 1 j S2 hi,j(w0) vi,j,0 2 0. Then it follows 1 T t=0 E[ F1/ ρ(wt) 2] F1/ ρ(w0) E[F1/ ρ(w T )] + η2 ρM 2T 2 + C1 ρη 0 µmin + TC1 ρη(R1 + R2 + R3 + R4) ηT + η ρM 2 + 2C1 ρ 0 µmin T + 2C1 ρ(R1 + R2 + R3) η + 1 µmin ) + η + R1 + R2 + R3 + R4) where we define constant such that F1/ ρ(w0, s0, s 0) E[F1/ ρ(w T , s T , s T )] . With MSVR updates for vi,j,t and ui,t, following from Lemma A.3 and Lemma A.4, we have µ1 = B1B2τ1 2n1n2 , µ2 = B1τ2 2n1 , R1 = 2τ 1/2 1 σ B1/2 3 + 4n1n2 Ch Mη B1B2τ 1/2 1 B3 + 16n2 1n2 2Ch M 2η2 B2 1B2 2τ1 , R3 = 2τ 1/2 2 σ B1/2 2 + C2n1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ 1/2 2 + C2n3/2 1 n1/2 2 η B3/2 1 B1/2 2 τ 1/2 2 , B2 + C2 2n1B2τ 2 1 B1n2τ2 + C2 2n3 1n2η2 t=0 E[ F1/ ρ(wt) 2] η + 1 µmin ) + η + τ 1/2 1 B1/2 3 + τ1 B3 + τ 1/2 2 B1/2 2 + τ2 B1B2τ 1/2 1 + n2 1n2 2η2 B2 1B2 2τ1 + n1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ 1/2 2 + n3/2 1 n1/2 2 η B3/2 1 B1/2 2 τ 1/2 2 + n1B2τ 2 1 B1n2τ2 + n3 1n2η2 η + 1 µmin ) + τ 1/2 1 B1/2 3 + τ 1/2 2 B1/2 2 + n1n2η B1B2τ 1/2 1 + n1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ 1/2 2 + n3/2 1 n1/2 2 η B3/2 1 B1/2 2 τ 1/2 2 Algorithm 3 Stochastic Optimization algorithm for Non-smooth FCCO with coordinate moving average 1: Initialization: w0, {ui,0 : i S}. 2: for t = 0, . . . , T 1 do 3: Draw sample batches Bt 1 S, and Bt 2,i Di for each i Bt 1. 4: ui,t+1 = (1 τ)ui,t + τgi(wt; Bt 2,i), i Bt 1 ui,t, i Bt 1 5: Compute Gt = 1 B1 P i Bt 1 gi(wt; Bt 2,i) fi(ui,t) 6: Update wt+1 = wt ηGt 7: end for 8: return w t with uniformly sampled t {0, T 1}. min{B3, B1/2 1 n1/2 2 n1/2 1 }ϵ4 ! , τ2 = O(B2ϵ4), n1n2 τ 1/2 1 ϵ2, B3/2 1 B1/2 2 n3/2 1 n1/2 2 τ 1/2 2 B1/2 3 , B1/4 1 n1/4 2 n1/4 1 , B1/2 1 n1/2 2 n1/2 1 B1/2 3 , n1/4 1 B1/4 1 n1/4 2 , n1/2 1 B1/2 1 n1/2 2 ) n1n2 B1B2 ϵ 6 ! t=0 E[ F1/ ρ(wt) 2] ϵ2 B Solving Non-smooth FCCO and TCCO with Coordinate Moving Average In this section we consider solving non-smooth weakly-convex FCCO and TCCO without variance reduction method. To be specific, we use coordinate moving average updates for function values estimations instead of MSVR. This allows us to weaken the assumption on the Lipschitz continuity, i.e. the Lipschitz continuity of the stochastic function value estimation is not required, and can be replaced by the Lipschitz continuity of the function value. Moreover, compared with MSVR, coordinate moving average update does not need the stochastic evaluation from the previous iteration, and thus has a simpler implementation. However, as a result of not using variance reduction technique, the algorithms suffer from worse convergence rates in terms of ϵ. B.1 Solving Non-smooth FCCO with Coordinate Moving Average We first assume the followings assumptions hold. Assumption B.1. For all i S, we assume that fi( ) is ρf-weakly-convex, Cf-Lipschitz continuous and non-decreasing; gi( ) is ρg-weakly-convex and Cg-Lipschitz continuous; Stochastic gradient estimators gi(w; ξ) and gi(w; ξ) have bounded variance σ2. With coordinate moving average update, we present the following lemma of error bound. Lemma B.2. Consider the coordinate moving average update for {ui,t : i S1} in Algorithm 3, assume gi(w) is Cg-Lipschitz continuous for all i S1 and τ 1, then we have E[ ui,t+1 gi(wt+1) ] (1 B1τ 4n1 )t+1 ui,0 gi(w0) + 2 E[ ui,t+1 gi(wt+1) 2] (1 B1τ 4n1 )2(t+1) ui,0 gi(w0) 2 + 8τσ2 B2 + 32n2 1C2 g M 2η2 Then we have a convergence analysis similar to Theorem 4.6. Theorem B.3. Consider non-smooth weakly-convex FCCO problem, under Assumption B.1, setting τ = O(B2ϵ4) 1, η = O( B1B2 n1 ϵ6), Algorithm 3 converges to an ϵ-stationary point of the Moreau envelope F1/ ρ in T = O( n1 B1B2 ϵ 8) iterations. Proof of Theorem B.3. Since the only difference between SONX and Algorithm 3 is the update for {ui,t : i S1}, the proof of Theorem 4.6 still holds with the error bound replaced by Lemma B.2, i.e., i S ui,t+1 gi(wt+1) (1 µ)t+1 1 i S ui,0 gi(w0) + R1, i S ui,t+1 gi(wt+1) 2 (1 µ)t+1 1 i S ui,0 gi(w0) 2 + R2, 4n1 , R1 = 2 B1τ , R2 = 8τσ2 B2 + 32n2 1C2 g M 2η2 Then proof proceeds to t=0 E[ F1/ ρ(wt) 2] O 1 µ) + η + R1 + R2 B1τ ) + η + τ 1/2σ B1/2 2 + n1η B2 + n2 1η2 τ = O(B2ϵ4), η = O(B1B2 then to reach a nearly ϵ-stationary point, Algorithm 3 needs T = O( n1 B1B2 ϵ 8) iterations. B.2 Solving Non-smooth TCCO with Coordinate Moving Average We first assume the following assumptions hold. Assumption B.4. For all (i, j) S1 S2, we assume that fi( ) is ρf-weakly-convex, Cf-Lipschitz continuous and non-decreasing; gi( ) is ρg-weakly-convex and Cg-Lipschitz continuous. hi,j( ) is differentiable and Ch-Lipschitz continuous. Either gi is monotone and hi,j( ) is Lh-smooth, or gi is non-decreasing and hi,j( ) is Lh-weaklyconvex. Stochastic estimators hi,j(w, ξ), hi,j(w, ξ) and gi(vi,j) have bounded variance σ2, and hi,j(w) Ch. With coordinate moving average update, we present the following lemmas of error bounds. Algorithm 4 Stochastic Optimization algorithm for Non-smooth TCCO with coordinate moving average 1: Initialization: w0, {ui,0 : i S1}, vi,j,0 = hi,j(w0; B0 3,i,j) for all (i, j) S1 S2. 2: for t = 0, . . . , T 1 do 3: Sample batches Bt 1 S1, Bt 2 S2, and Bt 3,i,j Di,j for i Bt 1 and j Bt 2. 4: vi,j,t+1 = (1 τ1)vi,j,t + τ1hi,j(wt; Bt 3,i,j), (i, j) Bt 1 Bt 2 vi,j,t, (i, j) Bt 1 Bt 2 5: ui,t+1 = ( (1 τ2)ui,t + 1 B2 P j Bt 2 τ2gi(vi,j,t), i Bt 1 ui,t, i Bt 1 6: Gt = 1 B1 P i Bt 2 hi,j(wt; Bt 3,i,j) gi(vi,j,t) fi(ui,t) i 7: Update wt+1 = wt ηGt 8: end for 9: return w t with uniformly sampled t {0, T 1}. Lemma B.5. Consider the coordinate moving average update for {vi,j,t : (i, j) S1 S2} in Algorithm 4, assume hi,j(w) is Ch-Lipschitz continuous for all (i, j) S1 S2 and τ1 1, then we have j S2 vi,j,t+1 hi,j(wt+1) 4n1n2 )t+1 1 n1n2 j S2 vi,j,0 hi,j(w0) + 2 j S2 vi,j,t+1 hi,j(wt+1) 2 4n1n2 )2(t+1) 1 n1n2 j S2 vi,j,0 hi,j(w0) 2 + 8τ1σ2 B3 + 32n2 1n2 2C2 h M 2η2 B2 1B2 2τ 2 1 . Lemma B.6. Consider the coordinate moving average update for {ui,t : i S1} in Algorithm 4, assume gi( ) is Cg-Lipschitz continuous for all i S1 and τ2 1, then we have i S1 ui,t+1 1 j S2 gi(vi,j,t+1) i S1 ui,0 1 j S2 gi(vi,j,0) + 2 2Cg Mn1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ2 , i S1 ui,t+1 1 j S2 gi(vi,j,t+1) 2 4n1 )2(t+1) 1 i S1 ui,0 1 j S2 gi(vi,j,0) 2 + 8τ2σ2 B2 + 32C2 g M 2n1B2τ 2 1 B1n2τ 2 2 . Then we have a convergence analysis similar to Theorem A.2. Theorem B.7. Consider non-smooth weakly-convex TCCO problem, under Assumption B.4, setting τ1 = O min B3ϵ4, B1/2 1 n1/2 2 n1/2 1 B1/2 2 B2ϵ6 1, τ2 = O(B2ϵ4) 1, η = O min B3ϵ4, B1/2 1 n1/2 2 n1/2 1 B1/2 2 B2ϵ6 B1B2 n1n2 ϵ2 , Algorithm 4 converges to an ϵ-stationary point of the Moreau envelope F1/ ρ in T = O max 1 B3 , n1/2 1 B1/2 1 B1/2 2 n1/2 2 ϵ 2 n1n2 B1B2 ϵ 8 iterations. Proof of Theorem B.7. Since the only difference between SONT and Algorithm 4 is the update for {ui,t : i S1} and {vi,j,t : (i, j) S1 S2}, the proof of Theorem A.2 still holds with the error bound replaced by Lemma B.5 and Lemma B.6, i.e., j S2 E[ hi,j(wt) vi,j,t ] (1 µ1)t 1 n1n2 j S2 hi,j(w0) vi,j,0 + R1, j S2 E[ hi,j(wt) vi,j,t 2] (1 µ1)t 1 n1n2 j S2 hi,j(w0) vi,j,0 2 + R2, i S1 E 1 n2 j S2 gi(vi,j,t) ui,t j S2 gi(vi,j,0) ui,0 i S1 E 1 n2 j S2 gi(vi,j,t) ui,t 2 (1 µ2)t 1 j S2 gi(vi,j,0) ui,0 µ1 = B1B2τ1 4n1n2 , µ2 = B1τ2 4n1 , R1 = 2 B3 + 32n2 1n2 2C2 h M 2η2 B2 1B2 2τ 2 1 , R3 = 2 2Cg Mn1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ2 , B2 + 32C2 g M 2n1B2τ 2 1 B1n2τ 2 2 Then the proof proceeds to t=0 E[ F1/ ρ(wt) 2] η + 1 µmin ) + η + R1 + R2 + R3 + R4 η + 1 µmin ) + η + τ 1/2 1 σ B1/2 3 + τ1σ2 B3 + τ 1/2 2 σ B1/2 2 + τ2σ2 B1B2τ1 + n2 1n2 2η2 B2 1B2 2τ 2 1 + n1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ2 + n1B2τ 2 1 B1n2τ 2 2 η + 1 µmin ) + τ 1/2 1 σ B1/2 3 + τ 1/2 2 σ B1/2 2 + n1n2η B1B2τ1 + n1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ2 B3ϵ4, B1/2 1 n1/2 2 n1/2 1 B1/2 2 B2ϵ6 )! , τ2 = O(B2ϵ4), B3ϵ4, B1/2 1 n1/2 2 n1/2 1 B1/2 2 B2ϵ6 ) B1B2 then to reach a nearly ϵ-stationary point, Algorithm 4 need ( 1 B3 , n1/2 1 B1/2 1 B1/2 2 n1/2 2 ϵ 2 ) n1n2 B1B2 ϵ 10 ! iterations. C Details for TPAUC Maximization C.1 Assumption Verification We first present two lemmas about the weak convexity of the objective in the regular learning setting and in the multi-instance learning setting with mean pooling. Lemma C.1. Consider the formulation in problem (9) in the regular learning setting and assume that function ℓ( ) is non-decreasing, CℓLipschitz continuous and ρℓ-weakly-convex, and function hw(Xi) is Ch Lipschitz continuous and ρh-weakly-convex. then the following statements are true: fi(g, s ) is convex and Cf-Lipschitz continuous w.r.t. (g, s ), and non-decreasing w.r.t. g. ψi(w, si) is ρψ-weakly-convex w.r.t. (w, si), and the stochastic estimator of the finite sum function value ψi(w, si) is Cψ-Lipschitz continuous w.r.t. (w, si). i S+ fi(ψi(w, si), s ) is ρF -weakly-convex w.r.t. (w, s, s ). Lemma C.2. Consider the formulation in problem (9) in the multi-instance learning setting with mean pooling, and assume that function hi(w) = 1 |Xi| P x Xi e(we; x) wc is Lh-smooth and is bounded by Ch, and hi(w; ξ) = e(we; ξ) wc is Ch-Lipschitz continuous and has bounded variance σ2, ℓis non-decreasing and Lℓ-weakly-convex, then the followings are true: fi(g, s ) is convex and Cf-Lipschitz-continuous w.r.t. (g, s ), and non-decreasing w.r.t. g; gi(v, si) = si + (ℓ(v) si)+ β is ρg-weakly convex and non-decreasing w.r.t. v, convex w.r.t. si, and Cg-Lipschitz continuous w.r.t. (v, si); hi,j(w) = hj(w) hi(w) is Lh-weakly-convex, and hi,j(w; ξ, ζ) is Ch-Lipschitz continuous; Xi S+ fi(gi(hi,j(w), si), s ) is ρF -weakly-convex w.r.t. (w, s, s ). C.1.1 Proof of Lemma C.1 Proof of Lemma C.1. The convexity of fi(g, s ) with respect to (g, s ) follows from the convexity definition. With subgradients s fi(g, s ) [1 1 α, 1], gfi(g, s ) [0, 1 α], we can see that fi(g, s ) is 1 α-Lipschitz continuous w.r.t. (g, s ), and non-decreasing w.r.t. u. We first show that ℓ(hw(Xj) hw(Xi)) is weakly-convex w.r.t. w. ℓ(h w(Xj) h w(Xi)) ℓ(hw(Xj) hw(Xi)) + ℓ(hw(Xj) hw(Xi)), (h w(Xj) h w(Xi)) (hw(Xj) hw(Xi)) 2 (h w(Xj) h w(Xi)) (hw(Xj) hw(Xi)) 2 (a) ℓ(hw(Xj) hw(Xi)) + ℓ(hw(Xj) hw(Xi)), hw(Xj) hw(Xi), w w + 2ρℓC2 h w w 2 where (a) uses the weak-convexity of hw(Xi) and hw(Xj), h w(Xj) hw(Xj) hw(Xj), w w ρh h w(Xi) + hw(Xi) hw(Xi), w w + ρh Thus ℓ(hw(Xj) hw(Xi)) is 4ρℓC2 h-weakly-convex w.r.t. w. By convexity of (ℓ, si) 7 si + (ℓ si)+ β , we have ψi( w, si) ψi(w, si) + ℓψi(w, si), ℓ(h w(Xj) h w(Xi)) ℓ(hw(Xj) hw(Xi)) + siψi(w, si), si si (a) ψi(w, si) + ℓψi(w, si) ℓ(hw(Xj) hw(Xi)) hw(Xj) hw(Xi), w w 2ρℓC2 h w w 2 + siψi(w, si), si si (b) ψi(w, si) + ℓψi(w, si) ℓ(hw(Xj) hw(Xi)) hw(Xj) hw(Xi), w w + siψi(w, si), si si 2ρℓC2 h β w w 2 where (a) follows from the monotonicity of ψi w.r.t. ℓand weak-convexity of ℓ(hw(Xj) hw(Xi)), and (b) is due to the Lipschitz continuity of (ℓ, si) 7 si + (ℓ si)+ β w.r.t. ℓ. Thus ψi is 4ρℓC2 h β -weaklyconvex w.r.t. (w, si). With a similar argument using the convexity and Lipschitz continuity of fi(g, s ) w.r.t. (g, s ) and the weak-convexity of ψi(w, si), we can show that fi(ψi(w, si), s ) is 4ρℓC2 h β -weakly-convex w.r.t. (w, si, s ). Thus, F(w, si, s ) is ρF = 4ρℓC2 h β -weakly-convex w.r.t. (w, s, s ). Now we show the Lipschitz continuity of ψi(w, si; Xj), i.e. an unbiased stochastic estimator of ψi(w, si). We have ψi(w, si; Xj) ψi( w, si; Xj) 2 = (si + (ℓ(hw(Xj) hw(Xi)) si)+ β ) ( si + (ℓ(h w(Xj) h w(Xi)) si)+ 2 si si 2 + 2 (ℓ(hw(Xj) hw(Xi)) si)+ β (ℓ(h w(Xj) h w(Xi)) si)+ 2 si si 2 + 2 β2 (8C2 ℓC2 h w w 2 + 2 si si 2) (2 + 4 + 16C2 ℓC2 h β2 )( w w 2 + si si 2). Thus ψi(w, si; Xj) is (2 + 4+16C2 ℓC2 h β2 )1/2-Lipschitz continuous w.r.t. (w, si). C.1.2 Proof of Lemma C.2 Proof of Lemma C.2. First of all, the convexity of fi(u, s ) w.r.t. (u, s ) and the convexity of gi(vij, si) w.r.t. (ℓ, si) directly follows from the convexity definition. Moreover, one can see from the formulation that s fi(g, s ) [1 1 α, 1], ufi(g, s ) [0, 1 α], ℓgi(vij, si) [1 1 β , 1], sigi(vij, si) [0, 1 β ]. Thus fi is Cf = 1 α-Lipschitz continuous w.r.t. (u, s ) and non-decreasing w.r.t. u, gi is 1 β -Lipschitz continuous w.r.t. (ℓ, si) and non-decreasing w.r.t. ℓ. Since ℓ( ) is nondecreasing, gi(vij, si) is non-decreasing w.r.t. vij. As a result of Proposition 4.2, gi(vij, si) is ρg = 1 β Lℓ-weakly-convex w.r.t. vij. Due to the composition structure and the Lipschitz continuity of gi and ℓ, one can see that gi(vij, si) is Cg = 1 β Cℓ-Lipschitz continuous w.r.t. (vij, si). The Lh = 2 Lh-weakly-convexity of hi,j(w) and Ch = 2 Ch-Lipschitz continuity of hi,j(w; ξ, ζ) directly follows from the Lh-smoothness of hi(w) and Ch-Lipschitz continuity of hi(w; ξ). Finally, we show the weakly-convexity of fi(gi(hi,j(w), si), s ): fi(gi(hi,j( w), si) s ) (a) fi(gi(hi,j(w), si), s ) + s fi(gi(hi,j(w), si), s ), s s + ufi(gi(hi,j(w), si), s ), gi(hi,j( w), si) gi(hi,j(w), si) (b) fi(gi(hi,j(w), si), s ) + s fi(gi(hi,j(w), si), s ), s s + ufi(gi(hi,j(w), si), s ), ℓgi(hi,j(w), si), ℓ(hi,j( w)) ℓ(hi,j(w)) + ufi(gi(hi,j(w), si)s ), sigi(hi,j(w), si), si si (c) fi(gi(hi,j(w), si), s ) + s fi(gi(hi,j(w), si), s ), s s + ufi(gi(hi,j(w), si), s ) ℓgi(hi,j(w), si) ℓ(hi,j(w)), hi,j( w) hi,j(w) 2 hi,j( w) hi,j(w) 2 + ufi(s , gi(hi,j(w), si)) sigi(hi,j(w), si), si si (d) fi(gi(hi,j(w), si), s ) + s fi(gi(hi,j(w), si), s ), s s + ufi(gi(hi,j(w), si), s ) ℓgi(hi,j(w), si) ℓ(hi,j(w)) hi,j(w), w w + ufi(gi(hi,j(w), si), s ) sigi(hi,j(w), si), si si (Cf Cg C2 h Lℓ 2 + Cf Cg Lh where (a) uses the convexity of fi, (b) uses the monotonicity of fi w.r.t. u and convexity of gi(ℓ, si) w.r.t. (ℓ, si), (c) uses monotonicity of fi w.r.t. u, monotonicity of gi w.r.t. ℓand Lℓ-weak-convexity of ℓ, (d) uses the smoothness of hi,j. Thus fi(gi(hi,j(w), si), s ) is ρF = (Cf Cg C2 h Lℓ+ Cf Cg Lh)- weakly-convex w.r.t. (w, si, s ). Therefore, 1 n+ P i S+ fi(gi(hi,j(w), si), s ) is ρF -weakly-convex w.r.t. (w, s, s ). C.2 Algorithms for TPAUC and Multi-instance TPAUC Maximization Algorithm 5 SONX for TPAUC 1: Initialization: w0, {ui,0 : i S+}, {si,0 : i S+}, s 0 2: for t = 0, . . . , T 1 do 3: Sample batches Bt 1 S+ and Bt 2 S . 4: ui,t+1 = (1 τ)ui,t + τψi(wt, si,t; Bt 2) + γ(ψi(wt, si,t; Bt 2) ψi(wt 1, si,t 1; Bt 2)), i Bt 1 ui,t, i Bt 1 5: si,t+1 = B1 sψi(wt, si,t; Bt 2) uf(ui,t, s t), i Bt 1 si,t, i Bt 1 6: s t+1 = s t η 1 i Bt 1 s f(ui,t, s t) 7: Compute Gt = 1 B1 P i Bt 1 wψi(wt, si,t; Bt 2) uf(ui,t, s t) 8: Update wt+1 = wt ηGt 9: end for 10: return w t with t uniformly sampled from {0, . . . , T 1}. Algorithm 6 SONT for Multi-instance TPAUC 1: Initialization: w0, {ui,0 : i S+}, {si,0 : i S+}, s 0, {vi,j,0 : (i, j) S+ S } 2: for t = 0, . . . , T 1 do 3: Sample batches Bt 1 S+, Bt 2 S , and Bt 3,i Xi for i Bt 1 Bt 2. 4: vi,t+1 = Π Ch[(1 τ1)vi,t + τ1hi(wt; Bt 3,i) + γ1(hi(wt; Bt 3,i) hi(wt 1; Bt 3,i))], i Bt 1 Π Ch[(1 τ1)vi,t + τ1hi(wt; Bt 3,i) + γ2(hi(wt; Bt 3,i) hi(wt 1; Bt 3,i))], i Bt 2 vi,t, i Bt 1 and i Bt 2 5: ui,t+1 = (1 τ2)ui,t + 1 B2 P j Bt 2[τ2g(vj,t vi,t, si,t) +γ3(g(vj,t vi,t, si,t) g(vj,t 1 vi,t 1, si,t 1))], i Bt 1 ui,t, i Bt 1 6: si,t+1 = ( si,t η1 1 j Bt 2 sig(vj,t vi,t, si,t) i uf(s t, ui,t), i Bt 1 si,t, i Bt 1 7: s t+1 = s t η2 1 B1 P i Bt 1 s f(ui,t, s t) 8: Gt = 1 B1 P i Bt 1 uf(ui,t, s t) h 1 B2 P j Bt 2 hj(wt; Bt 3,j) hi(wt; Bt 3,i) vg(vj,t vi,t, si,t) i 10: Update wt+1 = wt ηGt 11: end for 12: return w t with t uniformly sampled from {0, . . . , T 1}. C.3 TPAUC in MIL with smoothed-max pooling and attention-based pooling We can extend our results to smoothed-max pooling and attention-based pooling. Smoothed-max Pooling. The smoothed-max pooling can be written as [45]: hw(X) = τ log x X exp(ϕ(w; x)/τ) where τ > 0 is a hyperparameter and ϕ(w; x) = e(we, x) wc is the prediction score for instance x. We can see that hw(X) itself is a compositional function. To map the problem into TCCO, we define hi(w) = 1 |Xi| P x Xi exp(ϕ(w; x)/τ) + C, where C > 0 is a constant. Then the objective function becomes min w,s ,s 1 n+ Xi S+ fi(ψi(w, si), s ), where fi(g, s ) = s + (g s )+ ψi(w, si) = 1 Xj S si + (ℓ(τ log hj(w) τ log hi(w)) si)+ In this case we define gi(ℓ(v), si) = si + (ℓ(τ log v1 τ log v2) si)+ β and hi,j(w) = [hi(w), hj(w)]. We can still prove that gi(ℓ(v), si) is monotone w.r.t to each component of v. It is not difficult to prove that ℓ(τ log v1 τ log v2) is weakly convex w.r.t v because τ log v1 τ log v2 is a smooth mapping of v due to v C and ℓis a convex function [8]. As a result, since gi(ℓ, si) is non-decreasing and convex w.r.t to ℓ, it is easy to prove that gi(ℓ(v), si) is weakly convex w.r.t v and is monotone (either non-decreasing or non-increasing) w.r.t to each component of v. Hence, assuming hi(w) is a smooth and Lipchitz continuous function, we can prove that gi(hi,j(w), si) is weakly convex w.r.t. to w. Attention-based Pooling. Attention-based pooling was recently introduced for deep MIL [14], which aggregates the feature representations using attention, i.e., E(w; X) = X exp(g(w; x)) P x X exp(g(w; x ))e(we; x) (26) where g(w; x) is a parametric function, e.g., g(w; x) = w a tanh(V e(we; x)) + C, where V Rm do and wa Rm. Based on the aggregated feature representation, the bag level prediction can be computed by hw(w, X) = (w c E(w; X)) (27) exp(g(w; x))δ(w; x) P x X exp(g(w; x )) where δ(w; x) = w c e(we; x). We can see that hw(X) itself is a compositional function. To map the problem into TCCO, we define h1 i (w) = 1 |Xi| P x Xi exp(g(w; x))δ(w; x), and h2 i (w) = 1 |Xi| P x Xi exp(g(w; x )). Assume |w a tanh(V e(we; x))| Cb then h2 i (w) exp(C Cb). Then the objective function becomes min w,s ,s 1 n+ Xi S+ fi(ψi(w, si), s ), where fi(g, s ) = s + (g s )+ α , ψi(w, si) = 1 Xj S si + (ℓ( h1 j(w) h2 j(w) h1 i (w) h2 i (w)) si)+ In this case we define gi(ℓ(v), si) = si + (ℓ( v3 β and hi,j(w) = [h1 i (w), h2 i (w), h1 j(w), h2 j(w)]. We can still prove that gi(ℓ(v), si) is monotone w.r.t to each component of v. It is not difficult to prove that ℓ( v3 v2 ) is weakly convex w.r.t v because v3 v4 v1 v2 is a smooth mapping of v when v2, v4 are lower bounded and ℓis a convex function [8]. As a result, since gi(ℓ, si) is non-decreasing and convex w.r.t to ℓ, it is easy to prove that gi(ℓ(v), si) is weakly convex w.r.t v and is monotone (either non-decreasing or non-increasing) w.r.t to each component of v. Hence, assuming h1 i (w), h2 i (w) are smooth and Lipchitz continuous, we can prove that gi(hi,j(w), si) is weakly convex w.r.t. to w. C.4 Convergence Analysis of TPAUC Maximization C.4.1 Convergence analysis for Algorithm 5 We first consider TPAUC maximization in the regular learning setting. Define F(w, s, s ) := 1 n+ P Xi S+ fi(ψi(w, si), s ). Due to the weak-convexity of F(w, s, s ) w.r.t. (w, s, s ), we consider the following Moreau envelope and proximal map defined as Fλ(w, s, s ) = min w, s, s F( w, s, s ) + 1 2λ w w 2 + s s 2 + s s 2 , proxλF (w, s, s ) = arg min w, s, s F( w, s, s ) + 1 2λ w w 2 + s s 2 + s s 2 . Following the same proof of Lemma 4.5, we have the following error bound Lemma C.3. Consider the update for {ui,t : Xi S+} in Algorithm 5. Assume ψi(w, si) is Cψ-Lipshitz continuous for all Xi S+. Assume Et[ Gt 2] M 2 and Et[ 1 B1 P Xi Bt 1 sψi(wt, si,t; Bt 2) uf(ui,t, s t)ei 2] M 2, where ei is the n+-dimensional vector with 1 at the i-th entry and 0 everywhere else. With γ = n+ B1 B1(1 τ) + (1 τ) and τ 1 Xi S+ ui,t+1 ψi(wt+1, si,t+1) Xi S+ ui,0 ψi(w0, si,0) + 2τ 1/2σ B1/2 2 + 8n+CψMη Then we have following convergence guarantee. Theorem C.4. Under the assumptions given in Lemma C.1, with γ = n+ B1 B1(1 τ) + (1 τ), τ = 2, η = O( B1B1/2 2 ϵ4 n+ ), and ρ = ρF + ρψCf, Algorithm 5 converges to an ϵ-stationary point of the Moreau envelope F1/ ρ in T = O( n+ B1B1/2 2 ϵ 6) iterations. Proof of Theorem C.4. Define ( ˆwt,ˆst, ˆs t) := prox F/ ρ(wt, st, s t). For a given Xi S+, we have fi(ψi( ˆwt, ˆsi,t), ˆs t) fi(ui,t, s t) (a) s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t)(ψi( ˆwt, ˆsi,t) ui,t) (b) s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t) ψi(wt, si,t) ui,t + wψi(wt, si,t), ˆwt wt 2 ˆwt wt 2 + siψi(wt, si,t), ˆsi,t si,t ρψ 2 ˆsi,t si,t 2 (c) s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t) ψi(wt, si,t) ui,t + ufi(ui,t, s t) wψi(wt, si,t), ˆwt wt + ufi(ui,t, s t) siψi(wt, si,t), ˆsi,t si,t ρψCf 2 ˆwt wt 2 + ˆsi,t si,t 2 where (a) follows from the convexity of fi, (b) follows from the monotonicity of fi( , s ) and weak convexity of ψi, (c) is due to 0 ufi(ui,t, s t) Cf. Then it follows 1 n+ s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t) wψi(wt, si,t), ˆwt wt + ufi(ui,t, s t) siψi(wt, si,t), ˆsi,t si,t fi(ψi( ˆwt, ˆsi,t), ˆs t) fi(ui,t, s t) ufi(ui,t, s t) ψi(wt, si,t) ui,t 2 ˆwt wt 2 + ˆsi,t si,t 2 Now we consider the change in the Moreau envelope: Et[F1/ ρ(wt+1, st+1, s t+1)] min w, s, s F( w, st, s t) + ρ 2 w wt+1 2 + s st+1 2 + s s t+1 2 Et h F( ˆwt,ˆst, ˆs t) + ρ 2 ˆwt wt+1 2 + ˆst st+1 2 + ˆs t s t+1 2 i = F( ˆwt,ˆst, ˆs t) + Et 2 ˆwt (wt ηGt) 2 + ˆst (st ηG1 t) 2 + ˆs t (s t ηG2 t) 2 F( ˆwt, ˆst, ˆs t) + ρ 2 ˆwt wt 2 + ˆst st 2 + ˆs t s t 2 + ρEt[η ˆwt wt, Gt + η ˆst st, G1 t + η ˆs t s t, G2 t ] + 3η2 ρM 2 2 = F1/ ρ(wt, st, s t) + ρEt[η ˆwt wt, Gt + η ˆst st, G1 t + η ˆs t s t, G2 t ] where for simplicity we denote G1 t = 1 B1 P Xi Bt 1 ufi(ui,t, s t) sψi(wt, si,t; Bt 2) and G2 t = Xi Bt 1 s fi(ui,t, s t). The second inequality in the above derivation uses the bounds of E[ Gt 2], E[ G1 t 2] and E[ G2 t 2], which follow from the Lipschitz continuity and bounded variance assumptions and are denoted by M. Moreover, we have Et[η ˆwt wt, Gt + η ˆst st, G1 t + η ˆs t s t, G2 t ] = η ˆwt wt, Et[Gt] + η ˆst st, Et[G1 t] + η ˆs t s t, Et[G2 t] , and Xi S+ ufi(ui,t, s t) wψi(wt, si,t) Et[G1 t] = 1 Xi S+ ufi(ui,t, s t) sψi(wt, si,t) Et[G2 t] = 1 Xi S+ s fi(ui,t, s t). Combining inequality 29 and 30 yields Et[F1/ ρ(wt+1, st+1, s t+1)] F1/ ρ(wt, st, s t) + 3η2 ρM 2 fi(ψi( ˆwt, ˆsi,t), ˆs t) fi(ui,t, s t) ufi(ui,t, s t) ψi(wt, si,t) ui,t + ρψCf 2 ˆwt wt 2 + ˆsi,t si,t 2 F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 + ρη(F( ˆwt,ˆst, ˆs t) F(wt, st, s t)) fi(ψi(wt, si,t), s t) fi(ui,t, s t) ufi(ui,t, s t) ψi(wt, si,t) ui,t 2 ˆwt wt 2 + ˆsi,t si,t 2 Due to the ρF -weak convexity of F(w, s, s ), we have ( ρ ρF )-strong convexity of (w, s, s ) 7 F(w, s, s ) + ρ 2 (wt, st, s t) (w, s, s ) 2. Then it follows F( ˆwt,ˆst, ˆs t) F(wt, st, s t) = F( ˆwt,ˆst, ˆs t) + ρ 2 (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 F(wt, st, s t) + ρ 2 (wt, st, s t) (wt, st, s t) 2 2 (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 2 ρ) (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 Plugging inequality 32 into inequality 31 yields Et[F1/ ρ(wt+1, st+1, s t+1)] E[F1/ ρ(wt, st, s t)] + 3η2 ρM 2 2 ρ) (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 fi(ψi(wt, si,t), s t) fi(ui,t, s t) ufi(ui,t, s t) ψi(wt, si,t) ui,t 2 ˆwt wt 2 + ˆsi,t si,t 2 Set ρ = ρF + ρψCf. We have Et[F1/ ρ(wt+1, st+1, s t+1)] F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 fi(ψi(wt, si,t), s t) fi(ui,t, s t) ufi(ui,t, s t) ψi(wt, si,t) ui,t (a) F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 φ1/ ρ(wt, st, s t) 2 fi(ψi(wt, si,t), s t) fi(ui,t, s t) ufi(ui,t, s t) ψi(wt, si,t) ui,t where inequality (a) follows from Lemma 3.2. Using the Lipschitz continuity of f, we have Et[F1/ ρ(wt+1, st+1, s t+1)] F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 F1/ ρ(wt, st, s t) 2 Xi S+ 2Cf ψi(wt, si,t) ui,t With the error bound from Lemma C.3, we have Xi S+ ψi(wt, si,t) ui,t Xi S+ ψi(w0, si,0) ui,0 + R with µ = B1τ 2n+ , R = 2τ 1/2σ B1/2 2 + 4n+CψMη B1τ 1/2 + 4n1/2 + CψMη B1τ 1/2 . Then E[F1/ ρ(wt+1, st+1, s t+1)] F1/ ρ(wt, st, s t) + 3η2 ρM 2 2E[ F1/ ρ(wt, st, s t) 2] Xi S+ ψi(w0, si,0) ui,0 + R Taking summation from t = 0 to T 1 yields E[F1/ ρ(w T , s T , s T )] F1/ ρ(w0, s0, s 0) + 3η2 ρM 2T t=0 E[ F1/ ρ(wt, st, s t) 2] t=0 (1 µ)t 1 Xi S+ ψi(w0, si,0) ui,0 + RT (a) F1/ ρ(w0, s0, s 0) + 3η2 ρM 2T t=0 E[ F1/ ρ(wt, st, s t) 2] 1 n+ ψi(w0, si,0) ui,0 + 2Cf ρηRT where (a) uses PT 1 t=0 (1 µ)t 1 Lower bounding the left-hand-side by minw,s,s F1/ ρ(w, s, s ), we obtain t=0 E[ F1/ ρ(wt, st, s t) 2] F1/ ρ(w0, s0, s 0) min w,s,s F1/ ρ(w, s, s ) + 3η2 ρM 2T Xi S+ ψi(w0, si,0) ui,0 + 2Cf ρηRT ηT + 3η ρM 2 + 8Cf ρ Xi S+ ψi(w0, si,0) ui,0 + 4Cf ρR µ) + C(η + R) where we assume F1/ ρ(w0, s0, s 0) minw,s,s F1/ ρ(w, s, s ) and C = max{8 , 12 ρM 2, 32Cf ρ X Xi S+ ψi(w0, si,0) ui,0 , 16Cf ρ}. Plugging the expression of µ and R yields t=0 E[ F1/ ρ(wt, st, s t) 2] B1τ ) + (τ 1/2σ B1/2 2 + n+η B1τ 1/2 ) Setting τ = O(B2ϵ4) and η = O( B1B1/2 2 n+ ϵ4), with T = O( n+ B1B1/2 2 ϵ 6) iterations, we have t=0 E[ F1/ ρ(wt, st, s t) 2] ϵ2. C.4.2 Convergence analysis for Algorithm 6 We now consider MIL TPAUC maximization with mean pooling. Define F(w, s, s ) := 1 n+ P Xi S+ fi(gi(hj(w) hi(w), si), s ). Due to the weak-convexity of F(w, s, s ) w.r.t. (w, s, s ), we consider the following Moreau envelope and proximal map defined as Fλ(w, s, s ) = min w, s, s F( w, s, s ) + 1 2λ w w 2 + s s 2 + s s 2 , proxλF (w, s, s ) = arg min w, s, s F( w, s, s ) + 1 2λ w w 2 + s s 2 + s s 2 . Following the same proofs of Lemma A.3 and Lemma A.4, we have the following error bounds Lemma C.5. Consider the update for {vi,t : Xi S+ S } in Algorithm 6. Assume hi(w; ξ) is Ch-Lipshitz for all Xi S+ S , and E[ Gt 2] M 2. With γ1 = n+ B1 B1(1 τ1) + (1 τ1), γ2 = n B2 B2(1 τ1) + (1 τ1) and τ1 1 Xi S+ vi,t+1 hi(wt+1) (1 B1τ1 Xi S+ vi,0 hi(wt) + 2τ 1/2 1 σ + 4n+Ch Mη Xj S vj,t+1 hj(wt+1) (1 B1τ1 Xj S vj,0 hj(wt) + 2τ 1/2 1 σ + 4n Ch Mη Xi S+ vi,t+1 hi(wt+1) 2 (1 B1τ1 2n+ )2(t+1) 1 Xi S+ vi,0 hi(wt) 2 + 4τ1σ2 + 16n2 +C2 h M 2η2 Xj S vj,t+1 hj(wt+1) 2 (1 B1τ1 2n )2(t+1) 1 Xj S vj,0 hj(wt) 2 + 4τ1σ2 + 16n2 C2 h M 2η2 Lemma C.6. Consider update for {ui,t : Xi S+} in Algorithm 6. Assume gi(vij, si) is Cg-Lipshitz w.r.t. (vij, si) for all Xi S+ and Xj S . With γ3 = n+ B1 B1(1 τ2) + (1 τ2) and τ2 1 Xi S+ ui,t+1 1 Xj S gi(vj,t+1 vi,t+1, si,t+1) Xi S+ ui,0 1 Xj S gi(vj,0 vi,0, si,0) + 2τ 1/2 2 σ + C2 n+ B1 (B1/2 1 n1/2 + + B1/2 2 n1/2 ) τ1 τ 1/2 2 + C2 n+ B1 ( n1/2 + B1/2 1 + n1/2 B1/2 2 ) η τ 1/2 2 + C2 n1/2 + η B1τ 1/2 2 where C2 is a constant defined in the proof. Then we have the following covnergence guarantee. Theorem C.7. Under assumptions given in Lemma C.2, with γ1 = n1 B1 B1(1 τ1) + (1 τ1), γ2 = n2 B2 B2(1 τ1) + (1 τ1), γ3 = n1 B1 B1(1 τ2) + (1 τ2), τ1 = O min B3, B1 n+ min{ n1/2 + B1/2 1 , n1/2 B1/2 2 }B1/2 2 ϵ4 1/2, τ2 = O(B2ϵ4) 1/2, η = O min min{ B1 n } min{B1/2 3 , B1/2 1 n1/2 + min{ n1/4 + B1/4 1 , n1/4 B1/4 2 }B1/4 2 }, B1 n+ min{ B1/2 1 n1/2 + , B1/2 2 n1/2 }B1/2 3 B2 } max{ 1 B1/2 3 , n1/2 + B1/2 1 max{B1/4 1 n1/4 + , B1/4 2 n1/4 } 1 B1/4 2 }, n+ B1 max{ n1/2 + B1/2 1 , n1/2 B1/2 2 } 1 iterations, Algorithm 6 gives ϵ-stationary point to the Moreau envelope, i.e., t=0 F1/ ρ(wt, st, s t) 2 ϵ2. where ρ = ρF + ρg Cf + 8ρg Cf Ch + Cf Cg Lh. Proof of Theorem C.7. Consider the change in the Moreau envelope: Et[F1/ ρ(wt+1, st+1, s t+1)] min w, s, s F( w, st, s t) + ρ 2 w wt+1 2 + s st+1 2 + s s t+1 2 Et h F( ˆwt,ˆst, ˆs t) + ρ 2 ˆwt wt+1 2 + ˆst st+1 2 + ˆs t s t+1 2 i = F( ˆwt,ˆst, ˆs t) + Et 2 ˆwt (wt ηGt) 2 + ˆst (st ηG1 t) 2 + ˆs t (s t ηG2 t) 2 F( ˆwt, ˆst, ˆs t) + ρ 2 ˆwt wt 2 + ˆst st 2 + ˆs t s t 2 + ρEt[η ˆwt wt, Gt + η ˆst st, G1 t + η ˆs t s t, G2 t ] + 3η2 ρM 2 2 = F1/ ρ(wt, st, s t) + ρEt[η ˆwt wt, Gt + η ˆst st, G1 t + η ˆs t s t, G2 t ] where for simplicity we denote G2 t = 1 B1 P i Bt 1 s fi(ui,t, s t), and G1 t is a n+-dimensional vector whose i-th coordinate is defined as ( 1 B1 ufi(ui,t, s t) h 1 B2 P Xj Bt 2 sigi(vj,t vi,t, si,t) i , Xi Bt 1 0, Xi Bt 1 . The second inequality in the above derivation uses the bounds of E[ Gt 2], E[ G1 t 2] and E[ G2 t 2], which follow from the Lipschitz continuity and bounded variance assumptions and are denoted by M. Note that Et[Gt] Xi S+ ufi(ui,t, s t) Xj S vgi(vj,t vi,t, si,t) ( hi(w) hj(w)) Et[G1 t] = 1 Xi S+ ufi(ui,t, s t) Xj S sgi(vj,t vi,t, si,t) Et[G2 t] = 1 Xi S+ s fi(ui,t, s t) Define ( ˆwt,ˆst, ˆs t) := prox F/ ρ(wt, st, s t). For a given i {1, . . . , m}, we have Xj S gi(hj( ˆwt) hi( ˆwt), ˆsi,t), ˆs t) fi(ui,t, s t) (a) s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t)( 1 Xj S gi(hj( ˆwt) hi( ˆwt), ˆsi,t) ui,t) (b) s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t) 1 Xj S gi(vj,t vi,t, si,t) ui,t Xj S vgi(vj,t vi,t, si,t), (hj( ˆwt) hi( ˆwt)) (vj,t vi,t)) 2 (hj( ˆwt) hi( ˆwt)) (vj,t vi,t) 2 Xj S sigi(vj,t vi,t), si,t, ˆsi,t si,t ρg 2 ˆsi,t si,t 2 (c) s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t) 1 Xj S gi(vj,t vi,t, si,t) ui,t Xj S ufi(ui,t, s t) vgi(vj,t vi,t, si,t), (hj( ˆwt) hi( ˆwt)) (vj,t vi,t)) | {z } A1 Xj S ufi(ui,t, s t) sigi(vj,t vi,t, si,t), ˆsi,t si,t 2 (hj( ˆwt) hi( ˆwt)) (vj,t vi,t) 2 ρg Cf 2 ˆsi,t si,t 2 where (a) follows from the convexity of fi, (b) follows from the monotonicity of fi( , s ) and weak convexity of gi, (c) is due to 0 ufi(ui,t, s t) Cf. The Lh-smoothness assumption of hi(w) hj(w) for all i, w implies hi( ˆwt) hj( ˆwt) hi(wt) hj(wt) + ( hi(wt) hj(wt)), ˆwt wt Lh 2 ˆwt wt 2 (39) Since ufi(ui,t, s t) vgi(vj,t vi,t, si,t) 0, we bound A1 as following A1 = ufi(ui,t, s t) vgi(vj,t vi,t, si,t), (hj( ˆwt) hi( ˆwt)) (vj,t vi,t)) (a) ufi(ui,t, s t) vgi(vj,t vi,t, si,t), (hi(wt) hj(wt)) (vj,t vi,t)) ufi(ui,t, s t) vgi(vj,t vi,t, si,t), Lh + ufi(ui,t, s t) vgi(vj,t vi,t, si,t)( hi(wt) hj(wt)), ˆwt wt (b) Cf Cg[ hi(wt) vi,t + hj(wt) vj,t ] Cf Cg Lh + ufi(ui,t, s t) ℓgi(vj,t vi,t, si,t)( hi(wt) hj(wt)), ˆwt wt where inequality (a) follows from inequality 39, (b) follows from the Lipschitz continuity and monotone assumptions on fi, gi, hi, hj. Then plugging the new formulation of A1 back to inequality 38 Xj S gi(hj( ˆwt) hi( ˆwt), ˆsi,t), ˆs t) fi(ui,t, s t) s fi(ui,t, s t)(ˆs t s t) + ufi(ui,t, s t) 1 Xj S gi(vj,t vi,t, si,t) ui,t Xj S [ Cf Cg[ hi(wt) vi,t + hj(wt) vj,t ]] Cf Cg Lh Xj S ufi(ui,t, s t) vgi(vj,t vi,t, si,t)( hi(wt) hj(wt)), ˆwt wt Xj S ufi(ui,t, s t) sigi(vj,t vi,t, si,t), ˆsi,t si,t 2 (hj( ˆwt) hi( ˆwt)) (vj,t vi,t) 2 ρg Cf 2 ˆsi,t si,t 2 Taking average over i S+ gives 1 n+ Xi S+ fi( 1 Xj S gi(hj( ˆwt) hi( ˆwt), ˆsi,t), ˆs t) fi(ui,t, s t) Et[G2 t], ˆs t s t + Et[Gt], ˆwt wt + Et[G1 t],ˆst st Xi S+ ufi(ui,t, s t) 1 Xj S gi(vj,t vi,t, si,t) ui,t Xi S+ hi(wt) vi,t + 1 Xj S hj(wt) vj,t 2 (hj( ˆwt) hi( ˆwt)) (vj,t vi,t) 2 1 2 ˆsi,t si,t 2 It follows Et[G2 t], ˆs t s t + Et[Gt], ˆwt wt + Et[G1 t],ˆst st Xj S gi(hj( ˆwt) hi( ˆwt), ˆsi,t), ˆs t) fi(ui,t, s t) ufi(ui,t, s t) 1 Xj S gi(vj,t vi,t, si,t) ui,t 2 (hj( ˆwt) hi( ˆwt)) (vj,t vi,t) 2 + ρg Cf 2 ˆsi,t si,t 2 Xi S+ hi(wt) vi,t + 1 Xj S hj(wt) vj,t + Cf Cg Lh Combining inequality 37 and 40 yields Et[F1/ ρ(wt+1, st+1, s t+1)] = F1/ ρ(wt, st, s t) + ρη ˆwt wt, Et[Gt] + ˆst st, Et[G1 t] + ˆs t s t, Et[G2 t] (a) F1/ ρ(wt, st, s t) + 3η2 ρM 2 Fi(ˆs t, ˆwt, ˆsi,t) Fi(s t, wt, si,t) + Cf Cg 1 n hi(wt) vi,t + hj(wt)) vj,t Xj S gi(vj,t vi,t, si,t) ui,t Xj S gi(vj,t vi,t, si,t) ui,t Xj S ρg Cf (hi( ˆwt) vi,t 2 + hj( ˆwt)) vj,t 2 + ρg Cf 2 ˆsi,t si,t 2 Xi S+ hi(wt) vi,t + 1 Xj S hj(wt) vj,t + Cf Cg Lh 2 ˆwt wt 2 ) = F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 + ρη(F( ˆwt,ˆst, ˆs t) F(wt, st, s t)) hi(wt) vi,t + hj(wt)) vj,t hi(wt) vi,t 2 + hj(wt)) vj,t 2 + 4ρg Cf Ch ˆwt wt 2 Xj S gi(vj,t vi,t, si,t) ui,t 2 ˆsi,t si,t 2 + Cf Cg Lh 2 ˆwt wt 2 ) (41) where (a) follows from the Lipschitz continuity of fi, gi, hi, hj and inequality 40. Due to the ρF -weak convexity of F(w, si, s ), we have ( ρ ρF )-strong convexity of (w, si, s ) 7 F(w, s, s ) + ρ 2 (wt, st, s t) (w, s, s ) 2. Then it follows F( ˆwt,ˆst, ˆs t) Fi(wt, st, s t) = Fi( ˆwt,ˆst, ˆs t) + ρ 2 (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 Fi(wt, st, s t) + ρ 2 (wt, st, s t) (wt, st, s t) 2 2 (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 2 ρ) (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 Plugging inequality 42 back into 41, we obtain Et[F1/ ρ(wt+1, st+1, s t+1)] F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 ρ) (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 hi(wt) vi,t + hj(wt) vj,t hi(wt) vi,t 2 + hj(wt) vj,t 2 Xj S gi(vj,t vi,t, si,t) ui,t 2 ˆsi,t si,t 2 + (4ρg Cf Ch + Cf Cg Lh 2 ) ˆwt wt 2 ) F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 (wt, st, s t) ( ˆwt,ˆst, ˆs t) 2 hi(wt) vi,t + hj(wt) vj,t + hi(wt) vi,t 2 + hj(wt) vj,t 2 Xj S gi(vj,t vi,t, si,t) ui,t (b) F1/ ρ(wt, st, s t) + 3η2 ρM 2 2 F1/ ρ(wt, st, s t) 2 hi(wt) vi,t + hj(wt) vj,t + hi(wt) vi,t 2 + hj(wt) vj,t 2 Xj S gi(vj,t vi,t, si,t) ui,t where in inequality (a) we use ρ = ρF + ρg Cf + 8ρg Cf Ch + Cf Cg Lh and C1 = max{2Cf Cg, 2ρg Cf, 2Cf}, and inequality (b) uses Lemma 3.2. With general error bounds 1 n+ Xi S+ E[ hi(wt) vi,t ] (1 µ1)t 1 Xi S+ hi(w0) vi,0 + R1, Xj S E[ hj(wt) vj,t ] (1 µ2)t 1 Xj S hj(w0) vj,0 + R2, Xi S+ E[ hi(wt) vi,t 2] (1 µ1)t 1 Xi S+ hi(w0) vi,0 2 + R3, Xj S E[ hj(wt) vj,t 2] (1 µ2)t 1 Xj S hj(w0) vj,0 2 + R4, Xi S+ E 1 n Xj S gi(vj,t vi,t, si,t) ui,t Xj S gi(vi,0 vj,0, si,0) ui,0 we have E[F1/ ρ(wt+1, st+1, s t+1)] E[F1/ ρ(wt, st, s t)] + 3η2 ρM 2 2E[ F1/ ρ(wt, st, s t) 2] Xi S+ hi(w0) vi,0 + (1 µ2)t 1 Xj S hj(w0) vj,0 + (1 µ1)t 1 Xi S+ hi(w0) vi,0 2 + (1 µ2)t 1 Xj S hj(w0) vj,0 2 + (1 µ3)t 1 Xj S gi(vi,0 vj,0, si,0) ui,0 + R1 + R2 + R3 + R4 + R5 E[F1/ ρ(wt, st, s t)] + 3η2 ρM 2 2E[ F1/ ρ(wt, st, s t) 2] (1 µmin)t 1 Xi S+ hi(w0) vi,0 + 1 Xj S hj(w0) vj,0 Xi S+ hi(w0) vi,0 2 + 1 Xj S hj(w0) vj,0 2 Xj S gi(vi,0 vj,0, si,0) ui,0 + R1 + R2 + R3 + R4 + R5 where µmin = min{µ1, µ2, µ3}. Taking summation from t = 0 to T 1 yields E[F1/ ρ(w T , s T , s T )] F1/ ρ(w0, s0, s 0) + 3η2 ρM 2T t=0 E[ F1/ ρ(wt, st, s t) 2] t=0 (1 µmin)t 1 Xi S+ hi(w0) vi,0 + 1 Xj S hj(w0) vj,0 Xi S+ hi(w0) vi,0 2 + 1 Xj S hj(w0) vj,0 2 Xj S gi(vi,0 vj,0, si,0) ui,0 + T(R1 + R2 + R3 + R4 + R5) F1/ ρ(w0, s0, s 0) + 3η2 ρM 2T t=0 F1/ ρ(wt, st, s t) 2 µmin + T(R1 + R2 + R3 + R4 + R5) where we use PT 1 t=0 (1 µmin)t 1 µmin and define constant 0 such that 1 Xi S+ hi(w0) vi,0 + 1 Xj S hj(w0) vj,0 Xi S+ hi(w0) vi,0 2 + 1 Xj S hj(w0) vj,0 2 Xj S gi(vi,0 vj,0, si,0) ui,0 Then it follows 1 T t=0 F1/ ρ(wt, st, s t) 2 F1/ ρ(w0, s0, s 0) E[F1/ ρ(w T , s T , s T )] + 3η2 ρM 2T µmin + T(R1 + R2 + R3 + R4 + R5) # ηT + (2 + n+ B1 )η ρM 2 + 2 ρC1 0 µmin T + 2 ρC1(R1 + R2 + R3 + R4 + R5) η + 1 µmin ) + η + R1 + R2 + R3 + R4 + R5 where we define constant such that F1/ ρ(w0, s0, s 0) E[F1/ ρ(w T , s T , s T )] . With MSVR updates for vi,t and ui,t, following from Lemma C.5 and Lemma C.6, we have 2n+ , µ2 = B1τ1 2n , µ3 = B1τ2 R1 = 2τ 1/2 1 σ B1/2 3 + 4n+Ch Mη B1τ 1/2 1 , R2 = 2τ 1/2 1 σ B1/2 3 + 4n Ch Mη B3 + 16n2 +C2 h M 2η2 B2 1τ1 , R4 = 4τ1σ2 B3 + 16n2 C2 h M 2η2 R5 = 2τ 1/2 2 σ B1/2 2 + C2 n+ B1 (B1/2 1 n1/2 + + B1/2 2 n1/2 ) τ1 τ 1/2 2 + C2 n+ B1 ( n1/2 + B1/2 1 + n1/2 B1/2 2 ) η τ 1/2 2 + C2 n1/2 + η B1τ 1/2 2 Then we have 1 T t=0 F1/ ρ(wt, st, s t) 2 η + 1 µmin ) + ( τ 1/2 1 B1/2 3 + τ 1/2 2 B1/2 2 )σ B1τ 1/2 1 + n η B2τ 1/2 1 + n+ B1 max{B1/2 1 n1/2 + , B1/2 2 n1/2 } τ1 τ 1/2 2 + n+ B1 max{ n1/2 + B1/2 1 , n1/2 B1/2 2 } η n+ min{ n1/2 + B1/2 1 , n1/2 B1/2 2 }B1/2 2 , τ2 = O(B2ϵ4), n } min{B1/2 3 , B1/2 1 n1/2 + min{ n1/4 + B1/4 1 , n1/4 B1/4 2 }B1/4 2 }, B1 n+ min{B1/2 1 n1/2 + , B1/2 2 n1/2 }B1/2 3 B2 } max{ 1 B1/2 3 , n1/2 + B1/2 1 max{B1/4 1 n1/4 + , B1/4 2 n1/4 } 1 B1/4 2 }, n+ B1 max{ n1/2 + B1/2 1 , n1/2 B1/2 2 } 1 iterations, we have t=0 F1/ ρ(wt, st, s t) 2 ϵ2. D Proofs of Lemmas and Propositions D.1 Additional Proposition Proposition D.1. Consider a Lipschitz continuous function f : O R where O Rd is an open set. Assume f to be non-increasing (resp. non-decreasing) with respect to each element in the input, then all subgradients of f are element-wise non-positive (resp. non-negative). Proof of Proposition D.1. Let D be the subset of O where f is differentiable. By Theorem 9.60 in [24], a Lipschitz continuous function f : O R, where O Rd is an open set, is differentiable almost everywhere, i.e., D is dense in O. Then by Theorem 9.61 in [24], the subdifferential of f at x is defined as f(x) = con{v| xk x with xk D, f(xk) v}, where con denotes the convex hull. If we assume that f is non-increasing with respect to each element in the input, then f(x) 0 (element-wise) for all differentiable points x D. It implies that the all vectors in {v| xk x with xk D, f(xk) v} are element-wise non-positive. Therefore, all subgradients of f are element-wise non-positive. On the other hand, if we assume that f is non-decreasing, one may follow the same argument and conclude that all subgradients of f are element-wise non-negative. For functions f : O Rm where O Rd is an open set, one may write f = (f1, . . . , fm) and apply the above proposition for each fk, k = 1, . . . , m. D.2 Proofs of Proposition 4.2 and Proposition 4.4 To prove Proposition 4.2 and Proposition 4.4, we first present the following proposition on the weak-convexity of composition functions. Proposition D.2. Assume f : Rd R is ρ1-weakly-convex and C1-Lipschitz continuous, g : R d Rd is C2-Lipschitz continuous, and either of the followings holds: 1. f( ) is monotone and g( ) is L2-smooth; 2. f( ) is non-decreasing and g( ) is L2-weakly-convex, then f g is ρ-weakly-convex with ρ = d L2C1 + ρ1C2 2. Proof of Proposition D.2. The weak convexity of f implies f(g(y)) f(g(x)) + v (g(y) g(x)) ρ1 2 g(y) g(x) 2 f(g(x)) + v (g(y) g(x)) ρ1C2 2 2 x y 2 where v f(g(x)). Moreover, due to the smoothness of g( ) (or weakly-convexity of g( ), then only the second inequality holds), we have g(y) g(x) g(x) (y x) + v L2 g(y) g(x) g(x) (y x) v L2 2 x y 2 . (43) where v(e) denotes a d-dimensional vector with value e on each dimensions. We first assume that f is non-increasing, then we may use the first inequality in (43) and the Lipschitz continuity of g to get f(g(y)) f(g(x)) + v g(x) (y x) + v L2 2 x y 2 ρ1C2 2 2 x y 2 f(g(x)) + v g(x) (y x) + v v L2 2 x y 2 ρ1C2 2 2 x y 2 f(g(x)) + v g(x) (y x) d L2C1 + ρ1C2 2 2 x y 2. On the other hand, if we assume f is non-decreasing, the same result follows from the second inequality in (43). Thus f g is ρ-weakly-convex with ρg = d L2C1 + ρ1C2 2. Proof of Proposition 4.2. Under Assumption 4.1, Proposition D.2 directly implies the ρF -weakconvexity of F(w) with ρF = d1ρg Cf + ρf C2 g. Proof of Proposition 4.4. Under Assumption 4.3, we first apply Proposition D.2 to the composite function gi(hi,j( )) and obtain its ρ g = d2Lh Cg + ρg C2 h-weak-convexity. To show it Lipschitz continuity, we use the Lipschitz continuity of gi and hi,j to obtain gi(hi,j(w)) gi(hi,j( w)) 2 C2 g C2 h w w 2. Thus gi(hi,j(w)) is C g = Cg Ch-Lipschitz-continuous Since we assume fi( ) is non-decreasing, ρf-weakly-convex and Cf-Lipschitz continuous, and gi(hi,j( )) is ρ g-weakly-convex and C g-Lipschitz-continuous, we apply Proposition D.2 again to conclude that F( ) is ρF = d1ρ g Cf + ρf C2 g-weakly-convex. D.3 Proof of Lemma 4.5 Proof of Lemma 4.5. With γ = n1 B1 B1(1 τ) +(1 τ), τ 1 2, MSVR update gives recursive error bound [15] E[ ui,t+1 gi(wt+1) 2] n1 )E[ ui,t gi(wt) 2] + 2τ 2B1σ2 n1B2 + 8n1C2 g B1 E[ wt wt+1 2] n1 )E[ ui,t gi(wt) 2] + 2τ 2B1σ2 n1B2 + 8n1C2 g B1 η2E[ Gt 2] 2n1 )2E[ ui,t gi(wt) 2] + 2τ 2B1σ2 n1B2 + 8n1C2 g M 2η2 Applying this inequality recursively, we obtain E[ ui,t+1 gi(wt+1) 2] 2n1 )2(t+1) ui,0 gi(w0) 2 + 2n1 )2(t j) 2τ 2B1σ2 n1B2 + 8n+C2 g M 2η2 2n1 )2(t+1) ui,0 gi(w0) 2 + 4τσ2 B2 + 16n2 1C2 g M 2η2 where we use Pt j=0(1 B1τ 2n1 )2(t j) 2n1 It follows E [ ui,t+1 gi(wt+1) ]2 E[ ui,t+1 gi(wt+1) 2] 2n1 )2(t+1) ui,0 gi(w0) 2 + 4τσ2 B2 + 16n2 1C2 g M 2η2 2n1 )t+1 ui,0 gi(w0) + 2τ 1/2σ B1/2 2 + 4n1Cg Mη E ui,t+1 gi(wt+1) 2n1 )t+1 ui,0 gi(w0) + 2τ 1/2σ B1/2 2 + 4n1Cg Mη Taking summation over i S, we obtain the desired result i S ui,t+1 gi(wt+1) i S ui,0 gi(w0) + 2τ 1/2σ B1/2 2 + 4n Cg Mη D.4 Proof of Lemma C.6 Proof of Lemma C.6. With γ3 = n+ B1 B1(1 τ2) +(1 τ2) and τ2 1 2, MSVR update gives the following recursive error bound [15] E[ ui,t+1 1 j S gi(vj,t+1 vi,t+1, si,t+1) 2] n+ )E[ ui,t 1 j S gi(vj,t vi,t, si,t) 2] + 2τ 2 2 B1σ2 + 8n+C2 g B1 E[ (vj,t vi,t, si,t) (vj,t+1 vi,t+1, si,t+1) 2] n+ )E[ ui,t 1 j S gi(vj,t vi,t, si,t) 2] + 2τ 2 2 B1σ2 + 16n+C2 g B1 E[ vi,t vi,t+1 2 + vj,t vj,t+1 2] + 8C2 g M 2η2 It remains to bound E[ vi,t vi,t+1 2] and E[ vj,t vj,t+1 2]. We bound the former, and the latter s bound naturally follows. Consider the update of vi,t+1 and we have E[ vi,t vi,t+1 2] n+ τ1vi,t τ1h(i)(wt; Bt 3,i) γ1(h(i)(wt; Bt 3,i) h(i)(wt 1; Bt 3,i)) 2 E 2B1τ 2 1 n+ vi,t h(i)(wt; Bt 3,i) 2 + 2B1γ2 1 n+ h(i)(wt; Bt 3,i) h(i)(wt 1; Bt 3,i) 2 E 2B1τ 2 1 n+ vi,t h(i)(wt; Bt 3,i) 2 + 2B1γ2 1Ch n+ wt wt 1 2 (a) 8B1τ 2 1 M 2 n+ + 8n+C2 hη2M 2 B1 where inequality (a) uses τ1 1/2 and γ1 = n+ B1 B1(1 τ1) + (1 τ1) 2n+ B1 . Plugging the above inequality back into inequality 44 gives E[ ui,t+1 1 j S gi(vj,t+1 vi,t+1, si,t+1) 2] n+ )E[ ui,t 1 j S gi(vj,t vi,t, si,t) 2] + 2τ 2 2 B1σ2 + 16n+C2 g B1 8τ 2 1 M 2(B1 n ) + 8C2 hη2M 2(n+ B2 ) + 8C2 g M 2η2 n+ )E[ ui,t 1 j S gi(vj,t vi,t, si,t) 2] + 2τ 2 2 B1σ2 + 128C2 g M 2 n+ n )τ 2 1 + 128C2 g C2 h M 2 n+ B2 )η2 + 8C2 g M 2η2 Applying this inequality recursively, we obtain E[ ui,t+1 1 j S gi(vj,t+1 vi,t+1, si,t+1) 2] 2n+ )2(t+1) ui,0 1 j S gi(vi,0 vj,0, si,0) 2 + j=0 (1 B1τ2 2n+ )2(t j) 2τ 2 2 B1σ2 + 128C2 g M 2 n+ n )τ 2 1 + 128C2 g C2 h M 2 n+ B2 )η2 + 8C2 g M 2η2 2n+ )2(t+1) ui,0 1 j S gi(vi,0 vj,0, si,0) 2 + 4τ2σ2 + 256C2 g M 2 n2 + B2 1 (B1 n )τ 2 1 τ2 + 256C2 g C2 h M 2 n2 + B2 1 (n+ τ2 + 16n+C2 g M 2η2 2n+ )2(t+1) ui,0 1 j S gi(vi,0 vj,0, si,0) 2 + 4τ2σ2 B2 + C2 2 n2 + B2 1 (B1 n )τ 2 1 τ2 + C2 2 n2 + B2 1 (n+ τ2 + C2 2 n+η2 B2 1τ2 where we use Pt j=0(1 B1τ2 2n+ )2(t j) 2n+ B1τ1 and denotes C2 2 = 2 max{256C2 g M 2, 256C2 g C2 h M 2, 16C2 g M 2}. Taking average over i S+ gives the squarednorm error bound. To derive the norm error bound, we derive E[ ui,t+1 1 j S gi(vj,t+1 vi,t+1, si,t+1) ]2 E[ ui,t+1 1 j S gi(vj,t+1 vi,t+1, si,t+1) 2] 2n+ )2(t+1) ui,0 1 j S gi(vi,0 vj,0, si,0) 2 + 4τ2σ2 B2 + C2 2 n2 + B2 1 (B1 n )τ 2 1 τ2 + C2 2 n2 + B2 1 (n+ τ2 + C2 2 n+η2 2n+ )t+1 ui,0 1 j S gi(vi,0 vj,0, si,0) + 2τ 1/2 2 σ B1/2 2 + C2 n+ B1 (B1/2 1 n1/2 + + B1/2 2 n1/2 ) τ1 + C2 n+ B1 ( n1/2 + B1/2 1 + n1/2 B1/2 2 ) η τ 1/2 2 + C2 n1/2 + η E[ ui,t+1 1 j S gi(vj,t+1 vi,t+1, si,t+1) ] 2n+ )t+1 ui,0 1 j S gi(vi,0 vj,0, si,0) + 2τ 1/2 2 σ B1/2 2 + C2 n+ B1 (B1/2 1 n1/2 + + B1/2 2 n1/2 ) τ1 + C2 n+ B1 ( n1/2 + B1/2 1 + n1/2 B1/2 2 ) η τ 1/2 2 + C2 n1/2 + η Taking average over i S+, we obtain the norm error bound i S+ ui,t+1 1 j S gi(vj,t+1 vi,t+1, si,t+1) i S+ ui,0 1 j S gi(vi,0 vj,0, si,0) + 2τ 1/2 2 σ + C2 n+ B1 (B1/2 1 n1/2 + + B1/2 2 n1/2 ) τ1 τ 1/2 2 + C2 n+ B1 ( n1/2 + B1/2 1 + n1/2 B1/2 2 ) η τ 1/2 2 + C2 n1/2 + η D.5 Proof of Lemma A.3 Proof of Lemma A.3. With γ1 = n1n2 B1B2 B1B2(1 τ1) +(1 τ1) and τ1 1 2, MSVR update has the following recursive error bound [15][15] E[ vi,j,t+1 hi,j(wt+1) 2] n1n2 )E[ vi,j,t hi,j(wt) 2] + 2τ 2 1 B1B2σ2 n1n2B3 + 8n1n2C2 h B1B2 E[ wt wt+1 2] 2n1n2 )2E[ vi,j,t hi,j(wt) 2] + 2τ 2 1 B1B2σ2 n1n2B3 + 8n1n2C2 h M 2η2 Applying this inequality recursively, we obtain E[ vi,j,t+1 hi,j(wt+1) 2] 2n1n2 )2(t+1) vi,j,0 hi,j(w0) 2 + j=0 (1 B1B2τ1 2n1n2 )2(t j)(2τ 2 1 B1B2σ2 n1n2B3 + 8n1n2C2 h M 2η2 2n1n2 )2(t+1) vi,j,0 hi,j(w0) 2 + 4τ1σ2 B3 + 16n2 1n2 2C2 h M 2η2 B2 1B2 2τ1 where we use Pt j=0(1 B1B2τ1 2n1n2 )2(t j) 2n1n2 B1B2τ1 . Taking average over (i, j) S1 S2 gives the squared-norm error bound. To derive the norm error bound, we derive E[ vi,j,t+1 hi,j(wt+1) ]2 E[ vi,j,t+1 hi,j(wt+1) 2] 2n1n2 )2(t+1) vi,j,0 hi,j(w0) 2 + 4τ1σ2 B3 + 16n2 1n2 2C2 h M 2η2 2n1n2 )t+1 vi,j,0 hi,j(w0) + 2τ 1/2 1 σ B1/2 3 + 4n1n2Ch Mη B1B2τ 1/2 1 E[ vi,j,t+1 hi,j(wt+1) ] 2n1n2 )t+1 vi,j,0 hi,j(w0) + 2τ 1/2 1 σ B1/2 3 + 4n1n2Ch Mη B1B2τ 1/2 1 Taking average over (i, j) S1 S2, we obtain the norm error bound j S2 vi,j,t+1 hi,j(wt+1) 2n1n2 )t+1 1 j S2 vi,j,0 hi,j(w0) + 2τ 1/2 1 σ B1/2 3 + 4n1n2Ch Mη B1B2τ 1/2 1 . D.6 Proof of Lemma A.4 Proof of Lemma A.4. With γ2 = n1 B1 B1(1 τ2) + (1 τ2) and τ2 1 2, MSVR update has the following recursive error bound [15] E[ ui,t+1 1 j S2 gi(vi,j,t+1) 2] n1 )E[ ui,t 1 j S2 gi(vi,j,t) 2] + 2τ 2 2 B1σ2 n1B2 + 8n1C2 g B1 E[ vi,j,t+1 vi,j,t 2] (45) It remains to bound E[ vi,j,t+1 vi,j,t 2], which is done as following E[ vi,j,t+1 vi,j,t 2] n1n2 τ1vi,j,t τ1hi,j(wt; Bt 3,i,j) γ1(hi,j(wt; Bt 3,i,j) hi,j(wt 1; Bt 3,i,j)) 2 E 2B1B2τ 2 1 n1n2 vi,j,t hi,j(wt; Bt 3,i,j) 2 + 2B1B2γ2 1 n1n2 hi,j(wt; Bt 3,i,j) hi,j(wt 1; Bt 3,i,j) 2 E 2B1B2τ 2 1 n1n2 vi,j,t hi,j(wt; Bt 3,i,j) 2 + 2B1B2γ2 1Ch n1n2 wt wt 1 2 (a) 8B1B2τ 2 1 M 2 n1n2 + 8n1n2C2 hη2M 2 B1B2 where inequality (a) uses τ1 1/2 and γ1 = n1n2 B1B2 B1B2(1 τ1) + (1 τ1) 2n1n2 B1B2 . Plugging the above inequality back into inequality 45 gives E[ ui,t+1 1 j S2 gi(vi,j,t+1) 2] n1 )E[ ui,t 1 j S2 gi(vi,j,t) 2] + 2τ 2 2 B1σ2 n1B2 + 8n1C2 g B1 8B1B2τ 2 1 M 2 n1n2 + 8n1n2C2 hη2M 2 n1 )E[ ui,t 1 j S2 gi(vi,j,t) 2] + 2τ 2 2 B1σ2 n1B2 + 64B2τ 2 1 M 2C2 g n2 + 64n2 1n2C2 hη2M 2C2 g B2 1B2 Applying this inequality recursively, we obtain E[ ui,t+1 1 j S2 gi(vi,j,t+1) 2] 2n1 )2(t+1) ui,0 1 j S2 gi(vi,j,0) 2 + j=0 (1 B1τ2 2n1 )t j 2τ 2 2 B1σ2 n1B2 + 64B2τ 2 1 M 2C2 g n2 + 64n2 1n2C2 hη2M 2C2 g B2 1B2 2n1 )2(t+1) ui,0 1 j S2 gi(vi,j,0) 2 + 4τ2σ2 B2 + 128n1B2τ 2 1 M 2C2 g B1n2τ2 + 128n3 1n2C2 hη2M 2C2 g B3 1B2τ2 where we use Pt j=0(1 B1τ2 2n1 )2(t j) 2n1 B1τ1 . Taking average over i S1 gives the squared-norm error bound. To derive the norm error bound, we derive E[ ui,t+1 1 j S2 gi(vi,j,t+1) ]2 E[ ui,t+1 1 j S2 gi(vi,j,t+1) 2] 2n1 )2(t+1) ui,0 1 j S2 gi(vi,j,0) 2 + 4τ2σ2 B2 + 128n1B2τ 2 1 M 2C2 g B1n2τ2 + 128n3 1n2C2 hη2M 2C2 g B3 1B2τ2 2n1 )t+1 ui,0 1 j S2 gi(vi,j,0) + 2τ 1/2 2 σ 2n1/2 1 B1/2 2 τ1MCg B1/2 1 n1/2 2 τ 1/2 2 + 8 2n3/2 1 n1/2 2 ChηMCg B3/2 1 B1/2 2 τ 1/2 2 Taking squared root on both sides and taking average over i S1, we obtain the norm error bound i S1 ui,t+1 1 j S2 gi(vi,j,t+1) i S1 ui,0 1 j S2 gi(vi,j,0) + 2τ 1/2 2 σ B1/2 2 + C2n1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ 1/2 2 + C2n3/2 1 n1/2 2 η B3/2 1 B1/2 2 τ 1/2 2 where C2 = max{8 D.7 Proof of Lemma B.2 Proof of Lemma B.2. Define ui,t = (1 τ)ui,t + τgi(wt; Bt 2,i) Then we have EBt 2,i[ ui,t gi(wt) 2] = EBt 2,i[ (1 τ)(ui,t gi(wt)) + τ(gi(wt; Bt 2,i) gi(wt)) 2] = EBt 2,i[(1 τ)2 ui,t gi(wt) 2 + τ 2 gi(wt; Bt 2,i) gi(wt) 2 + 2(1 τ)τ ui,t gi(wt), gi(wt; Bt 2,i) gi(wt) ] (1 τ)2 ui,t gi(wt) 2 + τ 2σ2 B2 It follows EBt 2,i EBt 1[ ui,t+1 gi(wt) 2] n1 EBt 2,i[ ui,t gi(wt) 2] + (1 B1 n1 ) ui,t gi(wt) 2 n1 (1 τ)2 ui,t gi(wt) 2 + B1τ 2σ2 n1B2 + (1 B1 n1 ) ui,t gi(wt) 2 2n1 )2 ui,t gi(wt) 2 + B1τ 2σ2 n1B2 where we use B1 n1 (1 τ)2 + (1 B1 n1 (1 2τ + τ 2) + 1 B1 n1 + τ 2 B1 2n1 )2 = (1 τB1 Then Et[ ui,t+1 gi(wt+1) 2] 4n1 ) ui,t+1 gi(wt) 2 + (1 + 4n1 B1τ ) gi(wt) gi(wt+1) 2 4n1 )(1 B1τ 2n1 )2 ui,t gi(wt) 2 + (1 + B1τ 4n1 )B1τ 2σ2 B1τ )C2 g Et wt wt+1 2] 4n1 )2 ui,t gi(wt) 2 + 2B1τ 2σ2 B1τ C2 g Et wt wt+1 2] 4n1 )2 ui,t gi(wt) 2 + 2B1τ 2σ2 n1B2 + 8n1C2 g M 2η2 where we use B1τ 4n1 1. Applying this inequality recursively, we obtain E[ ui,t+1 gi(wt+1) 2] 4n1 )2E[ ui,t gi(wt) 2] + 2B1τ 2σ2 n1B2 + 8n1C2 g M 2η2 4n1 )2(t+1) ui,0 gi(w0) 2 + 4n1 )2(t j) 2B1τ 2σ2 n1B2 + 8n1C2 g M 2η2 4n1 )2(t+1) ui,0 gi(w0) 2 + 8τσ2 B2 + 32n2 1C2 g M 2η2 where we use Pt j=0(1 B1τ 4n1 )2(t j) 4n1 To obtain the absolute bound, we derive E[ ui,t+1 gi(wt+1) ]2 E[ ui,t+1 gi(wt+1) 2] 4n1 )2(t+1) ui,0 gi(w0) 2 + 8τσ2 B2 + 32n2 1C2 g M 2η2 4n1 )t+1 ui,0 gi(w0) + 2 The desired result follows by taking squared root on both sides. D.8 Proof of Lemma B.5 Proof of Lemma B.5. The proof of Lemma B.5 is the same as Lemma B.2. D.9 Proof of Lemma B.6 Proof of Lemma B.6. Define ui,t = (1 τ2)ui,t + τ2 1 B2 j Bt 2,i gi(vi,j,t) Then we have EBt 2[ ui,t 1 j S2 gi(vi,j,t) 2] = EBt 2[ (1 τ2)(ui,t 1 j S2 gi(vi,j,t)) + τ2( 1 j Bt 2 gi(vi,j,t) 1 j S2 gi(vi,j,t)) 2] = EBt 2[(1 τ2)2 ui,t 1 j S2 gi(vi,j,t) 2 + τ 2 2 1 j Bt 2 gi(vi,j,t) 1 j S2 gi(vi,j,t) 2 + 2(1 τ2)τ2 ui,t 1 j S2 gi(vi,j,t), 1 j Bt 2 gi(vi,j,t) 1 j S2 gi(vi,j,t) ] (1 τ2)2 ui,t 1 j S2 gi(vi,j,t) 2 + τ 2 2 σ2 EBt 2,i EBt 1[ ui,t+1 1 j S2 gi(vi,j,t) 2] n1 EBt 2[ ui,t 1 j S2 gi(vi,j,t) 2] + (1 B1 n1 ) ui,t 1 j S2 gi(vi,j,t) 2 n1 (1 τ2)2 ui,t 1 j S2 gi(vi,j,t) 2 + B1τ 2 2 σ2 n1B2 + (1 B1 n1 ) ui,t 1 j S2 gi(vi,j,t) 2 2n1 )2 ui,t 1 j S2 gi(vi,j,t) 2 + B1τ 2σ2 where we use n1 (1 τ2)2 + (1 B1 n1 ) (1 τ2B1 Et[ ui,t+1 1 j S2 gi(vi,j,t+1) 2] 4n1 ) ui,t+1 1 j S2 gi(vi,j,t) 2 + (1 + 4n1 j S2 gi(vi,j,t) 1 j S2 gi(vi,j,t+1) 2 4n1 )(1 B1τ2 2n1 )2 ui,t 1 j S2 gi(vi,j,t) 2 + (1 + B1τ2 4n1 )B1τ 2 2 σ2 B1τ2 )C2 g Et[ 1 j S2 vi,j,t vi,j,t+1 2] 4n1 )2 ui,t 1 j S2 gi(vi,j,t) 2 + 2B1τ 2 2 σ2 n1B2 + 8C2 g M 2B2τ 2 1 n2τ2 where we use B1τ2 Et[ vi,j,t vi,j,t+1 2] = B1B2 n1n2 EBt 3,i,j τ1vi,j,t τ1hi,j(wt; Bt 3,i,j) 2 B1B2τ 2 1 M 2 Applying this inequality recursively, we obtain E[ ui,t+1 1 j S2 gi(vi,j,t+1) 2] 4n1 )2E[ ui,t 1 j S2 gi(vi,j,t) 2] + 2B1τ 2 2 σ2 n1B2 + 8C2 g M 2B2τ 2 1 n2τ2 4n1 )2(t+1) ui,0 1 j S2 gi(vi,j,0) 2 + j=0 (1 B1τ2 4n1 )2(t j) 2B1τ 2 2 σ2 n1B2 + 8C2 g M 2B2τ 2 1 n2τ2 4n1 )2(t+1) ui,0 1 j S2 gi(vi,j,0) 2 + 8τ2σ2 B2 + 32C2 g M 2n1B2τ 2 1 B1n2τ 2 2 where we use Pt j=0(1 B1τ2 4n1 )2(t j) 4n1 B1τ2 . To obtain the absolute bound, we derive E[ ui,t+1 1 j S2 gi(vi,j,t+1) ]2 E[ ui,t+1 1 j S2 gi(vi,j,t+1) 2] 4n1 )2(t+1) ui,0 1 j S2 gi(vi,j,0) 2 + 8τ2σ2 B2 + 32C2 g M 2n1B2τ 2 1 B1n2τ 2 2 4n1 )t+1 ui,0 1 j S2 gi(vi,j,0) + 2 2Cg Mn1/2 1 B1/2 2 τ1 B1/2 1 n1/2 2 τ2 The desired result follows by taking squared root on both sides. E Group Distributionally Robust Optimization NSWC FCCO finds an important application in group distributionally robust optimization (group DRO), particularly valuable in addressing distributional shift [25]. Consider N groups with different distributions. Each group k has an averaged loss Lk(w) = 1 nk Pnk i=1 ℓ(fw(xk i ), yk i ), where w is the the model parameter and (xk i , yk i ) is a data point. For robust optimization, we assign different weights to different groups and form the following robust loss minimization problem: min w max p Ω k=1 pk Lk(w), where Ω and denotes a simplex. A common choice for Ωis Ω= {p , pi 1/K} where K is an integer, resulting in the so-called CVa R losses, i.e., average of top-K group losses. Consequently, the above problem can be equivalently reformulated as [23]: min w min s F(w, s) = 1 k=1 [Lk(w) s]+ + s. This formulation can be mapped into non-smooth weakly-convex FCCO when the loss function ℓ( , ) is weakly convex in terms of w. In comparison to directly solving the min-max problem, solving the above FCCO problem avoids the need of dealing with the projection onto the constraint Ωand expensive sampling as in existing works [4]. F More Information for Experiments F.1 Dataset Statistics Table 3: Datasets Statistics. The percentage in parenthesis represents the proportion of positive samples. Dataset Train Validation Test moltox21(t0) 5834 (4.25%) 722 (4.01%) 709 (4.51%) molmuv(t1) 11466 (0.18%) 1559 (0.13%) 1709 (0.35%) molpcba(t0) 120762 (9.32%) 19865 (11.74%) 20397 (11.61%) Table 4: Data statistics for the MIL datasets. D+/D is the positive/negative bag number. Data Format Dataset D+ D average bag size #features Tabular MUSK2 39 63 64.69 166 Fox 100 100 6.6 230 Histopathological Lung 100 1000 256 32x32x3 Image Lung 100 1000 256 32x32x3 F.2 Illustration for Histopathology Dataset on MIL Task Figure 2: Illustration for Histopathology Dataset on MIL Task. Ade. is abbreviated for adenocarcinoma and SCC is short for squamous cell carcinoma. In this work, each RGB image is separated by 32 32 non-overlapped patches, which constitute the bag.