# firstorder_minimax_bilevel_optimization__f69aeaed.pdf

First-Order Minimax Bilevel Optimization

Yifan Yang , Zhaofeng Si , Siwei Lyu and Kaiyi Ji Department of Computer Science and Engineering University at Buffalo Buffalo, NY 14260 {yyang99, zhaofeng, siweilyu, kaiyiji}@buffalo.edu

Multi-block minimax bilevel optimization has been studied recently due to its great potential in multi-task learning, robust machine learning, and few-shot learning. However, due to the complex three-level optimization structure, existing algorithms often suffer from issues such as high computing costs due to the second-order model derivatives or high memory consumption in storing all blocks parameters. In this paper, we tackle these challenges by proposing two novel fully first-order algorithms named FOSL and Mem CS. FOSL features a fully single-loop structure by updating all three variables simultaneously, and Mem CS is a memory-efficient double-loop algorithm with cold-start initialization. We provide a comprehensive convergence analysis for both algorithms under full and partial block participation, and show that their sample complexities match or outperform those of the same type of methods in standard bilevel optimization. We evaluate our methods in two applications: the recently proposed multi-task deep AUC maximization and a novel rank-based robust meta-learning. Our methods consistently improve over existing methods with better performance over various datasets.

1 Introduction

In this paper, we study a general multi-block minimax bilevel optimization problem given by

min x Rdx max y Rdy F(x, y, z ) := 1

i=1 fi x, y, z i (x) = 1

i=1 Eξ fi x, y, z i (x); ξi

s.t. z i (x) = arg min z Rdz gi(x, z) = Eζ gi(x, z; ζi) (1)

where the upperand lower-level function fi and gi for block i take the expectation form w.r.t. the random variables ξ, ζ, and are jointly continuously differentiable, z = z 1(x), ..., z n(x) Rdz n contains all lower-level optimal solutions, and n is the number of blocks. The above problem has various applications in machine learning, including deep AUC maximization [24, 23], meta-learning [16, 3], hyperparameter optimization [17], and robust learning [17]. This paper focuses on the setting with a nonconvex-strongly-concave minimax upper-level problem and a strongly-convex lower-level problem.

To date, barring a few works on optimizing special cases of this problem [17, 24], the solution algorithm to its general form has not been well studied. The primary obstacle lies in the significant computational cost per iteration, arising from the inherent structure of multi-block minimax bilevel optimization. To address this challenge, [17] considered a special case where y is the simplex variable, and introduced a single-loop gradient descent-ascent algorithm, based on the two-timescale bilevel

*These authors contributed equally to this work.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

framework in [22]. [24] proposed a single-loop matrix-vector-based algorithm for a special case of our problem, where each upper-level function fi is evaluated only at the ith coordinate of y. However, these methods require computing the expensive second-order derivatives (i.e., the Hessian matrix or Hessian-vector product) per iteration, and the more efficient first-order approaches have not been explored yet. In this paper, we propose two efficient first-order minimax bilevel algorithms and further apply them to two novel ML applications. Our contributions are summarized as follows.

By converting the original minimax bilevel problem into a simple minimax problem, we first propose a fully first-order single-loop algorithm named FOSL, which is easy to implement by updating x, y and z simultaneously, and is computationally efficient without the calculation of any second-order Hessian or Jacobian matrices. We provide a convergence analysis for FOSL under a practical block sampling without replacement setting and show that its sample complexity matches the best-known result of the same type of methods in standard bilevel optimization. Technically, we characterize the gap between the reformulated and original problems and need to deal with the interplay among four variables in the error analysis.

In the settings where the number of blocks is substantial (e.g., in few-shot meta-learning), it becomes impractical to store all block-specific parameters to perform the single-loop optimization. To this end, we also propose a memory-efficient method named Mem CS via a cold-start initialization, which randomly initializes a new weight for each sampled block, without saving it for the next iteration. We then analyze the convergence of Mem CS under the partial-block and full-block participation, and show that it can achieve a better sample complexity than the same type of methods in standard bilevel optimization.

We further apply our approaches to two ML applications: deep AUC maximization and robust meta-learning. The first application pertains to the established field of AUC Maximization, while the second explores a novel application known as Rank-based Robust Meta Learning. We show the effectiveness of our methods over a variety of datasets including CIFAR100, Celeb A, Che Xpert, OGBG-Mol PCBA, Mini-Image Net and Tiered-Image Net.

2 Related Work

(Minimax) bilevel optimization. Bilevel optimization, introduced in [2], has been extensively studied, with constraint-based methods [13, 20, 50, 51] and gradient-based methods [1, 45, 14, 49, 57] emerging as two predominant types of approaches. The constraint-based methods (e.g., [34, 38, 30, 35, 54]) reformulated the lower-level problem as a value-function-based constraint, and solved it via different constrained optimization algorithms. More recently, [16, 23] studied the minimax bilevel optimization problem and proposed single-loop algorithms with applications to robust machine learning and deep AUC maximization. In this paper, we propose two efficient, fully first-order algorithms with solid performance guarantees. In recent years, there has been a growing interest in gradient-based methods due to their efficiency in solving machine-learning problems. Within this category, Iterative Differentiation (ITD) based methods [6, 7, 14, 39, 49, 27] and Approximate Implicit Differentiation (AID) based methods [1, 5, 33, 45, 41, 14, 27, 22] are two important classes distinguished by their approaches to approximating hypergradients.

Deep AUC maximization (DAM). DAM methods are aimed to mitigate the impact of imbalanced data in binary classification by directly maximizing the area under the ROC curve (AUC), a performance metric less affected by imbalanced data. As the AUC is difficult to optimize directly, research on DAM primarily focuses on devising effective optimization methods for its continuous surrogates [21, 4, 46, 8]. [37] proposed to reformulate the deep AUC maximization problem as a minimax optimization problem, providing the foundation for stochastic DAM algorithms developed in recent years [59, 60, 18, 24]. Among them, the most relevant work [24] formulated the DAM problem as a multi-block minimax optimization problem. In this work, we will use this form of DAM to demonstrate the effectiveness of our algorithm. A more comprehensive overview of DAM methods can be found in the survey [56].

Robust meta-learning. Meta-learning provides effective solutions to multi-task learning in few-shot learning settings. In meta-learning, one trains a meta-model that can be quickly turned into a model that adapts to new tasks with only a few updates. Meta-learning algorithms in real-world applications must be robust to handle corrupted or low-quality data such as outliers. The majority of robust metalearning methods encompass filtering [55], re-weighting [48, 28, 31, 36], and re-labeling[43, 52, 61]

on the sample level. Moreover, some other works focus on improving task-level robustness [28, 58]. In this work, we show that robust meta-learning can be formulated as a minimax bilevel optimization problem and solved with the proposed algorithm.

3 Algorithms

3.1 Reformulation as a Minimax Problem

Motivated by [34, 38, 30] in single-machine bilevel optimization, we reformulate the lower-level problem as a value-function-based constraint and aim to solve the following equivalent problem:

min x max y 1 n

i=1 fi(x, y, zi) s.t. gi(x, zi) gi(x, z i ) 0, (2)

where z i := arg minz gi(x, z). Inspired by [30], we form a Lagrangian Li with Lagrangian multiplier λ 0 to approximate the original problem for each block i in eq. (2), as Li(x, y, zi, vi) = fi(x, y, zi) + λ gi(x, zi) gi(x, vi) ,

where vi is used to approximate the lower-level solution z i (x) of the ith block. Then, we turn to solve the following surrogate minimax problem:

min x,z max y,v L(x, y, z, v) := 1

i=1 Li(x, y, zei, vei), (3)

where z = (z1, ..., zn) Rdz n, v = (v1, ..., vn) Rdz n and the standard basis vector ei has only one non-zero element of 1 at the ith coordinate. We show later in Section 4.2 that the gap between the gradients F(x, y (x), z (x)) and L(x, y (x), z λ(x), z (x)) of the original and surrogate problems can be effectively bounded by O(1/λ), where y (x) denotes the maximize of outer-objective F(x, , z (x)) and each vector z λ,i(x) in z λ(x) := (z λ,1(x), ..., z λ,n(x)) denotes the minimizer of the Lagrangian function Li(x, y (x), , v) (where z λ,i(x) has not reliance on v). This validates the effectiveness of the Lagrangian approximation for λ sufficiently large. Next, we propose two efficient algorithms, namely FOSL and Mem CS, to solve the surrogate problem in eq. (3).

3.2 FOSL: Fully First-Order Single-Loop Method

As shown in Algorithm 1, we first sample a subset It S := {1, ..., n} of blocks without replacement. Noting that zi and vi are both block-specific variables, we then apply a stochastic ascent and descent step to update vi and zi for all block i It as

vi,t+1 = vi,t + ηv zgi(xt, vi,t; ξt v,i)

zi,t+1 = zi,t ηz z Li xt, yt, zi,t, vi,t; ξt z,i ,

where the gradient of Li w.r.t. z has no dependence on v. Since the solutions w.r.t. variables x and y depend on all blocks, we use the average of stochastic gradient estimators from the selected blocks in It to update y and x as

yt+1 = yt + ηy 1 |It|

i It yfi(xt, yt, vi,t; ξt y,i)

xt+1 = xt ηx 1 |It|

i It x Li(xt, yt, zi,t, vi,t; ξt x,i).

Note that our algorithm takes a simpler fully single-loop structure via updating {vi,t, zi,t}i It, xt and yt simultaneously at each iteration. Hence, it can also benefit from the hardware parallelism. In addition, different from the methods in [17, 24] that need to compute the higher order Hessianor Jacobian-vector products, our method only uses the first-order gradients.

3.3 Mem CS: Memory-Efficient Cold-Start Method

Note that in the single-loop optimization of Algorithm 1, all block-specific parameters vi,t and zi,t of blocks in It need to be stored for the updates at iteration t + 1. However, in some ML applications,

Algorithm 1 Fully First-Order Single-Loop Method (FOSL)

1: Input: initialization {x0,y0,z0,v0}, number of iteration rounds T, learning rates {ηx, ηy, ηz, ηv} 2: for t = 0, 1, 2, ..., T do 3: Sample blocks It S 4: for i It do 5: vi,t+1 = vi,t ηv zgi(xt, vi,t; ξt v,i) 6: zi,t+1 = zi,t ηz z Li(xt, yt, zi,t, vi,t; ξt z,i) 7: end for 8: yt+1 = yt + ηy 1 |It| P

i It yfi(xt, yt, vi,t; ξt y,i) 9: xt+1 = xt ηx 1 |It| P

i It x Li(xt, yt, zi,t, vi,t; ξt x,i) 10: end for

Algorithm 2 Memory-Efficient Cold-Start (Mem CS)

1: Input: initialization {x0,y0}, number of iteration rounds T, learning rates {ηx, ηy, ηz, ηv} 2: for t = 0, 1, 2, ..., T 1 do 3: Sample blocks It S 4: for i It do 5: Random initializations v0 i,t, z0 i,t 6: for k = 0, 1, 2, ..., K 1 do 7: vk+1 i,t = vk i,t ηv zgi(xt, vk i,t) 8: zk+1 i,t = zk i,t ηz z Li(xt, yt, zk i,t, vk i,t) 9: end for 10: end for 11: yt+1 = yt + ηy 1 |It| P

i It yfi(xt, yt, v K i,t) 12: xt+1 = xt ηx 1 |It| P

i It x Li(xt, yt, z K i,t, v K i,t) 13: end for

such as few-shot meta-learning, the number of blocks/tasks is often large, and hence Algorithm 1 can suffer from significant memory consumption. To address this challenge, we propose a memoryefficient method named Mem CS in Algorithm 2. Differently from the single-loop updates in FOSL, Mem CS contains a sub-loop of K steps of gradient descent1 in updating the block-specific variables vk i,t and zk i,t for k = 0, ..., K 1 with random initialization v0 i,t and z0 i,t. After obtaining the outputs v K i,t, z K i,t of this sub-loop, the remaining step is to update yt and xt via gradient ascent and descent similarly as in FOSL.

4 Main Results

4.1 Assumptions

Definition 4.1. A mapping f is L-Lipschitz continuous if f(x1) f(x2) L x1 x2 , for any x1, x2. We say f is L-smooth if f is L-Lipschitz continuous.

Since the overall objective is nonconvex w.r.t. x, we aim to find an ϵ-accurate stationary point. Definition 4.2. We call x as an ϵ-accurate stationary point of the objective function Φ(x) if E Φ( x) 2 ϵ, where ϵ (0, 1] and x is the output of an algorithm.

We use the following assumptions in the subsequent description. Note that these assumptions are widely adopted in existing studies [24, 17]. Assumption 4.3. For any x Rdx, y Rdy, z Rdz and i {1, 2, ..., n}, fi(x, y) and gi(x, y) are twice continuously differentiable, fi(x, y, z) is µf-strongly concave w.r.t. y and gi(x, z) is µg-strongly convex w.r.t. z.

The following assumption imposes the Lipschitz continuity on the upperand lower-level functions and their derivatives.

1For Mem CS, we focus on the few-shot setting such as meta-learning, where each block contains a small number of samples, and hence we use gradient descent here. However, the algorithm can also be extended to the stochastic setting.

Assumption 4.4. For any x Rdx, z Rdz and i {1, 2, ..., n}, fi(x, y, z) is Lf,0-Lipschitz continuous w.r.t. x, gi(x, z) is Lg,0-Lipschitz continuous w.r.t. x, fi(x, y, z) is Lf,1-smooth and gi(x, y) is Lg,1-smooth. In addition, the second-order derivatives 2fi(x, y, z) and 2gi(x, y) are Lf,2and Lg,2-Lipschitz continuous.

Next, we make a bounded variance assumption for the gradients in the stochastic setting.

Assumption 4.5. There exist constants σf and σg such that the variances E fi(x, y, z) fi(x, y, z; ξ) 2 σ2 f and E gi(x, z) gi(x, z; ζ) 2 σ2 g.

The following assumption on block heterogeneity measures the similarities of the upper-level gradients yfi(x, z) for all i. This has not been discussed in previous works, but it is necessary for our approach as we explore a more general outer-maximization solution y (x) for F, rather than for single fi.

Assumption 4.6. For any x Rdx, z Rdz, there exist constants βth 1 and σth 0 such that

i=1 E yfi(x, y, z) 2 β2 th E y F(x, y, z) 2 + σ2 th.

We have βth = 1 and σth = 0 when all gi s are identical.

4.2 Convergence analysis

For simplicity, we fix the number of involved blocks |It| = P for all t. Let y (x) be the maximizer of F function w.r.t. y. Then, the overall objective of the original problem in eq. (1) w.r.t. x is given by

Φ(x) := F x, y (x), z (x) ,

where z (x) is the lower-level minimizer and y (x) is the maximizer of F(x, , z (x)). In addition, recall that the objective function of the surrogate problem in eq. (3) w.r.t. x is given by L (x) := L x, y (x), z λ(x), z (x) . We next characterize the difference between the gradients of the original and the surrogate problems.

Proposition 4.7. Under Assumptions 4.3, 4.4, the gap between Φ(x) and L (x) can be bounded as Φ(x) L (x) = O(1/λ).

Due to the limit of space, the proof of Proposition 4.7 can be found in Lemma D.5 in the appendix. For a properly large λ, Proposition 4.7 guarantees that the stationary points of the original and surrogate problems are close to each other. However, too large λ can explode the gradient estimation variance, resulting in a much slower convergence rate. This trade-off makes the selection of λ important, as shown in our theorems later.

We next give an upper bound on the gradient norm E L (xt) 2 of the surrogate problem. Denote ht x := x Li(xt, yt, zt, vt; ξt x,i) and its expectation as eht x.

Proposition 4.8. Under Assumptions 4.3, 4.4, 4.5, the consecutive iterates of Algorithm 1 satisfy:

E L (xt) 2 2

ηx E L (xt+1) L (xt) E eht x 2 + ηx L ,1E ht x 2 + 3L2 f,1E yt y (xt) 2

+ 3L2 λ,1 n

i=1 E h zi,t z λ,i(xt) 2 + vi,t z i (xt) 2i

for all t {0, 1, ..., T 1}, where Lλ,1 and L ,1 are given in Lemma D.1 and Lemma D.6 in the appendix respectively.

The proof of Proposition 4.8 can be found in Lemma E.1 in the appendix. The same result can be obtained for Algorithm 2 by replacing vi,t and zi,t with v K i,t and z K i,t. This proposition shows that the convergence rate of our algorithm relies on how fast the iterates yt, zi,t and vi,t converge to their optimal solutions at each iteration t. We next characterize the distance of yt to its optimal solution.

Proposition 4.9. Under Assumptions 4.3, 4.4, 4.5, the iterates of yt generated according to Algorithm 1 satisfy

E yt+1 y (xt+1) 2 E yt y (xt) 2 O(ηy) E yt y (xt) 2 + O η2 y |It|

(σ2 f + σ2 th)

i=1 E vi,t z i (xt) 2 + O η2 x ηy

E eht x 2 + O(η2 x)E ht x 2,

for all t {0, ..., T 1}.

The proof of Proposition 4.9 refers to Lemma E.3 in the appendix. It can be seen that with properly small stepsizes ηx and ηy, there exists a descent of the optimal distance E yt y (xt) 2, which is critical for the final convergence analysis. Similar results are obtained for vi,t and zi,t. Combining the above propositions and the auxiliary lemmas in the appendix, we get the following result for Algorithm 1. Theorem 4.10 (Convergence of FOSL). Suppose Assumptions 4.3, 4.4, 4.5, 4.6 are satisfied. Set parameters ηx = O(T 5

7 ), ηy = O(T 4

7 ), ηz = O(T 5

7 ), ηv = O(T 4

7 ) and λ = O(T 1 7 ). Then, we have

t=0 E Φ(xt) 2 2Cgap

λ2 + 4 Ψ0 ΨT

Tηx + 4ηxλ2

(ηzλ) + ηxλ2

+ 4 ηy + (ηzλ)λ2 + ηvλ2 C3

7 ), where Cgap is defined in Lemma D.5, C2, C3 are defined in eq. (39) in the appendix, and Ψt := L (xt) + Ky E yt y (xt) 2 + Kz 1

n Pn i=1 E zi,t z λ,i(xt) 2 + Kv 1

n Pn i=1 E vi,t z i (xt) 2.

Next, we characterize the sample complexity for FOSL. Corollary 4.11. Under the same setting of Theorem 4.10, our algorithm finds an ϵ-accurate stationary solution after T = O(ϵ 7

2 ) interactions. The total sample complexity for all involved blocks is PT = O(Pϵ 7

Compared with existing works [17, 24], our algorithm is free from second-order derivative computations. In addition, the sample complexity of our algorithm matches the best result [30] of the same type of methods in standard single-block bilevel optimization.

Next, we analyze the convergence for Algorithm 2 under the partialand full-block participation. Theorem 4.12 (Convergence of Mem CS). Suppose Assumptions 4.3, 4.4, 4.5, 4.6 are satisfied. Assume there exists some B > 0 such that z i (xt) B for any xt, i = 1, ..., N. For the partial-block participation, by setting parameters ηx = O(P 1 5 T 2

3 ), ηy = O(P 1

2 ), ηz = O(P 1

6 ), ηv = O(1), λ = O(P 1 10 T 1 6 ) and taking ϵsub = O(P 2

3 ), we have

t=0 E Φ(xt) 2 2Cgap

λ2 + 2(Ψ0 ΨT )

Tηx + 4ηxλ2

P 24L2 f,1σ2 th µf

+ 4 3L2 λ,1 + 12L4 f,1 µ2 f

where Lλ,1 := 3λLg,1, Cgap is defined by Lemma D.5 in the appendix and Ψt := L (xt)+Ky E yt y (xt) 2. For the full-block participation, by setting ηx = O(1), ηy = O(1), ηz = O(T 1

2 ), ηv = O(1), λ = O(T 1 2 ) and taking ϵsub = O(T 2), we have

t=0 E Φ(xt) 2 2Cgap

λ2 + 4 Ψ0 ΨT

Tηx + 12 L2 λ,1 + 4L4 f,1 µ2 f

ϵsub O(T 1).

Note that the assumption z i (xt) B can be removed when the domain of x is a closed convex set and projected gradient descent is used to update x [12, 15]. We next characterize the sample complexity for Mem CS.

Corollary 4.13. Under the same setting of Theorem 4.12,

For partial-block participation, our algorithm finds an ϵ-accurate stationary solution of Φ(x) after T = O(P 3

5 ϵ 3) outer iterations and K = O(log 1

ϵ ) inner iterations. The total sample complexity for all involved blocks is PKT = e O(P 2 5 ϵ 3).

For full-block participation, our algorithm finds an ϵ-accurate stationary solution of Φ(x) after T = O(ϵ 1) outer iterations and K = O(log 1

ϵ ) inner iterations. The total sample complexity for all involved blocks is n KT = e O(nϵ 1).

Note that the per-block sample complexity of our Mem CS algorithm takes an order of ϵ 1, which improves that of the same-type F2SA [30] by an order of ϵ 0.5, based on a refined analysis on the smoothness of the overall objective function. Corollary 4.13 also shows that Mem CS achieves a linear convergence speedup w.r.t. the number P of blocks. As far as we know, this is the first linear speedup result in multi-block minimax bilevel optimization.

5 Applications and Experiments

In this section, we conduct extensive experiments in two applications: deep AUC maximization and rank-based robust meta-learning. More experimental results such as time and space comparison are provided in Appendix B.

5.1 Deep AUC Maximization

5.1.1 Formulation

Deep AUC Maximization (DAM) addresses machine learning challenges presented by imbalanced datasets. In particular, the AUC (Area Under the ROC Curve) measures the likelihood that a positive sample s prediction score will be higher than that of a negative sample. As outlined by [24], the DAM issue is structured as a multi-block minimax bilevel optimization problem:

min w,a,b max α

j=1 Φj u j(wj), aj, bj, αj s.t. u j(wj) = arg min uj gj(uj, wj),

where gj(uj, wj) := 1

2 uj wj η LAV G(wj) 2, LAV G(wj) := 1 n Pn i=1 ℓ(wj; xi, yi), ℓ denotes the task loss (e.g., the cross-entropy loss in binary classification tasks), and Φj denotes the sample-level AUC loss function. The detailed formulation can be found in Appendix A.1. With the method in Section 3, we reformulate this problem as a single-level minimax optimization problem:

min w,u,a,b max α,v L(w, u, a, b, α, v),

where L(w, u, a, b, α, v) := Pm j=1 Φj(uj, aj, bj, αj) + λ gj(uj, wj) gj(vj, wj) is the Lagrangian function of AUC loss function, and vj is the approximate optimal solution of gj.

5.1.2 Results

Settings. Following the work [24], we assess our methodology using four datasets, namely CIFAR100 [29], Celeb A [40], Che Xpert [26] and OGBG-Mol PCBA [25], whose details are provided in Appendix B.1. We evaluate the effectiveness of our algorithm by comparing it with direct optimization on multi-block minimax AUC loss (m AUC) and compositional training on m AUC loss (m AUC-CT). The test AUC scores of m AUC and m AUC-CT for different datasets in Table 1 are derived from the original paper. Detailed configuration of experiments can be found in Appendix B.2.

Results. The results of deep AUC maximization on different datasets are shown in Table 1. The results indicate that our FOSL outperforms the m AUC method in terms of test AUC scores on all datasets and achieves comparative or better performance than m AUC-CT on various datasets. We proceed to visualize the AUC loss during the initial stages of training for all methods on Celeb A, as depicted in Figure 1a and 1b. The figures illustrate that, in the initial stage, our method and m AUC-CT [24] exhibit a faster loss drop than m AUC, and our method achieves the fastest overall convergence rate. Furthermore, our approach exhibits a smaller fluctuation compared to other baseline methods.

Table 1: Test AUC score with 95% confidence interval on different datasets for AUC maximization.

CIFAR100 Celeb A Che Xpert OGBG-Mol PCBA

m AUC[24] 0.9044 (0.0015) 0.9062 (0.0042) 0.8084 (0.1455) 0.7793 (0.0028) m AUC-CT[24] 0.9272 (0.0014) 0.9192 (0.0004) 0.8198 (0.1495) 0.8406 (0.0044) FOSL(ours) 0.9540 (0.0009) 0.9267 (0.0018) 0.8166 (0.0051) 0.8516 (0.0014)

Figure 1: Visualization results of FOSL experiments. (a) Training AUC loss over iteration rounds during the initial stages of training. (b) Training AUC loss over time during the initial training phase. (c) Impact of λ on test AUC score throughout training on the CIFAR100 dataset. (d) Impact of λ on test AUC score throughout training on the Celeb A dataset.

Impact of λ. To evaluate the impact of the hyper-parameter λ on training with FOSL algorithm, we conduct an ablation study on the CIFAR100 and the Celeb A datasets. The test AUC scores along with training are depicted in Figure 1c and 1d. As shown in Figure 1c, our method sustains robust performance across a wide range of choices for λ. For example, training within a λ range of [2, 8] shows that the speed of convergence and the final test performance are not sensitive to the change of λ. A similar observation also holds for the Celeb A dataset as shown in Figure 1d.

5.2 Robust Meta-learning with Rank-based Loss

5.2.1 Formulation

Our objective is to devise a robust meta-learning approach wherein, during each iteration, tasks with large loss values are filtered out, and the meta-model is updated with the remaining tasks. This approach effectively reduces the impact of tasks with noisy samples (noisy tasks), because deep learning models tend to acquire clean and simple patterns in their initial training stages [19], such that noisy samples/tasks often have large loss values. Further justification can be found in Figure 2.

We first define g[i] as the ith largest element of the set G = {g1, g2, ..., gn}, such that g[n] g[n 1] ... g[1]. Denote the task-specific loss as gi(ϕ, wi), where ϕ is the parameter of the meta-model and wi is the task-specific parameter. The proposed formulation is then given by:

min ϕ F(ϕ, w ) := 1

i=n k+1 g[i] ϕ, w [i](ϕ) s.t. w i (ϕ) = arg min wi gi(ϕ, wi),

where g[i] ϕ, w [i](ϕ) is the ith largest task-specific loss given w := w i (ϕ), ..., w n(ϕ) T , and w [i](ϕ) is the corresponding optimal task-specific parameter.

With the Lemma 1 in [42], by introducing an auxiliary variable γ, we can reformulate the problem as:

min ϕ max γ F(ϕ, w , γ) = 1

i=1 fi ϕ, w i (ϕ), γ = 1

n g i (ϕ) [g i (ϕ) γ]+ n k

s.t. w i (ϕ) = arg min wi gi(ϕ, wi).

Details about the derivation of the above formulation can be found in Appendix A.2. This formulation poses a non-convex optimization challenge for the primal variable ϕ, making it challenging to address using conventional optimization methods. Nevertheless, our proposed algorithm enables efficient resolution of this problem by reformulating the problem into a singlelevel minimax optimization problem as: minϕ,w maxγ,v L(ϕ, w, γ, v), where L(ϕ, w, γ, v) := 1 n Pn i=1 fi(ϕ, wi, γ)+λ gi(ϕ, wi) gi(ϕ, vi) is the Lagrangian function of the rank based loss function, v is an approximate optimal task-specific parameter of the lower-level problem.

Table 2: Test accuracy (%) on the Mini Image Net and the Tiered-Image Net datasets for meta-learning.

Dataset Method Clean Flip Rand

Mini MAML 64.75 52.75 52.50 Mem CS(ours) 69.25 57.50 60.25

Tiered MAML 66.25 44.75 54.25 Mem CS(ours) 67.25 57.00 59.75

Table 3: Test accuracy (%) on Mini-Image Net and Tiered-Image Net with different noisy ratio for Flip setting.

Dataset Method Noisy ratio

0 0.2 0.4 0.6 0.8

Mini MAML 64.75 59.50 56.50 52.75 42.00 Mem CS 69.25 65.00 61.75 57.50 53.50

Tiered MAML 66.25 63.75 53.25 44.75 39.00 Mem CS 67.25 66.50 62.75 57.00 54.50

(d) Figure 2: The portion of tasks being noisy during training for MAML and Mem CS on Mini-Image Net.

5.2.2 Results

Settings. We perform meta-learning experiments on few-shot learning tasks, focusing on the capability of rapid adaptation to new tasks with limited samples. Adhering to standard few-shot learning configurations, we carry out 5-ways 5-shot learning experiments on Mini-Image Net [53] (referred to as Mini in Table 2) and Tiered-Image Net [47] (referred to as Tiered in Table 2), where each task involves a 5-class classification task, with five samples per class used as training data. Since our robust meta-learning formulation is built on that of MAML [6], we compare our method with MAML on both clean dataset and noisy dataset to evaluate the effectiveness and robustness of our algorithm. In the noisy setting, we adopt a standard noisy training scheme in meta-learning, where the labels in a noisy task are randomly flipped. Specifically, we employ two label flipping strategies: Flip, where in each iteration, a certain portion of tasks (60% in Table 2) are designated as noisy, and each sample within the task is assigned to one of all labels with equal probability; and Rand, where a random noisy ratio is assigned to each task in every iteration, determining the proportion of samples to be flipped by randomly assigning another label to them. Detailed configuration of experiments can be found in Appendix B.2.

Results. Table 2 displays the test accuracies. These results show that in the presence of noisy tasks, both MAML and our Mem CS method undergo a decline in performance across both datasets, yet our approach manages to sustain a reasonable performance. To further evaluate the resilience of our Mem CS method against MAML, we executed additional experiments with escalating noise levels in the Flip scenario, with these findings detailed in Table 3. The data clearly show a performance decrease for both methodologies as the noise ratio intensifies. Nonetheless, our approach exhibits a notably more gradual decline in performance as the noise ratio escalates, especially at higher noise levels, signifying superior robustness compared to MAML.

Discussion. To show the effectiveness of our approach in facilitating robust learning, we have visualized the average losses for both clean and noisy tasks separately within the MAML training framework, as demonstrated in Figure 2. The graphical representation uncovers a consistent pattern where the losses associated with noisy tasks consistently exceed those related to clean tasks throughout the training period. This pattern underscores our approach s capacity to lessen the detrimental effects of noisy tasks. Further, we examine the noisy tasks in the update mechanism at five distinct intervals during the training phase, illustrated in Figure 2. The findings show that our methodology successfully deters the influence of noisy tasks on the meta-model s updates across both Rand and Flip scenarios, maintaining this protective measure throughout the training duration.

6 Conclusion

In this paper, we propose two fully first-order algorithms designed to address the challenges posed by multi-block minimax bilevel optimization problems: a fully single-loop algorithm, FOSL, and a

memory-efficient double-loop algorithm with cold-start initialization, Mem CS. We show that our methods can achieve comparative and even better per-block sample complexities than other methods with the same type in standard bilevel optimization. The experimental results indicate that our methods consistently demonstrate superior performance and robustness in applications on deep AUC maximization and robust meta-learning.

Acknowledgement

Yifan Yang and Kaiyi Ji were partially supported by NSF grants CCF-2311274 and ECCS-2326592. Zhaofeng Si and Siwei Lyu were partially supported by an NSF research grant IIS-2008532.

[1] Michael Arbel and Julien Mairal. Amortized implicit differentiation for stochastic bilevel optimization. ar Xiv preprint ar Xiv:2111.14580, 2021.

[2] Jerome Bracken and James T Mc Gill. Mathematical programs with optimization problems in the constraints. Operations Research, 21(1):37 44, 1973.

[3] Liam Collins, Aryan Mokhtari, and Sanjay Shakkottai. Task-robust model-agnostic metalearning. Advances in Neural Information Processing Systems, 33:18860 18871, 2020.

[4] Corinna Cortes and Mehryar Mohri. Auc optimization vs. error rate minimization. Advances in Neural Information Processing Systems, 16, 2003.

[5] Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pages 318 326. PMLR, 2012.

[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126 1135. PMLR, 2017.

[7] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pages 1165 1173. PMLR, 2017.

[8] Wei Gao, Rong Jin, Shenghuo Zhu, and Zhi-Hua Zhou. One-pass auc optimization. In International Conference on Machine Learning, pages 906 914. PMLR, 2013.

[9] Wei Gao and Zhi-Hua Zhou. On the consistency of auc pairwise optimization. ar Xiv preprint ar Xiv:1208.0645, 2012.

[10] Camille Garcin, Maximilien Servajean, Alexis Joly, and Joseph Salmon. Stochastic smoothing of the top-k calibrated hinge loss for deep imbalanced classification. In International Conference on Machine Learning, pages 7208 7222. PMLR, 2022.

[11] Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. ar Xiv preprint ar Xiv:2301.11235, 2023.

[12] Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

[13] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016.

[14] Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pages 3748 3758. PMLR, 2020.

[15] Riccardo Grazzi, Massimiliano Pontil, and Saverio Salzo. Bilevel optimization with a lowerlevel contraction: Optimal sample complexity without warm-start. Journal of Machine Learning Research, 24(167):1 37, 2023.

[16] Alex Gu, Songtao Lu, Parikshit Ram, and Lily Weng. Nonconvex min-max bilevel optimization for task robust meta learning. In International Conference on Machine Learning, 2021.

[17] Alex Gu, Songtao Lu, Parikshit Ram, and Lily Weng. Min-max multi-objective bilevel optimization with applications in robust machine learning. In International Conference on Learning Representations, 2023.

[18] Zhishuai Guo, Mingrui Liu, Zhuoning Yuan, Li Shen, Wei Liu, and Tianbao Yang. Communication-efficient distributed stochastic auc maximization with deep neural networks. In International Conference on Machine Learning, pages 3864 3874. PMLR, 2020.

[19] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in Neural Information Processing Systems, 31, 2018.

[20] Pierre Hansen, Brigitte Jaumard, and Gilles Savard. New branch-and-bound rules for linear bilevel programming. SIAM Journal on scientific and Statistical Computing, 13(5):1194 1217, 1992.

[21] Ralf Herbrich. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115 132, 2000.

[22] Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actorcritic. SIAM Journal on Optimization, 33(1):147 180, 2023.

[23] Quanqi Hu, Bokun Wang, and Tianbao Yang. A stochastic momentum method for min-max bilevel optimization. 2021.

[24] Quanqi Hu, Yongjian Zhong, and Tianbao Yang. Multi-block min-max bilevel optimization with applications in multi-task deep auc maximization. Advances in Neural Information Processing Systems, 35:29552 29565, 2022.

[25] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.

[26] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590 597, 2019.

[27] Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pages 4882 4892. PMLR, 2021.

[28] Krishnateja Killamsetty, Changbin Li, Chen Zhao, Feng Chen, and Rishabh Iyer. A nested bi-level optimization framework for robust few shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7176 7184, 2022.

[29] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[30] Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for stochastic bilevel optimization. In International Conference on Machine Learning, pages 18083 18113. PMLR, 2023.

[31] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5447 5456, 2018.

[32] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via scsg methods. Advances in Neural Information Processing Systems, 30, 2017.

[33] Renjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa Zhang, Ki Jung Yoon, Xaq Pitkow, Raquel Urtasun, and Richard Zemel. Reviving and improving recurrent back-propagation. In International Conference on Machine Learning, pages 3082 3091. PMLR, 2018.

[34] Gui-Hua Lin, Mengwei Xu, and Jane J Ye. On solving simple bilevel programs with a nonconvex lower level program. Mathematical Programming, 144(1-2):277 305, 2014.

[35] Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach. Advances in Neural Information Processing Systems, 35:17248 17262, 2022.

[36] Chenghao Liu, Zhihao Wang, Doyen Sahoo, Yuan Fang, Kun Zhang, and Steven CH Hoi. Adaptive task sampling for meta-learning. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XVIII 16, pages 752 769. Springer, 2020.

[37] Mingrui Liu, Zhuoning Yuan, Yiming Ying, and Tianbao Yang. Stochastic auc maximization with deep neural networks. ar Xiv preprint ar Xiv:1908.10831, 2019.

[38] Risheng Liu, Xuan Liu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A value-function-based interior-point method for non-convex bi-level optimization. In International Conference on Machine Learning, pages 6882 6892. PMLR, 2021.

[39] Risheng Liu, Yaohua Liu, Shangzhi Zeng, and Jin Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond. Advances in Neural Information Processing Systems, 34:8662 8675, 2021.

[40] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

[41] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pages 1540 1552. PMLR, 2020.

[42] Siwei Lyu, Yanbo Fan, Yiming Ying, and Bao-Gang Hu. Average top-k aggregate loss for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):76 86, 2020.

[43] Soufiane Mallem, Abul Hasnat, and Amir Nakib. Efficient meta label correction based on meta learning and bi-level optimization. Engineering Applications of Artificial Intelligence, 117:105517, 2023.

[44] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

[45] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737 746. PMLR, 2016.

[46] Alain Rakotomamonjy. Optimizing area under roc curve with svms. In ROCAI, pages 71 80, 2004.

[47] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. ar Xiv preprint ar Xiv:1803.00676, 2018.

[48] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4334 4343. PMLR, 2018.

[49] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated backpropagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1723 1732. PMLR, 2019.

[50] Chenggen Shi, Jie Lu, and Guangquan Zhang. An extended kuhn tucker approach for linear bilevel programming. Applied Mathematics and Computation, 162(1):51 63, 2005.

[51] Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276 295, 2017.

[52] Haoliang Sun, Chenhui Guo, Qi Wei, Zhongyi Han, and Yilong Yin. Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124:108467, 2022.

[53] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.

[54] Xiaoyu Wang, Rui Pan, Renjie Pi, and Tong Zhang. Effective bilevel optimization via minimax reformulation. ar Xiv preprint ar Xiv:2305.13153, 2023.

[55] Zhen Wang, Guosheng Hu, and Qinghua Hu. Training noise-robust deep neural networks via meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4524 4533, 2020.

[56] Tianbao Yang and Yiming Ying. Auc maximization in the era of big data and ai: A survey. ACM Computing Surveys, 55(8):1 37, 2022.

[57] Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Achieving O(ϵ 1.5) complexity in Hessian/Jacobian-free stochastic bilevel optimization. ar Xiv preprint ar Xiv:2312.03807, 2023.

[58] Huaxiu Yao, Yu Wang, Ying Wei, Peilin Zhao, Mehrdad Mahdavi, Defu Lian, and Chelsea Finn. Meta-learning with an adaptive task scheduler. Advances in Neural Information Processing Systems, 34:7497 7509, 2021.

[59] Zhuoning Yuan, Zhishuai Guo, Nitesh Chawla, and Tianbao Yang. Compositional training for end-to-end deep auc maximization. In International Conference on Learning Representations, 2021.

[60] Zhuoning Yuan, Yan Yan, Milan Sonka, and Tianbao Yang. Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3040 3049, 2021.

[61] Guoqing Zheng, Ahmed Hassan Awadallah, and Susan Dumais. Meta label correction for noisy label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11053 11061, 2021.

A Specifications of Applications

In this section, we provide a detailed introduction to the formulation utilized in Section 5.

A.1 Deep AUC Maximization (DAM)

For a classifier model f(w), we have the AUC score function as

AUC f(w) = Pr f(w; x) f(w; x )|y = 1, y = 1 ,

where Pr(X) denote probability of an event X being true. One of the surrogate loss (AUC square loss [9]) is given by:

min w 1 n+n

c f(w; xi) f(w; xj) 2 ,

where n+ and n are the numbers of positive examples and negative examples, respectively, and c is the margin parameter. One can transfer this problem into an equivalent minimax optimization problem according to Proposition 1 in [37] by:

min w,a,b max α Φ(w, a, b, α) := 1

i=1 ϕ(w, a, b, α; xi, yi),

ϕ(w, a, b, α; xi, yi) =(1 p) f(w; xi) a 2 Iyi=1 + p f(w; xi) b 2 Iyi= 1 p(1 p)α2

+ 2α p(1 p)c + pf(w; xi) Iyi= 1 (1 p)f(w; xi) Iyi=1 ,

where a, b are margin parameters, p = n+/n. This formulation decomposed the individual examples, which is more favorable in stochastic scenarios. [59] proposed a compositional training algorithm for this problem. The compositional objective function is formulated as:

min w,a,b max α Φ(w α LAV G(w), a, b, α),

where LAV G = 1

n Pn i=1 ℓ(w; xi, yi), ℓdenotes task loss, e.g. cross-entropy in classification tasks. Consider a multi-block problem with m tasks. This problem can be reformulated as a multi-block minimax bi-level optimization problem [24]:

min w,a,b max α

j=1 Φj u j(wj), aj, bj, αj s.t. u j(wj) = arg min uj gj(uj, wj),

where gj(uj, wj) := 1

2 uj wj η LAV G(wj) 2.

A.2 Robust Meta-learning

Consider the formulation of Robust Meta-learning in Section 5.2:

min ϕ F(ϕ, w ) := 1

i=n k+1 g[i] ϕ, w i (ϕ) s.t. w i (ϕ) = arg min wi gi(ϕ, wi),

where g[n] ϕ, w n(ϕ) g[n 1] ϕ, w n 1(ϕ) ... g[1] ϕ, w 1(ϕ) denotes task-specific losses. We define g i (ϕ) := gi ϕ, w i (ϕ) for simplicity in later formulations. The summation of the bottom-k losses is equivalent to the sum of all task losses subtracted by the sum of the top-(n-k) losses:

F(ϕ, w ) = 1

i=1 g i (ϕ)

i=1 g [i](ϕ) .

With the Lemma 1 in [42], we have:

i=1 g [i](ϕ) = min γ

i=1 [g i (ϕ) γ]+ o .

Now we can convert the original upper-level problem to:

min ϕ max γ ˆF(ϕ, w , γ) := 1

g i (ϕ) [g i (ϕ) γ]+ n k

The problem of robust meta-learning is then formulated as:

min ϕ max γ ˆF(ϕ, w , γ) = 1

n fi(ϕ, w i , γ) := g i (ϕ) [g i (ϕ) γ]+ n k

s.t. w i (ϕ) = arg min wi gi(ϕ, wi),

where ϕ is the parameter of meta-model, and w = [wi, ..., wn]T is the vector of task specific parameters.

Inspired by [10], we introduce a smoothed version of the upper-level loss function by incorporating Gaussian noise into the indicator function for alignment with the assumption of our Mem CS algorithm:

F(ϕ, w , γ) = 1

n fi(ϕ, w i , γ) := g i (ϕ) [g i (ϕ) γ + ϵZ]+ n k

where ϵ > 0 is the smoothing parameter, and Z N(0, 1) is standard normal random variable.

B Implementation Details and Extra Experimental Results

B.1 Datasets Description

Deep AUC Maximization. We assess our methodology using four datasets. The first dataset, CIFAR100 [29], serves primarily for classification endeavors. Within the context of the multi-block deep AUC maximization challenge, we treat each of the 100 categories as an individual block, with samples belonging to a specific category considered positive for that block. This dataset comprises 60, 000 images, each measuring 32 32 pixels, divided into 50, 000 training and 10, 000 testing images. The Celeb A [40] dataset encompasses 202,599 facial images, each annotated with a diverse set of attributes from 40 different features. The Che Xpert [26] dataset includes 224, 316 chest radiograph images from 65, 240 patients, marked for 14 distinct observations. Adhering to the methodology proposed in [24], we employ a simplified version of Che Xpert with a reduced image resolution and omit the Fracture observation due to insufficient positive samples. Lastly, the OGBGMol PCBA [25] dataset is employed to predict molecular properties, representing each molecule as a graph with atoms as nodes and chemical bonds as edges, featuring 437, 929 such graphs annotated across 128 properties.

Robust Meta-learning. Our experiments are conducted over two popular datasets for few-shot learning: Mini-Image Net [53] and Tiered-Image Net [47]. Both datasets are subsets of the ILSVRC12 dataset. Mini-Image Net comprises 100 classes, each containing 600 images with dimensions of 84 84 pixels. The 100 classes are distributed among training, validation, and testing sets with a ratio of 64:16:20, respectively. In contrast, Tiered-Image Net is a more extensive and challenging dataset, featuring 779,165 images annotated across 608 classes. These classes are further organized into 34 categories, with 20 categories designated for training, 6 for validation, and 8 for testing.

B.2 Implementation Details

Deep AUC Maximization. For the CIFAR100 and the Celeb A datasets, we use the Res Net18 architecture. For the large-scale Che Xpert dataset, we opt for the Dense Net121 model pre-trained on Image Net. When dealing with the OGBG-Mol PCBA graph dataset, the Graph Isomorphism Network (GIN) is used for training. All experimental runs are performed using a single NVIDIA RTX 6000 GPU. Regarding hyperparameters, we set the total training epoch to 2000 for the CIFAR100 and 100 for the OGBG-Mol PCBA datasets, adjust it to 40 for Celeb A, and reduce it to 6 for Che Xpert. The learning rate for the optimal approximator v is uniformly set to ηv = 0.1 across all experiments, with ηw = ηu = ηv/λ to maintain gradient magnitude consistency between u and v. This consistency is vital due to the influence of the λ parameter in the Lagrangian function on the update process for u.

Figure A1: Test AUC score along with training epochs on the CIFAR100 (left) and the Celeb A (right) datasets.

Table A1: Comparison of average iteration time and total training time of our method and AUCCT[24] on small scale dataset (Celeb A) and large scale dataset(Che Xpert).

Method Celeb A Che Xpert

FOSL 0.55s/8.2h 0.78s/7.3h AUC-CT[24] 0.69s/10.3h 0.83s/7.7h

Table A2: Comparison between FOSL and Mem CS on robust meta-learning task. Algorithm Best Test Accuracy Model Parameter Size Average Iteration Time

Mem CS 69.25 0.433MB 3.15s FOSL 67.75 611.167MB 1.42s

Learning rate decay is applied to Celeb A starting at the 30th epoch and to Che Xpert at the 4th epoch, whereas no decay strategy is applied for CIFAR100 and OGBG-Mol PCBA.

Robust Meta-learning. We adopt the Adam optimizer to update the meta-model in the context of MAML. For the hyper-parameters of MAML, we set the learning rate of meta-model update as ηmeta = 0.02, and set the learning rate of fast adaptation as ηadapt = 0.02, with an adaptation step of 15. To be consistent with the DAM experiments, we configure the learning rate of the optimal approximator as ηv = ηadapt = 0.02, and the learning rate of w and meta-model parameter ϕ as ηϕ = ηw = ηv/λ in the implementation. In practice, setting λ to 3 results in favorable performance. All experiments are conducted on a single NVIDIA RTX 6000 GPU using a widely used lightweight model featuring 4 convolutional layers (CNN4).

B.3 Extra Results on Deep AUC Maximization

This section presents the visualization of statistics throughout the training process and compares the computational costs with our method and AUC-CT [24]. To better grasp the training dynamics, we charted the test AUC scores across various training epochs for both the CIFAR100 and Celeb A datasets, as shown in Figure A1. The findings demonstrate that our method not only exhibits enhanced generalization capabilities but also greater stability.

Moreover, to assess efficiency across varying dataset scales, we examined the average iteration times and total training time of our FOSL algorithm and AUC-CT [24] on different-sized datasets (32 32 in CIFAR100 vs. 224 224 in Che Xpert) as detailed in Table A1. Note that the training time largely depends on the implementation and hyperparameters, such as the number of sampled tasks and batch sizes per task, suggesting that computational costs can be adjusted by modifying these hyperparameters according to the dataset. The result in Table A1 shows the ability to control training costs for datasets of various scales, which is indicated by the small gap between the total training time on Celeb A and Che Xpert datasets. Additionally, our method demonstrated a faster training pace compared to AUC-CT [24] under the same training settings. Note that the implementation of AUC-CT [24] avoided the calculation of second-order matrices so that the computational cost is more controllable with an increased dataset scale.

B.4 Comparison between FOSL and Mem CS

In this section, we compare our two proposed methods within the same experimental setting, i.e. robust meta-learning. To make it compatible with our FOSL algorithm, we slightly adjusted our training setting so that the number of training tasks in the dataset is known (20000 tasks) while maintaining the same testing procedures as those used with the Mem CS algorithm. The results, including test accuracy, memory cost, and computational cost, are detailed in Table A2. The result shows that using FOSL in a robust meta-learning setting can introduce a greatly increased memory cost, which is especially significant for a small model. However, the single-loop nature of FOSL can drastically shorten the average iteration time during training. This makes the FOSL algorithm a potentially advantageous choice in scenarios involving larger base models and fewer blocks.

C Notations

The original problem we solve here is

min x Rdx max y Rdy F x, y, z (x) : = 1

i=1 fi x, y, z i (x)

i=1 Eξ fi x, y, z i (x); ξi

s.t. z i (x) = arg min z Rdz gi(x, z) = Eζ gi(x, z; ζi) .

Moreover, we define z λ,i(x) = arg minz Li(x, y (x), z, v) and y (x) = arg maxy F x, y, z (x) . For the convenience of proof, we also define

Φ(x) = F x, y (x), z (x) = 1

i=1 fi x, y (x), z i (x) .

For the notational convenience of FOSL, we define the estimators of client set It as

ht x : = 1 |It|

h ht x,i := x Li(xt, yt, zi,t, vi,t; ξt x,i) i ,

ht y : = 1 |It|

h ht y,i := yfi(xt, yt, vi,t; ξt y,i) i ,

ht z : = 1 |It|

h ht y,i := z Li(xt, yt, zi,t; ξt z,i) i ,

ht v : = 1 |It|

h ht v,i := zgi(xt, vi,t; ξt v,i) i . (4)

Since we sample tasks without replacement and our estimators are unbiased, we have the expectations of estimators as

eht x : = E[ht x] = 1

h eht x,i := x Li(xt, yt, zi,t, vi,t) i ,

eht y : = E[ht y] = 1

h eht y,i := yf(xt, yt, vi,t) i ,

eht z : = E[ht z] = 1

h eht z,i := z Li(xt, yt, zi,t, vi,t) i ,

eht v : = E[ht v] = 1

h eht v,i := zg(xt, vi,t) i . (5)

We also define the optimal Lagrangian estimator of x and its gradients as

L (x) : = 1

i=1 Li x, y (x), z λ,i(x), z i (x) ,

H (x) : = 1

h H i (x) := x L x, y (x), z λ,i(x), z i (x) i

h xfi x, y (x), z λ,i(x) + λ xgi x, z λ,i(x) xgi x, z i (x) i . (6)

D Proofs of Preliminary Lemmas

D.1 Some basic properties

Lemma D.1. Under Assumptions 4.3, 4.4, for λ 2Lf,1

µg , both Li(x, y, z, v) and L(x, y, z, v) are

2 )-strongly convex in z and Lλ,1-smooth in (x, y, z), where Lλ,1 := 3λLg,1.

Proof. Since λ 2Lf,1

Lg,1 , we have that

2 zz Li(x, y, z, v) = 2 zzfi(x, y, z) + λ 2 zzgi(x, z) λ 2 zzgi(x, z) 2 zzfi(x, y, z) λµg

2 zz L(x, y, z, v) = 2 zz F(x, y, z) + λ 2 zz G(x, z) λ 2 zz G(x, z) 2 zz F(x, y, z) λµg

2Li(x, y, z, v) = 2fi(x, y, z) + λ 2gi(x, z) λ 2gi(x, v) Lf1 + 2λLg,1 3λLg,1 =: Lλ,1,

2L(x, y, z, v) = 2F(x, y, z) + λ 2G(x, z) λ 2G(x, v) Lf1 + 2λLg,1 3λLg,1 =: Lλ,1.

Then the proof is complete.

Lemma D.2. Under Assumptions 4.3, 4.4, for λ max 2Lf,1

µg , (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 , we have

z i (x) Lg,1

µg , z λ,i(x) 12Lg,1

µg and y (x) 1 + Lg,1

Proof. Recall that we define z i (x) := arg minz gi(x, z) and z λ,i(x) := arg minz Li(x, y (x), z, v). Then we have zgi x, z i (x) = 0 and z Li x, y (x), z λ,i(x), v = 0. Via implicit function theorem, we obtain

2 xzgi x, z i (x) + z i (x) T 2 zzgi x, z i (x) = 0,

2 xz Li x, y (x), z λ,i(x), v + y (x) T 2 yz Li x, y (x), z λ,i(x), v

+ z λ,i(x) T 2 zz Li x, y (x), z λ,i(x), v = 0. (7)

To measure that Lipschitz continuity of z i (x) and z λ,i(x) w.r.t. x, we take spectral norm of z i (x) and z λ,i(x) as

z i (x) = 2 xzgi x, z i (x) 2 zzgi x, z i (x) 1 (a) Lg,1

z λ,i(x) = 2 xz Li x, y (x), z λ,i(x), v 2 zz Li x, y (x), z λ,i(x), v 1

y (x) T 2 yz Li x, y, z λ,i(x), v 2 zz Li x, y (x), z λ,i(x), v 1 , (8)

where (a) uses Assumption 4.4. Similarly, for y (x), we have y F x, y (x), z (x) = 0. Via implicit function theorem, we have

h 2 xyfi x, y (x), z i (x) + y (x) T 2 yyfi x, y (x), z i (x)

+ z i (x) T 2 zyfi x, y (x), z i (x) i = 0, (9)

which indicates

h 2 xyfi x, y (x), z i (x) + z i (x) T 2 yzfi x, y (x), z i (x) i

2 yy F x, y (x), z (x) 1

h 2 xyfi x, y (x), z i (x) + z i (x) T 2 yzfi x, y (x), z i (x) i

2 yy F x, y (x), z (x) 1

(a) 1 + Lg,1

where (a) uses Assumption 4.3 and Assumption 4.4. Back to the second equation in eq. (8), with y (xt) (1 + Lg,1

µf , we have

z λ,i(x) 2 xz Li x, y (x), z λ,i(x), v + y (x) T 2 yz Li x, y, z λ,i(x), v

2 zz Li x, y (x), z λ,i(x), v 1

(a) 3λLg,1 + 1 + Lg,1

where (a) uses Lemma D.1 and (b) uses λ (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 . Then the proof is complete.

Lemma D.3. Under Assumptions 4.3, 4.4, the optimal solutions z i (x), z λ,i(x) and y i (x) are L ,z-, L ,zλand L ,y-smooth respectively, where we define L ,z, L ,zλ and L ,y as

L ,z := Lg,2

2 , L ,zλ := 1 + 1 + Lg,1

µf + 12Lg,1

L ,y := 1 + Lg,1

µg + 1 + Lg,1

µf + 1 + Lg,1

for any i {1, ..., n}, where we assume λ 2Lf,1/µg, (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 , (1 +

µg ) Lf,1Lf,2

3µf Lg,1 , Lf,1L ,y

6Lg,1 1 + (1 + Lg,1

µf + 12Lg,1

µg 1, (1 + Lg,1

µf + 1 Lf,1

Proof. Since z i (x) = arg minz gi(x, z), we have zgi x, z (x) = 0, which indicates that

2 xzgi x, z (x) + z (x) 2 zzgi x, z (x) = 0.

For any x1, x2 Rdx, we have

z i (x1) z i (x2)

= 2 xzgi x1, z i (x1) 2 zzgi x1, z i (x1) 1 2 xzgi x2, z i (x2) 2 zzgi x2, z i (x2) 1

2 xzgi x1, z i (x1) 2 xzgi x2, z i (x2) 2 zzgi x1, z i (x1) 1

+ 2 xzgi x2, z i (x2) 2 zzgi x1, z i (x1) 1 2 zzgi x2, z i (x2) 1

2 xzgi x1, z i (x1) 2 xzgi x2, z i (x2) 2 zzgi x1, z i (x1) 1

+ 2 xzgi x2, z i (x2) 2 zzgi x1, z i (x1) 1 2 zzgi x2, z i (x2) 1

2 xzgi x1, z i (x1) 2 xzgi x2, z i (x2)

2 zzgi x2, z i (x2) 2 zzgi x1, z i (x1)

x1 x2 + z i (x1) z i (x2)

2 x1 x2 , (10)

where (a) uses Assumption 4.3, 4.4 and (A 1 B 1) = A 1(B A)B 1; (b) follows from Lemma D.2. Next, plug x = x1 and x = x2 into eq. (9) and differentiate these two equations, then we get y (x1) T 2 yy F x1, y (x1), z i (x1) y (x2) T 2 yy F x2, y (x2), z i (x2)

= y (x1) y (x2) T 2 yy F x1, y (x1), z i (x1)

+ y (x2) T 2 yy F x1, y (x1), z i (x1) 2 yy F x2, y (x2), z i (x2) , (11)

and by using eq. (9), we get

y (x1) T 2 yyfi x1, y (x1), z i (x1) y (x2) T 2 yyfi x2, y (x2), z i (x2)

h 2 xyfi x1, y (x1), z i (x1) 2 xyfi x2, y (x2), z i (x2)

+ z i (x1) z i (x2) T 2 zyfi x1, y (x1), z i (x1)

+ z i (x1) T 2 zyfi x1, y (x1), z i (x1) 2 zyfi x1, y (x1), z i (x1) i . (12)

By combining eq. (11), eq. (12) and taking norm, we have y (x1) y (x2)

2 yy F x1, y (x1), z i (x1) 1

i=1 2 xyfi x1, y (x1), z i (x1) 2 xyfi x2, y (x2), z i (x2)

z i (x1) z i (x2) T 2 zyfi x1, y (x1), z i (x1)

z i (x1) T 2 zyfi x1, y (x1), z i (x1) 2 zyfi x1, y (x1), z i (x1)

i=1 2 yyfi x1, y (x1), z i (x1) 2 yyfi x2, y (x2), z i (x2)

µg + 1 + Lg,1

x1 x2 + y (x1) y (x2) + 1

i=1 z i (x1) z i (x2)

i=1 z i (x1) z i (x2)

(b) 1 + Lg,1

µg + 1 + Lg,1

µf + 1 + Lg,1

where (a) uses Assumption 4.4 and Lemma D.2; (b) follows from Assumption 4.4, Lemma D.2, and eq. (10). Similarly to eq. (10), from eq. (7), if we simplify the notation as

A1 = 2 xz Li x1, y (x1), z λ,i(x1), v1 + y (x1) T 2 yz Li x1, y (x1), z λ,i(x), v1 ,

B1 = 2 zz Li x1, y (x1), z λ,i(x1), v1

A2 = 2 xz Li x2, y (x2), z λ,i(x2), v2 + y (x2) T 2 yz Li x2, y (x1), z λ,i(x), v2

B2 = 2 zz Li x2, y (x2), z λ,i(x2), v2 ,

then we have

z λ,i(x1) z λ,i(x2) = A1B 1 1 A2B 1 2

(A1 A2)B 1 1 + A2(B 1 1 B 1 2 )

A1 A2 B 1 1 + A2 B 1 1 B 1 2

A1 A2 B 1 1 + A2 B 1 1 B 1 2 B1 B2 . (13) For the first term in eq. (13), via Lemma D.1, we have A1 A2

3λLg,1 x1 x2 + y (x1) y (x2) + z λ,i(x1) z λ,i(x2)

+ Lf,1 y (x1) y (x2)

+ Lf,2 y (x2) x1 x2 + y (x1) y (x2) + z λ,i(x1) z λ,i(x2)

(a) 3λLg,1 + 1 + Lg,1

x1 x2 + y (x1) y (x2) + z λ,i(x1) z λ,i(x2)

+ Lf,1 y (x1) y (x2)

1 + 1 + Lg,1

µf + 12Lg,1

x1 x2 + Lf,1L ,y x1 x2

1 + 1 + Lg,1

µf + 12Lg,1

where (a) uses Lemma D.2; (b) follows from Lemmas D.2, D.3 and λ (1 + Lg,1

µg ) Lf,1Lf,2

3µf Lg,1 ; (c)

uses λ Lf,1L ,y

6Lg,1 1 + (1 + Lg,1

µf + 12Lg,1

µg 1. By using Lemma D.1, we have B 1 1 2 λµg ,

B 1 2 2 λµg , B1 B2 3λLg,1 x1 and

A2 = 2 xz Li x2, y (x2), z λ,i(x2), v2 + y (x2) T 2 yz Li x2, y (x1), z λ,i(x), v2

2 xz Li x2, y (x2), z λ,i(x2), v2 + y (x2) 2 yzfi x2, y (x1), z λ,i(x)

(a) (Lf,1 + λLg,1) + 1 + Lg,1

(b) 2λLg,1, (15)

where (a) uses Assumption 4.4 and Lemma D.2; (b) uses λ (1 + Lg,1

µf + 1 Lf,1

Lg,1 . We also have B1 B2 =3λLg,1 x1 x2 + y (x1) y (x2) + z λ,i(x1) z λ,i(x2)

1 + 1 + Lg,1

µf + 12Lg,1

x1 x2 , (16)

where (a) uses Lemma D.2. Combining eq. (14), eq. (15), eq. (16) with the results in Lemma D.1, we have z λ,i(x1) z λ,i(x2) A1 A2 B 1 1 + A2 B 1 1 B 1 2 B1 B2

µg + 24L2 g,1 µ2g

1 + 1 + Lg,1

µf + 12Lg,1

Thus, the proof is complete.

D.2 Gap of Lower-level Optimal Points

Lemma D.4. Under Assumptions 4.3, 4.4, for any given x and , the gap between the optimal solutions of the lower-level problem z i (x) and the surrogate minimax problem z λ,i(x) can be bounded as

z λ,i(x) z i (x) Lf,0

z λ,i(x) z i (x) 1

µg + Lf,1 1 + 1 + Lg,1

1 + 1 + Lg,1

Lf,1 + Lf,0Lg,2

for any i {1, ..., n}, where we assume λ max 2Lf,1

µg , (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 .

Proof. For each block, we can have that

z λ,i(x) z i (x) (a) 1

zgi x, z λ,i(x) zgi x, z i (x)

zfi(x, y (x), z λ,i(x)) (c) Lf,0

where (a) uses Assumption 4.3; (b) follows from the definition of z i (x) and z λ,i(x); (c) uses Assumption 4.4. For the second part, since zgi x, z i (x) = 0, z Li x, y (x), z λ,i(x) = 0, we have

2 xzgi x, z i (x) + z i (x) 2 zzgi x, z i (x) = 0,

2 xz Li x, y, z λ,i(x) + y (x) 2 yz Li x, y,z λ,i(x) + z λ,i(x) 2 zz Li x, y, z λ,i(x) = 0,

which indicates that

z i (x) = 2 xzgi x, z i (x) 2 zzgi x, z i (x) 1,

z λ,i(x) = 2 xz Li x, y (x), z λ,i(x) + y (x) 2 yz Li x, y, z λ,i(x) 2 zz Li x, y (x), z λ,i(x) 1

2 xz Li x, y (x), z λ,i(x) + y (x) 2 yz Li x, y, z λ,i(x)

2 zz Li x, y (x), z λ,i(x)

Then the gap can be displayed as

z λ,i(x) z i (x) 2 xzgi x, z i (x) 2 xz Li x, y (x), z λ,i(x) + y (x) 2 yz Li x, y, z λ,i(x)

2 zzgi x, z i (x) 1

2 xz Li x, y (x), z λ,i(x) + y (x) 2 yz Li x, y, z λ,i(x)

2 zzgi x, z i (x) 1 2 zz Li x, y (x), z λ,i(x)

" 2 xzgi x, z i (x) 2 xzgi x, z λ,i(x)

+ 2 xzfi x, y (x), z λ,i(x) + y (x) 2 yzfi x, y, z λ,i(x)

+ 3Lg,1 1 + 1 + Lg,1

Lf,1 + Lf,0Lg,2

Lg,2 z λ,i(x) z i (x) + 1

λ Lf,1 1 + 1 + Lg,1

1 + 1 + Lg,1

Lf,1 + Lf,0Lg,2

µg + Lf,1 1 + 1 + Lg,1

1 + 1 + Lg,1

Lf,1 + Lf,0Lg,2

where (a) can be satisfied because 2 zzgi x, z i (x) 1 2 zz Li x, y (x), z λ,i(x)

= 2 zzgi x, z i (x) 1

2 zz Li x, y (x), z λ,i(x)

λ 2 zzgi x, z i (x)

2 zz Li x, y (x), z λ,i(x)

(a.1) 2 µ2g

2 zzfi x, y (x), z λ,i(x)

+ 2 zzgi x, z λ,i(x) 2 zzgi x, z i (x) !

λ + Lg,2 z λ,i(x) z i (x)

Lf,1 + Lf,0Lg,2

where (a.1) uses Assumption 4.3 and Lemma D.1. Then, the proof is complete.

Lemma D.5. Under Assumptions 4.3, 4.4, the gap between Φ(x) and H (x) can be bounded as

Φ(x) H (x) 2 Cgap

where Cgap := 3 1 + L2 f,1 µ2g

L2 f,1 Lf,0

2 + 1 + L2 g,1 µ2g

3L2 g,1 2 Lf,0

4 and we assume λ

µg , (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 .

Proof. By the definitions of F x, y (x), z (x) and H (x), we have F x, y (x), z (x) H (x) 2

i=1 fi x, y (x), z i (x) Li x, y (x), z λ,i(x), z i (x)

fi x, y (x), z i (x) Li x, y (x), z λ,i(x), z i (x) 2. (17)

For any i {1, ..., n}, we have fi x, y (x), z i (x) Li x, y (x), z λ,i(x), z i (x) 2

= xfi x, y (x), z i (x) 2 xzgi x, z i (x) [ 2 zzgi x, z i (x) ] 1 zfi x, y (x), z i (x)

xfi x, y (x), z λ,i(x) λ xgi x, z λ,i(x) xgi x, z i (x) 2

3 xfi x, y (x), z i (x) xfi x, y (x), z λ,i(x) 2

+ 3 2 xzgi x, z i (x) 2 zzgi x, z i (x) 1 zfi x, y (x), z λ,i(x) zfi x, y (x), z i (x) 2

+ 3 2 xzgi x, z i (x) 2 zzgi x, z i (x) 1 zfi x, y (x), z λ,i(x)

λ xgi x, z λ,i(x) xgi x, z i (x) 2

3 xfi x, y (x), z i (x) xfi x, y (x), z λ,i(x) 2

+ 3 2 xzgi x, z i (x) 2 zzgi x, z i (x) 1 zfi x, y (x), z λ,i(x) zfi x, y (x), z i (x) 2

+ 3 λ 2 xzgi x, z i (x) 2 zzgi x, z i (x) 1 zgi x, z λ,i(x)

λ xgi x, z λ,i(x) xgi x, z i (x) 2

(a) 3 1 + L2 g,1 µ2g

L2 f,1 z λ,i(x) z i (x) 2

+ 6λ2 2 xzgi x, z i (x) 2 zzgi x, z i (x) 1

zgi x, z λ,i(x) zgi x, z i (x) 2 zzgi x, z λ,i(x) z λ,i(x) z i (x) 2

+ 6λ2 2 xzgi x, z λ,i(x) z λ,i(x) z i (x) xgi x, z λ,i(x) xgi x, z i (x) 2

(b) 3 1 + L2 f,1 µ2g

L2 f,1 z λ,i(x) z i (x) 2 + 6λ2 1 + L2 g,1 µ2g

2 z λ,i(x) z i (x) 4

(c) 3 1 + L2 f,1 µ2g

L2 f,1 Lf,0

2 + 1 + L2 g,1 µ2g

where (a) follows from Assumption 4.3, 4.4 and eq. (7); (b) follows from Assumption 4.3, 4.4 and Lemma 1 in [44]; (c) uses Lemma D.4. The proof is finished by substituting eq. (18) into eq. (17).

Lemma D.6. Under Assumptions 4.3, 4.4, the gradient of Lagrangian function with optimal solutions H (x) is L ,1-Lipschitz continuous in x, where we define L ,1 := 1+ 12Lg,1

µg + 1+ Lg,1

Lf,1+ 1+ Lg,1

µg +Lg,2 h 1 µg

µg +Lf,1 1+(1+ Lg,1

i and we assume λ 2Lf,1/µg, (1+ Lg,1

µg ) L2 f,1 3µf Lg,1 , (1+ Lg,1

µg ) Lf,1Lf,2

3µf Lg,1 , Lf,1L ,y

6Lg,1 1+(1+

µf + 12Lg,1

µg 1, (1 + Lg,1

µf + 1 Lf,1

Proof. Recall that in eq. (6),

h xfi x, y (x), z λ,i(x) + λ xgi x, z λ,i(x) xgi x, z i (x) i .

Then we have

H (x) (a) = 1

i=1 2 xxfi(x, y (x), z λ,i(x)) + y (x) T 2 yxfi x, y (x), z λ,i(x)

+ z λ,i(x) T 2 zxfi(x, y (x), z λ,i(x)) + λ 2 xxgi(x, z λ,i(x) 2 xxgi(x, z i (x))

+ λ z λ,i(x) T 2 zxgi x, z λ,i(x) z i (x) T 2 zxgi x, z i (x) . (19)

By taking norm, we have

2 xxfi x, y (x), z λ,i(x) + 1

i=1 y (x) 2 xyfi x, y (x), z λ,i(x)

i=1 z λ,i(x) 2 xzfi x, y (x), z λ,i(x)

h 2 xxgi x, z λ,i(x) 2 xxgi x, z i (x)

+ z i (x) 2 xzgi x, z λ,i(x) 2 xzgi x, z i (x)

+ z λ,i(x) z i (x) 2 xzgi x, z λ,i(x) i

(a) 1 + 12Lg,1

µg + 1 + Lg,1

Lf,1 + λ 1 + Lg,1

Lg,2 z λ,i(x) z i (x)

+ λLg,1 z λ,i(x) z i (x)

(b) 1 + 12Lg,1

µg + 1 + Lg,1

Lf,1 + 1 + Lg,1

µg + Lf,1 1 + 1 + Lg,1

1 + 1 + Lg,1

Lf,1 + Lf,0Lg,2

where (a) uses Assumption 4.4 and Lemma D.2; (b) follows from Lemma D.4 Then, the proof is complete.

E Proofs of Theorem 4.10 and and Corollary 4.11

E.1 Descent in Objective Function

Lemma E.1. Under Assumptions 4.3, 4.4, 4.5 and Lemma D.6, for L ,1-smooth L (x), the consecutive iterates of Algorithm 1 satisfy:

E L (xt+1) L (xt)

2 E H (xt) 2 ηx

2 E eht x 2 + η2 x L ,1

2 E ht x 2 + 3ηx L2 f,1 2 E yt y (xt) 2

+ 3ηx L2 λ,1 2 E 1

zi,t z λ,i(xt) 2 + 1

vi,t z i (xt) 2

for all t {0, 1, ..., T 1}, where we assume λ 2Lf,1

Proof. Recall the definitions of L (x) and H (x) in eq. (6). By using the smoothness of L (xt) in Lemma D.6, we have that

E L (xt) + E H (xt), xt+1 xt + L ,1

2 E xt+1 xt 2

=E L (xt) ηx E H (xt), ht x + η2 x L ,1

=E L (xt) ηx H (xt),eht x + η2 x L ,1

=E L (xt) ηx

2 E H (xt) 2 ηx

2 E eht x 2 + ηx

2 E H (xt) eht x 2 + η2 x L ,1

(a) E L (xt) ηx

2 E H (xt) 2 ηx

2 E eht x 2 + η2 x L ,1

2 E ht x 2 + 3ηx L2 f,1 2 E yt y (xt)

+ 3ηx L2 λ,1 2 E 1

zi,t z λ,i(xt) + 1

vi,t z i (xt) ,

where (a) follows from

E H (xt) eht x 2

i=1 x Li xt, y (xt), z λ,i(xt), z i (xt) x Li xt, yt, zi,t, vi,t

i=1 E x Li xt, y (xt), z λ,i(xt), z i (xt) x Li xt, yt, zi,t, vi,t 2

(a.2) 3L2 f,1E yt y (xt) 2 + 3L2 λ,1E 1

zi,t z λ,i(xt) 2 + 1

vi,t z i (xt) 2 ,

and (a.1) uses Jensen inequality; (a.2) follows from Assumption 4.4 and Lemma D.1. Then, the proof is complete.

E.2 Bounds of Estimators

Lemma E.2. Under Assumptions 4.3, 4.4, 4.5, 4.6, the estimators of vi, zi, y and x can be bounded as

E ht v,i 2 2L2 g,1E vi,t z i (xt) 2 + 2σ2 g,

E ht z,i 2 4(L2 f,1 + λ2L2 g,1) E yt y (xt) 2 + E zi,t z λ,i(xt) 2 + 4(σ2 f + λ2σ2 g),

E ht y 2 σ2 f |It| + (n |It|)σ2 th (n 1)|It| + 1 + β2 th |It|

L2 f,1 E yt y (xt) 2 + 1

i=1 E vi,t z i (xt) 2 ,

E ht x 2 E eht x 2 + 3(n |It|)

(n 1)|It|(L2 f,0 + 2λ2L2 g,0) + 3

|It|(σ2 f + 2λ2σ2 g).

for any i {1, ..., n} and t {0, 1, ..., T 1}.

Proof. By using the definition of z i (xt), we can have that

E ht v,i 2 2E zgi(xt, vi,t; ξv i,t) zgi(xt, vi,t) 2 + 2E zgi(xt, vi,t) zgi(xt, z i (xt)) 2

2L2 g,1E vi,t z i (xt) 2 + 2σ2 g.

Similarly, we have

E ht z,i 2 2E z Li(xt, yt, zi,t, vi,t) z Li(xt, yt, zi,t, vi,t; ξz i,t) 2

+ 2E z Li(xt, yt, zi,t, vi,t) z Li(xt, y (xt), z λ,i(xt), vi,t) 2

4(L2 f,1 + λ2L2 g,1) E zi,t z λ,i(xt) 2 + E yt y (xt) 2 + 4(σ2 f + λ2σ2 g).

Next, for the estimator of x, we have

E ht x 2 =E 1 |It|

i It x Li(xt, yt, zi,t, vi,t; ξx i,t)

(a) =E 1 |It|

h x Li(xt, yt, zi,t, vi,t; ξx i,t) x Li(xt, yt, zi,t, vi,t) i

i It x Li(xt, yt, zi,t, vi,t)

where (a) uses unbiased estimation in Assumption 4.5. For the first part of eq. (20), since tasks are selected without replacement, we have

h x Li(xt, yt, zi,t, vi,t; ξx i,t) x Li(xt, yt, zi,t, vi,t) i

(a) = 1 |It|2 X

i It E x Li(xt, yt, zi,t, vi,t; ξx i,t) x Li(xt, yt, zi,t, vi,t) 2

3 |It|(σ2 f + 2λ2σ2 g) (21)

where (a) uses the unbiased estimation assumption in Assumption 4.5. For the second part of eq. (20), we have

i It x Li(xt, yt, zi,t, vi,t)

(a) = n(|It| 1)

|It|(n 1)E 1 n

i=1 x Li(xt, yt, zi,t, vi,t)

2 + n |It| (n 1)|It| 1

i=1 E x Li(xt, yt, zi,t, vi,t)

(b) E eht x 2 + 3(n |It|)

(n 1)|It|(L2 f,0 + 2λ2L2 g,0) (22)

where (a) used the Lemma A.1 in [32]; (b) uses Assumption 4.4. By combining eq. (22) with eq. (21), the fourth inequality is proved. Last, for the estimator of y, we have

E ht y 2 =E 1 |It|

i It yfi(xt, yt, vi,t; ξy i,t)

(a) =E 1 |It|

i It yfi(xt, yt, vi,t; ξy i,t) yfi(xt, yt, vi,t)

2 + E 1 |It|

i It yfi(xt, yt, vi,t)

(b) 1 |It|2 X

i It E yfi(xt, yt, vi,t; ξy i,t) yfi(xt, yt, vi,t) 2

+ n(|It| 1)

|It|(n 1)E 1 n

i=1 yfi(xt, yt, vi,t)

2 + n |It| (n 1)|It| 1

i=1 E yfi(xt, yt, vi,t)

(c) σ2 f |It| + (n |It|)σ2 th (n 1)|It| + n(|It| 1) + β2 th(n |It|) |It|(n 1) E 1 n

i=1 yfi(xt, yt, vi,t)

(d) σ2 f |It| + (n |It|)σ2 th (n 1)|It|

+ n(|It| 1) + β2 th(n |It|) |It|(n 1) E 1 n

i=1 yfi(xt, yt, vi,t) yfi xt, y (xt), z i (xt)

(e) σ2 f |It| + (n |It|)σ2 th (n 1)|It|

+ n(|It| 1) + β2 th(n |It|) |It|(n 1) L2 f,1 E yt y (xt) 2 + 1

i=1 E vi,t z i (xt) 2

(f) σ2 f |It| + (n |It|)σ2 th (n 1)|It| + 1 + β2 th |It|

L2 f,1 E yt y (xt) 2 + 1

i=1 E vi,t z i (xt) 2 ,

where (a) uses Assumption 4.5; (b) follows from the Lemma A.1 in [32]; (c) uses Assumption 4.5, 4.6; (d) follows from definition y (x) = arg maxy 1 n Pn i fi x, y, z i (x) ; (e) uses Assumption 4.4 and (f) uses βth 1. Then, the proof is complete.

E.3 Descent in Approximation Errors

Lemma E.3. Under Assumptions 4.3, 4.4, 4.5, 4.6, there exists δv,1, δz,1, δy,1 such that the iterates of vi,t, zi,t and yt in Algorithm 1 satisfy

h E vi,t+1 z i (xt+1) 2 E vi,t z i (xt) 2i

ηvµg + δv 1

i=1 E vi,t z i (xt) 2 + 2η2 v(1 + δv)σ2 g

+ η2 x L2 g,1 δv,1µ2g E eht x 2 + η2 x

L2 g,1 µ2g + L ,z

h E zi,t+1 z λ,i(xt+1) 2 E zi,t z λ,i(xt) 2i

i=1 E zi,t z λ,i(xt) 2 + (1 + δz) 2ηz L2 f,1 λµg + 8η2 zλ2L2 g,1 E yt y (xt) 2

+ 144η2 x L2 g,1 δz,1µ2g E eht x 2 + 144η2 x L2 g,1 µ2g + η2 x L ,zλ

E ht x 2 + 8(1 + δz)η2 zλ2σ2 g,

E yt+1 y (xt+1) 2 E yt y (xt) 2

ηyµf + η2 y(1 + δy) 1 + β2 th |It|

L2 f,1 + δy

E yt y (xt) 2

+ η2 y(1 + δy) σ2 f |It| + (n |It|)σ2 th (n 1)|It|

+ η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

+ ηy L2 f,1 µf + η2 y 1 + β2 th |It|

i=1 E vi,t z i (xt) 2

2 + 1 + Lg,1

2 L2 f,1 µ2 f

for all t {0, ..., T 1}, where we define δv := δv,1 + 3η2 x L ,z

2 (L2 f,0 + 2λ2L2 g,0), δz := δv,1 +

3η2 x L ,zλ

2 (L2 f,0+2λ2L2 g,0), δy := δy,1+ 3η2 x L ,y

2 (L2 f,0+2λ2L2 g,0) and we assume λ 2Lf,1/µg, (1+

µg ) L2 f,1 3µf Lg,1 , (1+ Lg,1

µg ) Lf,1Lf,2

3µf Lg,1 , Lf,1L ,y

6Lg,1 1+(1+ Lg,1

µf + 12Lg,1

µg 1, (1+ Lg,1

Lg,1 , ηv µg 2L2 g,1 , ηz µg 32L2 g,1λ.

Proof. For the iterations of the lower-level problem, we have

E vi,t+1 z i (xt+1) 2 =E vi,t+1 z i (xt) 2 + E z i (xt) z i (xt+1) 2

+ 2E vi,t+1 z i (xt), z i (xt) z i (xt+1) . (23)

For the first term of eq. (23), we have

E vi,t+1 z i (xt) 2

=E vi,t z i (xt) ηvht v,i 2

=E vi,t z i (xt) 2 + η2 v E ht v,i 2 2ηv E vi,t z i (xt), ht v,i

=E vi,t z i (xt) 2 + η2 v E ht v,i 2 2ηv E vi,t z i (xt), zgi xt, vi,t zgi xt, z i (xt)

(a) (1 2ηvµg)E vi,t z i (xt) 2 + η2 v E ht v,i 2

(b) (1 2ηvµg + 2η2 v L2 g,1)E vi,t z i (xt) 2 + 2η2 vσ2 g

(c) (1 ηvµg)E vi,t z i (xt) 2 + 2η2 vσ2 g, (24)

where (a) follows from Assumption 4.3; (b) uses Lemma E.2; (c) results from ηv µg 2L2 g,1 . For the second term of eq. (23), we have

E z i (xt) z i (xt+1) 2 L2 g,1 µ2g E xt xt+1 2 = η2 x L2 g,1 µ2g E ht x 2. (25)

For the last term of eq. (23), we have

2E vi,t+1 z i (xt), z i (xt) z i (xt+1)

= 2E vi,t+1 z i (xt), z i (xt)(xt+1 xt)

2E vi,t+1 z i (xt), z i (xt+1) z i (xt) z i (xt)(xt+1 xt)

=2E vi,t+1 z i (xt), z i (xt)ηxeht x

2E vi,t+1 z i (xt), z i (xt+1) z i (xt) z i (xt)(xt+1 xt)

2E vi,t+1 z i (xt) E z i (xt)ηxeht x

+ 2E vi,t+1 z i (xt) E z i (xt+1) z i (xt) z i (xt)(xt+1 xt)

(a) 2E vi,t+1 z i (xt) E z i (xt)ηxeht x + E vi,t+1 z i (xt) L ,z E xt+1 xt 2

δv,1E vi,t+1 z i (xt) 2 + η2 x L2 g,1 δv,1µ2g E eht x 2 + η2 x L ,z

+ 3η2 x L ,z

2 (L2 f,0 + 2λ2L2 g,0)E vi,t+1 z i (xt) 2

(b) =δv E vi,t+1 z i (xt) 2 + η2 x L2 g,1 δv,1µ2g E eht x 2 + η2 x L ,z

2 E ht x 2, (26)

where (a) use Lemma D.3 and Lemma 1 in [44]; (b) defines δv := δv,1 + 3η2 x L ,z

2 (L2 f,0 + 2λ2L2 g,0). By plugging eq. (24), eq. (25), eq. (26) into eq. (23), we have

E vi,t+1 z i (xt+1) 2 E vi,t z i (xt) 2

( ηvµg + δv)E vi,t z i (xt) 2 + 2η2 v(1 + δv)σ2 g + η2 x L2 g,1 δv,1µ2g E eht x 2 + η2 x

L2 g,1 µ2g + L ,z

Then we get

h E vi,t+1 z i (xt+1) 2 E vi,t z i (xt) 2i

ηvµg + δv 1

i=1 E vi,t z i (xt) 2 + 2η2 v(1 + δv)σ2 g

+ η2 x L2 g,1 δv,1µ2g E eht x 2 + η2 x

L2 g,1 µ2g + L ,z

Thus, the first inequality in the lemma is proved. Similarly, we have

E zi,t+1 z λ,i(xt+1) 2 =E zi,t+1 z λ,i(xt) 2 + E z λ,i(xt) z λ,i(xt+1) 2

+ 2E zi,t+1 z λ,i(xt), z λ,i(xt) z λ,i(xt+1) . (27)

We can bound the first term in eq. (27) similarly with eq. (24) as

E zi,t+1 z λ,i(xt) 2

E zi,t z λ,i(xt) 2 + η2 z E ht z,i 2 2ηz E zi,t z λ,i(xt), ht z,i

=E zi,t z λ,i(xt) 2 + η2 z E ht z,i 2

2ηz E zi,t z λ,i(xt), z Li xt, yt, zi,t, vi,t z Li xt, yt, z λ,i(xt), vi,t

2ηz E zi,t z λ,i(xt), z Li xt, yt, z λ,i(xt), vi,t z Li xt, y (xt), z λ,i(xt), vi,t

=E zi,t z λ,i(xt) 2 + η2 z E ht z,i 2

2ηz E zi,t z λ,i(xt), z Li xt, yt, zi,t, vi,t z Li xt, yt, z λ,i(xt), vi,t

2ηz E zi,t z λ,i(xt), zfi xt, yt, z λ,i(xt) zfi xt, y (xt), z λ,i(xt)

(1 ηzλµg)E zi,t z λ,i(xt) 2 + η2 z E ht z,i 2

2 E zi,t z λ,i(xt) 2 + 2ηz

λµg E zfi xt, yt, z λ,i(xt) zfi xt, y (xt), z λ,i(xt) 2

E zi,t z λ,i(xt) 2 + η2 z E ht z,i 2 + 2ηz L2 f,1 λµg E yt y (xt) 2

(a) 1 ηzλµg

E zi,t z λ,i(xt) 2 + 2ηz L2 f,1 λµg + 8η2 zλ2L2 g,1 E yt y (xt) 2

+ 8η2 zλ2L2 g,1E zi,t z λ,i(xt) 2 + 8η2 zλ2σ2 g

(b) 1 ηzλµg

E zi,t z λ,i(xt) 2 + 2ηz L2 f,1 λµg + 8η2 zλ2L2 g,1 E yt y (xt) 2 + 8η2 zλ2σ2 g, (28)

where (a) uses Lemma E.2 and λ max Lf,1

σg ; (b) uses ηzλ µg 32L2 g,1 ; we bound the second term in eq. (27) as

E z λ,i(xt) z λ,i(xt+1) 2 144L2 g,1 µ2g E xt xt+1 2 = 144η2 x L2 g,1 µ2g E ht x 2; (29)

and we bound the last term in eq. (27) similarly with eq. (26) as

2E zi,t+1 z λ,i(xt), z λ,i(xt) z λ,i(xt+1)

δz E zi,t+1 z λ,i(xt) 2 + 144η2 x L2 g,1 δz,1µ2g E eht x 2 + η2 x L ,zλ

2 E ht x 2, (30)

where δz := δz,1 + 3η2 x L ,zλ

2 (L2 f,0 + 2λ2L2 g,0). By plugging eq. (28), eq. (29), eq. (30) into eq. (27), we have

E zi,t+1 z λ,i(xt+1) 2 E zi,t z λ,i(xt) 2

4 + δz E zi,t z λ,i(xt) 2 + (1 + δz) 2ηz L2 f,1 λµg + 8η2 zλ2L2 g,1 E yt y (xt) 2

+ 144η2 x L2 g,1 δz,1µ2g E eht x 2 + 144η2 x L2 g,1 µ2g + η2 x L ,zλ

E ht x 2 + 8(1 + δz)η2 zλ2σ2 g,

After telescoping, we obtain

h E zi,t+1 z λ,i(xt+1) 2 E zi,t z λ,i(xt) 2i

i=1 E zi,t z λ,i(xt) 2 + (1 + δz) 2ηz L2 f,1 λµg + 8η2 zλ2L2 g,1 E yt y (xt) 2

+ 144η2 x L2 g,1 δz,1µ2g E eht x 2 + 144η2 x L2 g,1 µ2g + η2 x L ,zλ

E ht x 2 + 8(1 + δz)η2 zλ2σ2 g.

Then the second inequality in the lemma is proved. Last, for yt and y (xt), we have

E yt+1 y (xt+1) 2 =E yt+1 y (xt) 2 + E y (xt) y (xt+1) 2

+ 2E yt+1 y (xt), y (xt) y (xt+1) . (31)

We can bound the first term in eq. (31) as

E yt+1 y (xt) 2

=E yt + ht y y (xt) 2

=E yt y (xt) 2 + η2 y E ht y 2 + 2ηy E yt y (xt), ht y

(a) =E yt y (xt) 2 + η2 y E ht y 2

+ 2ηy E yt y (xt), 1

i=1 yfi xt, yt, vi,t yfi xt, yt, z i (xt)

+ 2ηy E yt y (xt), 1

i=1 yfi xt, yt, z i (xt) yfi xt, y (xt), z i (xt)

(b) E yt y (xt) 2 + η2 y E ht y 2

µf E yt y (xt) 2 + 1

i=1 yfi xt, yt, vi,t yfi xt, yt, z i (xt) 2

2ηyµf E yt y (xt) 2

(c) (1 ηyµf)E yt y (xt) 2 + η2 y E ht y 2 + ηy L2 f,1 µf 1

i=1 E vi,t z i (xt) 2

2 + η2 y 1 + β2 th |It|

E yt y (xt) 2 + η2 y

σ2 f |It| + (n |It|)σ2 th (n 1)|It|

+ ηy L2 f,1 µf + η2 y 1 + β2 th |It|

i=1 E vi,t z i (xt) 2, (32)

where (a) uses the definition of y (xt) and eq. (5); (b) uses strong concavity of fi in y; (c) follows from definition of y (xt) and Assumption 4.4; (d) uses Lemma E.2. We can bound the second term in eq. (31) as

E y (xt) y (xt+1) 2 (a) 1 + Lg,1

2 L2 f,1 µ2 f E xt xt+1 2 = η2 x 1 + Lg,1

2 L2 f,1 µ2 f E ht x 2,

where (a) follows from Lemma D.2. Also, we can get the bound of the last term as

2E yt+1 y (xt), y (xt) y (xt+1)

= 2E yt+1 y (xt), y (xt)(xt+1 xt)

2E yt+1 y (xt), y (xt+1) y (xt) y (xt)(xt+1 xt)

=2E yt+1 y (xt) ηx y (xt)eht x

+ 2E yt+1 y (xt) y (xt+1) y (xt) y (xt)(xt+1 xt)

(a) =δy,1E yt+1 y (xt) 2 + η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

+ E yt+1 y (xt) L ,y xt+1 xt 2

(b) δy,1E yt+1 y (xt) 2 + η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

2 E yt+1 y (xt) 2 xt+1 xt 2 + L ,y

2 E xt+1 xt 2

(c) δy,1 + 3η2 x L ,y

2 (L2 f,0 + 2λ2L2 g,0) E yt+1 y (xt) 2

+ η2 x δy,1

2 L2 f,1 µ2 f E eht x 2 + η2 x L ,y

(d) δy E yt+1 y (xt) 2 + η2 x δy,1

2 L2 f,1 µ2 f E eht x 2 + η2 x L ,y

2 E ht x 2, (34)

where (a) uses Lemma D.2 and Lemma D.3; (b) use Lemma D.3 and Lemma 1 in [44]; (c) follows from Assumption 4.4; (d) defines δy = δy,1 + 3η2 x L ,y

2 (L2 f,0 + 2λ2L2 g,0). By plugging eq. (32), eq. (33), eq. (34) into eq. (34), we get

E yt+1 y (xt+1) 2 E yt y (xt) 2

ηyµf + η2 y(1 + δy) 1 + β2 th |It|

L2 f,1 + δy

E yt y (xt) 2

+ η2 y(1 + δy) σ2 f |It| + (n |It|)σ2 th (n 1)|It|

+ η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

+ ηy L2 f,1 µf + η2 y 1 + β2 th |It|

i=1 E vi,t z i (xt) 2

2 + 1 + Lg,1

2 L2 f,1 µ2 f

Then the last inequality is proved. Thus, the proof is complete.

E.4 Descent in the Lyapunov Function and Proof of Theorem 4.10

We define the Lyapunov function as

Ψt :=L (xt) + Ky E yt y (xt) 2 + Kz 1 n

i=1 E zi,t z λ,i(xt) 2

i=1 E vi,t z i (xt) 2, (35)

where the coefficients are given by

3L2 f,1 2 + 216L2 g,1L2 f,1 µ2g +864L2 g,1 µg

, Kz = ηxλ2

ηzλ 54L2 g,1 µg , Kv = ηxλ2

ηv 54L2 g,1 µg ;

δy,1 = ηyµf

8 , δz,1 = ηzλµg

8 , δv,1 = ηvµg

For convenience, we define the following constants:

C1 := max L ,1

2 , 54L2 g,1 µg

L2 g,1 µ2g + L ,z

, 54L2 g,1 µg

144L2 g,1 µ2g + L ,zλ

3L2 f,1 2 + 216L2 g,1L2 f,1 µ2g + 864L2 g,1 µg

2 + 1 + Lg,1

2 L2 f,1 µf

C2 := max 62208L4 g,1 µ4g , 16

3L2 f,1 2 + 216L2 g,1L2 f,1 µ2g + 864L2 g,1 µg

2 L2 f,1 µ2 f

C3 := max 864L2 g,1σ2 g µg , 4

3L2 f,1 2 + 216L2 g,1L2 f,1 µ2g + 864L2 g,1 µg

(σ2 f + σ2 th) . (37)

We also constrain the conditions as below:

µg , (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 , (1 + Lg,1

µg )Lf,1Lf,2

3µf Lg,1 , Lf,1L ,y

1 + (1 + Lg,1

µf + 12Lg,1

µf + 1 Lf,1

Lg,1 , Lf,2

Lg,2 , Lf,0

σg , 16L2 f,1 27µ2 f L2 g,1

3L2 f,1 2 + 216L2 g,1L2 f,1 µ2g + 864L2 g,1 µg

ηx 1 16C1 , ηy min µf 8(1 + β2 th)L2 f,1 , 1 (1 + β2 th)µf

, ηzλ min µg 64L2 g,1 , 4

8L2 g,1 , 2

L2 g,1 , η2 x η2y 1 12C2 , η2 x η2z 1 12C2 , η2 xλ2

η2v 1 12C2 ,

ηy min 1 16C1 , µf 36L ,y L2 g,0

, η2 xλ ηz min 1 16C1 , µf 36L ,zλL2 g,0

η2 x ηv min 1 16C1 , µf 36L ,z L2 g,0

Plugging Lemma E.1, Lemma E.3 into eq. (35) and using eq. (36), we have the descent in the Lyapunov function as

2 E H (xt) 2 + η2 xλ2

(ηzλ) + ηxλ2

C1 9(L2 g,0 + σ2 g)

+ ηxηy + ηx(ηzλ)λ2 + ηxηvλ2 C3. (39)

Proof. To simplify the problem, we assume |It| = P for t = 0, ..., T 1. By taking summation of eq. (39) from t = 0 to T 1, we get

2 E H (xt) 2 1

T (Ψ0 ΨT ) + η2 xλ2

(ηzλ) + ηxλ2

+ ηxηy + ηx(ηzλ)λ2 + ηxηvλ2 C3 (40)

By using Lemma D.5 and eq. (40), we have

t=0 E Φ(xt) 2 2

t=0 E Φ(xt) H (xt) 2 + 2

t=0 E H (xt) 2

λ2 + 4(Ψ0 ΨT )

Tηx + 4ηxλ2

(ηzλ) + ηxλ2

+ 4 ηy + (ηzλ)λ2 + ηvλ2 C3

where (a) uses eq. (40); (b) takes ηx = O(T 5

7 ), ηy = O(T 2

7 ), ηz = O(T 5

7 ), ηv = O(T 4

7 ), λ = O(T 1 7 ), which satisfies eq. (38). Thus, Theorem 4.10 is proved.

E.5 Proof of Corollary 4.11

Proof. From eq. (41), to achieve ϵ-accurate stationary point of the objective function Φ(x) in definition 4.2, we let 1

T PT 1 t=0 E Φ(xt) 2 = O(T 2

7 ) ϵ. As a result, we can see that the epochs number we need is T = O(ϵ 7

2 ). The total sample complexity is PT = O(Pϵ 7

2 ). Then, we finish the proof.

F Proofs of Theorem 4.12 and Corollary 4.13

F.1 Descent in Objective Function

Lemma F.1. Under Assumptions 4.3, 4.4, 4.5 and Lemma D.6, for L ,1-smooth L (x), the consecutive iterates of Algorithm 2 satisfy:

E L (xt+1) L (xt)

2 E H (xt) 2 ηx

2 E eht x 2 + η2 x L ,1

i It x Li(xt, yt, z K i,t, v K i,t)

+ 3ηx L2 f,1 2 E yt y (xt) 2 + 3ηx L2 λ,1 2 E 1

z K i,t z λ,i(xt) 2 + 1

v K i,t z i (xt) 2

for all t {0, 1, ..., T 1}, where we assume λ 2Lf,1

Proof. Recall the definitions of L (x) and H (x) in eq. (6). By using the smoothness of L (xt) in Lemma D.6, we have that

E L (xt+1) E L (xt) + E H (xt), xt+1 xt + L ,1

2 E xt+1 xt 2

=E L (xt) ηx E H (xt),eht x + η2 x L ,1

i It x Li(xt, yt, z K i,t, v K i,t)

=E L (xt) ηx

2 E H (xt) 2 ηx

2 E eht x 2 + ηx

2 E H (xt) eht x 2

+ η2 x L ,1

i It x Li(xt, yt, z K i,t, v K i,t)

(a) E L (xt) ηx

2 E H (xt) 2 ηx

2 E eht x 2

+ η2 x L ,1

i It x Li(xt, yt, z K i,t, v K i,t)

2 + 3ηx L2 f,1 2 E yt y (xt) 2

+ 3ηx L2 λ,1 2 E 1

z K i,t z λ,i(xt) 2 + 1

v K i,t z i (xt) 2 ,

where (a) uses

E H (xt) eht x 2

i=1 x Li xt, y (xt), z λ,i(xt), z i (xt) x Li xt, yt, z K i,t, v K i,t

i=1 E x Li xt, y (xt), z λ,i(xt), z i (xt) x Li xt, yt, z K i,t, v K i,t 2

(a.2) 3L2 f,1E yt y (xt) 2 + 3L2 λ,1E 1

z K i,t z λ,i(xt) 2 + 1

v K i,t z i (xt) 2 ,

and (a.1) uses Jensen inequality; (a.2) follows from Lemma D.1. Then, the proof is complete.

F.2 Bounds of Estimators

Lemma F.2. Under Assumptions 4.3, 4.4, 4.5, we have the bounds of the estimators of yt and xt as

i It yfi(xt, yt, v K i,t)

2 1 + β2 th |It|

L2 f,1 E yt y (xt) 2 + 1

i=1 E v K i,t z i (xt) 2

+ (n |It|)σ2 th (n 1)|It| ,

i It x Li(xt, yt, z K i,t, v K i,t)

2 E eht x 2 + 3(n |It|)

(n 1)|It|(L2 f,0 + 2λ2L2 g,0)

for any i {1, ..., n} and t {0, 1, ..., T 1}.

Proof. For the estimator of y, we have

i It yfi(xt, yt, v K i,t)

(a) = n(|It| 1)

|It|(n 1)E 1 n

i=1 yfi(xt, yt, v K i,t)

2 + n |It| (n 1)|It| 1

i=1 E yfi(xt, yt, v K i,t)

(b) (n |It|)σ2 th (n 1)|It| + n(|It| 1) + β2 th(n |It|) |It|(n 1) E 1 n

i=1 yfi(xt, yt, v K i,t)

(c) (n |It|)σ2 th (n 1)|It|

+ n(|It| 1) + β2 th(n |It|) |It|(n 1) E 1 n

i=1 yfi(xt, yt, v K i,t) yfi xt, y (xt), z i (xt)

(d) (n |It|)σ2 th (n 1)|It| + n(|It| 1) + β2 th(n |It|) |It|(n 1) L2 f,1 E yt y (xt) 2 + 1

i=1 E v K i,t z i (xt) 2

(e) (n |It|)σ2 th (n 1)|It| + 1 + β2 th |It|

L2 f,1 E yt y (xt) 2 + 1

i=1 E v K i,t z i (xt) 2 ,

where (a) follows from the Lemma A.1 in [32]; (b) uses Assumption 4.5, 4.6; (c) follows from definition y (x) = arg maxy 1 n Pn i fi x, y, z i (x) ; (d) uses Assumption 4.4 and (e) uses βth 1. Next, for the estimator of y, we have

i It x Li(xt, yt, z K i,t, v K i,t)

2 (a) = n(|It| 1)

|It|(n 1)E 1 n

i=1 x Li(xt, yt, z K i,t, v K i,t)

+ n |It| (n 1)|It| 1

i=1 E x Li(xt, yt, z K i,t, v K i,t)

E eht x 2 + 3(n |It|)

(n 1)|It|(L2 f,0 + 2λ2L2 g,0). (42)

Then, the proof is complete.

F.3 Bounds of Sub-loop Errors

Lemma F.3. Under Assumptions 4.3, 4.4, for δ R+, we assume that z i (xt) B for some B < . Then we have

max n E v K i,t z i (xt) 2, E z K i,t z λ,i(xt) 2o ϵsub

when K max 1 ηv t µg log 2 v0 i,t 2+B2

ϵsub , 1 ηz t Lf,1 log 3 z0 i,t 2+ L2 f,0 4L2 f,1 +B2

, where ηv t (0, 1 2Lg,1 ),

ηz t (0, 1 4λLg,1 ), λ Lf,1

Proof. From Lemma D.1, we have that Li is λµg

2 -strongly convex in z and we have that Li is 2λLg,1-Lipschitz continue in z when λ Lf,1

µg ; also, from Assumption 4.3, 4.4, we have that gi is µg-strongly convex in v and Lg,1-smooth in v. According to Theorem 3.6 in [11], by taking 0 < ηv t < 1 2Lg,1 and 0 < ηz t < 1 4λLg,1 , we have

E v K i,t z i (xt) 2 (1 ηv t µg)KE v0 i,t z i (xt) 2,

E z K i,t z λ,i(xt) 2 (1 ηz t λµg

2 )KE z0 i,t z λ,i(xt) 2.

To make sure E v K i,t z i (xt) 2 ϵsub and E z K i,t z λ,i(xt) 2 ϵsub for some ϵsub 0, we let

(1 ηv t µg)KE v0 i,t z i (xt) 2 2(1 ηv t µg)K v0 i,t 2 + B2 ϵsub,

(1 ηz t λµg

2 )KE z0 i,t z λ,i(xt) 2

3(1 ηz t λµg

2 )K z0 i,t 2 + E z i (xt) z λ,i(xt) 2 + E z i (xt) 2

(a) 3(1 ηz t λµg

2 )K z0 i,t 2 + L2 f,0 µ2gλ2 + B2

(b) 3(1 ηz t λµg

2 )K z0 i,t 2 + L2 f,0 4L2 f,1 + B2

where (a) uses Lemma D.4; (b) take λ 2Lf,1

µg . Both can be achieved by taking

( 1 ηv t µg log 2 v0 i,t 2 + B2

ϵsub , 1 ηz t Lf,1 log 3 z0 i,t 2 + L2 f,0 4L2 f,1 + B2

Then the proof is complete.

Lemma F.4. Under the Assumptions 4.3, 4.4, 4.5, the iterates of yt updates according to Algorithm 2 satisfy

E yt+1 y (xt+1) 2 E yt y (xt) 2

ηyµf + η2 y(1 + δy) 1 + β2 th |It|

L2 f,1 + δy

E yt y (xt) 2 + η2 y(1 + δy)(n |It|)σ2 th (n 1)|It|

+ ηy L2 f,1 µf + η2 y 1 + β2 th |It|

i=1 E vi,t z i (xt) 2

+ η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

2 + 1 + Lg,1

2 L2 f,1 µ2 f

i It x Li(xt, yt, z K i,t, v K i,t)

for any t {0, ..., T 1}, where we assume λ 2Lf,1/µg, (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 , (1 +

µg ) Lf,1Lf,2

3µf Lg,1 , Lf,1L ,y

6Lg,1 1 + (1 + Lg,1

µf + 12Lg,1

µg 1, (1 + Lg,1

µf + 1 Lf,1

Proof. For y and y (x), we have

E yt+1 y (xt+1) 2 =E yt+1 y (xt) 2 + E y (xt) y (xt+1) 2

+ 2E yt+1 y (xt), y (xt) y (xt+1) . (43)

We can bound the first term in eq. (43) as

E yt+1 y (xt) 2

i It yfi(xt, yt, v K i,t) y (xt)

=E yt y (xt) 2 + η2 y E 1 |It|

i It yfi(xt, yt, v K i,t)

+ 2ηy E yt y (xt), 1

i It yfi(xt, yt, v K i,t)

(a) =E yt y (xt) 2 + η2 y E 1 |It|

i It yfi(xt, yt, v K i,t)

+ 2ηy E yt y (xt), 1

i=1 yfi xt, yt, v K i,t yfi xt, yt, z i (xt)

+ 2ηy E yt y (xt), 1

i=1 yfi xt, yt, z i (xt) yfi xt, y (xt), z i (xt)

(b) E yt y (xt) 2 + η2 y E 1 |It|

i It yfi(xt, yt, v K i,t)

µf E yt y (xt) 2 + 1

i=1 yfi xt, yt, v K i,t yfi xt, yt, z i (xt)

2ηyµf E yt y (xt) 2

(c) (1 ηyµf)E yt y (xt) 2 + η2 y E 1 |It|

i It yfi(xt, yt, v K i,t)

+ ηy L2 f,1 µf 1

i=1 E v K i,t z i (xt) 2

(d) 1 ηyµf + η2 y 1 + β2 th |It|

E yt y (xt) 2 + η2 y (n |It|)σ2 th (n 1)|It|

+ ηy L2 f,1 µf + η2 y 1 + β2 th |It|

i=1 E v K i,t z i (xt) 2 (44)

where (a) uses the definition of y (xt) and eq. (5); (b) uses strong concavity of fi in y; (c) follos from definition of y (xt) and Assumption 4.4; (d) uses Lemma F.2. We can bound the second term in eq. (43) as

E y (xt) y (xt+1) 2 (a) 1 + Lg,1

2 L2 f,1 µ2 f E xt xt+1 2

= η2 x 1 + Lg,1

2 L2 f,1 µ2 f E 1 |It|

i It yfi(xt, yt, v K i,t)

where (a) follows from Lemma D.2. Also, we can get the bound of the last term as

2E yt+1 y (xt), y (xt) y (xt+1)

= 2E yt+1 y (xt), y (xt)(xt+1 xt)

2E yt+1 y (xt), y (xt+1) y (xt) y (xt)(xt+1 xt)

=2E yt+1 y (xt) ηx y (xt)eht x

+ 2E yt+1 y (xt) y (xt+1) y (xt) y (xt)(xt+1 xt)

(a) =δy,1E yt+1 y (xt) 2 + η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

+ E yt+1 y (xt) L ,y xt+1 xt 2

δy,1E yt+1 y (xt) 2 + η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

2 E yt+1 y (xt) 2 xt+1 xt 2 + L ,y

2 E xt+1 xt 2

(b) δy,1 + 3η2 x L ,y

2 (L2 f,0 + 2λ2L2 g,0) E yt+1 y (xt) 2

+ η2 x δy,1

2 L2 f,1 µ2 f E eht x 2 + η2 x L ,y

i It x Li(xt, yt, z K i,t, v K i,t)

(c) δy E yt+1 y (xt) 2 + η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

+ η2 x L ,y

i It x Li(xt, yt, z K i,t, v K i,t)

where (a) uses Lemma D.2, D.3 and Lemma 1 in [44]; (b) follows from Assumption 4.4; (c) defines δy = δy,1 + 3η2 x L ,y

2 (L2 f,0 + 2λ2L2 g,0). By plugging eq. (44), eq. (45), eq. (46) into eq. (43), we get

E yt+1 y (xt+1) 2 E yt y (xt) 2

ηyµf + η2 y(1 + δy) 1 + β2 th |It|

L2 f,1 + δy

E yt y (xt) 2 + η2 y(1 + δy)(n |It|)σ2 th (n 1)|It|

+ ηy L2 f,1 µf + η2 y 1 + β2 th |It|

i=1 E vi,t z i (xt) 2

+ η2 x δy,1

2 L2 f,1 µ2 f E eht x 2

2 + 1 + Lg,1

2 L2 f,1 µ2 f

i It x Li(xt, yt, z K i,t, v K i,t)

Then the proof is complete.

F.4 Descent in the Lyapunov Function and Proof of Theorem 4.12

We define the Lyapunov function as

Ψ(x) := L (x) + Ky E yt y (xt) 2, (47)

where the coefficient is given by Ky = 3ηx L2 f,1 ηyµf . We also constrain the conditions as below:

δy,1 = ηyµf

4 , ηx 1 3L ,1 , ηy min 1 (1 + β2 th)µf , µf 8(1 + β2 th)L2 g,1

η2 x ηy min µg 12L ,y(L2 f,0 + 2λ2L2 g,0), µf 18L2 f,1

2 + 1 + 6Lg,1

2 L2 f,1 µ2 f

ηx ηy µ2 f 6

λ max 2Lf,1/µg, (1 + Lg,1

µg ) L2 f,1 3µf Lg,1 , (1 + Lg,1

µg )Lf,1Lf,2

1 + (1 + Lg,1

µf + 12Lg,1

1, (1 + Lg,1

µf + 1 Lf,1

Plugging Lemma F.1, Lemma E.3 into eq. (47), we have the descent in the Lyapunov function as Ψt+1 Ψt

2 E H (xt) 2 ηx

2 L2 f,1 µ2 f

2 + 1 + Lg,1

2 L2 f,1 µ2 f

i It x Li(xt, yt, z K i,t, v K i,t)

+ 2Kyη2 y (n |It|)σ2 th (n 1)|It| + 3ηx L2 λ,1 2 E 1

z K i,t z λ,i(xt) + 1

v K i,t z i (xt)

ηy L2 f,1 µf + η2 y 1 + β2 th |It|

i=1 E v K i,t z i (xt) 2

2 E H (xt) 2 ηx

2 L2 f,1 µ2 f

2 + 1 + Lg,1

2 L2 f,1 µ2 f

2 + 1 + Lg,1

2 L2 f,1 µ2 f

(n 1)|It|(L2 f,0 + 2λ2L2 g,0)

+ 2Kyη2 y (n |It|)σ2 th (n 1)|It| + 3ηx L2 λ,1 2 E 1

z K i,t z λ,i(xt) + 1

v K i,t z i (xt)

+ 4Kyηy L2 f,1 µf 1

i=1 E v K i,t z i (xt) 2

2 E H (xt) 2 + η2 x

2 + 1 + Lg,1

2 L2 f,1 µ2 f

(n 1)|It|(L2 f,0 + 2λ2L2 g,0)

+ 2Kyη2 y (n |It|)σ2 th (n 1)|It| + ηx

3L2 λ,1 + 12L4 f,1 µ2 f

2 E H (xt) 2 + η2 x 1 + ηx

C4 (n |It|)λ2

(n 1)|It| + ηxηy n |It| (n 1)|It| 6L2 f,1σ2 th µf

3L2 λ,1 + 12L4 f,1 µ2 f

where (a) uses Lemma F.2; (b) uses eq. (48) and takes

K max n 1 ηv t µg log 2E v0 i,t z i (xt) 2

ϵsub , 2 ηz t λµg log 2E z0 i,t z λ,i(xt) 2

(c) defines C4 := 3(L2 f,0+2λ2L2 g,0) λ2 max n L ,1

2 , 3L2 f,1 µf

2 + 1 + Lg,1

µg 2 L2 f,1 µ2 f

o and plugs in Ky.

F.5 Proof of Theorem 4.12

Proof. For partial block participation, we take the summation of eq. (49) from t = 0 to T 1. Then we have

2 E H (xt) 2 Ψ(x0) Ψ(x T ) + η2 x 1 + ηx

+ ηxηy n P (n 1)P 6L2 f,1σ2 th µf + ηx

3L2 λ,1 + 12L4 f,1 µ2 f

For partial block participation, By using Lemma D.5, we have

t=0 E Φ(xt) 2 2

t=0 E Φ(xt) H (xt) 2 + 2

t=0 E H (xt) 2

λ2 + 2 Ψ(x0) Ψ(x T )

Tηx + 4ηx 1 + ηx

+ ηy n P (n 1)P 24L2 f,1σ2 th µf + 4 3L2 λ,1 + 12L4 f,1 µ2 f

λ2 + 2 Ψ(x0) Ψ(x T )

Tηx + 4ηxλ2

P 24L2 f,1σ2 th µf

+ 4 3L2 λ,1 + 12L4 f,1 µ2 f

λ2 + 2 Ψ(x0) Ψ(x T )

Tηx + 4ηxλ2

P 24L2 f,1σ2 th µf

+ 4 9λ2L2 g,1 + 12L4 f,1 µ2 f

where (a) follows from eq. (50); (b) follows from 1 P < n; (c) takes ηx = O(P 1 5 T 2

3 ), ηy = O(P 1

2 ), λ = O(P 1 10 T 1 6 ), ϵsub = O(P 2

Next, for full block participation (n = P), we have the descent in the Lyapunov function for full block participation as

Ψ(xt+1) Ψ(xt) ηx

2 E H (xt) 2 + ηx

3L2 λ,1 + 12L4 f,1 µ2 f

Taking summation of eq. (52) from t = 0 to T 1, we have

2 E H (xt) 2 Ψ(x0) Ψ(x T ) + ηx

3L2 λ,1 + 12L4 f,1 µ2 f

By using Lemma D.5 and eq. (53), we have

t=0 E Φ(xt) 2 2

t=0 E Φ(xt) H (xt) 2 + 2

t=0 E H (xt) 2

λ2 + 4 Ψ(x0) Ψ(x T )

Tηx + 12 L2 λ,1 + 4L4 f,1 µ2 f

λ2 + 4 Ψ(x0) Ψ(x T )

Tηx + 12 9λ2L2 g,1 + 4L4 f,1 µ2 f

O(T 1). (54)

By taking ηx = O(1), ηy = O(1), λ = O(T 1 2 ), ϵsub = O(T 2). Then Theorem 4.12 is proved.

F.6 Proof of Corollary 4.13

Proof. For tasks participate in updates partially, by eq. (51), we can find the ϵ-stationary point in definition 4.2 once we take T = O(P 3

5 ϵ 3). Note that we set the error of sub-loop as ϵsub = O(P 2

3 ) = O(ϵ2). According to Lemma F.3, once we take ηv = O(1) and ηz = O(P 1

6 ), we have the iteration number of sub-loop as K = O log( 1

ϵ ) . Thus, we have the total sample complexity PKT = O(P 2 5 ϵ 3 log( 1

ϵ )) = e O(P 2 5 ϵ 3).

Similarly, by eq. (54), we can find the ϵ-stationary point in definition 4.2 once we take T = O(ϵ 1). Note that we set the error of sub-loop as ϵsub = O(T 2) = O(ϵ2). Once we take ηv = O(1) and ηz = O(T 1

2 ), we have the iteration number of sub-loop as K = O log( 1

ϵ ) . Thus, we have the total sample complexity n KT = O(nϵ 1 log( 1

ϵ )) = e O(nϵ 1).

Neur IPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

You should answer [Yes] , [No] , or [NA] .

[NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available.

Please provide a short (1 2 sentence) justification right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

Delete this instruction block, but keep the section heading Neur IPS paper checklist",

Keep the checklist subsection headings, questions/answers and guidelines below.

Do not modify the questions and only use the provided macros for your answers.

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The abstract and introduction in this paper accurately reflect the paper s contributions and scope.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: The limitation of previous works is discussed in the Introduction part.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We provide complete assumption in the main text and full proof in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We provide detailed setting of the experiments in the appendix.

Guidelines:

The answer NA means that the paper does not include experiments.

If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide the code as the supplementary material.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: Data and experimental settings are provided in the experiment part in the main text.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Error bars are provided in the experiments in Table 1.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: The computational resources are described in the implementation detail part in the appendix.

Guidelines:

The answer NA means that the paper does not include experiments.

The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The authors have reviewed the Code of Ethics and make sure the research conforms with it.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This paper focuses on the theoretical analysis of multi-block minimax bilevel optimization problem, which does not have a social impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA] Justification: The datasets used in this paper contain widely used real-world data; This research does not use pre-trained LLM or other generative models.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We have introduced all datasets with proper reference in the experiments part.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: This paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.