# fednest_federated_bilevel_minimax_and_compositional_optimization__2edf9d3d.pdf FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Davoud Ataee Tarzanagh 1 Mingchen Li 2 Christos Thrampoulidis 3 Samet Oymak 2 Standard federated optimization methods successfully apply to stochastic problems with singlelevel structure. However, many contemporary ML problems including adversarial robustness, hyperparameter tuning, actor-critic fall under nested bilevel programming that subsumes minimax and compositional optimization. In this work, we propose FEDNEST: A federated alternating stochastic gradient method to address general nested problems. We establish provable convergence rates for FEDNEST in the presence of heterogeneous data and introduce variations for bilevel, minimax, and compositional optimization. FEDNEST introduces multiple innovations including federated hypergradient computation and variance reduction to address inner-level heterogeneity. We complement our theory with experiments on hyperparameter & hyper-representation learning and minimax optimization that demonstrate the benefits of our method in practice. 1. Introduction In the federated learning (FL) paradigm, multiple clients cooperate to learn a model under the orchestration of a central server (Mc Mahan et al., 2017) without directly exchanging local client data with the server or other clients. The locality of data distinguishes FL from traditional distributed optimization and also motivates new methodologies to address heterogeneous data across clients. Additionally, cross-device FL across many edge devices presents additional challenges since only a small fraction of clients participate in each round, and clients cannot maintain state across rounds (Kairouz et al., 2019). Traditional distributed SGD methods are often unsuitable in 1Email: tarzanaq@umich.edu, University of Michigan 2Emails: {mli176@,oymak@ece.}ucr.edu, University of California, Riverside 3Email: cthrampo@ece.ubc.ca, University of British Columbia. Correspondence to: Davoud Ataee Tarzanagh . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). FL and incur high communication costs. To overcome this issue, popular FL methods, such as FEDAVG (Mc Mahan et al., 2017), use local client updates, i.e. clients update their models multiple times before communicating with the server (aka, local SGD). Although FEDAVG has seen great success, recent works have exposed convergence issues in certain settings (Karimireddy et al., 2020; Hsu et al., 2019). This is due to a variety of factors, including client drift, where local models move away from globally optimal models due to objective and/or systems heterogeneity. Existing FL methods, such as FEDAVG, are widely applied to stochastic problems with single-level structure. Instead, many machine learning tasks such as adversarial learning (Madry et al., 2017), meta learning (Bertinetto et al., 2018), hyperparameter optimization (Franceschi et al., 2018), reinforcement/imitation learning (Wu et al., 2020; Arora et al., 2020), and neural architecture search (Liu et al., 2018) admit nested formulations that go beyond the standard singlelevel structure. Towards addressing such nested problems, bilevel optimization has received significant attention in the recent literature (Ghadimi & Wang, 2018; Hong et al., 2020; Ji et al., 2021); albeit in non-FL settings. On the other hand, federated versions have been elusive perhaps due to the additional challenges surrounding heterogeneity, communication, and inverse Hessian approximation. Contributions: This paper addresses these challenges and develops FEDNEST: A federated machinery for nested problems with provable convergence and lightweight communication. FEDNEST is composed of FEDINN: a federated stochastic variance reduction algorithm (FEDSVRG) to solve the inner problem while avoiding client drift, and FEDOUT: a communication-efficient federated hypergradient algorithm for solving the outer problem. Importantly, we allow both inner & outer objectives to be finite sums over heterogeneous client functions. FEDNEST runs a variant of FEDSVRG on inner & outer variables in an alternating fashion as outlined in Algorithm 1. We make multiple algorithmic and theoretical contributions summarized below. The variance reduction of FEDINN enables robustness in the sense that local models converge to the globally optimal inner model despite client drift/heterogeneity unlike FEDAVG. While FEDINN is similar to FEDSVRG (Koneˇcn y et al., 2018) and FEDLIN (Mitra et al., 2021), we make two key contributions: (i) We FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Federated Bilevel Learning Client 1 Client 2 Client 3 Client ! !! !" !# !$ "! "" "# "!, $! "", $" "#, $# "$, $$ Inner Outer ( (*) Communication-efficiency: FEDIHGP avoids explicit Hessian LFEDNEST for local hypergradients Client heterogeneity: FEDINN avoids client drift Finite sample bilevel theory: Stochastic inner & outer analysis Specific nested optimization problems: Bilevel Minimax Compositional Figure 1: Depiction of federated bilevel nested optimization and high-level summary of FEDNEST (Algorithm 1). At outer loop, FEDIHGP uses multiple rounds of matrix-vector products to facilitate hypergradient computation while only communicating vectors. At inner loop, FEDINN uses FEDSVRG to avoid client drift and find the unique global minima. Both are crucial for establishing provable convergence of FEDNEST. leverage the global convergence of FEDINN to ensure accurate hypergradient computation which is crucial for our bilevel proof. (ii) We establish new convergence guarantees for single-level stochastic non-convex FEDSVRG, which are then integrated within our FEDOUT. Communication efficient bilevel optimization: Within FEDOUT, we develop an efficient federated method for hypergradient estimation that bypass Hessian computation. Our approach approximates the global Inverse Hessian-Gradient-Product (IHGP) via computation of matrix-vector products over few communication rounds. LFEDNEST: To further improve communication efficiency, we additionally propose a Light-FEDNEST algorithm, which computes hypergradients locally and only needs a single communication round for the outer update. Experiments reveal that LFEDNEST becomes very competitive as client functions become more homogeneous. Unified federated nested theory: We specialize our bilevel results to minimax and compositional optimization with emphasis on the former. For these, FEDNEST significantly simplifies and leads to faster convergence. Importantly, our results are on par with the state-of-theart non-federated guarantees for nested optimization literature without additional assumptions (Table 1). We provide extensive numerical experiments 1 on bilevel and minimax optimization problems. These demonstrate the benefits of FEDNEST, efficiency of LFEDNEST, and shed light on tradeoffs surrounding communication, computation, and heterogeneity. 2. Federated Nested Problems & FEDNEST We will first provide the background on bilevel nested problems and then introduce our general federated method. 1FEDNEST code is available at https://github.com/ ucr-optml/Fed Nest. Stochastic Bilevel Optimization Non-Federated FEDNEST ALSET BSA TTSA batch size O(1) samples in ξ samples in ζ O(κ5 gϵ 2) O(κ9 gϵ 2) O(κ5 gϵ 2) O(κ9 gϵ 2) O(κ6 gϵ 2) O(κ9 gϵ 3) O(κp gϵ 2.5) O(κp gϵ 2.5) Stochastic Minimax Optimization Non-Federated FEDNEST ALSET SGDA SMD batch size O(1) O(1) O(ϵ 1) N.A. samples O(κ3 fϵ 2) Stochastic Compositional Optimization Non-Federated FEDNEST ALSET SCGD NASA batch size O(1) samples O(ϵ 2) O(ϵ 2) O(ϵ 4) O(ϵ 2) Table 1: Sample complexity of FEDNEST and comparable non-FL methods to find an ϵ-stationary point of f: κg := ℓg,1/µg and κf := ℓf,1/µf. κp g denotes a polynomial function of κg. ALSET (Chen et al., 2021a), BSA (Ghadimi & Wang, 2018), TTSA (Hong et al., 2020), SGDA (Lin et al., 2020), SMD (Rafique et al., 2021), SCGD (Wang et al., 2017), and NASA (Ghadimi et al., 2020). Notation. For a differentiable function h(x, y) : Rd1 Rd2 R in which y = y(x) : Rd1 Rd2, we denote h Rd1 the gradient of h as a function of x and xh, yh the partial derivatives of h with respect to x and y, respectively. We let 2 xyh and 2 yh denote the Jacobian and Hessian of h, respectively. We consider FL optimization over m clients and we denote S = {1, . . . , m}. For vectors v Rd and matrix M Rd d, we denote v and M the respective Euclidean and spectral norms. 2.1. Preliminaries on Federated Nested Optimization In federated bilevel learning, we consider the following nested optimization problem as depicted in Figure 1: min x Rd1 f(x) = 1 m Pm i=1 fi (x, y (x)) subj. to y (x) argmin y Rd2 1 m Pm i=1 gi (x, y) . (1a) FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Algorithm 1 FEDNEST FEDNEST FEDNEST 1: Inputs: K, T N; (x0, y0) Rd1+d2; FEDINN, 2: FEDOUT with stepsizes {(αk, βk)}K 1 k=0 3: for k = 0, , K 1 do 4: yk,0 = yk 5: for t = 0, , T 1 do 6: yk,t+1 = FEDINN FEDINN FEDINN xk, yk,t, βk 7: end for 8: yk+1 = yk,T 9: xk+1 = FEDOUT FEDOUT FEDOUT xk, yk+1, αk 10: end for Recall that m is the number of clients. Here, to model objective heterogeneity, each client i is allowed to have its own individual outer & inner functions (fi, gi). Moreover, we consider a general stochastic oracle model, access to local functions (fi, gi) is via stochastic sampling as follows: fi(x, y (x)) := Eξ Ci [fi(x, y (x); ξ)] , gi(x, y) := Eζ Di [gi(x, y; ζ)] , (1b) where (ξ, ζ) (Ci, Di) are outer/inner sampling distributions for the ith client. We emphasize that for i = j, the tuples (fi, gi, Ci, Di) and (fj, gj, Cj, Dj) can be different. Example 1 (Hyperparameter tuning). Each client has local validation and training datasets associated with objectives (fi, gi)m i=1 corresponding to validation and training losses, respectively. The goal is finding hyper-parameters x that lead to learning model parameters y that minimize the (global) validation loss. The stochastic bilevel problem (1) subsumes two popular problem classes with the nested structure: Stochastic Mini Max & Stochastic Compositional. Therefore, results on the general nested problem (1) also imply the results in these special cases. Below, we briefly describe them. Minimax optimization. If gi(x, y; ζ) := fi(x, y; ξ) for all i S, the stochastic bilevel problem (1) reduces to the stochastic minimax problem min x Rd1 f(x) := 1 m max y Rd2 i=1 E[fi (x, y; ξ)]. (2) Motivated by applications in fair beamforming, training generative-adversarial networks (GANs) and robust machine learning, significant efforts have been made for solving (2) including (Daskalakis & Panageas, 2018; Gidel et al., 2018; Mokhtari et al., 2020; Thekumparampil et al., 2019). Example 2 (GANs). We train a generative model gx( ) and an adversarial model ay( ) using client datasets Ci. The local functions may for example take the form fi(x, y) = Es Ci{log ay(s)} + Ez Dnoise{log[1 ay(gx(z))]}. Compositional optimization. Suppose fi(x, y; ξ) := fi(y; ξ) and gi is quadratic in y given as gi(x, y; ζ) := y ri(x; ζ) 2. Then, the bilevel problem (1) reduces to min x Rd1 f(x) = 1 m Pm i=1 fi (y (x)) subj. to y (x) = argmin y Rd2 1 m Pm i=1 gi (x, y) (3) with fi(y (x)) := Eξ Ci[fi(y (x); ξ)] and gi(x, y) := Eζ Di[gi(x, y; ζ)]. Optimization problems in the form of (3) occur for example in model agnostic meta-learning and policy evaluation in reinforcement learning (Finn et al., 2017; Ji et al., 2020b; Dai et al., 2017; Wang et al., 2017). Assumptions. Let z = (x, y) Rd1+d2. Throughout, we make the following assumptions on inner/outer objectives. Assumption A (Well-behaved objectives). For all i [m]: (A1) fi(z), fi(z), gi(z), 2gi(z) are ℓf,0,ℓf,1,ℓg,1, ℓg,2-Lipschitz continuous, respectively; and (A2) gi(x, y) is µg-strongly convex in y for all x Rd1. Throughout, we use κg = ℓg,1/µg to denote the condition number of the inner function g. Assumption B (Stochastic samples). For all i [m]: (B1) fi(z; ξ), gi(z; ζ), 2gi(z; ζ) are unbiased estimators of fi(z), gi(z), 2gi(z), respectively; and (B2) Their variances are bounded, i.e., Eξ[ fi(z; ξ) fi(z) 2] σ2 f, Eζ[ 2gi(z; ζ) 2gi(z) 2] σ2 g,1, and Eζ[ 2gi(z; ζ) 2gi(z) 2] σ2 g,2 for some σ2 f, σ2 g,1, and σ2 g,2. These assumptions are common in the bilevel optimization literature (Ghadimi & Wang, 2018; Chen et al., 2021a; Ji et al., 2021). Assumption A requires that the inner and outer functions are well-behaved. Specifically, strong-convexity of the inner objective is a recurring assumption in bilevel optimization theory implying a unique solution to the inner minimization in (1). 2.2. Proposed Algorithm: FEDNEST In this section, we develop FEDNEST, which is formally presented in Algorithm 1. The algorithm operates in two nested loops. The outer loop operates in rounds k {1, . . . , K}. Within each round, an inner loop operating for T iterations is executed. Given estimates xk and yk, each iteration t {1, . . . , T} of the inner loop produces a new global model yk,t+1 of the inner optimization variable y (xk) as the output of an optimizer FEDINN. The final estimate yk+1 = yk,T of the inner variable is then used by an optimizer FEDOUT to update the outer global model xk+1. The subroutines FEDINN and FEDOUT are gradient-based optimizers. Each subroutine involves a certain number of FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization local training steps indexed by ν {0, . . . , τi 1} that are performed at the ith client. The local steps of FEDINN iterate over local models yi,ν of the inner variable. Accordingly, FEDOUT iterates over local models xi,ν of the global variable. A critical component of FEDOUT is a communication-efficient federated hypergradient estimation routine, which we call FEDIHGP. The implementation of FEDINN, FEDOUT and FEDIHGP is critical to circumvent the algorithmic challenges of federated bilevel optimization. In the remaining of this section, we detail the challenges and motivate our proposed implementations. Later, in Section 3, we provide a formal convergence analysis of FEDNEST. 2.3. Key Challenge: Federated Hypergradient Estimation FEDOUT is a gradient-based optimizer for the outer minimization in (1); thus each iteration involves computing f(x) = (1/m) Pm i=1 fi(x, y (x)). Unlike singlelevel FL, the fact that the outer objective f depends explicitly on the inner minimizer y (x) introduces a new challenge. A good starting point to understand the challenge is the following evaluation of f(x) in terms of partial derivatives. The result is well-known from properties of implicit functions. Lemma 2.1. Under Assumption A, for all i [m]: fi(x, y (x)) = Dfi (x, y (x)) + Ifi (x, y (x)) , where the direct and indirect gradient components are: Dfi(x, y (x)) := xfi (x, y (x)) , (4a) Ifi(x, y (x)) := 2 xyg(x, y (x)) 2 yg(x, y (x)) 1 yfi (x, y (x)) . (4b) We now use the above formula to describe the two core challenges of bilevel FL optimization. First, evaluation of any of the terms in (4) requires access to the minimizer y (x) of the inner problem. On the other hand, one may at best hope for a good approximation to y (x) produced by the inner optimization subroutine. Of course, this challenge is inherent in any bilevel optimization setting, but is exacerbated in the FL setting because of client drift. Specifically, when clients optimize their individual (possibly different) local inner objectives, the global estimate of the inner variable produced by SGD-type methods may drift far from (a good approximation to) y (x). We explain in Section 2.5 how FEDINN solves that issue. The second challenge comes from the stochastic nature of the problem. Observe that the indirect component in (4b) is nonlinear in the Hessian 2 yg(x, y (x)), complicating an unbiased stochastic approximation of fi(x, y (x)). As we expose here, solutions to this complication developed in the non-federated bilevel optimization literature, are not directly applicable in the FL setting. Indeed, existing stochastic bilevel algorithms, e.g. (Ghadimi & Wang, 2018), define f(x, y) := Df(x, y) + If(x, y) as a surrogate of f(x, y (x)) by replacing y (x) in definition (4) with an approximation y and using the following stochastic approximations: Df(x, y) xf(x, y; ξ), (5a) If(x, y) 2 xyg(x, y; ζN +1) I 1 ℓg,1 2 yg(x, y; ζn) i yf(x, y; ξ). (5b) Here, N is drawn from {0, . . . , N 1} uniformly at random (UAR) and { ξ, ζ1, . . . , ζN +1} are i.i.d. samples. Ghadimi & Wang (2018); Hong et al. (2020) have shown that using (5), the inverse Hessian estimation bias exponentially decreases with the number of samples N. One might hope to directly leverage the above approach in a local computation fashion by replacing the global outer function f with the individual function fi. However, note from (4b) and (5b) that the proposed stochastic approximation of the indirect gradient involves in a nonlinear way the global Hessian, which is not available at the client 2. Communication efficiency is one of the core objectives of FL making the idea of communicating Hessians between clients and server prohibitive. Is it then possible, in a FL setting, to obtain an accurate stochastic estimate of the indirect gradient while retaining communication efficiency? In Section 2.4, we show how FEDOUT and its subroutine FEDIHGP, a matrix-vector products-based (thus, communication efficient) federated hypergradient estimator, answer this question affirmatively. 2.4. Outer Optimizer: FEDOUT This section presents the outer optimizer FEDOUT, formally described in Algorithm 2. As a subroutine of FEDNEST (see Line 9, Algorithm 1), at each round k = 0, . . . , K 1, FEDOUT takes the most recent global outer model xk together the updated (by FEDINN) global inner model yk+1 and produces an update xk+1. To lighten notation, for a round k, denote the function s input as (x, y+) (instead of (xk, yk+1)) and the output as x+ (instead of xk+1). For each client i S, FEDOUT uses stochastic approximations of Ifi(x, y+) and Dfi(x, y+), which we call h I i (x, y+) and h D i (x, y+), respectively. The specific choice of these approximations (see Line 5) is critical and is discussed in detail later in this section. Before that, we explain 2We note that the approximation in (5) is not the only construction, and bilevel optimization can accommodate other forms of gradient surrogates (Ji et al., 2021). Yet, all these approximations require access (in a nonlinear fashion) to the global Hessian; thus, they suffer from the same challenge in FL setting. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Algorithm 2 x+ = FEDOUT FEDOUT FEDOUT (x, y+, α) for stochastic bilevel and minimax problems 1: Fi( ) xfi( , y+; ) 2: xi,0 = x and αi (0, α] 3: Choose N 1 and set p N = FEDIHGP FEDIHGP FEDIHGP (x, y+, N) 4: for i S in parallel do 5: hi = Fi(x; ξi) 2 xygi(x, y+; ζi)p N 6: hi = Fi(x; ξi) 7: end for 8: h = |S| 1 P i S hi 9: for i S in parallel do 10: for ν = 0, . . . , τi 1 do 11: hi,ν = Fi(xi,ν; ξi,ν) Fi(x; ξi,ν) + h 12: xi,ν+1 = xi,ν αihi,ν 13: end for 14: end for 15: x+ = |S| 1 P i S xi,τi how each client uses these proxies to form local updates of the outer variable. In each round, starting from a common global model xi,0 = x, each client i performs τi local steps (in parallel): xi,ν+1 = xi,ν αihi,ν, (6) and then the server aggregates local models via x+ = |S| 1 P i S xi,τi. Here, αi (0, α] is the local stepsize, hi,ν :=h I(x, y+) + h D(x, y+) h D i (x, y+) + h D i (xi,ν, y+) , (7) and, h(x, y) := |S| 1 P i S hi(x, y) = |S| 1 P i S (h D i (x, y+) h I i (x, y+)) . The key features of updates (6) (7) are exploiting past gradients (variance reduction) to account for objective heterogeneity. Indeed, the ideal update in FEDOUT would perform the update xi,ν+1 = xi,ν αi h I(xi,ν, y+) + h D(xi,ν, y+) using the global gradient estimates. But this requires each client i to have access to both direct and indirect gradients of all other clients which it does not, since clients do not communicate between rounds. To overcome this issue, each client i uses global gradient estimates, i.e., h I(x, y+) + h D(x, y+) from the beginning of each round as a guiding direction in its local update rule. However, since both h D and h I are computed at a previous (x, y+), client i makes a correction by subtracting off the stale direct gradient estimate h D i (x, y+) and adding its own local estimate h D i (xi,ν, y+). Our local update rule in Step 11 of Algorithm 2 is precisely of this form, i.e., hi,ν approximates h I(xi,ν, y+) + h D(xi,ν, y+) via (7). Note here that the described local correction of FEDOUT only applies Algorithm 3 p N = FEDIHGP FEDIHGP FEDIHGP (x, y+, N): Federated approximation of inverse-Hessian-gradient product 1: Select N {0, . . . , N 1} UAR. 2: Select S0 S UAR. 3: for i S0 in parallel do 4: pi,0 = yfi(x, y+; ξi,0) 5: end for 6: p0 = N ℓg,1 |S0| 1 P i S0 pi,0 7: if N = 0 then 8: Return p N 9: end if 10: Select S1, . . . , SN S UAR. 11: for n = 1, . . . , N do 12: for i Sn in parallel do 13: pi,n = I 1 ℓg,1 2 ygi(x, y+; ζi,n) pn 1 14: end for 15: pn = |Sn| 1 P i Sn pi,n 16: end for to the direct gradient component (the indirect component would require global Hessian information). An alterantive approach leading to LFEDNEST is discussed in Section 2.6. FEDOUT applied to special nested problems. Algorithm 2 naturally allows the use of other optimizers for minimax & compositional optimization. For example, in the minimax problem (2), the bilevel gradient components are Dfi(x, y (x)) = xfi (x, y (x)) and Ifi(x, y (x)) = 0 for all i S. Hence, the hypergradient estimate (7) reduces to hi,ν = h D(x, y+) h D i (x, y+) + h D i (xi,ν, y+). (8) For the compositional problem (3), Hessian becomes the identity matrix, the direct gradient is the zero vector, and xyg(x, y) = (1/m) Pm i=1 ri(x) . Hence, hi = ℓg,1 ri(x) p0 for all i S. More details on these special cases are provided in Appendices D and E. Indirect gradient estimation & FEDIHGP. Here, we aim to address one of the key challenges in nested FL: inverse Hessian gradient product. Note from (5b) that the proposed stochastic approximation of the indirect gradient involves in a nonlinear way the global Hessian, which is not available at the client. To get around this, we use a client sampling strategy and recursive reformulation of (5b) so that Ifi(x, y) can be estimated in an efficient federated manner. In particular, given N N, we select N {0 . . . , N 1} and S0, . . . , SN S UAR. For all i S, we then define h I i (x, y) = 2 xygi(x, y; ζi)p N , (9a) where p N = |S0| 1c Hy P i S0 yfi(x, y; ξi,0) and c Hy FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization is the approximate inverse Hessian: I 1 ℓg,1|Sn| i=1 2 ygi(x, y; ζi,n) . (9b) The subroutine FEDIHGP provides a recursive strategy to compute p N and FEDOUT multiplies p N with the global Jacobian to drive an indirect gradient estimate. Importantly, these approximations require only matrix-vector products and vector communications. Lemma 2.2. Under Assumptions A and B, the approximate inverse Hessian c Hy defined in (9b) satisfies the following for any x and y: 2 yg(x, y) 1 EW[c Hy] 1 EW h 2 yg(x, y) 1 c Hy i 2 Here, W := {Sn, ξi, ζi, ξi,0, ζi,n | i Sn, 0 n N }. Further, for all i S, h I i (x, y) defined in (9a) satisfies EW [h I i (x, y)] Ifi(x, y) b, (11) where b := κgℓf,1 (κg 1)/κg N. 2.5. Inner Optimizer: FEDINN In FL, each client performs multiple local training steps in isolation on its own data (using for example SGD) before communicating with the server. Due to such local steps, FEDAVG suffers from a client-drift effect under objective heterogeneity; that is, the local iterates of each client driftoff towards the minimum of their own local function. In turn, this can lead to convergence to a point different from the global optimum y (x) of the inner problem; e.g., see (Mitra et al., 2021). This behavior is particularly undesirable in a nested optimization setting since it directly affects the outer optimization; see, e.g. (Liu et al., 2021, Section 7). In light of this observation, we build on the recently proposed FEDLIN (Mitra et al., 2021) which improves FEDSVRG (Koneˇcn y et al., 2018) to solve the inner problem; see Algorithm 4. For each i S, let qi(x, y) denote an unbiased estimate of the gradient ygi(x, y). In each round, starting from a common global model y, each client i performs τi local SVRG-type training steps in parallel: yi,ν+1 = yi,ν βiqi,ν, where qi,ν := qi(x, yi,ν) qi(x, y) + q(x, y), βi (0, β] is the local inner stepsize, and q(x, y) := |S| 1 P i S qi(x, y). We note that for the optimization problems (1), (2), and (3), qi(x, yi,ν) is equal to ygi(x, yi,ν; ζi,ν), yfi(x, yi,ν; ξi,ν), and yi,ν ri(x; ζi,ν), respectively; see Appendices C E. Algorithm 4 y+ = FEDINN FEDINN FEDINN (x, y, β) 1: Gi( ) ygi(x, ) (bilevel) , yfi(x, ) (minimax) 2: yi,0 = y and βi (0, β] 3: for i S in parallel do 4: qi = Gi(y; ζi) 5: end for 6: q = |S| 1 P i S qi 7: for i S in parallel do 8: for ν = 0, . . . , τi 1 do 9: qi,ν = Gi(yi,ν; ζi,ν) Gi(y; ζi,ν) + q 10: yi,ν+1 = yi,ν βiqi,ν 11: end for 12: end for 13: y+ = |S| 1 P 2.6. Light-FEDNEST: Communication Efficiency via Local Hypergradients Each FEDNEST epoch k requires 2T + N + 3 communication rounds as follows: 2T rounds for SVRG of FEDINN, N iterations for inverse Hessian approximation within FEDIHGP and 3 additional aggregations. Note that, these are vector communications and we fully avoid Hessian communication. In Appendix A, we also propose simplified variants of FEDOUT and FEDIHGP, which are tailored to homogeneous or high-dimensional FL settings. These algorithms can then either use local Jacobian / inverse Hessian or their approximation, and can use either SVRG or SGD. Light-FEDNEST: Specifically, we propose LFEDNEST where each client runs IHGP locally. This reduces the number of rounds to T + 1, saving T + N + 2 rounds (see experiments in Section 4 for performance comparison and Appendix A for further discussion.) 3. Convergence Analysis for FEDNEST In this section, we present convergence results for FEDNEST. All proofs are relegated to Appendices C E. Theorem 3.1. Suppose Assumptions A and B hold. Further, assume αk i = αk/τi and βk i = βk/τi for all i S, where T , αk = min α1, α2, α3, α for some positive constants α1, α2, α3, α, and β independent of K. Then, for any T 1, the iterates {(xk, yk)}k 0 generated by FEDNEST satisfy k=1 E h f(xk) 2i = O α max(σ2 g,1, σ2 g,2, σ2 f) + 1 min( α1, α2, α3)K + b2 , FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization where b = κgℓf,1 (κg 1)/κg N and N is the input parameter to FEDIHGP. Corollary 3.1 (Bilevel). Under the same conditions as in Theorem 3.1, if N = O(κg log K) and T = O(κ4 g), then k=1 E h f(xk) 2i = O κ4 g K + κ2.5 g For ϵ-accurate stationary point, we need K = O(κ5 gϵ 2). Above, we choose N κg log K to guarantee b2 1/ K. In contrast, we use T κ4 g inner SVRG epochs. From Section 2.6, this would imply the communication cost is dominated by SVRG epochs N and O(κ4 g) rounds. From Corollary 3.1, we remark that FEDNEST matches the guarantees of centralized alternating SGD methods, such as ALSET (Chen et al., 2021a) and BSA (Ghadimi & Wang, 2018), despite federated setting, i.e. communication challenge, heterogeneity in the client objectives, and device heterogeneity. 3.1. Minimax Federated Learning We focus on special features of federated minimax problems and customize the general results to yield improved convergence results for this special case. Recall from (2) that gi(x, y) = fi(x, y) which implies that b = 0 and following Assumption A, fi(x, y) is µf strongly concave in y for all x. Corollary 3.2 (Minimax). Denote κf = ℓf,1/µf. Assume same conditions as in Theorem 3.1 and T = O(κf). Then, k=1 E h f(xk) 2i = O κ2 f K + κf Corrollary 3.2 implies that for the minimax problem, the convergence rate of FEDNEST to the stationary point of f is O(1/ K). Again, we note this matches the convergence rate of non-FL algorithms (see also Table 1) such as SGDA (Lin et al., 2020) and SMD (Rafique et al., 2021). 3.2. Compositional Federated Learning Observe that in the compositional problem (3), the outer function is fi(x, y; ξ) = fi(y; ξ) and the inner function is gi(x, y; ζ) = 1 2 y ri(x; ζ) 2, for all i S. Hence, b = 0 and κg = 1. Corollary 3.3 (Compositional). Under the same conditions as in Theorem 3.1, if we select T = 1 in (12). Then, k=1 E h f(xk) 2i = O 1 Corrollary 3.3 implies that for the compositional problem (3), the convergence rate of FEDNEST to the stationary point of f is O(1/ K). This matches the convergence rate of non-federated stochastic algorithms such as SCGD (Wang et al., 2017) and NASA (Ghadimi et al., 2020) (Table 1). 3.3. Single-Level Federated Learning Building upon the general results for stochastic nonconvex nested problems, we establish new convergence guarantees for single-level stochastic non-convex federated SVRG which is integrated within our FEDOUT. Note that in the single-level setting, the optimization problem (1) reduces to min x Rd1 f(x) = 1 i=1 fi (x) (13) with fi(x) := Eξ Ci[fi(x; ξ)], where ξ Ci is sampling distribution for the ith client. We make the following assumptions on (13) that are counterparts of Assumptions A and B. Assumption C (Lipschitz continuity). For all i [m], fi(x) is Lf-Lipschitz continuous. Assumption D (Stochastic samples). For all i [m], fi(x; ξ) is an unbiased estimator of fi(x) and its variance is bounded, i.e., Eξ[ fi(x; ξ) fi(x) 2] σ2 f. Theorem 3.2 (Single-Level). Suppose Assumptions C and D hold. Further, assume αk i = αk/τi for all i S, where αk = min α1, α for some α1, α > 0. Then, k=1 E h f(xk) 2i = O where f := f(x0) E[f(x K)]. Theorem 3.2 extends recent results by (Mitra et al., 2021) from the stochastic strongly convex to the stochastic nonconvex setting. The above rate is also consistent with existing single-level non-FL guarantees (Ghadimi & Lan, 2013). 4. Numerical Experiments In this section, we numerically investigate the impact of several attributes of our algorithms on a hyper-representation problem (Franceschi et al., 2018), a hyper-parameter optimization problem for loss function tuning (Li et al., 2021), and a federated minimax optimization problem. 4.1. Hyper-Representation Learning Modern approaches in meta learning such as MAML (Finn et al., 2017) and reptile (Nichol & Schulman, 2018) learn FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization 0 100 200 300 400 500 60 Fed Nest, non-iid LFed Nest, non-iid Fed Nest, iid LFed Nest, iid FEDNEST epochs Test accuracy (a) Comparison between FEDNEST and LFEDNEST. 0 1000 2000 3000 4000 60 Fed Nest, non-iid Fed Nest SGD, non-iid Fed Nest, iid Fed Nest SGD, iid Communication rounds (b) SVRG in FEDINN provides better convergence and stability. 0 1000 2000 3000 4000 60 = 1, non-iid = 5, non-iid = 1, iid = 5, iid Communication rounds (c) Larger τ in FEDOUT provides better performance. Figure 2: Hyper-representation experiments on a 2-layer MLP and MNIST dataset. representations (that are shared across all tasks) in a bilevel manner. Similarly, the hyper-representation problem optimizes a classification model in a two-phased process. The outer objective optimizes the model backbone to obtain better feature representation on validation data. The inner problem optimizes a header for downstream classification tasks on training data. In this experiment, we use a 2-layer multilayer perceptron (MLP) with 200 hidden units. The outer problem optimizes the hidden layer with 157,000 parameters, and the inner problem optimizes the output layer with 2,010 parameters. We study both i.i.d and non-i.i.d. ways of partitioning the MNIST data exactly following FEDAVG (Mc Mahan et al., 2017), and split each client s data evenly to train and validation datasets. Thus, each client has 300 train and 300 validation samples. Figure 2 demonstrates the impact on test accuracy of several important components of FEDNEST. Figure 2a compares FEDNEST and LFEDNEST. Both algorithms perform well on the i.i.d. setup, while on the non-i.i.d. setup, FEDNEST achieves i.i.d. performance, significantly outperforming LFEDNEST. These findings are in line with our discussions in Section 2.6. LFEDNEST saves on communication rounds compared to FEDNEST and performs well on homogeneous clients. However, for heterogeneous clients, the isolation of local Hessian in LFEDNEST (see Algorithm 5 in Appendix A) degrades the test performance. Next, Figure 2b demonstrates the importance of SVRG in FEDINN algorithm for heterogeneous data (as predicted by our theoretical considerations in Section 2.5). To further clarify the algorithm difference in Figures 2b and 3a, we use FEDNESTSGD to denote the FEDNEST algorithm where SGD is used in FEDINN. Finally, Figure 2c elucidates the role of local epoch τ in FEDOUT: larger τ saves on communication and improves test performance by enabling faster convergence. 4.2. Loss Function Tuning on Imbalanced Dataset We use bilevel optimization to tune a loss function for learning an imbalanced MNIST dataset. We aim to maximize the class-balanced validation accuracy (which helps mi- 0 200 400 600 800 1000 60 Fed Nest LFed Nest FEDNEST epochs Balanced Test Accuracy (a) FEDNEST achieves similar performance as centralized bilevel loss function tuning. 0 1000 2000 3000 4000 5000 6000 60 Fed Nest, non-iid Fed Nest SGD, non-iid Fed Nest, iid Fed Nest SGD, iid Communication rounds (b) SVRG in FEDINN provides better convergence and stability especially in non-iid setup. Figure 3: Loss function tuning on a 3-layer MLP and imbalanced MNIST dataset to maximize class-balanced test accuracy. The brown dashed line is the accuracy on nonfederated bilevel optimization (Li et al., 2021), and the black dashed line is the accuracy without tuning the loss function. nority/tail classes). Following the problem formulation in (Li et al., 2021) we tune the so-called VS-loss function (Kini et al., 2021) in a federated setting. In particular, we first create a long-tail imbalanced MNIST dataset by exponentially decreasing the number of examples per class (e.g. class 0 has 6,000 samples, class 1 has 3,597 samples and finally, class 9 has only 60 samples). We partition the dataset to 100 clients following again FEDAVG (Mc Mahan et al., 2017) on both i.i.d. and non-i.i.d. setups. Different from the hyper-representation experiment, we employ 80%-20% train-validation on each client and use a 3-layer MLP model with 200, 100 hidden units, respectively. It is worth noting that, in this problem, the outer objective f (aka validation cost) only depends on the hyperparameter x through the optimal model parameters y (x); thus, the direct gradient Dfi(x, y (x)) is zero for all i S. Figure 3 displays test accuracy vs epochs/rounds for our federated bilevel algorithms. The horizontal dashed lines serve as centralized baselines: brown depicts accuracy reached by bilevel optimization in non-FL setting, and, black depicts accuracy without any loss tuning. Compared to these, Figure 3a shows that FEDNEST achieves near non-federated performance. In Figure 3b, we investigate the key role of SVRG in FEDINN by comparing it with possible alternative FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization 0 25 50 75 100 125 150 175 200 10 22 Fed Avg-S, s=1 Fed Avg-S, s=10 LFed Nest, s=1 LFed Nest, s=10 Fed Nest, s=1 Fed Nest, s=10 0 25 50 75 100 125 150 175 200 10 22 Figure 4: FEDNEST converges linearly despite heterogeneity. LFEDNEST slightly outperforms FEDAVG-S. implementation that uses SGD-type updates. The figure confirms our discussion in Section 2.5: SVRG offers significant performance gains that are pronounced by client heterogeneity. 4.3. Federated Minimax Problem We conduct experiments on the minimax problem (2) with fi(x, y) := 1 2 y 2 b i y + y Aix + λ to compare standard FEDAVG saddle-point (FEDAVGS) method updating (x, y) simultaneously (Hou et al., 2021) and our alternative approaches (LFEDNEST and FEDNEST). This is a saddle-point formulation of minx Rd1 1 m Pm i=1 Aix bi 2. We set λ = 10, bi = b i 1 m Pm i=1 b i and Ai = ti I, where b i N(0, s2Id), and ti is drawn UAR over (0, 0.1). Figure 4 shows that LFEDNEST and FEDNEST outperform FEDAVG-S thanks to their alternating nature. FEDNEST significantly improves the convergence of LFEDNEST due to controlling clientdrift. To our knowledge, FEDNEST is the only alternating federated SVRG for minimax problems. 5. Related Work Federated learning. FEDAVG was first introduced by Mc Mahan et al. (2017), who showed it can dramatically reduce communication costs. For identical clients, FEDAVG coincides with local SGD (Zinkevich et al., 2010) which has been analyzed by many works (Stich, 2019; Yu et al., 2019; Wang & Joshi, 2018). Recently, many variants of FEDAVG have been proposed to tackle issues such as convergence and client drift. Examples include FEDPROX (Li et al., 2020b), SCAFFOLD (Karimireddy et al., 2020), FEDSPLIT (Pathak & Wainwright, 2020), FEDNOVA (Wang et al., 2020), and, the most closely relevant to us FEDLIN (Mitra et al., 2021). A few recent studies are also devoted to the extension of FEDAVG to the minimax optimization (Rasouli et al., 2020; Deng et al., 2020) and compositional optimization (Huang et al., 2021). In contrast to these methods, FEDNEST makes alternating SVRG updates between the global variables x and y, and yields sample complexity bounds and batch size choices that are on par with the non-FL guarantees (Table 1). Evaluations in the Appendix H.1 reveal that both alternating updates and SVRG provides a performance boost over these prior approaches. Bilevel optimization. This class of problems was first introduced by (Bracken & Mc Gill, 1973), and since then, different types of approaches have been proposed. See (Sinha et al., 2017; Liu et al., 2021) for surveys. Earlier works in (Aiyoshi & Shimizu, 1984; Lv et al., 2007) reduced the bilevel problem to a single-level optimization problem. However, the reduced problem is still difficult to solve due to for example a large number of constraints. Recently, more efficient gradient-based algorithms have been proposed by estimating the hypergradient of f(x) through iterative updates (Maclaurin et al., 2015; Franceschi et al., 2017; Domke, 2012; Pedregosa, 2016). The asymptotic and non-asymptotic analysis of bilevel optimization has been provided in (Franceschi et al., 2018; Shaban et al., 2019; Liu et al., 2020) and (Ghadimi & Wang, 2018; Hong et al., 2020), respectively. There is also a line of work focusing on minimax optimization (Nemirovski, 2004; Daskalakis & Panageas, 2018) and compositional optimization (Wang et al., 2017). Closely related to our work are (Lin et al., 2020; Rafique et al., 2021; Chen et al., 2021a) and (Ghadimi et al., 2020; Chen et al., 2021a) which provide non-asymptotic analysis of SGD-type methods for minimax and compositional problems with outer nonconvex objective, respectively. A more in-depth discussion of related work is given in Appendix B. We summarize the complexities of different methods for FL/non-FL bilevel optimization in Table 1. 6. Conclusions We presented a new class of federated algorithms for solving general nested stochastic optimization spanning bilevel and minimax problems. FEDNEST runs a variant of federated SVRG on inner & outer variables in an alternating fashion. We established provable convergence rates for FEDNEST under arbitrary client heterogeneity and introduced variations for min-max and compositional problems and for improved communication efficiency (LFEDNEST). We showed that, to achieve an ϵ-stationary point of the nested problem, FEDNEST requires O(ϵ 2) samples in total, which matches the complexity of the non-federated nested algorithms in the literature. Acknowledgements Davoud Ataee Tarzanagh was supported by ARO YIP award W911NF1910027 and NSF CAREER award CCF-1845076. Christos Thrampoulidis was supported by NSF Grant Numbers CCF-2009030 and HDR-1934641, and an NSERC Discovery Grant. Mingchen Li and Samet Oymak were supported by the NSF CAREER award CCF-2046816, Google Research Scholar award, and ARO grant W911NF2110312. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Aiyoshi, E. and Shimizu, K. A solution method for the static constrained stackelberg problem via penalty method. IEEE Transactions on Automatic Control, 29(12):1111 1114, 1984. Al-Khayyal, F. A., Horst, R., and Pardalos, P. M. Global optimization of concave functions subject to quadratic constraints: an application in nonlinear bilevel programming. Annals of Operations Research, 34(1):125 147, 1992. Arora, S., Du, S., Kakade, S., Luo, Y., and Saunshi, N. Provable representation learning for imitation learning via bi-level optimization. In International Conference on Machine Learning, pp. 367 376. PMLR, 2020. Barazandeh, B., Huang, T., and Michailidis, G. A decentralized adaptive momentum method for solving a class of min-max optimization problems. Signal Processing, 189: 108245, 2021a. Barazandeh, B., Tarzanagh, D. A., and Michailidis, G. Solving a class of non-convex min-max games using adaptive momentum methods. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3625 3629. IEEE, 2021b. Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparselocal-SGD: Distributed SGD with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems, pp. 14668 14679, 2019. Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. ar Xiv preprint ar Xiv:1805.08136, 2018. Bracken, J. and Mc Gill, J. T. Mathematical programs with optimization problems in the constraints. Operations Research, 21(1):37 44, 1973. Brown, G. W. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374 376, 1951. Chen, T., Sun, Y., and Yin, W. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34, 2021a. Chen, T., Sun, Y., and Yin, W. A single-timescale stochastic bilevel optimization method. ar Xiv preprint ar Xiv:2102.04671, 2021b. Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. ar Xiv preprint ar Xiv:2201.13409, 2022. Dai, B., He, N., Pan, Y., Boots, B., and Song, L. Learning from conditional distributions via dual embeddings. In Artificial Intelligence and Statistics, pp. 1458 1467. PMLR, 2017. Daskalakis, C. and Panageas, I. The limit points of (optimistic) gradient descent in min-max optimization. ar Xiv preprint ar Xiv:1807.03907, 2018. Deng, Y. and Mahdavi, M. Local stochastic gradient descent ascent: Convergence analysis and communication efficiency. In International Conference on Artificial Intelligence and Statistics, pp. 1387 1395. PMLR, 2021. Deng, Y., Kamani, M. M., and Mahdavi, M. Distributionally robust federated averaging. Advances in Neural Information Processing Systems, 33:15111 15122, 2020. Diakonikolas, J., Daskalakis, C., and Jordan, M. Efficient methods for structured nonconvex-nonconcave min-max optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2746 2754. PMLR, 2021. Domke, J. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp. 318 326. PMLR, 2012. Edmunds, T. A. and Bard, J. F. Algorithms for nonlinear bilevel mathematical programs. IEEE transactions on Systems, Man, and Cybernetics, 21(1):83 89, 1991. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135. PMLR, 2017. Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp. 1165 1173. PMLR, 2017. Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577. PMLR, 2018. Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018. Ghadimi, S., Ruszczynski, A., and Wang, M. A single timescale stochastic approximation method for nested stochastic optimization. SIAM Journal on Optimization, 30(1):960 979, 2020. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-Julien, S. A variational inequality perspective on generative adversarial networks. ar Xiv preprint ar Xiv:1802.10551, 2018. Grazzi, R., Franceschi, L., Pontil, M., and Salzo, S. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pp. 3748 3758. PMLR, 2020. Guo, Z., Xu, Y., Yin, W., Jin, R., and Yang, T. On stochastic moving-average estimators for non-convex optimization. ar Xiv preprint ar Xiv:2104.14840, 2021. Hansen, P., Jaumard, B., and Savard, G. New branch-andbound rules for linear bilevel programming. SIAM Journal on scientific and Statistical Computing, 13(5):1194 1217, 1992. Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A twotimescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020. Hou, C., Thekumparampil, K. K., Fanti, G., and Oh, S. Efficient algorithms for federated saddle point optimization. ar Xiv preprint ar Xiv:2102.06333, 2021. Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019. Huang, F. and Huang, H. Biadam: Fast adaptive bilevel optimization methods. ar Xiv preprint ar Xiv:2106.11396, 2021. Huang, F., Li, J., and Huang, H. Compositional federated learning: Applications in distributionally robust averaging and meta learning. ar Xiv preprint ar Xiv:2106.11264, 2021. Ji, K. and Liang, Y. Lower bounds and accelerated algorithms for bilevel optimization. Ar Xiv, abs/2102.03926, 2021. Ji, K., Yang, J., and Liang, Y. Provably faster algorithms for bilevel optimization and applications to meta-learning. Ar Xiv, abs/2010.07962, 2020a. Ji, K., Yang, J., and Liang, Y. Theoretical convergence of multi-step model-agnostic meta-learning. ar Xiv preprint ar Xiv:2002.07836, 2020b. Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pp. 4882 4892. PMLR, 2021. Ji, S. A pytorch implementation of federated learning. Mar 2018. doi: 10.5281/zenodo.4321561. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020. Khaled, A., Mishchenko, K., and Richtárik, P. First analysis of local GD on heterogeneous data. ar Xiv preprint ar Xiv:1909.04715, 2019. Khanduri, P., Zeng, S., Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. ar Xiv preprint ar Xiv:2102.07367, 2021. Khodak, M., Tu, R., Li, T., Li, L., Balcan, M.-F. F., Smith, V., and Talwalkar, A. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing. Advances in Neural Information Processing Systems, 34, 2021. Kini, G. R., Paraskevas, O., Oymak, S., and Thrampoulidis, C. Label-imbalanced and group-sensitive classification under overparameterization. ar Xiv preprint ar Xiv:2103.01550, 2021. Koneˇcn y, J., Mc Mahan, H. B., Ramage, D., and Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. International Conference on Learning Representations, 2018. Li, J., Gu, B., and Huang, H. Improved bilevel model: Fast and optimal algorithm with theoretical guarantee. ar Xiv preprint ar Xiv:2009.00690, 2020a. Li, M., Zhang, X., Thrampoulidis, C., Chen, J., and Oymak, S. Autobalance: Optimized loss functions for imbalanced data. Advances in Neural Information Processing Systems, 34, 2021. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450, 2020b. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of Fed Avg on non-IID data. ar Xiv preprint ar Xiv:1907.02189, 2019. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Lin, T., Jin, C., and Jordan, M. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pp. 6083 6093. PMLR, 2020. Liu, H., Simonyan, K., and Yang, Y. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018. Liu, M., Mroueh, Y., Ross, J., Zhang, W., Cui, X., Das, P., and Yang, T. Towards better understanding of adaptive gradient algorithms in generative adversarial nets. ar Xiv preprint ar Xiv:1912.11940, 2019. Liu, R., Mu, P., Yuan, X., Zeng, S., and Zhang, J. A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In International Conference on Machine Learning, pp. 6305 6315. PMLR, 2020. Liu, R., Gao, J., Zhang, J., Meng, D., and Lin, Z. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. ar Xiv preprint ar Xiv:2101.11517, 2021. Luo, L., Ye, H., Huang, Z., and Zhang, T. Stochastic recursive gradient descent ascent for stochastic nonconvexstrongly-concave minimax problems. Advances in Neural Information Processing Systems, 33:20566 20577, 2020. Lv, Y., Hu, T., Wang, G., and Wan, Z. A penalty function method based on kuhn tucker condition for solving linear bilevel programming. Applied Mathematics and Computation, 188(1):808 813, 2007. Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In International conference on machine learning, pp. 2113 2122. PMLR, 2015. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pp. 1273 1282, 2017. URL http://proceedings.mlr.press/ v54/mcmahan17a.html. Mitra, A., Jaafar, R., Pappas, G., and Hassani, H. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34, 2021. Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. In International Conference on Machine Learning, pp. 4615 4625. PMLR, 2019. Mokhtari, A., Ozdaglar, A., and Pattathil, S. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pp. 1497 1507. PMLR, 2020. Moore, G. M. Bilevel programming algorithms for machine learning model selection. Rensselaer Polytechnic Institute, 2010. Nazari, P., Tarzanagh, D. A., and Michailidis, G. Dadam: A consensus-based distributed adaptive gradient method for online optimization. ar Xiv preprint ar Xiv:1901.09109, 2019. Nedi c, A. and Ozdaglar, A. Subgradient methods for saddlepoint problems. Journal of optimization theory and applications, 142(1):205 228, 2009. Nemirovski, A. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004. Nichol, A. and Schulman, J. Reptile: a scalable metalearning algorithm. ar Xiv preprint ar Xiv:1803.02999, 2(3):4, 2018. Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., and Razaviyayn, M. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32, 2019. Pathak, R. and Wainwright, M. J. Fedsplit: an algorithmic framework for fast federated optimization. Advances in Neural Information Processing Systems, 33:7057 7066, 2020. Pedregosa, F. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737 746. PMLR, 2016. Rafique, H., Liu, M., Lin, Q., and Yang, T. Weakly-convex concave min max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, pp. 1 35, 2021. Rasouli, M., Sun, T., and Rajagopal, R. Fedgan: Federated generative adversarial networks for distributed data. ar Xiv preprint ar Xiv:2006.07228, 2020. Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization federated optimization. In International Conference on Learning Representations, 2020. Reisizadeh, A., Farnia, F., Pedarsani, R., and Jadbabaie, A. Robust federated learning: The case of affine distribution shifts. Advances in Neural Information Processing Systems, 33:21554 21565, 2020. Shaban, A., Cheng, C.-A., Hatch, N., and Boots, B. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1723 1732. PMLR, 2019. Shen, Y., Du, J., Zhao, H., Zhang, B., Ji, Z., and Gao, M. Fedmm: Saddle point optimization for federated adversarial domain adaptation. ar Xiv preprint ar Xiv:2110.08477, 2021. Shi, C., Lu, J., and Zhang, G. An extended kuhn tucker approach for linear bilevel programming. Applied Mathematics and Computation, 162(1):51 63, 2005. Sinha, A., Malo, P., and Deb, K. A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276 295, 2017. Stich, S. U. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019. URL https://openreview.net/ forum?id=S1g2Jn Rc FX. Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019. Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Efficient algorithms for smooth minimax optimization. ar Xiv preprint ar Xiv:1907.01543, 2019. Tran Dinh, Q., Liu, D., and Nguyen, L. Hybrid variancereduced sgd algorithms for minimax problems with nonconvex-linear function. Advances in Neural Information Processing Systems, 33:11096 11107, 2020. Wang, J. and Joshi, G. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. ar Xiv preprint ar Xiv:1808.07576, 2018. Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 2020. Wang, M., Liu, J., and Fang, E. Accelerating stochastic composition optimization. Advances in Neural Information Processing Systems, 29, 2016. Wang, M., Fang, E. X., and Liu, H. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419 449, 2017. Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya, C., He, T., and Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications, 37(6): 1205 1221, 2019. Wu, Y. F., Zhang, W., Xu, P., and Gu, Q. A finite-time analysis of two time-scale actor-critic methods. Advances in Neural Information Processing Systems, 33:17617 17628, 2020. Xie, J., Zhang, C., Zhang, Y., Shen, Z., and Qian, H. A federated learning framework for nonconvex-pl minimax problems. ar Xiv preprint ar Xiv:2105.14216, 2021. Yan, Y., Xu, Y., Lin, Q., Liu, W., and Yang, T. Optimal epoch stochastic gradient descent ascent methods for minmax optimization. Advances in Neural Information Processing Systems, 33:5789 5800, 2020. Yang, J., Kiyavash, N., and He, N. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:1153 1165, 2020. Yoon, T. and Ryu, E. K. Accelerated algorithms for smooth convex-concave minimax problems with o (1/kˆ 2) rate on squared gradient norm. In International Conference on Machine Learning, pp. 12098 12109. PMLR, 2021. Yu, H., Yang, S., and Zhu, S. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5693 5700, 2019. Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pp. 2595 2603, 2010. APPENDIX FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization The appendix is organized as follows: Section A introduces the LFEDNEST algorithm. Section B discusses the related work. We provide all details for the proof of the main theorems in Sections C, D, E, and F for federated bilevel, minimax, compositional, and single-level optimization, respectively. In Section G, we state a few auxiliary technical lemmas. Finally, in Section H, we provide the detailed parameters of our numerical experiments (Section 4) and then introduce further experiments. A. LFEDNEST Implementing FEDINN and FEDOUT naively by using the global direct and indirect gradients and sending the local information to the server that would then calculate the global gradients leads to a communication and space complexity of which can be prohibitive for large-sized d1 and d2. One can consider possible local variants of FEDINN and FEDOUT tailore to such scenarios. Each of the possible algorithms (See Table 2) can then either use the global gradient or only the local gradient, either use a SVRG or SGD. Algorithm 5 x+ = LFEDOUT LFEDOUT LFEDOUT (x, y, α) for stochastic bilevel , minimax , and compositional problems 1: xi,0 = x and αi (0, α] for each i S. 2: Choose N {1, 2, . . .} (the number of terms of Neumann series). 3: for i S in parallel do 4: for ν = 0, . . . , τi 1 do 5: Select N {0, . . . , N 1} UAR. 6: hi,ν = xfi(xi,ν, y; ξi,ν) N ℓg,1 2 xygi(xi,ν, y; ζi,ν) N Q I 1 ℓg,1 2 ygi(xi,ν, y; ζi,n) xfi(yi,ν, y, ξi,ν) 7: hi,ν = xfi(xi,ν, y; ξi,ν) 8: hi,ν = ri(xi,ν; ζi,ν) fi(yi,ν; ξi,ν) 9: xi,ν+1 = xi,ν αihi,ν 10: end for 11: end for 12: x+ = |S| 1 P Algorithm 6 y+ = LFEDINN LFEDINN LFEDINN (x, y, β) for stochastic bilevel , minimax , and compositional problems 1: yi,0 = y and βi (0, β] for each i S. 2: for i S in parallel do 3: for ν = 0, . . . , τi 1 do 4: qi,ν = ygi(x, yi,ν; ζi,ν) qi,ν = yfi(x, yi,ν; ξi,ν) qi,ν = yi,ν ri(x; ζi,ν) 5: yi,ν+1 = yi,ν βiqi,ν 6: end for 7: end for 8: y+ = |S| 1 P FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization definition properties outer inner global global global # communication optimization optimization outer gradient IHGP inner gradient rounds FEDNEST FEDNEST FEDNEST Algorithm 2 Algorithm 4 yes yes yes 2T + N + 3 (SVRG on x) (SVRG on y) LFEDNEST LFEDNEST LFEDNEST Algorithm 5 Algorithm 6 no no no T + 1 (SGD on x) (SGD on y) FEDNEST FEDNEST FEDNESTSGD Algorithm 2 Algorithm 6 yes yes no T + N + 3 (SVRG on x) (SGD on y) LFEDNEST LFEDNEST LFEDNESTSVRG Algorithm 5 Algorithm 4 no no yes 2T + 1 (SGD on x) (SVRG on y) Table 2: Definition of studied algorithms by using inner/outer optimization algorithms and server updates and resulting properties of these algorithms. T and N denote the number of inner iterations and terms of Neumann series, respectively. B. Related Work We provide an overview of the current literature on non-federated nested (bilevel, minmimax, and compositional) optimization and federated learning. B.1. Bilevel Optimization A broad collection of algorithms have been proposed to solve bilevel nonlinear programming problems. Aiyoshi & Shimizu (1984); Edmunds & Bard (1991); Al-Khayyal et al. (1992); Hansen et al. (1992); Shi et al. (2005); Lv et al. (2007); Moore (2010) reduce the bilevel problem to a single-level optimization problem using for example the Karush-Kuhn-Tucker (KKT) conditions or penalty function methods. A similar idea was also explored in Khodak et al. (2021) where the authors provide a reformulation of the hyperparameter optimization (bilevel objective) into a single-level objective and develop a federated online method to solve it. However, the reduced single-level problem is usually difficult to solve (Sinha et al., 2017). In comparison, alternating gradient-based approaches designed for the bilevel problems are more attractive due to their simplicity and effectiveness. This type of approaches estimate the hypergradient f(x) for iterative updates, and are generally divided to approximate implicit differentiation (AID) and iterative differentiation (ITD) categories. ITD-based approaches (Maclaurin et al., 2015; Franceschi et al., 2017; Finn et al., 2017; Grazzi et al., 2020) estimate the hypergradient f(x) in either a reverse (automatic differentiation) or forward manner. AID-based approaches (Pedregosa, 2016; Grazzi et al., 2020; Ghadimi & Wang, 2018) estimate the hypergradient via implicit differentiation which involves solving a linear system. Our algorithms follow the latter approach. Theoretically, bilevel optimization has been studied via both asymptotic and non-asymptotic analysis (Franceschi et al., 2018; Liu et al., 2020; Li et al., 2020a; Shaban et al., 2019; Ghadimi & Wang, 2018; Ji et al., 2021; Hong et al., 2020). In particular, (Franceschi et al., 2018) provided the asymptotic convergence of a backpropagation-based approach as one of ITD-based algorithms by assuming the inner problem is strongly convex. (Shaban et al., 2019) gave a similar analysis for a truncated backpropagation approach. Non-asymptotic complexity analysis for bilevel optimization has also been explored. Ghadimi & Wang (2018) provided a finite-time convergence analysis for an AID-based algorithm under three different loss geometries, where f( ) is either strongly convex, convex or nonconvex, and g(x, ) is strongly convex. (Ji et al., 2021) provided an improved non-asymptotic analysis for AIDand ITD-based algorithms under the nonconvex-strongly-convex geometry. (Ji & Liang, 2021) provided the first-known lower bounds on complexity as well as tighter upper bounds. When the objective functions can be expressed in an expected or finite-time form, (Ghadimi & Wang, 2018; Ji et al., 2021; Hong et al., 2020) developed stochastic bilevel algorithms and provided the non-asymptotic analysis. (Chen et al., 2021a) provided a tighter analysis of SGD for stochastic bilevel problems. (Chen et al., 2021b; Guo et al., 2021; Khanduri et al., 2021; Ji et al., 2020a; Huang & Huang, 2021; Dagréou et al., 2022) studied accelerated SGD, SAGA, momentum, and adaptive-type bilevel optimization methods. More results can be found in the recent review paper (Liu et al., 2021) and references therein. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization B.1.1. MINIMAX OPTIMIZATION Minimax optimization has a long history dating back to (Brown, 1951). Earlier works focused on the deterministic convexconcave regime (Nemirovski, 2004; Nedi c & Ozdaglar, 2009). Recently, there has emerged a surge of studies of stochastic minimax problems. The alternating version of the gradient descent ascent (SGDA) has been studied by incorporating the idea of optimism (Daskalakis & Panageas, 2018; Gidel et al., 2018; Mokhtari et al., 2020; Yoon & Ryu, 2021). (Rafique et al., 2021; Thekumparampil et al., 2019; Nouiehed et al., 2019; Lin et al., 2020) studied SGDA in the nonconvex-strongly concave setting. Specifically, the O(ϵ 2) sample complexity has been established in (Lin et al., 2020) under an increasing batch size O(ϵ 1). Chen et al. (2021a) provided the O(ϵ 2) sample complexity under an O(1) constant batch size. In the same setting, accelerated GDA algorithms have been developed in (Luo et al., 2020; Yan et al., 2020; Tran Dinh et al., 2020). Going beyond the one-side concave settings, algorithms and their convergence analysis have been studied for nonconvex-nonconcave minimax problems with certain benign structure; see e.g., (Gidel et al., 2018; Liu et al., 2019; Yang et al., 2020; Diakonikolas et al., 2021; Barazandeh et al., 2021b). A comparison of our results with prior work can be found in Table 1. B.1.2. COMPOSITIONAL OPTIMIZATION Stochastic compositional gradient algorithms (Wang et al., 2017; 2016) can be viewed as an alternating SGD for the special compositional problem. However, to ensure convergence, the algorithms in (Wang et al., 2017; 2016) use two sequences of variables being updated in two different time scales, and thus the iteration complexity of (Wang et al., 2017) and (Wang et al., 2016) is worse than O(ϵ 2) of the standard SGD. Our work is closely related to ALSET (Chen et al., 2021a), where an O(ϵ 2) sample complexity has been established in a non-FL setting. B.2. Federated Learning FL involves learning a centralized model from distributed client data. Although this centralized model benefits from all client data, it raises several types of issues such as generalization, fairness, communication efficiency, and privacy (Mohri et al., 2019; Stich, 2019; Yu et al., 2019; Wang & Joshi, 2018; Stich & Karimireddy, 2019; Basu et al., 2019; Nazari et al., 2019; Barazandeh et al., 2021a). FEDAVG (Mc Mahan et al., 2017) can tackle some of these issues such as high communication costs. Many variants of FEDAVG have been proposed to tackle other emerging issues such as convergence and client drift. Examples include adding a regularization term in the client objectives towards the broadcast model (Li et al., 2020b), proximal splitting (Pathak & Wainwright, 2020; Mitra et al., 2021), variance reduction (Karimireddy et al., 2020; Mitra et al., 2021) and adaptive updates (Reddi et al., 2020). When clients are homogeneous, FEDAVG is closely related to local SGD (Zinkevich et al., 2010), which has been analyzed by many works (Stich, 2019; Yu et al., 2019; Wang & Joshi, 2018; Stich & Karimireddy, 2019; Basu et al., 2019). In order to analyze FEDAVG in heterogeneous settings, (Li et al., 2020b; Wang et al., 2019; Khaled et al., 2019; Li et al., 2019) derive convergence rates depending on the amount of heterogeneity. They showed that the convergence rate of FEDAVG gets worse with client heterogeneity. By using control variates to reduce client drift, the SCAFFOLD method (Karimireddy et al., 2020) achieves convergence rates that are independent of the amount of heterogeneity. Relatedly, FEDNOVA (Wang et al., 2020) and FEDLIN (Mitra et al., 2021) provided the convegence of their methods despite arbitrary local objective and systems heterogeneity. In particular, (Mitra et al., 2021) showed that FEDLIN guarantees linear convergence to the global minimum of deterministic objective, despite arbitrary objective and systems heterogeneity. As explained in the main body, our algorithms critically leverage these ideas after identifying the additional challenges that client drift brings to federated bilevel settings. B.2.1. FEDERATED MINIMAX LEARNING A few recent studies are devoted to federated minimax optimization (Rasouli et al., 2020; Reisizadeh et al., 2020; Deng et al., 2020; Hou et al., 2021). In particular, (Reisizadeh et al., 2020) consider minimax problem with inner problem satisfying PL condition and the outer one being either nonconvex or satisfying PL. However, the proposed algorithm only communicates x to the server. Xie et al. (2021) consider a general class of nonconvex-PL minimax problems in the cross-device federated learning setting. Their algorithm performs multiple local update steps on a subset of active clients in each round and leverages global gradient estimates to correct the bias in local update directions. Deng & Mahdavi (2021) studied federated optimization for a family of smooth nonconvex minimax functions. Shen et al. (2021) proposed a distributed minimax optimizer called FEDMM, designed specifically for the federated adversary domain adaptation problem. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Hou et al. (2021) proposed a SCAFFOLD saddle point algorithm (SCAFFOLD-S) for solving strongly convex-concave minimax problems in the federated setting. To the best of our knowledge, all the aforementioned developments require a bound on the heterogeneity of the local functions, and do not account for the effects of systems heterogeneity which is also a key challenge in FL. In addition, our work proposes the first alternating federated SVRG-type algorithm for minimax problems with iteration complexity that matches to the non-federated setting (see, Table 1). C. Proof for Federated Bilevel Optimization Throughout the proof, we will use Fk,t i,ν to denote the filtration that captures all the randomness up to the ν-th local step of client i in inner round t and outer round k. With a slight abuse of notation, Fk,t i, 1 is to be interpreted as Fk,t, i S. For simplicity, we remove subscripts k and t from the definition of stepsize and model parameters. For example, x and x+ denote xk and xk+1, respectively. We further set hi(xi,ν, y+) := E h hi(xi,ν, y+)|Fi,ν 1 i . (15) Proof of Lemma 2.1 Proof. Given x Rd1, the optimality condition of the inner problem in (1) is yg(x, y) = 0. Now, since x ( yg(x, y)) = 0, we obtain 2 xygj (x, y (x)) + y (x) 2 ygj (x, y (x)) , which implies y (x) = m X i=1 2 xygi (x, y (x)) m X i=1 2 ygj(x, y (x)) 1 . The results follows from a simple application of the chain rule to f as follows: f (x, y (x)) = xf (x, y (x)) + y (x) yf (x, y (x)) . Proof of Lemma 2.2 Proof. By independency of N , ζi,n, and Sn, and under Assumption B, we have EW h c Hy i = EW I 1 ℓg,1|Sn| i=1 2 ygi(x, y; ζi,n) I 1 ℓg,1|Sn| i=1 2 ygi(x, y; ζi,n) h I 1 ℓg,1 2 yg(x, y) in , (16) where the last equality follows from the uniform distribution of N . Note that since I 1 ℓg,1 2 ygi µg ℓg,1 for all i [m] due to Assumption A, we have EW h c Hy i N I 1 ℓg,1|Sn| i=1 2 ygi(x, y; ζi,n) ℓg,1 EN 1 µg The reminder of the proof is similar to (Ghadimi & Wang, 2018). FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization The following lemma extends (Ghadimi & Wang, 2018, Lemma 2.2) and (Chen et al., 2021a, Lemma 2) to the finite-sum problem (1). Proofs follow similarly by applying their analysis to the inner & outer functions (fi, gi), i S. Lemma C.1. Under Assumptions A and B, for all x1, x2: f(x1) f(x2) Lf x1 x2 , (17a) y (x1) y (x2) Ly x1 x2 , (17b) y (x1) y (x2) Lyx x1 x2 , (17c) Also, for all i S, ν {0, . . . , τi 1}, x1, x2, and y, we have: fi(x1, y) fi(x1, y (x1)) Mf y (x1) y , (17d) fi(x2, y) fi(x1, y) Mf x2 x1 , (17e) E hi(xi,ν, y) hi(xi,ν, y) 2 σ2 f, (17f) E hi(xi,ν, y+) 2|Fi,ν 1 D2 f. (17g) µg = O(κg), Lyx := ℓg,2 + ℓg,2Ly ℓg,2 + ℓg,2Ly = O(κ3 g), Mf := ℓf,1 + ℓg,1ℓf,1 ℓg,2 + ℓg,1ℓg,2 Lf := ℓf,1 + ℓg,1(ℓf,1 + Mf) ℓg,2 + ℓg,1ℓg,2 σ2 f := σ2 f + 3 (σ2 f + ℓ2 f,0)(σ2 g,2 + 2ℓ2 g,1) + σ2 fℓ2 g,1 , D2 f := ℓf,0 + ℓg,1 µg ℓf,1 + ℓg,1ℓf,1 1 µg 2 + σ2 f = O(κ2 g), where the other constants are provided in Assumptions A and B. C.1. Descent of Outer Objective The following lemma characterizes the descent of the outer objective. Lemma C.2 (Descent Lemma). Suppose Assumptions A and B hold. Further, assume τi 1 and αi = α/τi, i S for some positive constant α. Then, FEDOUT guarantees: E f(x+) E [f(x)] α 2 E f(x) 2 + α2Lf 2 (1 αLf) E ν=0 hi(xi,ν, y+) b2 + M 2 f E y+ y (x) 2 + M 2 f m ν=0 E xi,ν x 2 ! Proof. It follows from Algorithm 2 that xi,0 = x, i S, and ν=0 hi(xi,ν, y+). (20) FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Now, using the Lipschitz property of f in Lemma C.1, we have E f(x+) E [f(x)] E x+ x, f(x) + Lf ν=0 hi(xi,ν, y+), f(x) ν=0 hi(xi,ν, y+) In the following, we bound each term on the right hand side (RHS) of (21). For the first term, we have ν=0 hi(xi,ν, y+), f(x) ν=0 E hi(xi,ν, y+), f(x) | Fi,ν 1 # ν=0 hi(xi,ν, y+), f(x) ν=0 hi(xi,ν, y+) 2 E h f(x) 2i ν=0 hi(xi,ν, y+) f(x) where the first equality follows from the law of total expectation; the second equality uses the fact that hi(xi,ν, y+) = E [hi(xi,ν, y+)|Fi,ν 1]; and the last equality is obtained from our assumption αi = α/τi, i S. Next, we bound the last term in (22). Note that ν=0 hi(xi,ν, y+) f(x) ν=0 ( hi(xi,ν, y+) fi(x, y+)) ν=0 fi(x, y+) f(x) 2 ν=0 ( hi(xi,ν, y+) fi(xi,ν, y+)) ν=0 ( fi(xi,ν, y+) fi(x, y+)) + 3 f(x, y+) f(x) 2 , where the inequality uses Lemma G.1. ν=0 hi(xi,ν, y+) f(x) 2 # 3b2 + 3M 2 f m ν=0 E xi,ν x 2 + 3M 2 f E y+ y (x) 2 , where the inequality uses Lemmas 2.2 and C.1. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Substituting (23) into (22) yields ν=0 hi(xi,ν, y+), f(x) ν=0 hi(xi,ν, y+) 2 # b2 + M 2 f E y+ y (x) 2 + M 2 f m ν=0 E xi,ν x 2 . Next, we bound the second term on the RHS of (21). Observe that ν=0 hi(xi,ν, y+) hi(xi,ν, y+) hi(xi,ν, y+) + hi(xi,ν, y+) ν=0 hi(xi,ν, y+) 2 # where the inequality follows from Lemmas G.3 and C.1. Plugging (25) and (24) into (21) completes the proof. C.2. Error of FEDINN The following lemma establishes the progress of FEDINN. It should be mentioned that the assumption on βi, i S is identical to the one listed in (Mitra et al., 2021, Theorem 4). Lemma C.3 (Error of FEDINN). Suppose Assumptions A and B hold. Further, assume τi 1, αi = α τi , βi = β where 0 < β < min 1/(6ℓg,1), 1 and α is some positive constant. Then, FEDINN guarantees: E h y+ y (x) 2i 1 βµg T E h y y (x) 2i + 25Tβ2σ2 g,1, and (26a) E h y+ y (x+) 2i a1(α)E ν=0 hi(xi,ν, y+) + a2(α)E h y+ y (x) 2i + a3(α) σ2 f. (26b) a1(α) := L2 yα2 + Lyα 4Mf + Lyxα2 a2(α) := 1 + 4Mf Lyα + ηLyx D2 fα2 a3(α) := α2L2 y + Lyxα2 for any η > 0. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Proof. Note that E y+ y (x+) 2 =E y+ y (x) 2 + E y (x+) y (x) 2 +2E y+ y (x), y (x) y (x+) . (28) Next, we upper bound each term on the RHS of (28). Bounding the first term in (28): From (Mitra et al., 2021, Theorem 4), for all t {0, . . . , T 1}, we obtain E[ yt+1 y (x) 2] 1 βµg E[ yt y (x) 2] + 25β2σ2 g,1, which together with our setting y+ = y T implies E[ y+ y (x) 2] 1 βµg T E[ y y(x ) 2] + 25Tβ2σ2 g,1. (29) Bounding the second term in (28): By similar steps as in (25), we have E y (x+) y (x) 2 L2 y E ν=0 hi(x, y+) ν=0 hi(xi,ν, y+) + α2L2 y σ2 f, where the inequalities are obtained from Lemmas C.1 and G.3. Bounding the third term in (28): Observe that E y+ y (x), y (x) y (x+) = E y+ y (x), y (x)(x+ x) E y+ y (x), y (x+) y (x) y (x)(x+ x) . (31) For the first term on the R.H.S. of the above equality, we have E[ y+ y (x), y (x)(x+ x) ] = E y+ y (x), 1 ν=0 hi(xi,ν, y+) ν=0 hi(xi,ν, y+) ν=0 hi(xi,ν, y+) 2γE h y+ y (x) 2i + L2 yα2 ν=0 hi(xi,ν, y+) where the first equality uses the fact that hi(xi,ν, y+) = E [hi(xi,ν, y+)|Fi,ν 1]; the second inequality follows from Lemma C.1; and the last inequality is obtained from the Young s inequality such that ab 2γa2 + b2 Further, using Lemma C.1, we have E[ y+ y (x), y (x+) y (x) y (x)(x+ x) ] E y+ y (x) y (x+) y (x) y (x)(x+ x) 2 E h y+ y (x) x+ x 2i , FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization where the inequality follows from Lemma C.1. From Algorithm 2, we have xi,0 = x, i S, and x+ = x 1 ν=0 hi(xi,ν, y+). (34) Note that F0 = Fi,0 for all i S. Hence, E h y+ y (x) 2 x+ x 2i ν=0 E h y+ y (x) 2 E h hi(xi,ν, y+) 2 | Fi,τi 1 ii α2 D2 f E h y+ y (x) 2i , where the last inequality uses Lemma C.1. Note also that for any η > 0, we have 1 η 2 + 1 2η. Combining this inequality with (33) and using (35) give E[ y+ y (x), y (x+) y (x) y (x)(x+ x) ] 2 E h y+ y (x) x+ x 2i 4 E h y+ y (x) x+ x 2i + Lyx 4η E h x+ x 2i ηLyx D2 fα2 4 E h y+ y (x) 2i + Lyxα2 ν=0 hi(xi,ν, y+) where the last inequality uses (35) and Lemma 2.2. Let γ = Mf Lyα. Plugging (36) and (32) into (31), we have E[ y+ y (x), y (x) y (x+) ] 2γ + ηLyx D2 f 4 α2 ! E h y+ y (x) 2i ν=0 hi(xi,ν, y+) 2Mf Lyα + ηLyx D2 f 4 α2 ! E h y+ y (x) 2i 8Mf + Lyxα2 ν=0 hi(xi,ν, y+) Substituting (37), (30), and (29) into (28) completes the proof. C.3. Drifting Errors of FEDOUT The following lemma provides a bound on the drift of each xi,ν from x for stochastic nonconvex bilevel problems. It should be mentioned that similar drifting bounds for single-level problems are provided under either strong convexity (Mitra et al., 2021) and/or bounded dissimilarity assumptions (Wang et al., 2020; Reddi et al., 2020; Li et al., 2020b). Lemma C.4 (Drifting Error of FEDOUT). Suppose Assumptions A and B hold. Further, assume τi 1 and αi 1/(5Mfτi), i S. Then, for each i S and ν {0, . . . , τi 1}, FEDOUT guarantees: E h xi,ν x 2i 36τ 2 i α2 i M 2 f E y+ y (x) 2 + E f(x) 2 + 3b2 + 27τiα2 i σ2 f. (38) FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Proof. The result trivially holds for τi = 1. Let τi > 1 and define vi,ν := hi(xi,ν, y+) fi(xi,ν, y+) hi(x, y+) + fi(x, y+) + h(x, y+) f(x, y+), wi,ν := hi(xi,ν, y+) hi(xi,ν, y+) + hi(x, y+) hi(x, y+) + h(x, y+) h(x, y+), zi,ν := fi(xi,ν, y+) fi(x, y+) + f(x, y+) f(x) + f(x). One will notice that vi,ν + wi,ν + zi,ν = hi(xi,ν, y+) hi(x, y+) + h(x, y+). Hence, from Algorithm 2, for each i S, and ν {0, . . . , τi 1}, we have xi,ν+1 x = xi,ν x αi(vi,ν + wi,ν + zi,ν), (40) which implies that E xi,ν+1 x 2 = E xi,ν x αi(vi,ν + zi,ν) 2 + α2 i E wi,ν 2 2E [E [ xi,ν x αi(vi,ν + zi,ν), αiwi,ν | Fi,ν 1]] = E xi,ν x αi(vi,ν + zi,ν) 2 + α2 i E wi,ν 2 . Here, the last equality uses Lemma G.3 since E[wi,ν|Fi,ν 1] = 0, by definition. From Lemmas G.1, 2.2, and C.1, for vi,ν, wi,ν, and zi,ν defined in (39), we have E h vi,ν 2i 3E h fi(xi,ν, y+) hi(xi,ν, y+) 2 + hi(x, y+) fi(x, y+) 2 + f(x, y+) h(x, y+) 2 i E wi,ν 2 3E h hi(xi,ν, y+) hi(xi,ν, y+) 2 + hi(x, y+) hi(x, y+) 2 + h(x, y+) h(x, y+) 2i and E zi,ν 2 3E h fi(xi,ν, y+) fi(x, y+) 2 + f(x, y+) f(x) 2 + f(x) 2i 3 M 2 f E xi,ν x 2 + M 2 f E y+ y (x) 2 + E f(x) 2 . Now, the first term in the RHS of (41) can be bounded as follows: E xi,ν x αi(vi,ν + zi,ν) 2 1 + 1 2τi 1 E xi,ν x 2 + 2τi E αi(vi,ν + zi,ν) 2 1 + 1 2τi 1 E xi,ν x 2 + 4τiα2 i E zi,ν 2 + E vi,ν 2 1 + 1 2τi 1 E xi,ν x 2 + 4τiα2 i E zi,ν 2 + 9b2 1 + 1 2τi 1 + 12τiα2 i M 2 f + 12τiα2 i M 2 f E y+ y (x) 2 + E f(x) 2 + 3b2 . FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Here, the first inequality follows from Lemma G.2; the second inequality uses Lemma G.1; and the third and last inequalities follow from (42a) and (42c). Substituting (43) into (41) gives E h xi,ν+1 x 2i 1 + 1 2τi 1 + 12τiα2 i M 2 f + 12τiα2 i M 2 f E y+ y (x) 2 + E f(x) 2 + 3b2 + 9α2 i σ2 f + 12τiα2 i M 2 f E y+ y (x) 2 + E f(x) 2 + 3b2 + 9α2 i σ2 f. Here, the first inequality uses (42b) and the last inequality follows by noting αi 1/(5Mfτi). For all τi > 1, we have 1 + 1 τi 1 ν 1 1 + 1 τi 1 1 τi exp (1)τi < 3τi. Now, iterating equation (44) and using xi,0 = x, i S, we obtain E xi,ν x 2 12τiα2 i M 2 f E y+ y (x) 2 + E f(x) 2 + 3b2 + 9α2 i σ2 f ν 1 X 3τi 12τiα2 i M 2 f E y+ y (x) 2 + E f(x) 2 + 3b2 + 9α2 i σ2 f , where the second inequality uses (45). This completes the proof. Remark C.1. Lemma C.4 shows that the bound on the client-drift scales linearly with τi and the inner error y+ y (x) 2 in general nested FL. We aim to control such a drift by selecting αi = O(1/τi) for all i S and using the inner error bound provided in Lemma C.3. Next, we provide the proof of our main result which can be adapted to general nested problems (bilevel, min-max, compositional). C.4. Proof of Theorem 3.1 Proof. We define the following Lyapunov function Wk := f(xk) + Mf Ly yk y (xk) 2. (46) Motivated by (Chen et al., 2021a), we bound the difference between two Lyapunov functions. That is, Wk+1 Wk =f(xk+1) f(xk) + Mf yk+1 y (xk+1) 2 yk y (xk) 2 . (47) The first two terms on the RHS of (47) quantifies the descent of outer objective f and the reminding terms measure the descent of the inner errors. From our assumption, we have αk i = αk/τi, βk i = βk/τi, i S. Substituting these stepsizes into the bounds provided in FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Lemmas C.2 and C.3, and using (47), we get E[Wk+1] E[Wk] Lfα2 k 2 + Mf Ly a3(αk) σ2 f + 3αk 2 E f(xk) 2 + 3M 2 f αk 2m ν=0 E xk i,ν xk 2 (48a) 2 Lfα2 k 2 Mf Ly a1(αk) E ν=0 hi(xk i,ν, yk+1) 2 + a2(αk) E yk+1 y (xk) 2 Mf Ly E yk y (xk) 2 , (48c) where a1(α) a3(α) are defined in (27). τi , i S where αk 1 216M 2 f + 5Mf . (49) The above choice of αk satisfies the condition of Lemma C.4 and we have 54M 2 f α3 k α2 k/4. Hence, from Lemma C.4, we get 2 E f(xk) 2 + 81 2 α3 k M 2 f σ2 f + 54M 2 f α3 k M 2 f E yk+1 y (xk) 2 + E f(xk) 2 + 3b2 4 E f(xk) 2 + α2 k 4 σ2 f + α2 k 4 M 2 f E yk+1 y (xk) 2 + 3b2 4 E f(xk) 2 + α2 k 4 σ2 f + 3α2 k 4 b2 Tβ2 kσ2 g,1 + Mf where the first inequlaity uses (49) and the last inequality follows from (26a). To guarantee the descent of Wk, the following constraints need to be satisfied 2 Lfα2 k 2 Mf L2 yα2 k + Lyαk 4Mf + Lyxα2 k 2η 2Lf + 4Mf Ly + 2Mf Lyx where the second line uses (27). Further, substituting (26) in (48c) gives 2 + a2(αk) Tβ2 kσ2 g,1 2 + a2(αk) 1 βkµg E[ yk y (xk) 2]. (52) FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Substituting (50) (52) into (48) gives E[Wk+1] E[Wk] αk 4 E[ f(xk) 2] Ly a3(αk) σ2 f 4 α2 k + 3Mf Lyαk 2 + a2(αk) Tβ2 kσ2 g,1 4 α2 k + 3Mf Lyαk 2 + a2(αk) 1 βkµg E[ yk y (xk) 2]. (53a) Let βk < min 1/(6ℓg,1), 1 . Then, we have βkµg/2 < 1. This together with (27) implies that for any αk > 0 4 α2 k + 11Mf Lyαk 2 + ηLyx D2 fα2 k 2 4 α2 k + 11Mf Lyαk 2 + ηLyx D2 fα2 k 2 = βk 11Mf Ly + ηLyx D2 fαk + Mf Lyαk From (49), (51) and (54), we select αk = min{ α1, α2, α3, α K }, βk = βαk where β := 1 11Mf Ly + ηLyx D2 f α1 + Mf Ly α1 2Lf + 4Mf Ly + 2Mf Lyx Lyη , α2 := T 8ℓg,1 β , α3 := 1 216M 2 f + 5Mf , (56) With the above choice of stepsizes, (53) can be simplified as E[Wk+1] E[Wk] αk 4 E[ f(xk) 2] 2 2 α2 k + Mf Ly a3(αk) σ2 f 4 α2 k + 3Mf Ly 2 αk + a2(αk) Tβ2 kσ2 g,1 4 E[ f(xk) 2] + c1α2 kσ2 g,1 + 3 b2 + c2α2 k σ2 f, where the constants c1 and c2 are defined as 1 + 11Mf Ly Mf Ly + 2ηLyx D2 f 4 c2 = Lf + 1 2 2 + Mf Ly + Lyx Mf FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Then telescoping gives k=0 E[ f(xk) 2] 4 PK 1 k=0 αk αk + α2 k 2 b2 + c1α2 kσ2 g,1 + c2α2 k σ2 f 4 W min{ α1, α2, α3}K + 4 W K + 6 1 + α 2 K σ2 g,1 + 4c2 α 1 min{ α1, α2, α3}K + α max(σ2 g,1, σ2 g,2, σ2 f) where W := W0 E[WK]. C.5. Proof of Corollary 3.1 Proof. Let η = Mf Ly = O(κg) in (58). It follows from (18), (56), and (58) that α1 = O(κ 3 g ), α2 = O(Tκ 3 g ), α3 = O(κ 4 g ), c1 = O(κ9 g/T), c2 = O(κ3 g). (60) Further, N = O(κg log K) gives b = 1 K1/4 . Now, if we select α = O(κ 2.5 g ) and T = O(κ4 g), Eq. (59) gives k=0 E[ f(xk) 2] = O κ4 g K + κ2.5 g To achieve ε-optimal solution, we need K = O(κ5 gε 2), and the samples in ξ and ζ are O(κ5 gε 2) and O(κ9 gε 2), respectively. D. Proof for Federated Minimax Optimization Note that the minimax optimization problem (2) has the following bilevel form min x Rd1 f(x) = 1 m Pm i=1 fi (x, y (x)) subj. to y (x) = argmin y Rd2 1 m Pm i=1 fi (x, y) . (61a) fi(x, y) = Eξ Ci[fi(x, y; ξ)] (61b) is the loss functions of the ith client. In this case, the hypergradient of (61) is fi(x) = xfi x, y (x) + xy (x) yfi x, y (x) = xfi x, y (x) , (62) where the second equality follows from the optimality condition of the inner problem, i.e., yf(x, y (x)) = 0. For each i S, we can approximate fi(x) on a vector y in place of y (x), denoted as fi(x, y) := xfi x, y . We also note that in the minimax case hi is an unbiased estimator of fi(x, y). Thus, b = 0. Therefore, we can apply FEDNEST using qi,ν = yfi(x, yi,ν; ξi,ν) + yfi(x, y; ξi,ν) 1 i=1 yfi(x, y; ξi), hi,ν = xfi(xi,ν, y+; ξi,ν) xfi(x, y+; ξi,ν) + 1 i=1 xfi(x, y+; ξi). FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization D.1. Supporting Lemmas Let z = (x, y) Rd1+d2. We make the following assumptions that are counterparts of Assumptions A and B. Assumption E. For all i [m]: (E1) fi(z), fi(z), 2fi(z) are respectively ℓf,0, ℓf,1, ℓf,2-Lipschitz continuous; and (E2) fi(x, y) is µf-strongly convex in y for any fixed x Rd1. We use κf = ℓf,1/µf to denote the condition number of the inner objective with respect to y. Assumption F. For all i [m]: (F1) fi(z; ξ) is unbiased estimators of fi(z); and (F2) Its variance is bounded, i.e., Eξ[ fi(z; ξ) fi(z) 2] σ2 f, for some σ2 f. In the following, we re-derive Lemma C.1 for the finite-sum minimax problem (61). Lemma D.1. Under Assumptions E and F, we have hi(x, y) = fi(x, y) for all i S and (17a) (17g) hold with Lyx = ℓf,2 + ℓf,2Ly µf + ℓf,1(ℓf,2 + ℓf,2Ly) µ2 f = O(κ3 f), Mf = ℓf,1 = O(1), Lf = (ℓf,1 + ℓ2 f,1 µf ) = O(κf), µf = O(κf), σ2 f = σ2 f, D2 f = ℓ2 l,0 + σ2 f, where ℓf,0, ℓf,1, ℓf,2, µf, and σf are given in Assumptions E and F. D.2. Proof of Corollary 3.2 Proof. Let η = 1. From (55) and (56), we have αk = min α1, α2, α3, α where β = 1 11ℓf,1Ly + Lyx D2 f α1 + ℓf,1Ly α1 2Lf + 4ℓf,1Ly + 2ℓf,1Lyx Ly , α2 = T 8ℓg,1 β , α3 = 1 216ℓ2 f,1 + 5ℓf,1 . (65b) Using the above choice of stepsizes, (59) reduces to k=0 E[ f(xk) 2] 4 W K min{ α1, α2, α3} + 4 W K + 4(c1 + c2) α K σ2 f, (66) where W = W0 E[WK], c1 = 25ℓf,1 1 + 11ℓf,1Ly ℓf,1Ly + 2Lyx D2 f 4 c2 = Lf + 1 2 2 + ℓf,1Ly + Lyxℓf,1 FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization Let α = O(κ 1 f ). Since by our assumption, T = O(κf), it follows from (64) and (73) that α1 = O(κ 2 f ), α2 = O(κ 1 f ), α3 = O(1), c1 = O(κ2 f), c2 = O(κ2 f). (68) Substituting (68) in (66) and (67) gives k=0 E[ f(xk) 2] = O κ2 f K + κf To achieve ε-accuracy, we need K = O(κ2 fε 2). E. Proof for Federated Compositional Optimization Note that in the stochastic compositional problem (3), the inner function fi(x, y; ξ) = fi(y; ξ) for all i S, and the outer function is gi(x, y; ζ) = 1 2 y ri(x; ζ) 2, for all i S. In this case, we have ygi (x, y) = yi ri(x; ζ), yyg(x, y; ζ) = Id2 d2, and xyg(x, y; ζ) = 1 i=1 ri(x; ζ) . (70) Hence, the hypergradient of (3) has the following form fi(x) = xfi (y (x)) 2 xyg(x, y (x))[ 2 yg(x, y (x))] 1 yfi (y (x)) i=1 ri(x)) yfi(y (x)). (71) We can obtain an approximate gradient fi(x) by replacing y (x) with y; that is fi(x, y) = ( 1 m Pm i=1 ri(x)) yfi(y). It should be mentioned that in the compositional case b = 0. Thus, we can apply FEDNEST and LFEDNEST using the above gradient approximations. E.1. Supporting Lemmas Let z = (x, y) Rd1+d2. We make the following assumptions that are counterparts of Assumptions A and B. Assumption G. For all i [m], fi(z), fi(z), ri(z), ri(z) are respectively ℓf,0, ℓf,1, ℓr,0, ℓr,1-Lipschitz continuous. Assumption H. For all i [m]: (H1) fi(z; ξ), ri(x; ζ), ri(x; ζ) are unbiased estimators of fi(z), ri(x), and ri(x). (H2) Their variances are bounded, i.e., Eξ[ fi(z; ξ) fi(z) 2] σ2 f, Eζ[ ri(z; ζ) ri(z) 2] σ2 r,0, and Eζ[ ri(z; ζ) ri(z) 2] σ2 r,1 for some σ2 f, σ2 r,0, and σ2 r,1. The following lemma is the counterpart of Lemma C.1. The proof is similar to (Chen et al., 2021a, Lemma 7). Lemma E.1. Under Assumptions G and H, we have hi(x, y) = fi(x, y) for all i S, and (17a) (17g) hold with Mf = ℓr,0ℓf,1, Ly = ℓr,0, Lf = ℓ2 r,0ℓf,1 + ℓf,0ℓr,1, Lyx = ℓr,1, σ2 f = ℓ2 r,0σ2 f + (ℓ2 f,0 + σ2 f)σ2 r,1, D2 f = (ℓ2 f,0 + σ2 f)(ℓ2 r,0 + σ2 r,1). (72) E.2. Proof of Corrollary 3.3 Proof. By our assumption T = 1. Let α = 1 and η = 1/Lyx. From (55) and (56), we obtain αk = min α1, α2, α3, 1 , βk = βαk, (73a) FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization 11ℓf,1ℓ2 r,0 + D2 f α1 + ℓf,1ℓ2 r,0 α1 2 α1 = 1 2ℓf,0ℓr,1 + 6ℓf,1ℓ2 r,0 + 2ℓf,1ℓ2 r,1 , α2 = 1 8ℓr,0 β , α3 = 1 216(ℓr,0ℓf,1)2 + 5ℓr,0ℓf,1 . Then, using (59), we obtain k=0 E f(xk) 2 = O 1 This completes the proof. F. Proof for Federated Single-Level Optimization Next, we re-derive Lemmas C.2 and C.4 for single-level nonconvex FL under Assumptions C and D. Lemma F.1 (Counterpart of Lemma C.2). Suppose Assumptions C and D hold. Further, assume τi 1 and αi = α/τi, i S for some positive constant α. Then, FEDOUT guarantees: E f(x+) E [f(x)] α ν=0 fi(xi,ν) 2 E f(x) 2 + αL2 f 2m ν=0 E xi,ν x 2 + α2Lf Proof. By applying Algorithm 2 to the single-level optimization problem (13), we have xi,0 = x i S, x+ = x 1 ν=0 hi(xi,ν). hi(xi,ν) := xfi(xi,ν; ξi,ν) xfi(x; ξi,ν) + 1 i=1 xfi(x; ξi). This together with Assumption C implies that E f(x+) E [f(x)] E x+ x, f(x) + Lf ν=0 hi(xi,ν), f(x) ν=0 hi(xi,ν) FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization For the first term on the RHS of (76), we obtain ν=0 hi(xi,ν), f(x) ν=0 E [ hi(xi,ν), f(x) | Fi,ν 1] ν=0 fi(xi,ν) 2 E h f(x) 2i ν=0 fi(xi,ν) f(x) ν=0 fi(xi,ν) 2 E h f(x) 2i ν=0 E h xi,ν x 2i , where the first equality follows from the law of total expectation and the last inequality is obtained from Assumption C. For the second term on the RHS of (76), Assumption D gives ν=0 hi(xi,ν, y+) ν=0 (hi(xi,ν) fi(xi,ν) + fi(xi,ν)) ν=0 fi(xi,ν) Plugging (78) and (77) into (76) gives the desired result. Lemma F.2 (Counterpart of Lemma C.4). Suppose Assumptions C and D hold. Further, assume τi 1 and αi = α/τi, i S, where α 1/(3Lf). Then, for all ν {0, . . . , τi 1}, FEDOUT gives E h xi,ν x 2i 12τ 2 i α2 i E h f(x) 2i + 27τiα2 i σ2 f. (79) Proof. The result trivially holds for τi = 1. Similar to what is done in the proof of Lemma C.4, let τi > 1 and define vi,ν := fi(xi,ν) fi(x) + f(x), wi,ν := hi(xi,ν) fi(xi,ν) + fi(x) hi(x) + h(x) f(x). (80) where hi(x) = xfi(x; ξi,ν) and h(x) = 1/m Pm i=1 xfi(x; ξi). From Algorithm 2, for each i S, and ν {0, . . . , τi 1}, we obtain xi,ν+1 x = xi,ν x αi (hi(xi,ν) hi(x) + h(x)) = xi,ν x αi (vi,ν + wi,ν) , which implies that E xi,ν+1 x 2 = E xi,ν x αivi,ν 2 + α2 i E wi,ν 2 2E [E [ xi,ν x αivi,ν, αiwi,ν | Fi,ν 1]] = E xi,ν x αivi,ν 2 + α2 i E wi,ν 2 . Here, the last equality uses Lemma G.3 since E[wi,ν|Fi,ν 1] = 0. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization From Assumption D and Lemma G.1, for wi,ν defined in (80), we have E wi,ν 2 3E hi(xi,ν) fi(xi,ν) 2 + fi(x) hi(x) 2 + h(x) f(x) 2 9σ2 f. (82) Substituting (82) into (81), we get E xi,ν x αivi,ν 2 1 + 1 2τi 1 E xi,ν x 2 + 2τiα2 i E vi,ν 2 + 9α2 i σ2 f 1 + 1 2τi 1 + 4τiα2 i L2 f E xi,ν x 2 + 4τiα2 i E f(x) 2 + 9α2 i σ2 f E xi,ν x 2 + 4τiα2 i E f(x) 2 + 9α2 i σ2 f. Here, the first inequality follows from Lemma G.2; the second inequality uses Assumption C and Lemma G.1; and the last inequality follows by noting αi = α/τi, i S and α 1/(3Lf). Now, iterating equation (83) and using xi,0 = x, i S, we obtain E xi,ν x 2 4τiα2 i E f(x) 2 + 9α2 i σ2 f ν 1 X 12τ 2 i α2 i E f(x) 2 + 27τiα2 i σ2 f, where the second inequality uses (45). This completes the proof. F.1. Proof of Theorem 3.2 Proof. Let α1 := 1/(3Lf(1 + 8Lf)). Note that by our assumption αk α1. Hence, the stepsize αk satisfies the condition of Lemma F.2, and we have 6L2 fα3 k α2 k/4. This together with Lemmas F.1 and F.2 gives E f(xk+1) E f(xk) αk 2 E[ f(xk) 2] + L2 fαk 2m ν=0 E h xk i,ν xk 2i 2 (1 αk Lf)E ν=0 fi(xk i,ν) 2 E[ f(xk) 2] + L2 fαk 2m ν=0 E h xk i,ν xk 2i + α2 k Lf 2 E[ f(xk) 2] + 6L2 fα3 k E[ f(xk) 2] + 27 2 α3 k L2 f + α2 k Lf 4 E[ f(xk) 2] + (4 + Lf)α2 kσ2 f, where the second and last inequalities follow from (14). Summing (85) over k and using our choice of stepsize in (14), we obtain k=0 E[ f(xk) 2] 4 f + 4(4 + Lf)σ2 f α α + 4(4 + Lf) ασ2 f where f = f(x0) E[f(x K)]. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization G. Other Technical Lemmas We collect additional technical lemmas in this section. Lemma G.1. For any set of vectors {xi}m i=1 with xi Rd, we have i=1 xi 2. (87) Lemma G.2. For any x, y Rd, the following holds for any c > 0: x + y 2 (1 + c) x 2 + 1 + 1 Lemma G.3. For any set of independent, mean zero random variables {xi}m i=1 with xi Rd, we have i=1 E h xi 2i . (89) H. Additional Experimental Results In this section, we first provide the detailed parameters in Section 4 and then discuss more experiments. In Section 4, our federated algorithm implementation is based on (Ji, 2018), both hyper-representation and loss function tuning use batch size 64 and Neumann series parameter N = 5. We conduct 5 SGD/SVRG epoch of local updates in FEDINN and τ = 1 in FEDOUT. In FEDNEST, we use T = 1, have 100 clients in total, and 10 clients are selected in each FEDNEST epoch. 0 2000 4000 6000 8000 10000 60 LFed Nest-Non Alt (1) LFed Nest (T+1) LFed Nest SVRG (2T+1) Test accuracy (a) The test accuracy w.r.t to the algorithm epochs. 0 2000 4000 6000 8000 10000 60 LFed Nest-Non Alt (1) LFed Nest (T+1) LFed Nest SVRG (2T+1) # of communications (b) The test accuracy w.r.t to the number of communications. Figure 5: Hyper representation experiment comparing LFEDNEST, LFEDNESTSVRG and LFEDNEST-NONALT on non-i.i.d dataset. The number in parentheses corresponds to communication rounds shown in Table 2. H.1. The effect of the alternating between inner and outer global variables In our addition experiments, we investigate the effect of the alternating between inner and outer global variables x and y. We use LFEDNEST-NONALT to denote the training where each client updates their local yi and then update local xi w.r.t. local yi for all i S. Hence, the nested optimization is performed locally (within the clients) and the joint variable [xi, yi] is communicated with the server. One can notice that only one communication is conducted when server update global x and y by aggregating all xi and yi. FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization 0 1000 2000 3000 4000 5000 10 LFed Nest (T+1) LFed Nest SVRG (2T+1) Fed Nest (2T+N+3) Test accuracy # of communications (a) α = 0.0075 0 1000 2000 3000 4000 5000 10 LFed Nest (T+1) LFed Nest SVRG (2T+1) Fed Nest (2T+N+3) # of communications (b) α = 0.005 0 1000 2000 3000 4000 5000 10 LFed Nest (T+1) LFed Nest SVRG (2T+1) Fed Nest (2T+N+3) Test accuracy # of communications (c) α = 0.0025 0 1000 2000 3000 4000 5000 10 LFed Nest (T+1) LFed Nest SVRG (2T+1) Fed Nest (2T+N+3) # of communications (d) α = 0.001 Figure 6: Learning rate analysis on non-i.i.d. data with respect to # of communications. As illustrated in Figure 5, the test accuracy of LFEDNEST-NONALT remains around 80%, but both standard LFEDNEST and LFEDNESTSVRG achieves better performance. Here, the number of inner iterations is set to T = 1. The performance boost reveals the necessity of both averaging and SVRG in FEDINN, where the extra communication makes clients more consistent. H.2. The effect of the learning rate and the global inverse Hessian Figure 6 shows that on non-i.i.d. dataset, both SVRG and FEDOUT have the effect of stabilizing the training. Here, we set T = 1 and N = 5. As we observe in (a)-(d), where the learning rate decreases, the algorithms with more communications are easier to achieve convergence. We note that LFEDNEST successfully converges in (d) with a very small learning rate. In contrast, in (a), FEDNEST (using the global inverse Hessian) achieves better test accuracy in the same communication round with a larger learning rate.