# firstorder_federated_bilevel_learning__68d07e3e.pdf

First-Order Federated Bilevel Learning

Yifan Yang1, Peiyao Xiao1, Shiqian Ma2, Kaiyi Ji1

1Department of Computer Science and Engineering, University at Buffalo 2Department of Computational Applied Math and Operations Research, Rice University yyang99@buffalo.edu, peiyaoxi@buffalo.edu, sqma@rice.edu, kaiyiji@buffalo.edu

Federated bilevel optimization (FBO) has garnered significant attention lately, driven by its promising applications in meta-learning and hyperparameter optimization. Existing algorithms generally aim to approximate the gradient of the upper-level objective function (hypergradient) in the federated setting. However, because of the nonlinearity of the hypergradient and client drift, they often involve complicated computations. These computations, like multiple optimization sub-loops and second-order derivative evaluations, end up with significant memory consumption and high computational costs. In this paper, we propose a computationally and memory-efficient FBO algorithm named Mem FBO. Mem FBO features a fully single-loop structure with all involved variables updated simultaneously, and uses only firstorder gradient information for all local updates. We show that Mem FBO exhibits a linear convergence speedup with milder assumptions in both partial and full client participation scenarios. We further implement Mem FBO in a novel FBO application for federated data cleaning. Our experiments, conducted on this application and federated hyper-representation, demonstrate the effectiveness of the proposed algorithm.

1 Introduction

Bilevel optimization has received significant attention recently in a wide range of applications including few-shot meta-learning (Finn, Abbeel, and Levine 2017; Rajeswaran et al. 2019), hyperparameter search (Franceschi et al. 2018; Feurer and Hutter 2019), fairness-aware machine learning (Roh et al. 2021), lifelong learning (Hao, Ji, and Liu 2023), etc. Due to the significant increase in the demand for efficient computing and growing concerns about user privacy, recent studies have shifted their attention to efficiently solving the following federated bilevel optimization (FBO) problem:

min x Rp Φ(x) = F x, y (x) := 1

i=1 fi(x, y (x))

s.t. y (x) = arg min y Rq G(x, y) := 1

i=1 gi(x, y), (1)

Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

where n is the total number of clients; p and q are the dimension of x and y; fi(x, y (x)) = Eξ fi x, y (x); ξi

and gi(x, y) = Eζ gi(x, y; ζi) represent the expectation of the ith upperand lower-level function w.r.t. the random variables ξi, ζi, respectively. Various FBO algorithms have been developed recently (Tarzanagh et al. 2022; Xiao and Ji 2023; Yang, Xiao, and Ji 2023b) and applied to important applications such as federated meta-learning (Fallah, Mokhtari, and Ozdaglar 2020; Xiao and Ji 2023), and graph-aided federated learning (Xing et al. 2022). However, existing algorithms often suffer from some computational and memory issues due to a complex implementation and the computation of high-order model information, induced by the nested optimization structure in FBO. For example, (Huang et al. 2022; Tarzanagh et al. 2022; Huang, Zhang, and Ji 2023) and (Xiao and Ji 2023) proposed two types of FBO algorithms based on approximate implicit differentiation (AID) and iterative differentiation (ITD), which involves multiple sub-loops to approximate the global federated hypergradient (i.e., gradient of the upperlevel objective function). Most recently, (Yang, Xiao, and Ji 2023b) proposed a more efficient single-loop FBO algorithm called Sim FBO without any sub-loop. However, the local aggregation in each client of Sim FBO needs to store multiple second-order Hessian or Jacobian-vector products. This could cause practical challenges related to computing and memory, as edge devices like smartphones typically have relatively low memory availability and computational capacity. To address these challenges, this paper proposes a computationally and memory-efficient FBO algorithm named Mem FBO, and further applies Mem FBO to a novel application of FBO in robust federated learning with potentially noisy data. Our specific contributions are summarized as follows.

Inspired by the Lagrangian approximation of bilevel optimization, we use simpler single-level minimax problems as effective substitutes for the original local and global bilevel problems, making them easier to manage. We then propose a simple and efficient FBO algorithm named Mem FBO. As illustrated in Figure 1, Mem FBO features a fully single-loop optimization structure and computes only gradient information at each local step for all participating clients.

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Figure 1: Algorithmic comparison of federated hypergradient-based methods Fed Nest (Tarzanagh et al. 2022) (left), Sim FBO (Yang, Xiao, and Ji 2023b) (middle) and our Mem FBO (right). Mem FBO has a similar single-loop structure as Sim FBO, but performs more efficient updates using only first-order gradient information and without any projection.

We provide a comprehensive convergence analysis for Mem FBO under both partial and full participation. We show that Mem FBO achieves a linear speedup (i.e., the communication complexity is improved linearly with respect to the number of sampled clients) under client sampling without replacement. Technically, we eliminate the restrictive assumption of Sim FBO (Yang, Xiao, and Ji 2023b) on the third-order Lipschitz continuity of all local lower-level objective functions. Compared to the analysis of Lagrangian methods in standard bilevel optimization, our characterizations need to 1) control the variance associated with local stochastic estimation and the errors from client sampling, 2) mitigate the impact of client drift on the final convergence, and 3) mitigate the negative impact of the large Lagrange multiplier on amplifying the client drifts and gradient estimation variances.

We further explore a new application of FBO: robust federated data cleaning via adaptive weight selection. We are interested in the potential setting where there are two sets of data in each client: one is clean and of high quality, while the other is possibly noisy and constitutes the majority. For example, in mobile devices, the noisy data often contain errors or variations due to measurement or transmission issues, such as sensor errors, network delays, or channel noise. To deal with this challenge, we propose a federated data cleaning approach via the lens of FBO, where the lower-level is weighted loss over noisy samples of all participating clients, and the upper-level is to fine-tune these weights to maximize the prediction accuracy on the clean data.

We implement Mem FBO in this application and demonstrate its more robust performance in the noisy setting than multiple popular (personalized) federated learning baselines. Moreover, our experiments on federated hyper-representation also demonstrate improved efficiency and accuracy over existing FBO approaches.

2 Related Work

Bilevel optimization. Bilevel optimization has been extensively studied since first introduced by (Bracken and Mc Gill 1973). Early works (Hansen, Jaumard, and Savard 1992; Shi, Lu, and Zhang 2005; Gould et al. 2016; Sinha, Malo, and Deb 2017) solved the bilevel problem from a constrained optimization perspective. More recently, gradientbased bilevel methods have received intensive attention due to the effiency and effectiveness in machine learning problems, which approximate implicit differentiation (AID) (Domke 2012; Liao et al. 2018; Pedregosa 2016; Lorraine, Vicol, and Duvenaud 2020; Grazzi et al. 2020; Ji, Yang, and Liang 2021; Arbel and Mairal 2021; Hong et al. 2023) and iterative differentiation (ITD) (Finn, Abbeel, and Levine 2017; Franceschi et al. 2017; Shaban et al. 2019; Grazzi et al. 2020; Liu et al. 2021b; Ji, Yang, and Liang 2021) based approaches. Stochastic bilevel algorithms have been studied recently using techniques such as Neumann series (Chen, Sun, and Yin 2021; Ji, Yang, and Liang 2021; Arbel and Mairal 2021), variance reduction (Yang, Ji, and Liang 2021; Dagr eou et al. 2022; Yang, Ji, and Liang 2021; Huang and Huang 2021; Guo et al. 2021), finitedifference matrix-vector estimation (Yang, Xiao, and Ji 2023a). Another class of approaches formulated the lowerlevel problem as a value-function-basaed constraint (Sabach and Shtern 2017; Liu et al. 2020; Li, Gu, and Huang 2020; Liu et al. 2021b,a, 2022; Sow et al. 2022; Shen and Chen 2023). The more relevant works (Kwon et al. 2023; Wang et al. 2023) proposed first-order algorithms by solving a single-level Lagrangian-induced problem. Inspired by a similar idea, we propose effective single-level surrogates for local and global problems in FBO, and further develop a novel efficient algorithm with strong resilience to the client drifts. Federated bilevel optimization. Federated bilevel optimization has not been widely explored except in a few studies. (Gao 2022; Li, Huang, and Huang 2022) proposed momentum-based algorithms in homogeneous settings. (Tarzanagh et al. 2022; Huang, Zhang, and Ji 2023) proposed Fed Nest and Fed MBO using an AID-based fed-

erated hypergradient estimator in heterogeneous settings. (Xiao and Ji 2023) and (Yang, Xiao, and Ji 2023b) further improved the communication efficiency of AID-based methods via ITD-based hypergradient aggregation and simple single-loop optimization. (Huang 2022; Li, Huang, and Huang 2023) exploited momentum-based variance reduction to reduce the overall sample complexity. In addition, there are some studies focused on other distributed settings, including decentralized bilevel optimization (Chen, Huang, and Ma 2022; Chen et al. 2023; Yang, Zhang, and Wang 2022; Lu et al. 2022; Dong et al. 2023), distributed network utility maximization (Ji and Ying 2023), and asynchronous optimization over directed network (Yousefian 2021). In this paper, we propose a simple first-order FBO method and explore its performance in an important federated application.

3 Existing Methods and Challenges 3.1 Computing and Memory Challenges To solve the problem in eq. (1) efficiently, the high-level idea of existing works like (Tarzanagh et al. 2022; Gao 2022; Li, Huang, and Huang 2022; Xiao and Ji 2023; Yang, Xiao, and Ji 2023b) is to approximate the following federated hypergradient (i.e., the gradient of the upper-level Φ function) induced by implicit function theorem:

Φ(x) = x F 2 xy G 2 yy G 1 y F ,

i=1 xf i 2 xyg i 2 yyg i 1 yf i , (2)

where we denote f i := fi(x, y (x)), g i := gi(x, y (x)), F := F(x, y (x)) and G := G x, y (x) . As shown in eq. (2), one cannot use the aggregation of local hypergradients to approximate the global hypergradient Φ(x) due to the nonlinearity from the matrix multiplication and the inversion of the global Hessian matrix yy G . Various approaches have been used to address this challenge. For example, (Tarzanagh et al. 2022; Gao 2022; Li, Huang, and Huang 2022) extended the idea of AID-based hypergradient estimation (Ghadimi and Wang 2018) to the federated setting, and proposed AID-based FBO algorithms, which require each participating client to compute a large number of Hessian-vector products in estimating the global Hessianinverse-vector product of the federated hypergradient. (Xiao and Ji 2023) proposed an ITD-based FBO algorithm to reduce the communication cost of AID-based methods, but still requires numerous Hessian-vector computations in each communication round. Most recently, (Yang, Xiao, and Ji 2023b) proposed a more efficient single-loop FBO algorithm named Sim FBO by updating variables y, v, x simultaneously, whose each local step is given by

yk i vk i xk i

ηy ygi xk i , yk i ; ζk i

ηv v Ri xk i , yk i , vk i ; ψk i

ηx fi x(k i , yk i , vk i ; ξk i

where fi x, y, v; ξ = xfi x, y; ξ 2 xygi x, y; ξ vi is the local hypergradient estimate for client i and

Ri(x, y, v) = 1

2v T 2 yygi(x, y)v v T yfi(x, y)

is the loss function of client i for approximating the global Hessian-inverse-vector product in eq. (2). Then, Sim FBO performs a local aggregation in each client i as

k=0 ak i v Ri xk i , yk i vk i ; ψk i , (3)

and similar aggregations are also applied to the local updates of x and y. Then, it can be seen from eq. (3) that the local aggregation of Sim FBO on v and x needs to store τ (i.e., the number of local steps) Hessianor Jacobian-vector products, which can incur substantial computing and memory costs as further shown in Figure 4 in the experiments.

3.2 Our Solution: First-Order FBO Inspired by the recent advance in first-order bilevel optimization (Lin, Xu, and Ye 2014; Liu et al. 2021a; Kwon et al. 2023; Wang et al. 2023), we reformulate the FBO problem in eq. (1) into an equivalent constrained problem

min x F(x, y) s.t. G(x, y) G x, y (x) 0,

where G x, y (x) is the optimal value function of the lower-level problem. One effective way to deal with this constrained problem is to solve its Lagrangian problem:

min x,y max z L(x, y, z) = F(x, y) + λ G(x, y) G(x, z) ,

where λ is the Lagrange multiplier. Note that the maximizer over z equals y (x). The above Lagrangian approximation provides several advantages. First, differently from existing hypergradient-based methods, its gradients w.r.t. all variables x, y, z do not contain second-order derivatives such as Hessian or Jacobian matrices. Second, existing methods include complex sub-loops to tackle the nested optimization structure of the original FBO problem. In contrast, this Lagrangian problem is a minimax problem with a single objective, and hence facilitates simpler implementation in federated learning. Third, the Lagrangian function can be decomposed into local Lagrangian functions as

min x,y max z L(x, y, z) := 1

i=1 Li(x, y, z). (4)

where each local Lagrangian is given by

Li(xi, yi, zi) : = fi(xi, yi) + λ gi(xi, yi) gi(xi, zi) .

As a result, each client can focus on its local Lagrangian problem as

(Local Lagrangian:) min xi,yi max zi Li(xi, yi, zi), (5)

which also facilitates single-loop updates on xi, yi and zi. For simplicity, we use the same Lagrange multiplier λ for all local problems. However, it is also feasible to use different multipliers λi, i = 1, ..., n for different local Lagrangian problems because it can be shown that the gap between the hypergradient of the original problem and the gradient of the Lagrangian function L (x) = miny maxz L(x, y, z) w.r.t. x can vanish if we choose λi sufficiently large.

4 The Mem FBO Algorithm In this section, we introduce an efficient FBO algorithm named Mem FBO, which features flexible updates and aggregation on both the client and server sides, as described in Algorithm 1. Local updates. Specifically, at the beginning of each communication round, we sample a subset Ct of client without replacement. Then, each client in Ct performs τ local stochastic gradient descent (SGD) steps to optimize its local Lagrangian problem as

zt,k+1 i zt,k i ηz ygi xt,k i , zt,k i ; ζt,k i ,

yt,k+1 i yt,k i ηy y Li xt,k i , yt,k i , zt,k i ; ψt,k i ,

xt,k+1 i xt,k i ηx x Li xt,k i , yt,k i , zt,k i ; ξt,k i , (6)

where ηx, ηy,ηx are positive local stepsizes and the stochastic Lagrangian function is defined as Li(x, y, z; ξ) = fi(x, y; ξ)+λ gi(x, y; ξ) gi(x, z; ξ) . Note from the above eq. (6) that the local updates on z, y and x are conducted simultaneously and hence can benefit a lot from the hardware parallelism. In addition, differently from Sim FBO (Yang, Xiao, and Ji 2023b) whose local updates involve secondorder derivatives, all steps in eq. (6) use only first-order gradient information and hence are more computing and memory friendly. It is also worth mentioning that the local SGD steps can be extended to momentum-based updates to achieve improved theoretical complexity. Local aggregation and communication. In this stage, each client i Ct aggregates all τ local updates

k=0 ygi xt,k i , zt,k i ; ζt,k i ,

k=0 y Li xt,k i , yt,k i , zt,k i ; ψt,k i ,

k=0 x Li xt,k i , yt,k i , zt,k i ; ξt,k i . (7)

These normalized aggregations ht z,i, ht y,i and ht x,i are then communicated to the server. Differently from existing AIDbased and ITD-based FBO algorithms that contain multiple sub-loops and communication rounds to approximate the federated hypergradient in this stage, our updates here are much simpler with a single communication round involved. Global aggregation and updates. The server collects the local aggregations ht z,i, ht y,i and ht x,i, which are then further aggregated into ht z, ht y, ht x = 1 |Ct|

ht z,i, ht y,i, ht x,i . (8)

Based on the global aggregations {ht z, ht y, ht x}, the next step is to update global variables xt, yt and zt simultaneously as

zt+1 = zt γzht z,

yt+1 = yt γyht y,

xt+1 = xt γxht x. (9)

Algorithm 1: Mem FBO

1: Input: initialization x0, y0, z0, local learning rates ηz, ηy, ηx, global learning rates γz, γy, γx, local update rounds τ, communication rounds T. 2: for t = 0, 1, 2, ..., T 1 do 3: for i Ct in parallel do 4: zt,0 i = zt, yt,0 i = yt, xt,0 i = xt 5: for k = 0, 1, 2, ..., τ 1 do 6: Locally update zt,k i , yt,k i and xt,k i simultaneously via eq. (6) 7: end for 8: Client i computes local aggregations ht z,i, ht y,i, ht x,i via eq. (7) 9: Client i communicates {ht z,i, ht y,i, ht x,i} to server 10: end for 11: Server computes global aggregations {ht z, ht y, ht x} via eq. (8) and update xt, yt, zt via eq. (9) 12: end for

Note that the global updates in the above equation use a different set {γz, γy, γx} of stepsizes from {ηz, ηy, ηx} in the local updates. This flexible design not only facilitates the algorithmic deployment because the local stepsizes are often less restrictive than global stepsizes, but also improves the overall communication efficiency in theory and practice.

5 Convergence Analysis 5.1 Assumptions and Definitions Definition 5.1. A mapping f is L-Lipschitz continuous if f(x1) f(x2) L x1 x2 , for any x1, x2.

Definition 5.2. We call x as an ϵ-accurate stationary point of the objective function Φ(x) if E Φ( x) 2 ϵ, where x is an output of an algorithm.

The following assumption characterizes the geometries of objective functions.

Assumption 5.3. For any x Rdx, y Rdy and i {1, 2, ..., n}, fi(x, y) and gi(x, y) are twice continuously differentiable, and gi(x, y) is µ-strongly convex w.r.t. y.

We next impose the Lipschitzness conditions on the function fi and gi and their derivatives.

Assumption 5.4. For any x Rdx, y Rdy and i, fi(x, y) is Lf,0-Lipschitz continuous and gi(x, y) is Lg,0-Lipschitz continuous w.r.t. x, the gradients fi(x, y) and gi(x, y) are Lf,1 and Lg,1-Lipschitz continuous respectively, and the second-order derivatives 2fi(x, y) and 2gi(x, y) are Lf,2 and Lg,2-Lipschitz continuous respectively.

Note from Theorem 5.4 that we only assume gi(x, y) is Lg,0-Lipschitz continuous w.r.t. x rather than y because otherwise it contradicts with the fact that gi(x, y) is µ-strongly convex w.r.t. y. Next, we assume that the stochastic gradients and second-order derivatives have bounded estimation variances.

Assumption 5.5. fi(x, y; ξ) and gi(x, y; ζ) are unbiased estimators of fi(x, y) and gi(x, y). Furthermore,

there exist constants σf and σg such that E fi(x, y) fi(x, y; ξ) 2 σ2 f, E gi(x, y) gi(x, y; ζ) 2 σ2 g.

The following assumption characterizes the degree of heterogeneity among the individual gradients ygi(x, y), i {1, ..., n}. Assumption 5.6. For any x Rdx, y Rdy, there exist constants βgh 1 and σgh 0 such that

i=1 E ygi(x, y) 2 β2 gh E y G(x, y) 2 + σ2 gh.

We have βgh = 1 and σgh = 0 when all gi s are identical. This assumption uses βgh and σgh to describe the degree of heterogeneity among the gradients of lower-level objective functions of all clients. Note that we do not directly make the heterogeneity assumption on the target Lagrangian objective L in eq. (4) because this assumption can be hard to justify due to the dependence of L on λ that can be sufficiently large. Instead, we impose the heterogeneity assumption only on the lower-level gradients without λ, and then use this assumption to prove the heterogeneity result for L (see eq. (18) in the appendix for more details).

5.2 Convergence and Complexity Analysis For simplicity, we let |Ct| = P for all t. Let L (x) := miny maxz L(x, y, z) be the Lagrangian function w.r.t. x by taking minimum and maximum over y and z. Since it can be shown that the gradient L (x) has a gap of O( 1

λ) with the gradient Φ(x) of the original problem, we can focus on finding an ϵ-accurate stationary point for λ sufficiently large. Next, we provide an upper bound on E L (xt) 2. Proposition 5.7. Under Assumptions 5.4 and 5.5, the iterates of Algorithm 1 satisfy:

E L (xt) 2 O(1/γx) E L (xt) L (xt+1)

O(1) E eht x 2 + O(γx)E ht x 2

+ O(λ2) t x + t y + t z

+ O(λ2) E zt y (xt) 2 + yt y λ(xt) 2 ,

where we define y λ(x) := arg miny L(x, , z), eht x := E[ht x] and t x := 1

n Pn i=1 1 τ Pτ 1 k=0 E xt,k i xt 2 ( t y and t z can be defined similarly). The proof of Theorem 5.7 refers to that of Theorem C.8 in the appendix. Theorem 5.7 shows that the gradient norm E L (xt) 2 can be effectively bounded by the gap of L ( ) at two consecutive iterates, the norm of the generalized gradient ht x, the client drifts t x, t y, t z, and the distances of global iterates zt and yt to their optimal solutions at each iteration t. Note that Theorem 5.7 also indicates an important trade-off in the selection of the Lagrange multiplier λ. To see this, a large λ guarantees a small gap between the surrogate L ( ) and Φ( ) of the original problem, but also exacerbates the client drifts and approximations errors by zt and yt. Thus, an appropriate choice of λ is crucial to achieve a good convergence and complexity performance, as shown later in our theorems.

We next provide the following propositions to upper bound the key quantities ht x , t x, t y, t z and E zt y (xt) 2, yt y λ(xt) 2 in Theorem 5.7, respectively. Proposition 5.8. Under Assumptions 5.3, 5.4, 5.5, the client drift of yt induced by its local updates satisfies: t y O(η2 yτλ2)(Mse + Mcs) + O(η2 yτλ2) t x,

+ O(η2 yτλ2)E yt y λ(xt) 2,

where the factors Mse := 1 λ2 (σ2 f + λ2σ2 g), Mcs :=

1 λ2 L2 f,0(1 + 2L2 g,1 µ2 ) + λ2σ2 gh are correlated with variances from gradient estimation and client sampling. The proof of Theorem 5.8 can be found in the proof of Theorem C.7 in the appendix. We can see from the above proposition that the client drift w.r.t. y can be bounded by three error terms. The first term O(η2 yτλ2)(Mse + Mcs) is induced by the variances of the stochastic gradient estimators fi(x, y; ξ) and gi(x, y; ζ). The second term O(η2 yτλ2) t x correlates to the client drift by the local updates on variable x because the local gradient estimator

y Li(xt,k i , yt,k i , zt,k i ; ξt,k i )

= yfi(xt,k i , yt,k i ; ξt,k i ) + λ ygi(xt,k i , yt,k i ; ξt,k i )

to update yt,k i also relies on the local iterate xt,k i . The third term O(η2 yτλ2)E yt y λ(xt) 2 is associated with the gap between the global iterate yt and the corresponding optimal point y λ(xt). Note that all these three errors are scaled by the local stepsize η2 y. Thus, by choosing properly small local stepsizes, we can mitigate the impact of client drifts on the final convergence analysis. Proposition 5.9. Under the Assumptions 5.3, 5.4, 5.5, the iterates on y according to Algorithm 1 satisfy E yt+1 y λ(xt+1) 2 E yt y λ(xt) 2

O(γyλ) E yt y λ(xt) 2 + O(γ2 y) E ht y 2

+ O(γ2 x) E ht x 2 + O γ2 x γyλ

+ O(γyλ)( t x + t y). (10) The proof of Theorem 5.9 refers to that of Theorem C.10 in the appendix. A similar result can be established for E zt y (xt) 2. It shows that when the stepsize γy is properly small, there is a descent in terms of the optimality gap E yt y λ(xt) 2, but the bound also contains multiple additional errors by the client drifts t x, t y, the norm of generalized gradient ht y, ht x and its expectation eht x. The client drifts are controllable to be small based on Theorem 5.8, and the gradient norm E ht y (similarly for E ht x ) can be upper bounded using the following proposition. Proposition 5.10. Under Assumptions 5.4 and 5.5, the generalized stochastic gradient ht y on y updates as

E ht y 2 O(λ2) E yt y λ(xt) 2 + O(λ2) ( t x + t y)

Mse + O (n P)λ2

where Mse and Mcs are the same as in Theorem 5.8.

The proof of Theorem 5.10 refers to that of Theorem C.9 in the appendix. Then, by incorporating the bound in Theorem 5.10 into Theorem 5.9, it can be seen that for the stepsize γy sufficiently small, the first error term in eq. (11), after being scaled by γ2 y, can be merged into the negative term O(γyλ) E yt y λ(xt) 2 in eq. (10) (similarly for the other error terms). Also note that the last two error terms in eq. (11) are caused by data and client sampling, and are decreasing sublinearly w.r.t. the number τ of local steps and the number P of participating clients. This result helps to establish a linear speedup in the convergence and complexity of our algorithm, as shown in the following theorem. Based on the above propositions, we construct a Lyapunov function Ψ(xt) := L (xt) + Kz E zt y (xt) 2 + Ky E yt y λ(xt) 2 to prove the final convergence result of our algorithm, as shown below. Theorem 5.11. Define Φ(x) = F x, y (x) . Suppose that Assumptions 5.3, 5.4, 5.5, 5.6 hold. For partial participation, the convergence rate of Algorithm 1 satisfies

min t E Φ(xt) 2 = O(P 2

For full participation, the convergence rate of Algorithm 1 satisfies

min t E Φ(xt) 2 = O(P 2

In both cases, γx, γy, γz, λ, ηx, ηy, ηz in Algorithm 1 need to be specifically chosen. Their specific choices are given in eq. (47) and eq. (48) in the appendix. We can relax the requirement of P = n for full client participation to a milder condition that P min (τ 1)n

τ + 1, n , and still achieve the same convergence rate. Then, there are a few remarks about the above result. First, our method achieves a linear convergence speedup under both partial and full client participation, where the clients are sampled without replacement in the case of partial client participation. Second, under full client participation, we can ignore the impact of the noise brought by the client sampling on the convergence of our method, and in turn, increasing the number of local updating steps can help to improve the convergence rate. Next, we analyze the communication and complexity of our proposed algorithm. Corollary 5.12. Under the setting of Theorem 5.11, we have the following results: Partial participation: we can find an ϵ-accurate stationary solution of Φ(x) after T = O(P 1ϵ 3.5) global iterations, where the communication complexity (which is defined as the number of samples) is O(P 1ϵ 3.5), and the overall sample complexity (defined as the total number of communication rounds) is PτT = O(ϵ 3.5). Full participation: our method finds an ϵ-stationary solution of Φ(x) after T = O(ϵ 1.5) global iterations, where the communication complexity is O(ϵ 1.5), and overall the sample complexity is PτT = O(ϵ 3.5). There are several remarks about Theorem 5.12. First, note that we choose the number τ of local steps differently for

Figure 2: Change of averaged weights on clean/noisy data with the number of global iterations (noise rate = 50%).

the partial and full client participation settings, respectively. This difference is caused by the analysis of the last two error terms in eq. (11) of Theorem 5.10. For partial client participation, we set τ = O(1) because further increasing τ cannot improve the final convergence rate due to the dominant error O (n P )λ2

P (n 1) Mcs induced by client sampling. In contrast, in the case of full client participation, the error O (n P )λ2

P (n 1) Mcs vanishes, and hence increasing τ to

its maximum τ = O(P 1T 4 3 ) can reduce the overall estimation variance and hence improve the final convergence rate. Third, differently from existing AIDand ITD-based FBO methods that compute a large number of Hessianand Jacobian-vector products, our method uses only first-order gradient information, while achieving an improved communication complexity of O(ϵ 1.5) over the typical O(ϵ 2) result obtained by Fed Nest (Tarzanagh et al. 2022), FBOAgg ITD (Xiao and Ji 2023) and Fed MBO (Huang, Zhang, and Ji 2023).

6 Experiment We first conduct an experiment on federated data cleaning to compare the performance of our proposed Mem FBO algorithms with multiple benchmark personalized federated learning algorithms including Fed Rep (Collins et al. 2021), Fed Per (Arivazhagan et al. 2019), LG-Fed (Liang et al. 2020), and popular federated learning methods including Fed Avg (Mc Mahan et al. 2017) and Fed Prox (Li et al. 2020). We then perform a federated hyper-representation experiment to compare the performance of our method with other FBO algorithms including Fed Nest (Tarzanagh et al. 2022), LFed Nest (Tarzanagh et al. 2022), Agg ITD (Xiao and Ji 2023), and Sim FBO (Yang, Xiao, and Ji 2023b). The former experiment tests the robustness of different methods on the CIFAR10 dataset with CNN backbones, following the same experimental setup as in (Collins et al. 2021). The latter experiment tests the communication and memory efficiency on MNIST with MLP backbones, following the experiment setup in (Tarzanagh et al. 2022). More details of all experiments can be found in Appendix A.

6.1 Robust Federated Data Cleaning We consider a federated learning setting where edge devices have two data sets: a clean, high-quality set and a noisy,

Method Noise rate 0 30% 50% 70% Fed Rep 86.01 70.56 61.68 56.73 Fed Per 88.86 74.11 71.04 57.19 LG-Fed 88.93 76.27 57.37 49.72 Fed Avg 44.86 22.52 16.74 10.34 Fed Prox 46.30 23.12 17.53 11.93 Sim FBO 71.06 65.71 63.27 60.44 Mem FBO (ours) 81.33 78.21 77.94 73.60

Table 1: Average test accuracy (%) comparison of different (personalized) federated algorithms with heterogeneous data on the CIFAR10 dataset under different noisy levels.

majority set. To develop a robust FL model, we propose a novel FBO problem where the lower level trains the model on noisy data with learnable sample weights, and the upper level evaluates the model on clean data to refine these weights. This FBO problem can be formulated as:

min w Rm n Φ(w) = 1

i=1 fi y (w)

s.t. y (w) := arg min y 1 nm

j=1 wi,jgi,j(y),

where w contain all weights for data samples of all clients, i is the client index, and j is the data sample index for a client, gi,j( ) denotes the loss on the jth sample from the noisy data set of client i, and fi( ) denotes the loss on the clean data of client i. Following our first-order approach, the local Lagrangian function then takes the formulation as:

Li(wi, yi, zi) := fi(yi) + λ

j=1 wi,j gi,j(yi) gi,j(zi) .

Experimental setting. For the baselines, we use 100 clients with the CIFAR10 dataset split into 500 training and 100 test samples per client. To simulate noise, the 500 training samples are divided into 450 noisy samples and 50 clean samples. A subset of the 450 samples is corrupted based on a noise rate (proportion of corrupted data) and a flip rate. In this experiment, the noise rate is set as 0%, 30%, 50%, and 70%. The flip rate decides the portion of labels of a data sample, which are assigned to one of all labels randomly with equal probability. We fix the flip rate to be 80%. Finally, we set up the data heterogeneity following the same procedure as in Fed Rep (Collins et al. 2021), where each client is assigned with 2 classes. Results. The results of test accuracy of different (personalized) federated learning methods under different noisy levels are shown in Table 1. The results show the baseline methods suffer from severe performance degradation, while the performance of the two FBO methods is much more robust to data corruption. However, Sim FBO performance is notably lower than that of Mem FBO. To have a deeper understanding of such robustness, we visualize the averaged weights learned by our FBO method on noisy and clean data for a

Figure 3: The test accuracy v.s. the number of communication rounds on non-i.i.d. MNIST datasets with MLP networks among our Mem FBO and related baselines.

Figure 4: The memory cost comparison between Sim FBO and Mem FBO.

single client in Figure 2. It clearly shows that the weights associated with the noisy data decrease dramatically, whereas the weights on the clean data increase slightly.

6.2 Federated Hyper Representation In this section, we compare the performance of the proposed Mem FBO method with other FBO baselines on a hyperrepresentation problem. We present the comparison results based on test accuracy, communication rounds, and memory cost in Figure 3 and Figure 4. From Figure 3, it can be seen that our method can achieve a faster convergence rate and a higher accuracy than second-order methods with subloops including Fed Nest, LFed Nest, and Agg ITD, while achieving comparable performance with the fully single-loop secondorder method Sim FBO. Moreover, Figure 4 shows that the memory cost of Sim FBO increases dramatically with the number of local steps, whereas our Mem FBO consistently maintains low memory usage.

7 Conclusion We propose Mem FBO, a memoryand computationefficient federated bilevel algorithm with theoretical convergence under milder assumptions. Experiments validate its efficiency, with extensions to decentralized optimization and applications in hyperparameter tuning, meta-learning, and edge computing.

Acknowledgements Yifan Yang, Peiyao Xiao and Kaiyi Ji were partially supported by NSF grants CCF-2311274 and ECCS-2326592. Siqian Ma was partially supported by ONR grant N0001424-1-2705, NSF grant CCF-2311275 and ECCS-2326591.

References Arbel, M.; and Mairal, J. 2021. Amortized implicit differentiation for stochastic bilevel optimization. ar Xiv preprint ar Xiv:2111.14580. Arivazhagan, M. G.; Aggarwal, V.; Singh, A. K.; and Choudhary, S. 2019. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818. Bracken, J.; and Mc Gill, J. T. 1973. Mathematical programs with optimization problems in the constraints. Operations Research, 21(1): 37 44. Chen, L.; Ma, Y.; and Zhang, J. 2023. Near-Optimal Fully First-Order Algorithms for Finding Stationary Points in Bilevel Optimization. ar Xiv preprint ar Xiv:2306.14853. Chen, T.; Sun, Y.; and Yin, W. 2021. A single-timescale stochastic bilevel optimization method. ar Xiv preprint ar Xiv:2102.04671. Chen, X.; Huang, M.; and Ma, S. 2022. Decentralized bilevel optimization. ar Xiv preprint ar Xiv:2206.05670. Chen, X.; Huang, M.; Ma, S.; and Balasubramanian, K. 2023. Decentralized stochastic bilevel optimization with improved per-iteration complexity. In International Conference on Machine Learning, 4641 4671. PMLR. Collins, L.; Hassani, H.; Mokhtari, A.; and Shakkottai, S. 2021. Exploiting Shared Representations for Personalized Federated Learning. ar Xiv preprint ar Xiv:2102.07078. Dagr eou, M.; Ablin, P.; Vaiter, S.; and Moreau, T. 2022. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. ar Xiv preprint ar Xiv:2201.13409. Domke, J. 2012. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, 318 326. PMLR. Dong, Y.; Ma, S.; Yang, J.; and Yin, C. 2023. A Single-Loop Algorithm for Decentralized Bilevel Optimization. ar Xiv preprint ar Xiv:2311.08945. Fallah, A.; Mokhtari, A.; and Ozdaglar, A. 2020. Personalized federated learning: A meta-learning approach. ar Xiv preprint ar Xiv:2002.07948. Feurer, M.; and Hutter, F. 2019. Hyperparameter optimization. Automated machine learning: Methods, systems, challenges, 3 33. Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 1126 1135. PMLR. Franceschi, L.; Donini, M.; Frasconi, P.; and Pontil, M. 2017. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, 1165 1173. PMLR.

Franceschi, L.; Frasconi, P.; Salzo, S.; Grazzi, R.; and Pontil, M. 2018. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, 1568 1577. PMLR. Gao, H. 2022. On the convergence of momentum-based algorithms for federated stochastic bilevel optimization problems. ar Xiv preprint ar Xiv:2204.13299. Ghadimi, S.; and Wang, M. 2018. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246. Gould, S.; Fernando, B.; Cherian, A.; Anderson, P.; Cruz, R. S.; and Guo, E. 2016. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447. Grazzi, R.; Franceschi, L.; Pontil, M.; and Salzo, S. 2020. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, 3748 3758. PMLR. Guo, Z.; Hu, Q.; Zhang, L.; and Yang, T. 2021. Randomized stochastic variance-reduced methods for multi-task stochastic bilevel optimization. ar Xiv preprint ar Xiv:2105.02266. Hansen, P.; Jaumard, B.; and Savard, G. 1992. New branchand-bound rules for linear bilevel programming. SIAM Journal on Scientific and Statistical Computing, 13(5): 1194 1217. Hao, J.; Ji, K.; and Liu, M. 2023. Bilevel Coreset Selection in Continual Learning: A New Formulation and Algorithm. In Thirty-seventh Conference on Neural Information Processing Systems. Hong, M.; Wai, H.-T.; Wang, Z.; and Yang, Z. 2023. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actorcritic. SIAM Journal on Optimization, 33(1): 147 180. Huang, F. 2022. Fast Adaptive Federated Bilevel Optimization. ar Xiv preprint ar Xiv:2211.01122. Huang, F.; and Huang, H. 2021. Biadam: Fast adaptive bilevel optimization methods. ar Xiv preprint ar Xiv:2106.11396. Huang, M.; Zhang, D.; and Ji, K. 2023. Achieving Linear Speedup in Non-IID Federated Bilevel Learning. ar Xiv preprint ar Xiv:2302.05412. Huang, Y.; Lin, Q.; Street, N.; and Baek, S. 2022. Federated learning on adaptively weighted nodes by bilevel optimization. ar Xiv preprint ar Xiv:2207.10751. Ji, K.; Yang, J.; and Liang, Y. 2021. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, 4882 4892. PMLR. Ji, K.; and Ying, L. 2023. Network Utility Maximization with Unknown Utility Functions: A Distributed, Data Driven Bilevel Optimization Approach. ar Xiv preprint ar Xiv:2301.01801. Kwon, J.; Kwon, D.; Wright, S.; and Nowak, R. D. 2023. A fully first-order method for stochastic bilevel optimization. In International Conference on Machine Learning, 18083 18113. PMLR.

Li, J.; Gu, B.; and Huang, H. 2020. Improved bilevel model: Fast and optimal algorithm with theoretical guarantee. ar Xiv preprint ar Xiv:2009.00690. Li, J.; Huang, F.; and Huang, H. 2022. Local stochastic bilevel optimization with momentum-based variance reduction. ar Xiv preprint ar Xiv:2205.01608. Li, J.; Huang, F.; and Huang, H. 2023. Communication Efficient Federated Bilevel Optimization with Local and Global Lower Level Problems. ar Xiv preprint ar Xiv:2302.06701. Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; and Smith, V. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2: 429 450. Liang, P. P.; Liu, T.; Ziyin, L.; Allen, N. B.; Auerbach, R. P.; Brent, D.; Salakhutdinov, R.; and Morency, L.-P. 2020. Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523. Liao, R.; Xiong, Y.; Fetaya, E.; Zhang, L.; Yoon, K.; Pitkow, X.; Urtasun, R.; and Zemel, R. 2018. Reviving and improving recurrent back-propagation. In International Conference on Machine Learning, 3082 3091. PMLR. Lin, G.-H.; Xu, M.; and Ye, J. J. 2014. On solving simple bilevel programs with a nonconvex lower level program. Mathematical Programming, 144(1-2): 277 305. Liu, B.; Ye, M.; Wright, S.; Stone, P.; and Liu, Q. 2022. Bome! bilevel optimization made easy: A simple first-order approach. Advances in Neural Information Processing Systems, 35: 17248 17262. Liu, R.; Liu, X.; Yuan, X.; Zeng, S.; and Zhang, J. 2021a. A value-function-based interior-point method for non-convex bi-level optimization. In International Conference on Machine Learning, 6882 6892. PMLR. Liu, R.; Liu, Y.; Zeng, S.; and Zhang, J. 2021b. Towards gradient-based bilevel optimization with non-convex followers and beyond. Advances in Neural Information Processing Systems, 34: 8662 8675. Liu, R.; Mu, P.; Yuan, X.; Zeng, S.; and Zhang, J. 2020. A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In International Conference on Machine Learning, 6305 6315. PMLR. Lorraine, J.; Vicol, P.; and Duvenaud, D. 2020. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, 1540 1552. PMLR. Lu, S.; Cui, X.; Squillante, M. S.; Kingsbury, B.; and Horesh, L. 2022. Decentralized bilevel optimization for personalized client learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5543 5547. IEEE. Mc Mahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 1273 1282. PMLR. Nesterov, Y.; and Polyak, B. T. 2006. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1): 177 205.

Pedregosa, F. 2016. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, 737 746. PMLR. Rajeswaran, A.; Finn, C.; Kakade, S. M.; and Levine, S. 2019. Meta-learning with implicit gradients. Advances in Neural Information Processing Systems, 32. Roh, Y.; Lee, K.; Whang, S. E.; and Suh, C. 2021. Fair Batch: Batch Selection for Model Fairness. In International Conference on Learning Representations. Sabach, S.; and Shtern, S. 2017. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2): 640 660. Shaban, A.; Cheng, C.-A.; Hatch, N.; and Boots, B. 2019. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, 1723 1732. PMLR. Shen, H.; and Chen, T. 2023. On Penalty-based Bilevel Gradient Descent Method. ar Xiv preprint ar Xiv:2302.05185. Shi, C.; Lu, J.; and Zhang, G. 2005. An extended Kuhn Tucker approach for linear bilevel programming. Applied Mathematics and Computation, 162(1): 51 63. Sinha, A.; Malo, P.; and Deb, K. 2017. A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2): 276 295. Sow, D.; Ji, K.; Guan, Z.; and Liang, Y. 2022. A constrained optimization approach to bilevel optimization with multiple inner minima. ar Xiv preprint ar Xiv:2203.01123. Tarzanagh, D. A.; Li, M.; Thrampoulidis, C.; and Oymak, S. 2022. Fed Nest: Federated bilevel, minimax, and compositional optimization. In International Conference on Machine Learning, 21146 21179. PMLR. Wang, X.; Pan, R.; Pi, R.; and Zhang, T. 2023. Effective Bilevel Optimization via Minimax Reformulation. ar Xiv preprint ar Xiv:2305.13153. Xiao, P.; and Ji, K. 2023. Communication-Efficient Federated Hypergradient Computation via Aggregated Iterative Differentiation. ar Xiv preprint ar Xiv:2302.04969. Xing, P.; Lu, S.; Wu, L.; and Yu, H. 2022. Bi G-Fed: Bilevel Optimization enhanced graph-aided federated learning. IEEE Transactions on Big Data. Yang, J.; Ji, K.; and Liang, Y. 2021. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34: 13670 13682. Yang, S.; Zhang, X.; and Wang, M. 2022. Decentralized gossip-based stochastic bilevel optimization over communication networks. ar Xiv preprint ar Xiv:2206.10870. Yang, Y.; Xiao, P.; and Ji, K. 2023a. Achieving O(ϵ 1.5) Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization. ar Xiv preprint ar Xiv:2312.03807. Yang, Y.; Xiao, P.; and Ji, K. 2023b. Sim FBO: Towards Simple, Flexible and Communication-efficient Federated Bilevel Learning. ar Xiv preprint ar Xiv:2305.19442. Yousefian, F. 2021. Bilevel distributed optimization in directed networks. In 2021 American Control Conference (ACC), 2230 2235. IEEE.