# byzantinerobust_and_hessianfree_federated_bilevel_optimization__d4306bb1.pdf

Published in Transactions on Machine Learning Research (09/2025)

Byzantine-Robust and Hessian-Free Federated Bilevel Optimization

Shruti Maralappanavar mshruti32@gmail.com Department of Electrical, Electronics and Communication Engineering Indian Institute of Technology Dharwad Dharwad, India

B. N. Bharath bharathbn@iitdh.ac.in Department of Electrical, Electronics and Communication Engineering Indian Institute of Technology Dharwad Dharwad, India

Reviewed on Open Review: https: // openreview. net/ pdf? id= 5trmyvtkeo

In the last few years, Byzantine robust algorithms to solve a minimization problem in the Federated setup have received significant attention. Most of the existing works consider the problem of byzantine-robustness for single-level optimization or consider the federated bilevel optimization without Byzantine nodes. However, problem formulation such as federated bilevel optimization in the presence of byzantine nodes is unexplored. Recognizing the gap, for the first time, we propose a computationally efficient and robust algorithm for solving Federated Bilevel Optimization with Byzantine (Fed BOB) nodes that: ①Work under the assumption that the data across nodes are heterogeneous (non-iid), ②Consider the lowerlevel objective is non-convex and satisfies the Polyak-Łojasiewicz (PL)-inequality, and ③ Is fully first-order and does not rely on second order information. We achieve this by reformulating the federated bilevel problem into a single penalty problem. We provide the theoretical performance of the proposed algorithm and experimentally corroborate our theoretical findings.

1 Introduction

Deep Learning thrives on large datasets that are often distributed across multiple owners (Verbraeken et al., 2020). The challenge of training the model on distributed data sets was addressed using a paradigm called Federated Learning (FL), which enables clients (agents) to train models locally on private data while a central server (aggregator) combines them into a unified model (Mc Mahan et al., 2017; Smith et al., 2017). Although FL offers several benefits, it comes with certain risks. Since it is a collaborative mechanism, it opens the possibility of security threats (Karimireddy et al., 2020; Gorbunov et al., 2022). In particular, a few nodes in the FL setting can potentially be malicious, also known as Byzantine nodes, and send corrupt information which renders the updates at the central server useless. Recently, the question of byzantine robustness has received significant attention making it relatively well studied both in theory and practice for single-level minimization problems in the FL setting (Yin et al., 2018; Blanchard et al., 2017b; Chen et al., 2017; Karimireddy et al., 2021; 2020; Rammal et al., 2024). There are many real-world applications that cannot be modeled as single level minimization problems, e.g. robust learning Zhang et al. (2022); Khanduri et al. (2023), meta-learning Rajeswaran et al. (2019), hyperparameter optimization Franceschi et al. (2018), neural architecture search Xue et al. (2021), resource management Sun et al. (2021), and image denoising Crockett et al. (2022) and other problems encountered in machine learning and signal processing tasks Zhang et al. (2024), to name a few. Such problems follow a nested structure that goes beyond the scope of a standard single-level minimization structure, as in Karimireddy et al. (2021; 2020); Rammal et al.

Published in Transactions on Machine Learning Research (09/2025)

(2024). Towards solving such nested problems, bilevel optimization in non-FL as well as the FL setting has received great attention in the recent past Ghadimi & Wang (2018); Hong et al. (2023); Khanduri et al. (2021); Tarzanagh et al. (2022); Huang et al. (2023); Yang et al. (2024b). But these works cannot handle the presence of Byzantine clients. Recently, the authors in Abbas et al. (2024) proposed a byzantine-resilient bilevel federated optimization algorithm. However, they ①use second order information to update its model parameters making it computationally expensive, and ②make restrictive assumption such as strongly-convex lower level function.

A typical approach to solving the bilevel problems is to compute the gradient of the upper-level function (called hypergradient). To find the hypergradient there is a need to compute the second-order information of the lower-level function that makes the algorithm computationally complex (Ghadimi & Wang, 2018; Chen et al., 2021). This issue was resolved by using the first order method involving penalty formulation (Liu et al., 2022a; Kwon et al., 2023b; Shen & Chen, 2023; Chen et al., 2023) of the bilevel problem; albeit in the non-FL setting. However, none of the existing work considers a computationally efficient robust algorithm to solve a bilevel optimization problem in the FL setting leading to the following question.

Q: Is it possible to design a Robust and Low Complexity algorithm for solving a bilevel optimization problem in the FL setting in the presence of Byzantine nodes?

In this paper, we answer the above question by considering a federated version of the bilevel optimization problem in the presence of Byzantine nodes. We assume that there are N nodes denoted by N = G S B, where G is the set of good or legitimate nodes, and B is the set of Byzantine or bad nodes. Let |G| = G and |B| = B. The problem of Federated Bilevel Optimization with Byzantine (Fed BOB Problem) involves solving the following:

min x Rd1 f(x) := f (x, y (x)) := 1

k G fk (x, y (x))

arg min y Rd2 g(x, y) := 1

k G gk (x, y)

where fk : Rd1 Rd2 R and gk : Rd1 Rd2 R, k G are upper and lower level objective functions, respectively. The standard approach to solving the above problem involves computing second-order information such as Hessian, which is computationally expensive. In this paper, we take the first order approach by minimizing a penalty function defined as minx,y {hλ (x, y) := f(x) + λp(x, y)}, where λ > 0

and p(x, y) := g(x, y) miny g x, y (Shen & Chen, 2023; Liu et al., 2022a; Kwon et al., 2023b;a). Next,

we present challenges that we need to address in comparison with the existing work (Karimireddy et al., 2020; Shen & Chen, 2023; Blanchard et al., 2017a; Pillutla et al., 2022a).

Challenges: The penalty based method in the non-FL setting is very well studied (Shen & Chen, 2023). This work relies on the equivalence between a local solution to the penalty formulation (i.e., the point at which the gradient is zero) and a solution to the original problem. Unfortunately, the FL setting with heterogeneous and Byzantine nodes forces the solution to have non-zero gradients (see lower bounds in (Karimireddy et al., 2020)), and hence the equivalence in (Shen & Chen, 2023) cannot be directly used. We take a different approach where we analyze (i) convergence of the gradient, as in (Shen & Chen, 2023) and (b) a new notion called constraint violation that measures the average violation of the constraint in the lower-level problem, which is new. Keeping track of the average constraint violation is particularly important in settings such as adversarial learning (Khanduri et al., 2023). The average constraint violation gives information on how frequently the constraint is violated rather than capturing it implicitly in the convergence rate, as in (Shen

& Chen, 2023). In addition, minimizing the penalty function involves computing miny g x, y in a FL fashion in the presence of Byzantine. To make matters more challenging, the resulting lower-level optimum drifts away from y (x) due to heterogeneity in the data and is further alleviated due to the presence of Byzantine nodes, which makes the analysis difficult.

Contribution: In this paper, we address the above challenges and make the following contribution. ①We consider the Fed BOB problem in equation 1, and propose a robust and fully first-order federated al-

Published in Transactions on Machine Learning Research (09/2025)

gorithm by reformulating the federated bilevel problem into a single-level penalty-based optimization problem that makes it computationally efficient. Unlike the existing work on federated bilevel optimization (Tarzanagh et al., 2022; Huang et al., 2023; Yang et al., 2024b) where the lower level is strongly convex, we consider a class of non-convex functions satisfying the PL inequality, thus widening the scope of applicability.

②The proposed algorithm (Rob-Fed BOB) is shown to converge at a rate of O λ2

R + α(ζ2 f + λ2ζ2 g) , where R is the number of communication rounds, ζf and ζg are the inter-client heterogeneity terms corresponding to the upper and lower level objective functions, respectively, and α [0, 0.5] is the fraction of the Byzantine nodes. Additionally, the algorithm results in an average constraint violation that scales as

O (1+cαζ2 f ) λ2 + cαζ2 g . Our results demonstrate the trade-off between the convergence and the constraint violation that is captured through λ. Higher λ results in better constraint violation properties, but at the expense of lower convergence rate and vice versa. Furthermore, our bounds reveal that Byzantine nodes and heterogeneity act as a bottleneck in achieving good performance, as expected. We study the trade-off between convergence and the constraint violation, and provide insight on the choice of λ.

③Finally, we present experimental results for data hyperclearning application for various attacks and corroborate our theoretical findings. In particular, we consider the following attacks (i) Bit Flipping (BF), (ii) Label Flipping (LF), (iii) Inner Product Manipulation (IPM), and (iv) A Little is Enough (ALIE), and show that both gradient and constraint violation go down and converge to a constant with increasing number of communication rounds. The constants to which the gradient and the average violation converge are governed by the number of Byzantine nodes, and the inter-client heterogeneity of the good nodes.

1.1 Related Work

Robust Federated Learning: Over the recent years there has been a significant amount of work on byzantine robustness in case of single level optimization problems (Karimireddy et al., 2020; Gorbunov et al., 2022). Byzantine robustness is very well studied when the nodes have iid data distributions. One approach to obtaining robustness is to use robust aggregation strategies such as Coordinate wise median (Chen et al., 2017), KRUM (Blanchard et al., 2017a), geometric median (Pillutla et al., 2022a), use variance reduction techniques (Wu et al., 2020), filter byzantine nodes. Recently, there have been many works which consider byzantine robustness for non-iid data distributions such as: outlier based-robust clustering (Sattler et al., 2020), spectral methods for robust optimization (Data & Diggavi, 2021). Other interesting approaches include use of (a) a formal definition of robust aggregation with client momentum (Karimireddy et al., 2020), (b) random checks of computations (Rammal et al., 2024), (c) normalized gradient (Zuo et al., 2024), and (d) normalized momentum (Yang et al., 2024a). These works are limited to solving single level federated learning problem in the presence of Byzantine nodes.

Federated Bilevel Optimization (FBO): Recently some works have considered Federated Bilevel optimization as many machine learning applications can be formulated as a nested bilevel problem. The authors in (Tarzanagh et al., 2022) proposed a federated alternating stochastic gradient method that requires a federated hypergradient computation. They use variance reduction to handle lower-level heterogeneity. The results on the complexity of the sample were improved using momentum-based federated bilevel algorithms with a reduction in variance in (Li et al., 2022). Later, the work in (Huang et al., 2023) achieved linear speedup in the presence of non-iid data by using a novel client sampling scheme. Along similar lines, the authors in (Yang et al., 2024b) propose a communication efficient federated bilevel optimization algorithm. However, most of these works require the second-order information to perform the gradient update. Further, these works limit the lower-level objective function to strongly convex. Thus, the challenging problem of finding a computationally efficient algorithm that is robust to byzantine attacks in a federated bilevel setting is still an open problem. In this work, we have closed this gap. Concurrent to our work, the authors in (Abbas et al., 2024) proposed a byzantine-resilient bilevel federated optimization algorithm. They use the second order information such as the Hessians/Jacobians making it computationally expensive. In addition, we have included more experimental results compared to (Abbas et al., 2024).

Published in Transactions on Machine Learning Research (09/2025)

2 Problem Statement

We consider a federated version of the bilevel optimization problem in the presence of Byzantine nodes as in equation 1. The problem in equation 1 cannot be solved directly in the presence of Byzantine nodes. This is due to the fact that the Byzanine nodes can share corrupt information with the central server, which can make the updates potentially useless. The problem of Byzantine robust distributed optimization in various settings have been studied earlier (Blanchard et al., 2017a; Pillutla et al., 2022a; Wu et al., 2020; He et al., 2020). Clearly, as explained in the previous section, the problem of federated bilevel optimization with Byzantine is not at all addressed in the literature as it posses several challenges: ①The typical solution for the bilevel problem without Byzantine involves using second order methods thus making it computationally expensive (Ghadimi & Wang, 2018; Chen et al., 2021). ②Most of these works assume that the lower level function gk (x, y) for all k G is strongly convex making it more restrictive. ③The lower level optimum drifts away from y (x) due to heterogeneity in the data, and is further aggravated due to the presence of Byzantine nodes. We handle the above challenges by ①proposing a fully first order method of solving Fed BOB Problem, and ②using robust aggregation strategies that is resilient to the Byzantine attacks. Our method is motivated by reformulating the Fed BOB Problem into an approximate equivalent form as follows:

Approx-Fed BOB Problem: min x f (x, y)

such that p(x, y) := g (x, y) v(x) ϵ, (2)

where v(x) := miny g (x, y) and ϵ > 0. Letting ϵ = 0 in the above problem makes it equivalent to that of the Fed BOB Problem. Writing the Lagrangian function of the above problem results in

hλ (x, y) := 1

k=1 hλ,k (x, y) := f (x, y) + λp (x, y) (3)

for all x Rd1, y Rd2. In the above, λ > 0 and p (x, y) is the penalty term. Now, the dual problem is to solve the following in the presence of Byzantine nodes

Fed BOB Penalty Problem: min x,y hλ (x, y) . (4)

It turns out that solving the above problem is approximately equivalent to solving the original problem (see (Shen & Chen, 2023) for more details). Note that this boils down to a single level federated learning problem in the presence of Byzantine. Although single level federated optimization in the presence of Byzantine (see (Karimireddy et al., 2020; Gorbunov et al., 2022; Pillutla et al., 2022a; Wu et al., 2020)) has been studied in the literature, our framework is completely different in the following sense ①the impact of the Byzantine nodes and λ on the performance need to be studied, and does not follow directly from the existing literature. ②The solution obtained by our proposed algorithm should not only result in a good stationary point of equation 4 but also should exhibit good constraint violation properties this additional requirement is not explicit in the single level formulation. Proposing a low complexity robust algorithm and analyzing its performance is completely new.

2.1 Preliminaries

In this subsection, we discuss the assumptions and definitions required in the analysis of the proposed Robust Federated Bilevel Optimization with Byzantine (Rob-Fed BOB) algorithm. Assumption 1. (Smoothness): The upper-level objective function fk (x, y) is assumed to be Lf,k smooth for all good nodes k G, i.e.,

fk (x, y) fk (x, y ) 2 Lf,k y y 2

for all x, x Rd1, y, y Rd2. Further, we also assume that fk (x, y) is a lf,k Lipschitz, i.e.,

| fk (x, y) fk (x, y ) | lf,k y y 2

for all x, x Rd1, y, y Rd2 and k G. In addition, gk (x, ) is a Lg,k-smooth and lg,k lipschitz function for all k G.

Published in Transactions on Machine Learning Research (09/2025)

Assumption 2. (PL inequality): The lower-level objective function gk (x, y) for all k G satisfies the PL inequality, i.e., gk (x, y) 2 µg (gk (x, y) vk(x)) for some µg > 0 and for all x Rd1, y1, y2 Rd2. Here y k(x) miny gk(x, y). In addition, we consider the average lower level objective g (x, y) satisfies the PL inequality, i.e., g (x, y) 2 µg (g (x, y) v(x)) for some µg > 0 for all x Rd1, y Rd2.

Assumption 3. (Inter-client heterogeneity): The upper-level objective function fk (x, y) for all k G is said to satisfy inter-client heterogeneity, i.e., Ek G fk (x, y) f (x, y) 2 ζ2 f for some ζf > 0 and for all x Rd1, y Rd2. In addition, gk (x, y) also satisfies inter-client heterogeneity for some ζg > 0, i.e., Ek G gk (x, y) g (x, y) 2 ζ2 g.

Most of the assumptions above such as smoothness, inter-client heterogeneity are standard (Shen & Chen, 2023; Karimireddy et al., 2020; Rammal et al., 2024). The PL inequality is satisfied by most of the overparameterized neural networks (Liu et al., 2022b; Shen & Chen, 2023). The inter-client heterogeneity (Karimireddy et al., 2020; Gorbunov et al., 2022; Yu et al., 2018) restricts the data heterogeneity of good nodes. In fact, the lower bound in (Karimireddy et al., 2020) shows that the heterogeneity condition is inevitable. Note that ζf = 0 and ζg = 0 case imply that the data are homogeneous across all nodes.

3 Algorithm Design

It is well known that the aggregator plays an important role in the performance of any algorithm with Byzantines. Hence, we first define the class of aggregators, which is adapted from (Karimireddy et al., 2020).

Definition 1. ((α, c)-Robust Aggregator): Let us assume that we are given inputs {x1, x2, . . . , x N} such that there exists a subset G [N] of size | G |= G (1 α) N for α 0.5 such that E xi xj 2 δ2 for all i, j G and some δ 0. Then we say that ˆx satisfies (α, c)-Robust Aggregator (RAgg) if

E ˆx x 2 cαδ2, (5)

where x = 1 G PG k=1 xk. Here the expectation is taken with respect to possible randomness of good nodes {x1, x2, . . . , x G}.

Note that (α, c)-Robust Aggregator property is satisfied by many aggregators such as KRUM, Coordinatewise median (CM), Geometric median (RFA). We provide more details in Appendix H.

3.1 Rob-Fed BOB Algorithm

Given that the penalty reformulation is a single level problem, an obvious approach is to use the following gradient updates: xr+1 = x β xhλ (xr, yr) and yr+1 = y β yhλ (xr, yr). However, this computation involves finding p (xr, yr) which requires v (xr). In general, v (x) need not to smooth always. Further, xv(x) = xg (x, y ) at y . From assumptions 2 and 1, using Lemma A.5 of Nouiehed et al. (2019), we can see that v(x) = xg (x, y (x)). In addition to this, the gradient update requires y (x), which is a solution to the lower level problem with respect to y, and is unknown. More specifically, computing y (x) requires access to g(x, y) (see equation 1), which is not available at each node. One way to handle this is that each node k runs T number of GD steps on the lower level function gk (xr, y) for a given xr as follows (see line 9 of Algorithm 1)

yr,t+1 = yr,t γ ygag (xr, yr,t) ,

where yr,0 = yr 1 and t = 0, 1, . . . , T 1. Here,

ygag (xr, yr) = RAgg ( ygk (xr, yr) , k [N]) , (6)

where RAgg uses bucketing followed by geometric median aggregation (see (Karimireddy et al., 2020) for more details). Now, we can use yk,T as a proxy for y (xr) to get the following updates (see line 13 of

Published in Transactions on Machine Learning Research (09/2025)

Algorithm 1):

yr+1 = yr η yhλ,ag (xr, yr) , and (7)

xr+1 = xr β xˆhλ,ag (xr, yr) , (8)

where the gradients are defined as

yhλ,ag (xr, yr) = RAgg ( yhλ,k (xr, yr) , k [N]) ,

xˆhλ,ag (xr, yr) = RAgg xˆhλ,k (xr, yr) , k [N] .

In the above,

xˆhλ,k (xr, yr) := xfk (xr, yr) + λ ( xgk (xr, yr) xgk (xr, yr,T )) (9)

and yhλ,k (xr, yr) := yfk (xr, yr) + λ ygk (xr, yr) for all k G. In order to access whether the output of the algorithm satisfies the constraint or not, we propose the following notion of violation:

r=0 p(xr, yr), (10)

where xr and yr are the output of the algorithm in round r. Note that when Viol R/R converges to a small constant, it means that p(xr, yr) for sufficiently large r is close to the constant, and hence captures the violation performance of the algorithm. Note: The robust strategy involving binning followed by geometric median is known to satisfy the (α, c) robustness (see (Karimireddy et al., 2020)) property. However, in our case, the gradient of the penalty function is aggregated at the central server. This function depends on λ and the proxy for y (xr), which is obtained by using lines 5 to 11 in Algorithm 1. This makes it necessary for us to prove that the (α, c) robustness is still achieved for the penalty function, as shown in the following lemmas.

Lemma 1. For the robust aggregator RARgg, we have for some c > 0

yhλ,ag (xr, yr) yhλ (xr, yr) 2 cαρ2

where ρ2 := 8ζ2 f + 8λ2ζ2 g.

Proof: See Sec. B in Appendix.

The above lemma shows that the robustness depends on λ and the heterogeneity terms. In the following,

we prove the robustness of x ˆhλ (xr, yr) := 1

G PG k=1 xˆhλ,k (xr, yr) with δ2 := 8λ2L2 g,maxl2 g,max µ2g 1 γµg

8λ2L2 g,maxcαζ2 g µg + 6ζ2 f + 12λ2ζ2 g, where Lg,max := maxk G Lg,k and lg,max := maxk G lg,k.

Lemma 2. The robust aggregator ARgg satisfies

xˆhλ,ag (xr, yr) x ˆhλ (xr, yr) 2 cαδ2.

Proof: See Sec. C in Appendix.

The above lemma shows that the robustness depends on how close the proxy yr,T is to the actual optimal y (xr) captured through the first term in δ2. It also reveals that this can be reduced by increasing T.

Published in Transactions on Machine Learning Research (09/2025)

Algorithm 1 Rob-Fed BOB Algorithm

1: Initialize x0 Rd1, y0 Rd2

2: for r = 0, 1, 2, . . . , R 1 do

3: Send xr, yr to each node 4: Set zr,0 = yr 5: for t = 0, 1, . . . , T 1 do 6: for k N in parallel do 7: Send ygk (xr, yr,t), k N 8: end for

9: yr,t+1 = yr,t γ ygag (xr, yr,t) 10: Send yr,t+1 to all nodes 11: end for 12: Set yr,T = yr,T 13: Server updates xr+1 and yr+1 using equation 8 and equation 7, respectively. 14: end for Output: x R and y R

3.2 Convergence Analysis

In the following, we provide the first main result for the Rob-Fed BOB algorithm.

Theorem 1. Suppose assumptions 1-3 hold, then for the aggregator RAgg, Algorithm 1 achieves the following bound 1 R

r=0 hλ (xr, yr) 2 O λ2

R + α(ζ2 f + λ2ζ2 g)

for constant learning rates η 1 Lh , β 1 Lh , and T 2 γµg log 8λ2RL2 g,maxl2 g,max µ2g

Proof: See Sec. F in Appendix.

The above result reveals the effect of λ, the fraction of the Byzantine nodes α, and the heterogeneity terms ζf and ζg on the convergence rate. Note that when α = 0, i.e., when there are no byzantine nodes, we get a

convergence rate of O λ2

R . Thus, a smaller gradient can be achieved by choosing larger R. On the other hand, in the presence of Byzantine, i.e., α = 0, we have

r=0 hλ (xr, yr) 2 = O(α(ζ2 f + ζ2 g)). (11)

This cannot be made very small unless the heterogeneity terms ζ2 f and ζ2 g are zeros. A similar observation was made in Karimireddy et al. (2020) in the context of single level setting. They also showed a lower bound demonstrating that the result they obtain cannot be improved further. Since single level is a special case of the bilevel problem that we are considering, we believe that the bound in Theorem 1 cannot be improved. Next, we present a bound on the average violation that relates to the gradient of the penalty function.

Lemma 3. Suppose Assumptions 1 and 2 hold, and hλ (x, y) ψ for some ψ > 0, then the average violation in equation 10 is bounded as follows

R 2 l2 f + ψ2

µgλ2 . (12)

Proof: See Sec. D in Appendix.

Published in Transactions on Machine Learning Research (09/2025)

Note that the above shows that by choosing a large λ or if ψ is small results in a good constraint qualification properties. In order to prove a violation bound on the proposed algorithm, we use a bound on hλ (xr, yr) 2

for each r obtained in the proof of Theorem 1.

Theorem 2. Algorithm 1 satisfies the following average constraint violation provide the conditions in Theorem 1 are satisfied

(1 + cαζ2 f) λ2 + cαζ2 g

Proof: See Sec. G in Appendix.

We observe from the above theorem that larger values of λ result in a better violation qualification properties, as expected. In particular, as λ , average violation converges to cαζ2 g revealing the bottleneck due to the Byzantine nodes and the heterogeneity in the lower level function. Since the violation is with respect to the lower level function, it is expected that the heterogeneity in the upper level function impacts mildly on the average violation. However, larger λ results in a bad gradient bound leading to a trade-off between convergence (gradient bound) and violation. Convergence versus Violation trade-off: To better understand the trade-off, consider α = 0 condition, i.e., no Byzantine nodes. In this case, choosing λ = O 1 ϵ results in an average violation of O (ϵ) and

the average gradient converges at the rate of O 1

R recovering the results of (Huang et al., 2023; Tarzanagh et al., 2022) when the variance is zero. The presence of the Byzantine nodes (i.e., α = 0) changes the scenario completely. Choosing λ as above, and R results in an average violation of O cαζ2 g and the convergence rate of the average gradient decreases to

r=0 hλ (xr, yr) 2 O

ζ2 f + ζ2 g ϵ

This also reveals the impact of heterogeneity in the presence of Byzantine nodes.

Remark: Note that the ((α, c)-Robust Aggregator, initially proposed by Karimireddy et al. (2021; 2020) plays an important role in our algorithm. Most of the existing works that use this robust aggregator (Karimireddy et al., 2020; Rammal et al., 2024; Yang et al., 2024a) with single-level optimization problem. However, we consider a bilevel problem, which requires us to carefully combine the upper and lower level objective functions. Moreover, we need to obtain the optimal solution y (x) to the lower level problem, which adds complexity to our analysis. Also, our problem formulation requires proving convergence of the gradient and a bound on the average constraint violation, which is completely new. In addition, λ plays an important role in studying the trade-off between convergence of the gradient and the constraint violation guarantee; this makes the problem more challenging compared to existing work (Karimireddy et al., 2020; Shen & Chen, 2023).

Complexity: The total number of communication rounds for bilevel federated learning problems (with or without Byzantine) is O(R) (Huang et al., 2023). Our communication complexity is O(R log R), and orderwise matches with other existing algorithms except for a log R factor; this is done to ensure a good proxy for y (xr) (see lines 5 to 11 of Algorithm 1). In contrast to the existing algorithms in the federated bilevel learning problem (Tarzanagh et al., 2022; Huang et al., 2023), we require only first-order information, and as a consequence, the run time at each node is very low as demonstrated in our experimental results.

3.3 Proof Sketch of Theorem 1

In order to prove our main results, we first need the smoothness of the penalty problem.

Published in Transactions on Machine Learning Research (09/2025)

Lemma 4. Suppose assumption 1 hold, then the function h(x, y) is Lh smooth i.e.,

hλ(x, y) hλ(x, y ) Lh y y ,

where Lh := Lf + λLg.

Proof: See Sec. E in Appendix.

0 500 1,000 0

R PR 1 r=0 hλ(x, y) 2

α = 0.2 α = 0.4

0 500 1,000

α = 0.2 α = 0.4

0 500 1,000 0

α = 0.2 α = 0.4

0 500 1,000 0

α = 0.2 α = 0.4

Figure 1: Effect of α on the convergence of Rob-Fed BOB under BF (see (a)), LF (see (b)), IPM (see (c)) and ALIE (see (d)) attacks on the data hypercleaning application.

0 500 1,000

log(Viol R)

α = 0.2 α = 0.4

0 500 1,000 100

α = 0.2 α = 0.4

0 500 1,000

α = 0.2 α = 0.4

0 500 1,000

α = 0.2 α = 0.4

Figure 2: Effect of α on the violation under BF (see (a)), LF (see (b)), IPM (see (c)) and ALIE (see (d)) attacks on the data hypercleaning application.

Note that unlike vanilla FL problems, in the bilevel problems, we need extra rounds of communication (see lines 5 to 11 of the algorithm) to obtain a good proxy for the lower level problem, which is stated in the next lemma.

Lemma 5. Under assumptions 1-2 and choosing η 1 µg , the approximation error in Algorithm 1 is bounded as

d2 S(xr)(yr,T ) 2l2 g,max

µ2g γT + 2cαζ2 g µg ,

where γ := 1 γµg

Proof: See Sec. E in Appendix.

Note that by choosing T as in Theorem 1 ensures that d2 S(xr)(yr,T ) O 1 R + 2cαζ2 g µg

indicating that the

presence of Byzantines effect the proxy for the lower level problem as well. Since the gradients yhλ,k (xr, yr) and xˆhλ,k (xr, yr) are sent to the central server, and is aggregated using bucketing followed by geometric median, we first need to relate these gradients to the true average gradients of the good nodes. This is done using Lemmas 1 and 2. Once these results are established, we prove a bound on hλ (x, y) 2, which is

Published in Transactions on Machine Learning Research (09/2025)

summed over r = 0, 1, . . . , R 1 to get a bound on the total gradient as in Theorem 1. We establish a result that connects violation to the bound on hλ (x, y) 2 as in Lemma 3. Using this Lemma, we establish the rates for the average violation as in Lemma 2.

0 500 1,000

R PR 1 r=0 hλ(x, y) 2

α = 0.1 α = 0.2

0 500 1,000

α = 0.2 α = 0.3

0 500 1,000

α = 0.2 α = 0.3

Figure 3: Effect of α on the convergence of Rob-Fed BOB under BF (see (a)), IPM (see (b)) and ALIE (see (c)) attacks on the learnable regularization application.

0 500 1,000 10 4

log(Viol R)

α = 0.1 α = 0.2

0 500 1,000 10 4

α = 0.2 α = 0.3

0 500 1,000 10 4

α = 0.2 α = 0.3

Figure 4: Effect of α on the violation under BF (see (a)), LF (see (b)), IPM (see (c)) and ALIE (see (d)) attacks on the learnable regularization application.

4 Experimental Results

In this section, we experimentally validate our theoretical findings of Rob-Fed BOB. We evaluate the performance of Rob-Fed BOB on two ML applications.

Data Hyper-cleaning: We consider the Data Hyper-cleaning task to experimentally validate our theory (Shaban et al., 2019; Kwon et al., 2023b) for the following attacks (i) BF, (ii) LF, (iii) IPM, and (iv) ALIE. The goal of Data Hyper-cleaning is to solve the following bilevel problem in a federated manner:

i=1 lk,i y (x); Dval k

y (x) arg min y 1 Gn

i=1 σ(xk,i)lk,i y; Dtrain k + c y 2 , (14)

where, Dtrain k := {( ak,i, bk,i)}m i=1 is the noisy training data set and Dval k := {(ak,i, bk,i)}n i=1 denotes a clean validation data set. Here, σ(xk,i) is the sigmoid function, lk () is the cross entropy loss at each node k N and c is the regularization constant. The lower level problem aims to find the optimal model parameters y that minimizes the weighted average of the loss function (with regularization) on the noisy dataset Dtrain k for a fixed set of coefficients σ(xk,i), k G. On the other hand, the upper-level optimization problem aims to find the coefficients σ(xk,i) by minimizing the validation loss learned on the clean dataset Dval k for k G.

Published in Transactions on Machine Learning Research (09/2025)

Learnable Regularization: The goal of learnable regularization is to learn the optimal regularization coefficients for newsgroup dataset (Chen et al., 2023; Liu et al., 2022a). The federated bilevel optimization problem for this task can be posed as:

i=1 lk y (x); Dval k ,

y (x) arg min y 1 Gm

i=1 lk y, Dtrain k + Wxky 2 ,

where, Dtrain k := {( ak,i, bk,i)}m i=1 the training data set and Dval k := {(ak,i, bk,i)}n i=1 denotes the validation data set. The lower level function minimizes the loss on the training dataset across the good nodes by finding the optimal model parameters for a given regularization coefficient. The upper level objective finds the optimal regualrization coefficient by minimizing the validation dataset of good nodes.

We train a linear model on the MNIST dataset with N = 16 nodes. We divide the dataset into equal parts among G good nodes, and ensured that the data distribution is heterogeneous. Also, we consider Byzantine nodes which have access to the entire dataset. We have used the bucketing algorithm followed by geometric median aggregator. Next, we present the four different attacks that we consider: Bit Flipping (BF): In this attack, the Byzantine nodes change the sign of the gradient of the penalty function computed using the entire data set, i.e., it sends hλ(x, y) aiming to cancel the effect of the gradients shared by all the good nodes. See (Karimireddy et al., 2020; Rammal et al., 2024) for more details. Label Flipping (LF): Byzantine nodes modify the label of the MNIST dataset to 9 z, z {0, 1, . . . , 9}. Inner product Manipulation (IPM): The byzantine nodes will send G P

k G hλ,k(x, y), where decides the intensity of the attack (Xie et al., 2020). A Little is Enough (ALIE): The byzantine nodes send µG νσG to the server, where µG and σG are the mean and the standard deviations of the good nodes gradients, respectively. Here, ν dictates the intensity of the attack (Baruch et al., 2019).

Next, we present the experimental results for Rob-Fed BOB algorithm and corroborate our theoretical findings made in this paper for N = 16 nodes:

Effect of α and heterogeneity: Figure 1 and 3 show the convergence rate of the proposed algorithm versus the communication rounds R under different attacks for the application of the data hypercleaning and learnable regularization, respectively. We have chosen B = 3 and B = 6 for a total of N = 16 nodes which result in approximately α = 0.2 and α = 0.4, respectively. It is clear from the figure that the gradients does not converge to zero due to the presence of Byzantine nodes, as expected. The figures also demonstrate that the gradient converges to a constant that scales with α. This corroborates with Theorem 1, where we have shown that 1 R PR 1 r=0 hλ (xr, yr) 2 increases as α increases because of α(ζ2 f + λ2ζ2 g) term, provided λ and the inter-client heterogeneity are fixed. Figures 2 and 4 show the plot of the constraint violation (in log scale) versus the number of communication rounds R under different attacks for data hypercleaning and learnable regularization applications, respectively. In particular, the figure shows that by increasing α for a fixed λ = 1 results in a poor violation performance. This corroborates our theoretical findings (see Theorem 2), where we have proved that Viol R

R increases as α increases at the rate of α(ζ2 f + λ2ζ2 g).

In this subsection, we use the same setting as explained in our experimental results section (see 4). Effect of λ: Figures 5(a) and (b) show the plot of convergence rate (in log scale) and the constraint violation (in log scale) versus R by varying λ while fixing α = 0.2. The heterogeneity in the data causes the Rob-Fed BOB to convergence to a non-zero gradient and violation, as expected. Note that Theorem 1 shows that the average gradient hλ (xr, yr) 2 scales with λ as λ2

R . On the contrary, Fig. 5(b) shows that larger λ results in a better violation. This is in agreement with Theorem 2 where we have shown that Viol R decreases at the rate of 1 λ2 .

Effect of T: Figure 6(a) and (b) show the effect of T on the convergence and constraint violation of Rob-Fed BOB algorithm. We get O( 1

R) rate when T O(log(R)) rounds.

Published in Transactions on Machine Learning Research (09/2025)

0 200 400 600 800 1,000

R PR 1 r=0 hλ(x, y) 2

λ = 1 λ = 300

0 200 400 600 800 1,000 100

log(Viol R)

λ = 1 λ = 300

Figure 5: Effect of λ on the convergence of Rob-Fed BOB (see (a)) and violation (see (b)) under BF attack in the log scale for the data hyperclearning application.

0 200 400 600 800 1,000

R PR 1 r=0 hλ(x, y) 2

T = 1 T = 10

0 200 400 600 800 1,000

log(Viol R)

T = 1 T = 10

Figure 6: Effect of T on the convergence of Rob-Fed BOB (see (a)) and violation (see (b)) under LF attack in the log scale for the data hypercleaning application.

0 50 100 150

Rob-Fed BOB Bilantine

Figure 7: Performance comparison of Rob-Fed BOB with BILATINE.

Performance Comparison: The Figure 7 shows the performance comparison between our proposed method (Rob-Fed BOB) and Hessian-based federated bilevel optimization algorithm (BILANTINE) for the data hyper cleaning application. We have considered the bit flipping attack in both the cases. In Kwon et al. (2023b), the authors have demonstrated the superiority of the penalty based method over the second order (Hessian) method in the absence of Byzantine nodes in the centralized case. We on the other hand show superiority of our penalty method compared to BILATINE (second order method) in the FL setting in the presence of Byzantine nodes. As stated in Kwon et al. (2023b), the exact mathematical reason of why the penalty methods work better than the second order methods is not clear, and will be relagated to the future work.

Published in Transactions on Machine Learning Research (09/2025)

5 Conclusion

In this work, we considered the problem of bilevel optimization with Byzantine nodes where both upper and lower level objective functions are non-convex. Typically, bilevel problems are solved using an estimate of the hypergradient which requires second-order information about the lower level objective making it computationally expensive. Further, the presence of Byzantine nodes require the algorithm to be robust against attacks. We propose Rob-Fed BOB algorithm, a fully first order byzantine robust federated algorithm that (i) does not require second-order information making it computationally efficient, (ii) handles nonconvex lower level objective function satisfying PL-inequality and heterogeneous data, and (iii) aggregates the information from all the nodes in a robust manner making it resilient to Byzantine attacks. We prove theoretical performance of the proposed algorithm, and show that the rate at which the gradient and the average violation scale depends heavily on the fraction of the Byzantine nodes and the heterogeneity of the upper and lower level objective functions. In the absence of Byzantine nodes, we show that the gradient converges at the rate of 1/R, where R is the number of communication rounds while the violation can be made arbitrarily small. We have performed extensive experiments to corroborate our theoretical findings and to demonstrate that the complexity of Rob-Fed BOB is very low.

Acknowledgements: This work is in part supported by MEi TY, India AI Mission (2024-2026), SERBCRG (grant number: CRG/2021/007502) and SERB-MATRICS (grant number: MTR/2021/000575). The authors would like to thank the anonymous reviewers for their valuable comments.

Momin Abbas, Yi Zhou, Nathalie Baracaldo, Horst Samulowitz, Parikshit Ram, and Theodoros Salonidis. Byzantine-resilient bilevel federated learning. In 2024 IEEE 13rd Sensor Array and Multichannel Signal Processing Workshop (SAM), pp. 1 5, 2024. doi: 10.1109/SAM60225.2024.10636694.

Gilad Baruch, Moran Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning. Advances in Neural Information Processing Systems, 32, 2019.

Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: byzantine tolerant gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 118 128, Red Hook, NY, USA, 2017a. Curran Associates Inc. ISBN 9781510860964.

Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017b. URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf.

Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal fully first-order algorithms for finding stationary points in bilevel optimization. ar Xiv preprint ar Xiv:2306.14853, 2023.

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Tighter analysis of alternating stochastic gradient method for stochastic nested problems. ar Xiv preprint ar Xiv:2106.13781, 2021.

Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1 25, 2017.

Caroline Crockett, Jeffrey A Fessler, et al. Bilevel methods for image reconstruction. Foundations and Trends in Signal Processing, 15(2-3):121 289, 2022.

Deepesh Data and Suhas Diggavi. Byzantine-resilient sgd in high dimensions on heterogeneous data. In 2021 IEEE International Symposium on Information Theory (ISIT), pp. 2310 2315. IEEE, 2021.

Published in Transactions on Machine Learning Research (09/2025)

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pp. 1568 1577. PMLR, 2018.

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Eduard Gorbunov, Samuel Horv ath, Peter Richt arik, and Gauthier Gidel. Variance reduction is an antidote to byzantines: Better rates, weaker assumptions and communication compression as a cherry on the top. 2022.

Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. Secure byzantine-robust machine learning. ar Xiv preprint ar Xiv:2006.04747, 2020.

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147 180, 2023.

Minhui Huang, Dewei Zhang, and Kaiyi Ji. Achieving linear speedup in non-iid federated bilevel learning. In International Conference on Machine Learning, pp. 14039 14059. PMLR, 2023.

Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Byzantine-robust learning on heterogeneous datasets via bucketing. In International Conference on Learning Representations, 2020.

Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from history for byzantine robust optimization. In International Conference on Machine Learning, pp. 5311 5319. PMLR, 2021.

Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A nearoptimal algorithm for stochastic bilevel optimization via double-momentum. Advances in neural information processing systems, 34:30271 30283, 2021.

Prashant Khanduri, Ioannis Tsaknakis, Yihua Zhang, Jia Liu, Sijia Liu, Jiawei Zhang, and Mingyi Hong. Linearly constrained bilevel optimization: a smoothed implicit gradient approach. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023.

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. ar Xiv preprint ar Xiv:2309.01753, 2023a.

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for stochastic bilevel optimization. In International Conference on Machine Learning, pp. 18083 18113. PMLR, 2023b.

Junyi Li, Feihu Huang, and Heng Huang. Local stochastic bilevel optimization with momentum-based variance reduction. ar Xiv preprint ar Xiv:2205.01608, 2022.

Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach. Advances in neural information processing systems, 35:17248 17262, 2022a.

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized nonlinear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85 116, 2022b.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017.

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32, 2019.

Published in Transactions on Machine Learning Research (09/2025)

Krishna Pillutla, Sham M. Kakade, and Zaid Harchaoui. Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142 1154, 2022a. doi: 10.1109/TSP.2022.3153135.

Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142 1154, 2022b.

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.

Ahmad Rammal, Kaja Gruntkowska, Nikita Fedin, Eduard Gorbunov, and Peter Richtárik. Communication compression for byzantine robust learning: New efficient algorithms and improved rates. In International Conference on Artificial Intelligence and Statistics, pp. 1207 1215. PMLR, 2024.

Felix Sattler, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. On the byzantine robustness of clustered federated learning. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8861 8865, 2020. doi: 10.1109/ICASSP40776.2020.9054676.

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1723 1732. PMLR, 2019.

Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. In International Conference on Machine Learning, pp. 30992 31015. PMLR, 2023.

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 6211080fa89981f66b1a0c9d55c61d0f-Paper.pdf.

Haoran Sun, Wenqiang Pu, Minghe Zhu, Xiao Fu, Tsung-Hui Chang, and Mingyi Hong. Learning to continuously optimize wireless resource in episodically dynamic environment. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945 4949. IEEE, 2021.

Davoud Ataee Tarzanagh, Mingchen Li, Christos Thrampoulidis, and Samet Oymak. Fednest: Federated bilevel, minimax, and compositional optimization. In International Conference on Machine Learning, pp. 21146 21179. PMLR, 2022.

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. A survey on distributed machine learning. Acm computing surveys (csur), 53(2):1 33, 2020.

Zhaoxian Wu, Qing Ling, Tianyi Chen, and Georgios B Giannakis. Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Transactions on Signal Processing, 68:4583 4596, 2020.

Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In Uncertainty in Artificial Intelligence, pp. 261 270. PMLR, 2020.

Chao Xue, Xiaoxing Wang, Junchi Yan, Yonggang Hu, Xiaokang Yang, and Kewei Sun. Rethinking bilevel optimization in neural architecture search: A gibbs sampling perspective. In AAAI Conference on Artificial Intelligence, 2021. URL https://api.semanticscholar.org/Corpus ID:235349202.

Yi-Rui Yang, Chang-Wei Shi, and Wu-Jun Li. On the effect of batch size in byzantine-robust distributed learning. In The Twelfth International Conference on Learning Representations, 2024a. URL https: //openreview.net/forum?id=wri KDQqi OQ.

Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Simfbo: Towards simple, flexible and communication-efficient federated bilevel learning. Advances in Neural Information Processing Systems, 36, 2024b.

Published in Transactions on Machine Learning Research (09/2025)

Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International conference on machine learning, pp. 5650 5659. Pmlr, 2018.

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/Corpus ID:53285773.

Yihua Zhang, Guanhua Zhang, Prashant Khanduri, Mingyi Hong, Shiyu Chang, and Sijia Liu. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In International Conference on Machine Learning, pp. 26693 26712. PMLR, 2022.

Yihua Zhang, Prashant Khanduri, Ioannis Tsaknakis, Yuguang Yao, Mingyi Hong, and Sijia Liu. An introduction to bilevel optimization: Foundations and applications in signal processing and machine learning. IEEE Signal Processing Magazine, 41(1):38 59, 2024. doi: 10.1109/MSP.2024.3358284.

Shiyuan Zuo, Xingrun Yan, Rongfei Fan, Li Shen, Puning Zhao, Jie Xu, and Han Hu. Byzantine-resilient federated learning employing normalized gradients on non-iid datasets. ar Xiv preprint ar Xiv:2408.09539, 2024.

Published in Transactions on Machine Learning Research (09/2025)

A.1 Useful Lemma

In this subsection, we prove a lemma that relates the gradient of the lower level functions with and without aggregation.

Lemma 6. Suppose the aggregator RAgg is (α, c) robust aggregator (see definition 1), then

ygag (xr, yr,t) yg (xr, yr,t) 2 4cαζ2 g.

Proof: It follows from the definition 1 of (α, c)-Robust Aggregator that the above result can be shown provided we prove a bound on E ygi (xr, yr,t) ygj (xr, yr,t) 2, where the expectation is with respect to uniformly randomly chosen i, j G. Adding and subtracting the term yg (xr, yr) and using the inequality a + b 2 2 a 2 + 2 b 2, we get

E ygi (xr, yr,t) ygj (xr, yr,t) 2 2Ei G ygi (xr, yr) yg (xr, yr,t) 2 +

2Ej G ygj (xr, yr) yg (xr, yr,t) 2

= 4Ei G ygi (xr, yr) yg (xr, yr,t) 2 . (15)

Using the inter-client heterogeneity in Assumption 3, the above is bounded as

Ei G ygi (xr, yr) yg (xr, yr,t) 2 ζ2 g. (16)

Using this in equation 15, we get

E ygi (xr, yr,t) ygj (xr, yr,t) 2 4ζ2 g.

From the definition 1 of (α, c)-Robust Aggregator, it follows that

ygag (xr, yr,t) yg (xr, yr,t) 2 4cαζ2 g.

This completes the proof.

B Proof of Lemma 1

In this section, we prove a bound on yhλ,ag (xr, yr) yhλ (xr, yr) 2 to show robustness of the aggregator with respect to the penalty function. From Definition 1, we know that robustness can be proved by proving a bound on E yhλ,i (xr, yr) xhλ,j (xr, yr) 2, where expectation is with respect i, j uniformly sampled from G. Adding and subtracting the term yhλ (xr, yr) and using the inequality a + b 2 2 a 2 + 2 b 2, we get

E yhλ,i (xr, yr) yhλ,j (xr, yr) 2 2Ei G yhλ,i (xr, yr) yhλ (xr, yr) 2 +

2Ej G yhλ,j (xr, yr) yhλ (xr, yr) 2

= 4Ei G yhλ,i (xr, yr) yhλ (xr, yr) 2 . (17)

Consider the above equation, i.e., A := Ei G yhλ,i (xr, yr) yhλ (xr, yr) 2. Substituting yhλ,i (xr, yr) = yfi (xr, yr) + λ ygi (xr, yr) for all i G and yhλ (xr, yr) = yf (xr, yr) + λ yg (xr, yr), we get

A = Ei G yfi (xr, yr) + λ ygi (xr, yr) yf (xr, yr) λ yg (xr, yr) 2

(a) 2Ei G yfi (xr, yr) yf (xr, yr) 2 + 2Ei Gλ2 ygi (xr, yr) yg (xr, yr) 2 ,

2ζ2 f + 2λ2ζ2 g, (18)

Published in Transactions on Machine Learning Research (09/2025)

where (a) follows from the inequality, a + b 2 2 a 2 + 2 b 2, and the last inequality follows from the heterogeneity Assumption 3. Substituting the above in equation 17, we get

E yhλ,i (xr, yr) yhλ,j (xr, yr) 2 8ζ2 f + 8λ2ζ2 g.

From the definition 1 of (α, c)-Robust Aggregator,it is clear that the output of robust aggregator satisfies

yhλ,ag (xr, yr) yhλ (xr, yr) 2 cα 8ζ2 f + 8λ2ζ2 g .

This completes the proof.

C Proof of Lemma 2

In this section, we prove a bound on xˆhλ,ag (xr, yr) x ˆhλ (xr, yr) 2 . Towards this, we need to first

prove a bound on (see Definition 1) E xˆhλ,i (xr, yr) xˆhλ,j (xr, yr) 2 , where i, j are uniformly chosen

from G. Adding and subtracting the term xhλ (xr, yr) to the above and using the inequality a + b 2 2 a 2 + 2 b 2, we get

E xˆhλ,i (xr, yr) xˆhλ,j (xr, yr) 2 2Ei G xˆhλ,i (xr, yr) xhλ (xr, yr) 2 +

2Ej G xˆhλ,j (xr, yr) xhλ (xr, yr) 2

= 4Ei G xˆhλ,i (xr, yr) xhλ (xr, yr) 2 . (19)

The following lemma is required to prove a bound on the above equation.

Lemma 7. Under Assumption 3, the following bound is satisfied

Ei G xhλ,i (xr, yr) xhλ,j (xr, yr) 2 4λ2L2 g,maxd2 S(xr)(yr,T ) + 6ζ2 f + 12λ2ζ2 g. (20)

Proof: Towards bounding the above equation, let us add and subtract the term x ˆhλ (xr, yr) := 1 G P k G ˆhλ,j (xr, yr) (see Lemma 2)

Ei G xˆhλ,i (xr, yr) xhλ (xr, yr) 2 = Ei G xˆhλ,i (xr, yr) x ˆhλ (xr, yr) +

x ˆhλ (xr, yr) xhλ (xr, yr) 2

2 Ei G xˆhλ,i (xr, yr) x ˆhλ (xr, yr) 2

| {z } :=A1

2 x ˆhλ (xr, yr) xhλ (xr, yr) 2

| {z } :=A2

where the above follows from the inequality a + b 2 2 a 2 + 2 b 2. Consider

A1 = Ei G xˆhλ,i (xr, yr) x ˆhλ (xr, yr) 2 .

Published in Transactions on Machine Learning Research (09/2025)

Substituting xˆhλ,i (xr, yr) = xfi (xr, yr) + λ ( xgi (xr, yr) xgi (xr, yr,T )) from equation 9, and

x ˆhλ (xr, yr) = xf (xr, yr) + λ ( xg (xr, yr) xg (xr, yr,T )) in the above, we get

A1 = Ei G xfi (xr, yr) + λ ( xgi (xr, yr) xgi (xr, yr,T )) xf (xr, yr)

λ ( xg (xr, yr) xg (xr, yr,T )) 2

(a) 3Ei G xfi (xr, yr) xf (xr, yr) 2 + 3Ei Gλ2 xgi (xr, yr) xg (xr, yr) 2

+ 3Ei Gλ2 xgi (xr, yr,T ) xg (xr, yr,T ) 2 , (22)

where (a) follows from the inequality, a + b + c 2 3 a 2 + 3 b 2 + 3 c 2. Now, consider

A1 3Ei xfi (xr, yr) xf (xr, yr) 2 + 3λ2Ei xgi (xr, yr) xg (xr, yr) 2

+ 3λ2Ei xgi (xr, yr,T ) xg (xr, yr,T ) 2

3ζ2 f + 6λ2ζ2 g, (23)

where the last inequality follows from the inter-client heterogeneity Assumption 3, i.e., Ei G fi (x, y) f (x, y) 2 ζ2 f and Ei G gi (x, y) g (x, y) 2 ζ2 g. Let us consider the term A2 from equation 21

A2 = x ˆhλ (xr, yr) xh (xr, yr) 2 .

Substituting x ˆhλ (xr, yr) = xf (xr, yr) + λ ( xg (xr, yr) xg (xr, yr,T )) and xh (xr, yr) = xf (xr, yr) λ ( xg (xr, yr) xv (xr)) in the above, we get

A2 xf (xr, yr) + λ ( xg (xr, yr) xg (xr, yr,T )) xf (xr, yr) λ ( xg (xr, yr) xv (xr)) 2.

Simplifying further, we get

A2 λ2 xv (xr) xg (xr, yr,T ) 2 .

Using Jensen s inequality, we get

k=1 xvk (xr) xgk (xr, yr,T ) 2

k=1 L2 g,kd2 S(xr)(yr,T )

(b) λ2L2 g,maxd2 S(xr)(yr,T ), (24)

where (a) follows from the Assumption 1, and L2 g,max := maxk G L2 g,k. Substituting equation 23, equation 24 in equation 21, we get

Ei G xˆhλ,i (xr, yr) xhλ (xr, yr) 2 2λ2L2 g,maxd2 S(xr)(yr,T ) + 6ζ2 f + 12λ2ζ2 g. (25)

Using the results from equation 25 in equation 19, we get the desired result. This completes the proof of the Lemma.

To complete the proof of Lemma 2, we substitute the result from Lemma 5 in equation 20 to get

Ei G xhλ,i (xr, yr) xhλ,j (xr, yr) 2 8λ2L2 g,maxl2 g,max µ2g

T + 16λ2L2 g,maxcαζ2 g µg +6ζ2 f + 12λ2ζ2 g.

Published in Transactions on Machine Learning Research (09/2025)

Using the definition 1 of (α, c)-robust aggregator, the output of the robust aggregator satisfies the following

xˆhλ,ag (xr, yr) x ˆhλ (xr, yr) 2 cα

8λ2L2 g,maxl2 g,max µ2g

T + 16λ2L2 g,maxcαζ2 g µg

+6ζ2 f + 12λ2ζ2 g . (26)

This completes the proof.

D Proof of Lemma 3

Suppose (xr, yr) is an approximate stationary point of Fed BOB Penalty Problem, i.e.,

yhλ(xr, yr) = yf (xr, yr) + λ yg (xr, yr) ψr, (27)

where ψr > 0 is the approximation error. Then, for some vr such that vr ψr, the above can be written as

yf (xr, yr) = λ yg (xr, yr) + vr.

Rearranging the above, taking norms on both sides and applying triangle s inequality, we get

λ yg (xr, yr) yf (xr, yr) + ψr.

Squaring on both the sides, we get

yg (xr, yr) 2 (a) 2 λ2

yf (xr, yr) 2 + ψ2 r

k=0 yfk (xr, yr) 2 + ψ2 r

(c) 2 l2 f + ψ2 r

where (a) follows from the inequality a + b 2 2 a 2 + 2 b 2, (b) uses the Jensen s inequality, and (c) follows from Assumption 1 i.e., yfk (x, y) 2 l2 f for all k G. From Assumption 2, we know that

g (xr, yr) gt (xr, y (xr)) 1

µg yg (xr, yr) 2 . (29)

Substituting equation 28 in equation 29, we get

p (xr, yr) := g (xr, yr) g (xr, y (xr)) 2l2 f µgλ2 + 2ψ2 r µgλ2 .

Summing from r = 0 to R 1, and dividing by R, we get

r=0 p (xr, yr) 2l2 f µgλ2 + 2 µgλ2 1

r=0 ψ2 r. (30)

Note that the result in Lemma 3 can be obtained by choosing (xr, yr) = (x, y) for all r satisfying yhλ(x, y) ψ. This completes the proof.

E Proof of Lemma 4 and Lemma 5

In this section, we prove smoothness of the penalty function in Lemma 4 and a bound on d2 S(xr)(yr,T ) in Lemma E.0.2. Recall that yr,T is the output of the Algorithm 1 in step 12 which is used as a proxy for y arg miny g(xr, y). First, we state the proof of Lemma 4.

Published in Transactions on Machine Learning Research (09/2025)

E.0.1 Proof of Lemma 4

Recall from equation 3 that the penalty function hλ (x, y) := f (x, y) + λp (x, y). Consider the following

yhλ(x, y) yhλ(x, y ) = yf (x, y) + λ yp (x, y) yf (x, y ) λ yp (x, y )

yf (x, y) yf (x, y ) + λ yp (x, y) yp (x, y ) .

The above inequality follows from the triangle s inequality i.e., a + b a + b . From equation 2, we know that p (x, y ) = g (x, y ) g (x, y ), where y arg miny g(x, y). Using the fact that yg (x, y ) = 0 in the above, we get

hλ(x, y) hλ(x, y ) = yf (x, y) yf (x, y ) + λ yg (x, y) yg (x, y )

k=1 ( yfk (x, y) yfk (x, y ))

k=1 ( ygk (x, y) ygk (x, y ))

k=1 Lf,k y y + λ

k=1 Lg,k y y

(b) (Lf,max + λLg,max) y y .

(c) Lh y y ,

where (a) follows from the triangle s inequality and Assumption 1 and (b) uses the fact that Lf,max := maxk G Lf,k and Lg,max := maxk G Lg,k, and (c) results from Lh := Lf,max + λLg,max. This completes the proof.

E.0.2 Proof of Lemma 5

First, note that in Lemma 5, we have used yr,T = yr,T . Using smoothness property from the Assumption 1, we have

g (xr, yr,t+1) g (xr, yr,t) + yr,t+1 yr,t, yg (xr, yr,t) + Lg

2 yr,t+1 yr,t 2 .

Substituting the update in 6 i.e., yr,t+1 = yr,t γ ygag (xr, yr,t) in the above, we have

g (xr, yr,t+1) g (xr, yr,t) γ ygag (xr, yr,t) , yg (xr, yr,t) + γ2Lg

ygag (xr, yr,t) 2 .

Using the identity a, b = 1

2 a b 2, we get

g (xr, yr,t+1) g (xr, yr,t) γ

2 ygag (xr, yr,t) 2 γ

2 yg (xr, yr,t) 2

2 ygag (xr, yr,t) yg (xr, yr,t) 2 + γ2Lg

ygag (xr, yr,t) 2 . (31)

Choosing γ 1 Lg in the above, and ignoring the negative term, we get

g (xr, yr,t+1) g (xr, yr,t) γ

2 yg (xr, yr,t) 2 + γ

2 ygag (xr, yr,t) yg (xr, yr,t) 2. (32)

Using the result of Lemma 6, the third term in the above is bounded as

ygag (xr, yr,t) yg (xr, yr,t) 2 4cαζ2 g. (33)

Using the PL-inequality in Assumption 2 i.e., g (xr, yr,t) 2 µg (g (xr, yr,t) v(xr) and substituting equation 33 in equation 32, we get

g (xr, yr,t+1) g (xr, yr,t) γµg

2 (g (xr, yr,t) v(xr) + 2γcαζ2 g

Published in Transactions on Machine Learning Research (09/2025)

Adding and subtracting the term v (xr), we have

g (xr, yr,t+1) v (xr) 1 γµg

(g (xr, yr,t) v(xr)) + 2γcαζ2 g.

Applying recursion, we have

g (xr, yr,T ) v (xr) 1 γµg

T (g (xr, yr) v(xr) +

t 2γcαζ2 g.

It is important to note that the PL-inequality implies quadratic growth condition, i.e., g (xr, yr,T ) v (xr) µg

2 d2 S(xr)(yr,T ) for some µg > 0. Since g() satisfies the PL-inequality, invoking the quadratic growth condition, we get

d2 S(xr)(yr,T ) 2 µg

T (g (xr, yr) v (xr)) +

t 2γcαζ2 g.

Next, we can further bound the above using the fact that the lower level function satisfies the PL-inequality (see Assumption 2):

d2 S(xr)(yr,T ) 2 µ2g

T g (xr, yr) 2 +

t 2γcαζ2 g. (34)

Using Assumption 1, i.e., gk (xr, yr) 2 l2 g,k, we have g (xr, yr) 2 1

G PG k=1 gk (xr, yr) 2 l2 g,max. Using this in the above results in

d2 S(xr)(yr,T ) 2l2 g,max

t 2γcαζ2 g. (35)

The above is further bounded using geometric series as

d2 S(xr)(yr,T ) 2l2 g,max

T + 4cαζ2 g µg . (36)

This completes the proof.

F Proof of Theorem 1

First, we state and prove the following Lemma, which is also useful while proving the violation bound.

Lemma 8. Suppose assumptions 1-3 hold, then for the aggregator RAgg, Algorithm 1 achieves the following bound

hλ (xr, yr) 2 (f (xr, yr) f (xr+1, yr+1)) + λ (g (xr, yr) g (xr+1, yr+1))

λ (v (xr+1) v (xr))) + βcαδ2 + η

2cαρ2 + 2βλ2L2 g,maxl2 g,max µ2g

+2βλ2L2 g,maxcαζ2 g µg . (37)

for β 1/Lh and η 1/Lh, and for any T 1.

Published in Transactions on Machine Learning Research (09/2025)

Proof: Using the smoothness result from Lemma 4, we can write

hλ (xr+1, yr+1) hλ (xr, yr) + xr+1 xr, xhλ (xr, yr) + yr+1 yr, yhλ (xr, yr)

h xr+1 xr 2 + yr+1 yr 2i . (38)

Substituting for the update xr+1 xr = β xˆhλ,ag (xr, yr) and yr+1 yr = η yhλ,ag (xr, yr) from equation 8 and equation 7, we get

hλ (xr+1, yr+1) hλ (xr, yr) β D xˆhλ,ag (xr, yr) , xhλ (xr, yr) E η yhλ,ag (xr, yr) , yhλ (xr, yr)

xˆhλ,ag (xr, yr) 2 + η2Lh

yhλ,ag (xr, yr) 2 .

Using the identity a, b = 1

2 a b 2, and adding and subtracting the term x ˆhλ (xr, yr), we have

hλ (xr+1, yr+1) hλ (xr, yr) β

xˆhλ,ag (xr, yr) 2 β

2 xhλ (xr, yr) 2

xˆhλ,ag (xr, yr) x ˆhλ (xr, yr) + x ˆhλ (xr, yr) xhλ (xr, yr) 2

yhλ,ag (xr, yr) 2 η

2 yhλ (xr, yr) 2 + η

yhλ,ag (xr, yr) yhλ (xr, yr) 2

xˆhλ,ag (xr, yr) 2 + η2Lh

yhλ,ag (xr, yr) 2 .

Combining the common terms, choosing η 1 Lh and β 1 Lh , and applying the inequality a + b 2

2 a 2 + 2 b 2, we get

hλ (xr+1, yr+1) hλ (xr, yr) β

2 xhλ (xr, yr) 2 + β xˆhλ,ag (xr, yr) x ˆhλ (xr, yr) 2

+β x ˆhλ (xr, yr) xhλ (xr, yr) 2 η

2 yhλ (xr, yr) 2 +

yhλ,ag (xr, yr) yhλ (xr, yr) 2 . (39)

Using xˆhλ,k (xr, yr) := xfk (xr, yr) + λ ( xgk (xr, yr) xgk (xr, yr,T )) in the fourth term above, we get

x ˆhλ (xr, yr) xhλ (xr, yr) 2 xf (xr, yr) + λ

xg (xr, yr) 1

k=1 xgk (xr, yr,T )

xf (xr, yr) λ

xg (xr, yr) 1

k=1 xvk (xr)

Simplifying the above results in

x ˆhλ (xr, yr) xhλ (xr, yr) 2 λ2 1 G

k=1 xvk (xr) 1

k=1 xgk (xr, yr,T )

Using Jensen s inequality, the above is further bounded as

x ˆhλ (xr, yr) xhλ (xr, yr) 2 λ2

k=1 xvk (xr) xgk (xr, yr,T ) 2

k=1 L2 g,kd2 S(xr)(yr,T )

(b) λ2L2 g,maxd2 S(xr)(yr,T ),

Published in Transactions on Machine Learning Research (09/2025)

where (a) follows from the Assumption 1 and (b) uses the fact that L2 g,max := maxk G L2 g,k. Now using the result from Lemma 5 in equation 40, we get x ˆhλ (xr, yr) xhλ (xr, yr) 2 2λ2L2 g,maxl2 g,max µ2g

T + 4λ2L2 g,maxcαζ2 g µg . (40)

From Lemma 2, the third term in equation 39 can be bounded as xˆhλ,ag (xr, yr) x ˆhλ (xr, yr) 2 cαδ2, (41)

where δ2 := 8λ2L2 g,maxl2 g,max µ2g 1 γµg

2 T + 16λ2L2 g,maxcαζ2 g µg + 6ζ2 f + 12λ2ζ2 g. Similarly, from Lemma 1, the last term in equation 39 is bounded as yhλ,ag (xr, yr) yhλ (xr, yr) 2 cαρ2. (42)

where ρ2 := cα 8ζ2 f + 8λ2ζ2 g . Substituting equation 40, equation 41 and equation 42 in equation 39, we get

hλ (xr+1, yr+1) hλ (xr, yr) β

2 xhλ (xr, yr) 2 η

2 yhλ (xr, yr) 2 + βcαδ2 + η

+2βλ2L2 g,maxl2 g,max µ2g

T + 2βλ2L2 g,maxcαζ2 g µg .

Rearranging and using the fact that hλ (x, y) = f (x, y) + λ (g (x, y) v (x)), the above becomes

2 xhλ (xr, yr) 2 + η

2 yhλ (xr, yr) 2 f (xr, yr) + λ (g (xr, yr) v (xr))

f (xr+1, yr+1) λ (g (xr+1, yr+1) v (xr+1))

2cαρ2 + 2βλ2L2 g,maxl2 g,max µ2g

+2βλ2L2 g,maxcαζ2 g µg . (43)

Using the fact that, min n β 2 , η

2 o hλ (xr, yr) 2 β

2 xhλ (xr, yr) 2 + η

2 yhλ (xr, yr) 2 and rearranging results in the bound of Lemma 8. This completes the proof of the lemma.

Completing the Proof of Theorem 1: Now, summing both sides of the the bound of Lemma 8 results in

r=0 hλ (xr, yr) 2 (f (x0, y0) f (x R, y R)) + λ (g (x0, y0) g (x R, y R))

λ (v (x R)) v (x0)) + βRcαδ2 + η

2βRλ2L2 g,maxl2 g,max µ2g

T + 2βRλ2L2 g,maxcαζ2 g µg . (44)

Since we need η, β 1/Lh, we assume that both η and β are of the same order. This results in

r=0 hλ (xr, yr) 2 (f (x0, y0) f (x R, y R))

ηR + λ (g (x0, y0) g (x R, y R))

λ (v (x R)) v (x0))

ηR + cαδ2 + 1

2cαρ2 + 2λ2L2 g,maxl2 g,max µ2g

+2λ2L2 g,maxcαζ2 g µg . (45)

Published in Transactions on Machine Learning Research (09/2025)

We wish the second last term on the right hand side in the above equation to satisfy

2λ2L2 g,maxl2 g,max µ2g

The above condition is achieved by choosing T 2 γµg log 2Rλ2L2 g,maxl2 g,max µ2g

; this makes the communication complexity slighter higher than the conventional second order methods. This will be relaxed in the next result that we state. Substituting equation 46 in equation 45, we get

r=0 hλ (xr, yr) 2 (f (x0, y0) f (x R, y R))

ηR + λ (g (x0, y0) g (x R, y R))

λ (v (x R)) v (x0))

ηR + cαδ2 + 1

R + 2λ2L2 g,maxcαζ2 g µg .

Now, substituting for δ2 and ρ2, we get

r=0 hλ (xr, yr) 2 (f (x0, y0) f (x R, y R))

ηR + λ (g (x0, y0) g (x R, y R))

ηR λ (v (x R)) v (x0))

8λ2L2 g,maxl2 g,max µ2g

T + 8λ2L2 g,maxcαζ2 g µg + 6ζ2 f + 12λ2ζ2 g

2cα 8ζ2 f + 8λ2ζ2 g + 1

R + 2λ2L2 g,maxcαζ2 g µg .

Suppose, we choose T 2 γµg log 8Rλ2L2 g,maxl2 g,max µ2g

, the above is further bounded as

r=0 hλ (xr, yr) 2 (f (x0, y0) f (x R, y R))

ηR + λ (g (x0, y0) g (x R, y R))

λ (v (x0)) v (x R))

R + 8λ2L2 g,maxc2α2ζ2 g µg + 6cαζ2 f + 12cαλ2ζ2 g

+4cαζ2 f + 4cαλ2ζ2 g + 1

R + 2λ2L2 g,maxcαζ2 g µg .

Rearranging the above, multiplying by 2 on both sides, and using only those terms that depend on λ and R, we get the desired order result of the theorem. This completes the proof.

G Proof of Theorem 2

In order to prove the Theorem, we need a bound on the stationary point of the penalty function (see Lemma 3. More specifically, we need a result of the form hλ (xr, yr) 2 ψ2 r. Towards this, consider equation 37 of Lemma 8

hλ (xr, yr) 2 (f (xr, yr) f (xr+1, yr+1)) + λ (g (xr, yr) g (xr+1, yr+1))

λ (v (xr+1) v (xr))) + βcαδ2 + η

2βλ2L2 g,maxl2 g,max µ2g

T + 2βλ2L2 g,maxcαζ2 g µg . (47)

Published in Transactions on Machine Learning Research (09/2025)

Dividing by λ2 and using the fact that min n β 2 , η

2 o = 1 2Lh = 1 2(Lf,max+λLg,max) and further rearranging, we get

1 λ2 hλ (xr, yr) 2 2Lf,max (f (xr, yr) f (xr+1, yr+1))

λ2 + 2Lg,max (f (xr, yr) f (xr+1, yr+1))

2Lf,max (g (xr, yr) g (xr+1, yr+1))

λ + 2Lg,max (g (xr, yr) g (xr+1, yr+1))

2Lf,max (v (xr+1) v (xr)))

λ 2Lg,max (v (xr+1) v (xr))) + 2cαδ2

+cαρ2 + 2λ2L2 g,maxl2 g,max λ2µ2g

T + 2λ2L2 g,maxcαζ2 g λ2µg .

Choosing T 2 γµg log 2λ2L2 g,maxl2 g,max µ2g

1 λ2 hλ (xr, yr) 2 2Lf,max (f (xr, yr) f (xr+1, yr+1))

λ2 + 2Lg,max (f (xr, yr) f (xr+1, yr+1))

2Lf,max (g (xr, yr) g (xr+1, yr+1))

λ + 2Lg,max (g (xr, yr) g (xr+1, yr+1))

2Lf,max (v (xr+1) v (xr)))

λ 2Lg,max (v (xr+1) v (xr))) + 2cαδ2

+cαρ2 + 1 λ2R + 2λ2L2 g,maxcαζ2 g λ2µg .

Summing from r = 0 to R 1 and simplifying, we get

r=0 hλ (xr, yr) 2

2Lf,max (f (xr, yr) f (xr+1, yr+1))

2Lg,max (f (xr, yr) f (xr+1, yr+1))

2Lf,max (g (xr, yr) g (xr+1, yr+1))

λ + 2Lg,max

r=0 (g (xr, yr) g (xr+1, yr+1))

2Lf,max (v (xr+1) v (xr)))

r=0 (v (xr+1) v (xr)))

2λ2L2 g,maxcαζ2 g λ2µg . (48)

Using the telescopic sum, we get PR 1 r=0 (f (xr, yr) f (xr+1, yr+1)) = f (x0, y0) f (x R, y R), PR 1 r=0 (g (xr, yr) g (xr+1, yr+1)) = g (x0, y0) g (x R, y R) and PR 1 r=0 (v (xr+1) v (xr)) = v (x R) v (x0). Using these results in equation 48, we get

1 λ2 hλ (xr, yr) 2 2Lf,max

λ2 + 2Lg,max

(f (x0, y0) f (x R, y R)) +

λ + 2Lg,max

(g (x0, y0) g (x R, y R)) + 2Lf,max

λ + 2Lg,max

(v (x0) v (x R)) + 2Rcαδ2

1 λ2 + 2Rλ2L2 g,maxcαζ2 g λ2µg . (49)

From Lemma 3, we know that

Viol R 2Rl2 f µgλ2 + 2 µgλ2

r=0 ψ2 r. (50)

Published in Transactions on Machine Learning Research (09/2025)

Substituting equation 49 in equation 50, we get

Viol R 2Rl2 f µgλ2 + 2

λ2 + 2Lg,max

(f (x0, y0) f (x R, y R)) +

λ + 2Lg,max

(g (x0, y0) g (x R, y R)) +

λ + 2Lg,max

(v (x0) v (x R)) +

µgλ2 + 2Rcαρ2

λ2 + 2Rλ2L2 g,maxcαζ2 g λ2µg . (51)

Now substituting for δ2 and ρ2, we get

Viol R 2Rl2 f µgλ2 + 2

λ2 + 2Lg,max

(f (x0, y0) f (x R, y R)) +

λ + 2Lg,max

(g (x0, y0) g (x R, y R)) +

λ + 2Lg,max

(v (x0) v (x R)) +

8λ2L2 g,maxl2 g,max µ2g

T + 8λ2L2 g,maxcαζ2 g µg + 6ζ2 f + 12λ2ζ2 g

µgλ2 8ζ2 f + 8λ2ζ2 g + 1

λ2 + 2Rλ2L2 g,maxcαζ2 g λ2µg .

Now, choosing T 2 ηµg log 8Rλ2L2 g,maxl2 g,max µ2g

Viol R 2Rl2 f µgλ2 + 2

λ2 + 2Lg,max

(f (x0, y0) f (x R, y R)) +

λ + 2Lg,max

(g (x0, y0) g (x R, y R)) +

λ + 2Lg,max

(v (x0) v (x R)) +

1 R + 8λ2L2 g,maxcαζ2 g µg + 6ζ2 f + 12λ2ζ2 g

µgλ2 8ζ2 f + 8λ2ζ2 g + 1

λ2 + 2Rλ2L2 g,maxcαζ2 g λ2µg . (52)

Dividing on both sides of the above equation by R results in the following average violation

R 2l2 f µgλ2 + 2 Rµg

λ2 + 2Lg,max

(f (x0, y0) f (x R, y R)) +

λ + 2Lg,max

(g (x0, y0) g (x R, y R)) +

λ + 2Lg,max

(v (x0) v (x R)) +

4cα Rµgλ2 + 32L2 g,maxc2α2ζ2 g µ2g + 24cαζ2 f µgλ2 + 48cαζ2 g µg +

16cαζ2 f µgλ2 + 16cαζ2 g µg + 1 Rλ2 + 2L2 g,maxcαζ2 g µg . (53)

Published in Transactions on Machine Learning Research (09/2025)

Now, retaining terms that depend on λ, R and α, we get the order result stated in the Theorem. This completes the proof.

H Examples of (α, c) Robust Aggregators

Though many robust aggregators have been proposed in the literature (Chen et al., 2017; Pillutla et al., 2022b), they do not satisfy the definition 1. Also there are many new attacks where the methods using these aggregators fail to converge.

In this section, we present the aggregators which when used with the bucketing algorithm, introduced in Karimireddy et al. (2021; 2020) satisfy (α, c)-robust aggregator definition in 1.

Geometric Median: Also known as Robust Federated Averaging (RFA) (Chen et al., 2017; Pillutla et al., 2022b) where aggregation is performed using geometric median:

GM(x1, x2, . . . , x N) := arg min x Rd

Coordinate-wise Median: CM is an aggregator which performs co-ordinate wise median:

[CM(x1, x2, . . . , x N)]j := Median([x1]j, [x2]j, . . . , [x N]j),

where [x]j is the jth component of vector x.

Krum: Krum finds a vector xk which is closest to the mean of the input vectors when n |B| 2 vectors are excluded

Krum(x1, x2, . . . , xn) := arg min x {x1,...,xn}

j Sj xj xi 2 .