# distributed_bilevel_optimization_with_communication_compression__63beb7b7.pdf

Distributed Bilevel Optimization with Communication Compression

Yutong He * 1 Jie Hu * 1 Xinmeng Huang 2 Songtao Lu 3 Bin Wang 4 Kun Yuan 1 5 6

Stochastic bilevel optimization tackles challenges involving nested optimization structures. Its fastgrowing scale nowadays necessitates efficient distributed algorithms. In conventional distributed bilevel methods, each worker must transmit fulldimensional stochastic gradients to the server every iteration, leading to significant communication overhead and thus hindering efficiency and scalability. To resolve this issue, we introduce the first family of distributed bilevel algorithms with communication compression. The primary challenge in algorithmic development is mitigating bias in hypergradient estimation caused by the nested structure. We first propose C-SOBA, a simple yet effective approach with unbiased compression and provable linear speedup convergence. However, it relies on strong assumptions on bounded gradients. To address this limitation, we explore the use of moving average, error feedback, and multi-step compression in bilevel optimization, resulting in a series of advanced algorithms with relaxed assumptions and improved convergence properties. Numerical experiments show that our compressed bilevel algorithms can achieve 10 reduction in communication overhead without severe performance degradation.

1. Introduction

Large-scale optimization and learning have emerged as indispensable tools in numerous applications. Solving such large and intricate problems poses a formidable challenge, usually demanding hours or days to complete. Consequently, it is imperative to expedite large-scale optimization and learning with distributed algorithms. In distributed learn-

*Equal contribution 1Peking University 2University of Pennsylvania 3IBM Research 4Zhejiang University 5National Engineering Labratory for Big Data Analytics and Applications 6AI for Science Institute, Beijing, China. Correspondence to: Kun Yuan <kunyuan@pku.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

ing, multiple workers collaborate to solve a global problem through inter-worker communications. In most current implementations (Smola & Narayanamurthy, 2010; Li et al., 2014; Strom, 2015; Gibiansky, 2017), each worker transmits full-dimensional gradients to a central server for updating model parameters. Since the size of full-dimensional gradients is massive, communicating them per iteration incurs substantial overhead, which impedes algorithmic efficiency and scalability (Seide et al., 2014; Chilimbi et al., 2014).

To mitigate this issue, communication compression (Alistarh et al., 2017; Bernstein et al., 2018; Stich et al., 2018; Richt arik et al., 2021; Huang et al., 2022) has been developed to reduce overhead. Instead of transmitting full gradient/model tensors, these strategies communicate compressed tensors with substantially smaller sizes per iteration. Two prevalent approaches of compression are quantization and sparsification. Quantization (Alistarh et al., 2017; Horvath et al., 2019; Seide et al., 2014) involves mapping input tensors from a large, potentially infinite, set to a smaller set of discrete values, such as 1-bit quantization (Seide et al., 2014) or natural compression (Horvath et al., 2019). In contrast, sparsification (Wangni et al., 2018; Stich et al., 2018; Safaryan et al., 2021) entails dropping a certain number of entries to obtain sparse tensors for communication, such as rand-K or top-K compressor (Stich et al., 2018). Both approaches have demonstrated strong empirical performance in communication savings.

Communication compression has widespread application in single-level stochastic optimization. However, many machine learning tasks, including adversarial learning (Madry et al., 2018), meta-learning (Bertinetto et al., 2019), hyperparameter optimization (Franceschi et al., 2018), reinforcement learning (Hong et al., 2023), neural architecture search (Liu et al., 2018), and imitation learning (Arora et al., 2020) involve upperand lower-level optimization formulations that go beyond the conventional single-level paradigm. Addressing such nested problems has prompted substantial attention towards stochastic bilevel optimization (Ghadimi & Wang, 2018; Ji et al., 2021). While tremendous efforts have been made (Yang et al., 2022; Chen et al., 2022; Tarzanagh et al., 2022; Yang et al., 2023b) to solve distributed stochastic bilevel optimization, no existing algorithms, to the best of our knowledge, have been developed under communication compression. To fill this gap, this paper provides the

Distributed Bilevel Optimization with Communication Compression

first comprehensive study on distributed stochastic bilevel optimization with communication compression.

1.1. Distributed Bilevel Optimization

We consider distributed stochastic bilevel problems with the following nested upperand lower-level structure:

min x Rdx Φ(x) 1

i=1 fi(x, y (x)), (1a)

s.t. y (x) = argmin y Rdy g(x, y) 1

i=1 gi(x, y). (1b)

Here, n denotes the number of workers, with each worker i privately owns its upper-level cost function fi : Rdx Rdy R, lower-level cost function gi : Rdx Rdy R, and local data distribution Dfi, Dgi such that

fi(x, y) Eϕ Dfi[F(x, y; ϕ)],

gi(x, y) Eξ Dgi[G(x, y; ξ)].

The objective for all workers is to find a global solution to bilevel problem (1). Typical applications of problem (1) can be found in (Yang et al., 2021; Tarzanagh et al., 2022).

1.2. Challenges in Compressed Bilevel Optimization

Conceptually, if each worker i could access the accurate oracle function f i (x) fi(x, y (x)) without any sampling noise, a straightforward framework to solve (1) under compressed communication (with compressors {Ci}n i=1) is

xk+1 = xk 1

i=1 Ci f i (xk) , (2)

where each worker transmits the compressed hypergradient to the server to update model parameters. However, update (2) demands an accurate estimate of the hypergradient f i (x), which can be written as (Ghadimi & Wang, 2018)

f i (x) = xfi(x, y (x)) 2 xyg(x, y (x)) 2 yyg(x, y (x)) 1 yfi(x, y (x)) (3)

2 xyg(x, y (x)) = 1

i=1 2 xy gi(x, y (x)), (4a)

2 yyg(x, y (x)) = 1

i=1 2 yy gi(x, y (x)). (4b)

It is challenging to precisely evaluate f i (x) through (3) in distributed bilevel optimization, particularly under compressed communication, for the following reasons:

Unavailable y (x). The solution y (x) to problem (1b) is not directly accessible. Existing literature (Ghadimi & Wang, 2018; Ji et al., 2021) often introduces iterative loops to approximately solve problem (1b), leading to expensive computation costs and biased estimates of y (x).

Inexact Hessian inversion. Even provided with the accurate y (x), it is cumbersome to evaluate the global 2 xyg(x, y (x)) and [ 2 yyg(x, y (x))] 1 through (4) as it incurs expensive matrix communication. Recent works (Tarzanagh et al., 2022; Xiao & Ji, 2023) propose to communicate imprecise Hessian/Jacobian-vector products achieved by approximate implicit differentiation, which inevitably introduces bias in estimating f i (x). Compression-incurred distortion. As indicated by (3) and (4), specific Jacobean matrices shall be communicated to tackle sub-problem (1b). One may consider using the compressed proxies Ci( 2 yy gi(x, y (x))) and Ci( 2 xy gi(x, y (x))) to replace Jacobians matrices in (4). However, the compression incurs information distortion, which brings additional bias when evaluating f i (x).

To summarize, practical bilevel algorithms with communication compression essentially perform

xk+1 = xk 1

i=1 Ci f i (xk) + bias (5)

rather than (2), where the bias originates from the nested bilevel structure of (1), as opposed to data sampling or communication compression. This bias term poses substantial challenges in developing distributed bilevel algorithms with communication compression. Most existing single-level compression techniques, including error feedback (Stich et al., 2018; Richt arik et al., 2021) and multi-step compression (Huang et al., 2022; He et al., 2023a), require unbiased estimates of gradients1 (i.e., f i (x)) every iteration, and are thus not directly applicable to bilevel problem (1). This calls for the urgent need to develop new algorithms that can effectively mitigate the bias incurred by the nested bilevel structure, as well as new analyses to clarify how this bias impacts the convergence of compressed bilevel algorithms.

1.3. Contributions and Main Results

Contributions. This paper develops the first set of bilevel algorithms with communication compression.

SOBA (Dagr eou et al., 2022) is a newly introduced singleloop bilevel algorithm with lightweight communication and computation. While SOBA still suffers from biased hypergradient estimates, we surprisingly find applying unbiased compression directly to SOBA yields a simple yet effective compressed bilevel algorithm, which is denoted as distributed SOBA with communication Compression, or C-SOBA for brevity. Under the strong assumption of bounded gradients, we establish its convergence as well as computational and communication complexities2.

1While error feedback and multi-step compression accommodate biased compressors, they need accurate or unbiased gradients. 2Throughout the paper, the computational and communication complexities are referred to in an asymptotic sense, see Table 1.

Distributed Bilevel Optimization with Communication Compression

Table 1. Comparison between distributed bilevel algorithms with communication compression. For simplicity, we unify the compression variance and heterogeneity bounds in both upper and lower levels. Notation n is the number of workers, ϵ is the target precision such that E Φ(ˆx) 2 2 ϵ, ω is compression-related parameter (see Assumption 2.4), σ2 is the variance of stochastic oracles, b2 bounds the gradient dissimilarity. We also list the best-known single-level compression algorithm in the bottom line for reference.

Algorithms #A. Comp. #A. Comm.3 Single Loop Mechanism Heter. Asp. Implement

C-SOBA (Alg. 1 green) (1+ω)σ2+ωb2

nϵ2 ωb2 nϵ2 + 1+ω/n

CM-SOBA (Alg. 1 pink) (1+ω)σ2+ωb2

nϵ2 ωb2 nϵ2 + 1+ω/n

+MSC (Alg. 4) σ2 nϵ2 1+ω

ϵ MA + MSC BGD ,

EF-SOBA (Alg. 2) (1+ω)7σ2

nϵ2 1+ω+ω5/n

ϵ EF + MA None /

+MSC (Alg. 5) σ2 nϵ2 1+ω

ϵ EF + MSC None /

Single-level EF21-SGDM (Fatkhullin et al., 2023) σ2 nϵ2 1+ω

ϵ EF + MA None /

3 Asymptotic communication complexity: number of communication rounds when σ 0 (smaller is better). Asymptotic computational complexity: number of gradient/Jacobian evaluations per worker when ϵ 0 (smaller is better). Compression mechanisms: MA , EF , and MSC refer to as moving average, error feedback, and multi-step compression. Data heterogeneity assumptions (fewer/milder is better). BG and BGD denote bounded gradients (Assumption 3.2) and bounded gradient dissimilarity (Assumptions 3.1 and 4.1), respectively. BG is much more restrictive than BGD . Easy to implement or not. , indicates easy to implement and / indicates relatively harder to implement due to the EF mechanism.

With gradient upper bound Bx (Assumption 3.2), a more precise complexity is ωb2

ω(n+ω)2b2B4x

nϵ4/3 + (1+ω/n)(1+Bx)

While commonly adopted in literature, the boundedgradient assumption of C-SOBA is restrictive. To address this limitation, we leverage Moving average to enhance the theoretical performance of C-SOBA, proposing the refined CM-SOBA method. CM-SOBA converges under the more relaxed assumption of bounded heterogeneity with improved complexities, compared to C-SOBA.

The convergence of CM-SOBA still relies on the magnitude of data heterogeneity. When local data distributions Dfi and Dgi differ drastically across workers, the performance of C-SOBA and CM-SOBA substantially degrade. To mitigate this issue, we further incorporate Error Feedback into CM-SOBA, leading to the EF-SOBA algorithm. EF-SOBA does not rely on any assumptions regarding data heterogeneity, making it suitable for applications with severe data heterogeneity.

Finally, owing to the bias in (5) incurred by the nested structure, the established communication and computation complexities of the aforementioned compressed bilevel algorithms are less favorable compared to the bestknown single-level compressed algorithms (Huang et al., 2022; Fatkhullin et al., 2023). Consequently, we utilize multi-step compression to enhance the convergence of C-SOBA and EF-SOBA, attaining the same complexities as the best-known single-level compressed algorithms.

Results in Table 1. All established algorithms, along with their assumptions and complexities are listed in Table 1. It is noteworthy that the utilization of more advanced mechanisms relaxes assumptions and substantially improves complexities, albeit introducing increased intricacy in algorithmic structures and implementations. Furthermore, our algo-

0 25 50 75 100 125 150 175 200 Iteration

Test accuracy

180 190 200 0.91

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0 1 2 3 4 5 #Bits / n 1e7

Test accuracy

(4.7e+07, 0.9) (2.2e+06, 0.9)

NC-SOBA C-SOBA CM-SOBA EF-SOBA

Figure 1. Hyper-representation on MNIST under homogeneous data distributions. NC-SOBA indicates non-compressed SOBA.

rithms can achieve the same theoretical complexities as the best-known single-level compression algorithm (Fatkhullin et al., 2023), demonstrating its efficacy in overcoming the bias incurred by the nested structure in bilevel problems.

Experiments. Our numerical experiments demonstrate that the proposed algorithms can achieve 10 reduction in communicated bits, compared to non-compressed distributed bilevel algorithms, see Fig. 1 and Sec. 7.

Analysis. Our analysis also contributes new insights. They furnish convergence guarantees even when utilizing biased gradient estimates in compressors as shown in (5). Additionally, they elucidate how upperand lower-level compression exerts distinct influences on convergence, enlightening the compressor selection for upperand lower-level problems.

1.4. Related Work

Bilevel optimization. A key challenge in bilevel optimization lies in accurately estimating the hypergradient Φ(x). Various algorithms have emerged to tackle this challenge, employing techniques such as approximate implicit differentiation (Domke, 2012; Ghadimi & Wang, 2018; Grazzi et al.,

Distributed Bilevel Optimization with Communication Compression

2020; Ji et al., 2021), iterative differentiation (Franceschi et al., 2018; Maclaurin et al., 2015; Domke, 2012; Grazzi et al., 2020; Ji et al., 2021), and Neumann series (Chen et al., 2021; Hong et al., 2023). However, these methods introduce additional inner loops that lead to increased computational overhead and deteriorated computation complexity. A recent work (Dagr eou et al., 2022) introduces SOBA, a novel single-loop framework, to enable simultaneous updates of the lowerand upper-level variables. Recent efforts have been made to develop distributed bilevel algorithms within the federated learning setup (Tarzanagh et al., 2022; Yang et al., 2021; Huang et al., 2023a) and decentralized scenarios (Yang et al., 2022; Chen et al., 2022; 2023a; Lu et al., 2022a; Gao et al., 2023). However, bilevel optimization with communication compression has not been studied in existing literature to our knowledge.

Communication compression. Communication compression shows notable success in single-level distributed optimization (Alistarh et al., 2017; Bernstein et al., 2018; Stich et al., 2018). Two main approaches are quantization and sparsification. Quantization strategies include Sign-SGD (Seide et al., 2014; Bernstein et al., 2018), Turn Grad (Wen et al., 2017), and natural compression (Horvath et al., 2019). On the other hand, classical sparsification strategies involve rand-K and top-K (Stich, 2019; Wangni et al., 2018). Compression introduces information distortion, which slows down convergence and incurs more communication rounds to achieve desired solutions. Various advanced techniques such as error feedback (Richt arik et al., 2021; Stich et al., 2018), multi-step compression (Huang et al., 2022; He et al., 2023a), and momentum (Fatkhullin et al., 2023; Huang et al., 2023b) are developed to effectively mitigate the impact of compression-incurred errors. Furthermore, the optimal convergence for single-level distributed optimization with communication compression is established in (Huang et al., 2022; He et al., 2023a). However, none of these results have been established for bilevel stochastic optimization.

2. Preliminaries

Notations. For a second-order continuously differentiable function f : Rdx Rdy R, we denote xf(x, y) and yf(x, y) as the partial gradients in terms of x and y, respectively. Correspondingly, 2 xyf(x, y) Rdx dy

and 2 yyf(x, y) Rdy dy represent its Jacobian matrices. The full gradient of f is represented as f(x, y) xf(x, y) , yf(x, y) .

Basic assumptions. We now introduce some basic assumptions needed throughout theoretical analysis.

Assumption 2.1 (CONTINUITY). For any i (1 i n),

function fi is Cf-Lipschitz continuous with respect to y; functions fi, gi, 2 xygi, 2 yygi are Lipschitz continuous with constants Lf, Lg, Lgxy, Lgyy, respectively.

Assumption 2.2 (STRONG CONVEXITY). For any i (1 i n), gi is µg-strongly convex with respect to y.

In the above assumptions, we allow data heterogeneity to exist across different workers, i.e., every (fi, gi) differs from each other. Furthermore, we do not assume the Lipschitz continuity of fi with respect to x, which relaxes the assumptions used in prior works (Lu et al., 2022b; Yang et al., 2023a; Huang et al., 2023a; Chen et al., 2022; Lu et al., 2022a; Yang et al., 2022).

Assumption 2.3 (STOCHASTIC NOISE). There exists σ 0 such that for any i (1 i n), and any x Rdx, y Rdy,

the gradient oracles satisfy:

Eϕ Dfi[ F(x, y; ϕ)] = fi(x, y),

Eξ Dgi[ y G(x, y; ξ)] = ygi(x, y),

h F(x, y; ϕ) fi(x, y) 2 2 i σ2,

h y G(x, y; ξ) ygi(x, y) 2 2 i σ2;

the Jacobian oracles satisfy:

Eξ Dgi 2 xy G(x, y; ξ) = 2 xygi(x, y),

Eξ Dgi 2 yy G(x, y; ξ) = 2 yygi(x, y),

h 2 xy G(x, y; ξ) 2 xygi(x, y) 2

h 2 yy G(x, y; ξ) 2 yygi(x, y) 2

The following notion is for compressors.

Assumption 2.4 (UNBIASED COMPRESSION). A compressor C( ) : Rd C Rd C is ω-unbiased (ω 0), if for any input x Rd C, we have

E[C(x)] = x, and E h C(x) x 2 2 i ω x 2 2.

Different compressors yield different values for ω. Generally speaking, a large ω indicates more aggressive compression and, consequently, induces more information distortion. Below, we also assume the conditional independence among all local compressors, i.e., the outputs of local compressors are mutually independent, conditioned on the inputs.

3. Compressed SOBA

SOBA (Dagr eou et al., 2022) is a single-loop bilevel algorithm with lightweight communication and computational costs, originally devised for single-node optimization. In this section, we extend SOBA to address distributed bilevel optimization (1) and then incorporate communication compression, resulting in our first compressed bilevel algorithm.

Non-compressed SOBA. To address the Hessian-inversion issue when evaluating the hypergradient Φ(x) for problem (1), SOBA introduces z [ 2 yyg(x, y (x))] 1

Distributed Bilevel Optimization with Communication Compression

yf(x, y (x)), which can be viewed as the solution to the following distributed optimization problem:

2z 2 yygi(x, y (x)) z+z yfi(x, y (x)) .

By simultaneously solving the lower-level problem, estimating the Hessian-inverse-vector product, and minimizing the upper-level problem, we achieve distributed recursions:

xk+1 = xk α

2 xygi(xk, yk)zk + xfi(xk, yk) ,

yk+1 = yk β

i=1 ygi(xk, yk),

zk+1 = zk γ

2 yygi(xk, yk)zk + yfi(xk, yk) ,

where a central server collects local information to update global variables. We call this algorithm non-compressed SOBA (NC-SOBA). NC-SOBA accommodates stochastic variants by introducing noisy gradient/Jacobian oracles.

Compressed SOBA. When each worker compresses its information before communicating with the central server, we obtain Compressed SOBA, or C-SOBA for short. To detail the algorithm, we let each worker i independently sample data ϕk i Dfi and ξk i Dgi, and calculate

Dk x,i 2 xy G(xk, yk; ξk i )zk + x F(xk, yk; ϕk i ), (6a)

Dk y,i y G(xk, yk; ξk i ), (6b)

Dk z,i 2 yy G(xk, yk; ξk i )zk + y F(xk, yk; ϕk i ). (6c)

Next, worker i transmits Cu i (Dk x,i), Cℓ i (Dk y,i), and Cℓ i (Dk z,i) to the central server where Cu i and Cℓ i are the upper-level ωuand lower-level ωℓ-unbiased compressors utilized by worker i. The implementation of C-SOBA is listed in Algorithm 1 where the update of xk+1 follows the green line. A Clip operation is conducted before updating z to boost the algorithmic performance, in which

Clip( zk+1; ρ) min 1, ρ/ zk+1 2 zk+1.

Intuition behind clipping operation. The variable z is intended to estimate z (x), which, under Assumptions 2.1 and 2.2, should not exceed a magnitude of Cf/µg (refer to Lemma B.2). However, during gradient descent steps, as in our algorithms, the update of z may surpass this limit. Therefore, it is natural to opt for the nearest neighbor of z with a magnitude no greater than Cf/µg instead. A rough estimation ρ Cf/µg suffices as an upper bound. Alternatively, we can directly confine z to the domain B(0, ρ), the closed ball of dimension dy centered at 0 with a radius of ρ, using projected gradient descent. Both approaches

Algorithm 1 C-SOBA and CM-SOBA

Input: α, β, γ, ρ, x0, y0, z0( z0 2 ρ), h0 x ; for k = 0, 1, , K 1 do

on each worker:

Compute Dk x,i, Dk y,i, Dk z,i as in (6); Send Cu i (Dk x,i), Cℓ i (Dk y,i), Cℓ i (Dk z,i) to the server; on server:

xk+1 = xk (α/n) Pn i=1 Cu i (Dk x,i) (C-SOBA) ;

hk+1 x = (1 θ)hk x + (θ/n) Pn i=1 Cu i (Dk x,i) ;

xk+1 = xk α hk x (CM-SOBA) ; yk+1 = yk (β/n) Pn i=1 Cℓ i (Dk y,i); zk+1 = zk (γ/n) Pn i=1 Cℓ i (Dk z,i); zk+1 = Clip( zk+1; ρ); Broadcast xk+1, yk+1, zk+1 to all workers; end for

result in the projection operation zk+1 = PB(0,ρ)( zk+1), equivalent to the clipping operation zk+1 = Clip( zk+1, ρ). Here, PΩ( ) denotes projection onto a closed convex set Ω.

The justification for this operation in theory is straightforward; zk+1 is always closer than (or at an equal distance with) zk+1 to z (xk+1).

In particular, assuming ρ Cf/µg, the non-expansiveness property of projection operators allows us to deduce:

zk+1 z xk+1 2 = PB(0,ρ) zk+1 PB(0,ρ) z xk+1 2 zk+1 z xk+1 2 .

Convergence. To establish the convergence for C-SOBA, we need more assumptions beyond those discussed in Sec. 2.

Assumption 3.1 (BOUNDED HETEROGENEITY). There exist constants bf 0, bg 0, such that for any x Rdx, y Rdy, it holds that

fi(x, y) f(x, y) 2 2 b2 f,

ygi(x, y) yg(x, y) 2 2 b2 g, 2 xygi(x, y) 2 xyg(x, y) 2

2 b2 g, 2 yygi(x, y) 2 yyg(x, y) 2

For conciseness, we present results with the notation b2 max{b2 f, b2 g} in the main text and defer the detailed counterparts associated with b2 f and b2 g to Appendix B.

Assumption 3.2 (BOUNDED GRADIENTS). There exists constant Bx 0, such that for any (xk, yk, zk) generated

Distributed Bilevel Optimization with Communication Compression

by C-SOBA(Alg. 1), we have

2 xyg(xk, yk)zk + xf(xk, yk) 2

2 B2 x. (8)

It is noteworthy that the above assumption is milder than the Lfx-Lipschitz continuity of f with respect to x (Lu et al., 2022b; Yang et al., 2023a; Huang et al., 2023a; Chen et al., 2022; Lu et al., 2022a; Yang et al., 2022), which implies Bx

2Lg Cf/µg +

Assumption 3.3 (2ND-ORDER SMOOTHNESS). Jacobian matrices 2 xyg, 2 yyg are Lgxy - and Lgyy -smooth, 2f is Lff-Lipschitz continuous.

Theorem 3.4. Under Assumptions 2.1 2.4 and 3.1 3.3, if we set the hyperparameters as in Appendix B.1, C-SOBA converges as

k=0 E h Φ(xk) 2

(1 + ωℓ+ ωu) σ + p

+ 3 4 ((1 + ωℓ)σ2 + ωℓb2) 1 4 p

1 + ωu/n Bx n1/4K 3 4

+ 3 4 (1 + ωℓ)σ2 + ωℓb2 1

4 (1 + ωu)σ2 + ωub2 1

(1 + ωu)(1 + ωℓ/n) σ + p

ωu(1 + ωℓ/n)b n K

(1 + ωℓ/n)(1 + ωu/n) Bx

+ (1 + ωℓ/n + ωu/n)

where max{Φ(x0), y0 y (x0) 2 2, z0 z (x0) 2 2}.

Asymptotic complexities. C-SOBA achieves asymptotic linear speedup with a rate of O(1/

n K) as the number of iterations K . This corresponds to an asymptotic sampling/computational complexity of O(1/(nϵ2)) as ϵ 0. Furthermore, C-SOBA asymptotically requires O(ω/(nϵ2)) communication rounds to achieve an ϵ-accurate solution when ϵ 0, see more details in Table 1.

Consistency with non-compression methods. The leading term in (9) reduces to O(

n K) when ωℓ= ωu = 0, which is consistent with non-compression algorithms.

A recommended choice of ωu and ωl. According to (He et al., 2023b), an ω-unbiased compressor C(x) with input dimension d will transmit at least O(d/(1 + ω)) bits per communication round. If ωu + ωℓis bounded away from 0,

C-SOBA will asymptotically transmit

(ωℓ+ ωu) (σ2 + b2)

nϵ2 | {z } asym. comm. rounds

dx 1 + ωu + dy 1 + ωℓ

| {z } bits per round

bits to achieve an ϵ-accurate solution. To minimize the communicated bits, a recommended choice for ωℓand ωu satisfies

(1 + ωu)/(1 + ωℓ) = Θ q

When using rand-K compressors, we can set different values of K for lowerand upper-level compression to achieve the recommended relation of ωℓand ωu that satisfy (10). We refer the readers to the ablation experiments on different choices in Appendix D.4.

4. CM-SOBA Algorithm

Although simple and effective, C-SOBA relies on strong assumptions, particularly Assumption 3.2 on bounded gradients. Moreover, the typically large upper bound Bx significantly hampers the convergence performance. These strong assumptions and inferior convergence complexities are attributed to the bias introduced by the nested bilevel structure, as elucidated in (5).

CM-SOBA. To enhance the convergence properties, we introduce a momentum procedure hk+1 x = (1 θ)hk x + (θ/n) Pn i=1 Cu i (Dk x,i) to tweak the descent direction of x, i.e., xk+1 = xk αhk x. We refer to this new algorithm as Compressed SOBA with Momentum, abbreviated as CM-SOBA. The implementation is outlined in Algorithm 1, where the update of xk+1 is indicated by the pink color. CM-SOBA is inspired by the momentum-based algorithm (Chen et al., 2023b) which eliminates the dependence on Assumption 3.2 in the single-node scenario.

Convergence. With an additional momentum step, CMSOBA converges with more relaxed assumptions. In particular, it removes the strong assumption on bounded gradients.

Assumption 4.1 (POINT-WISE BOUNDED HETEROGENEITY). There exist constants bf 0, bg 0, such that for any x Rdx, y = y (x), (7) holds.

Assumption 4.1 is weaker than Assumption 3.1 since it only assumes bounds at (x, y (x)), as opposed to arbitrary (x, y). By writing b2 max{b2 f, b2 g}, we have the follows. Theorem 4.2. Under Assumptions 2.1 2.4 and 4.1, if we set the hyperparameters as in Appendix B.2, CM-SOBA converges as

k=0 E h Φ(xk) 2

Distributed Bilevel Optimization with Communication Compression

(1 + ωℓ+ ωu) σ + p

+ (1 + ωℓ/n + ωu/n)

in which max{Φ(x0), h0 x Φ(x0) 2 2, y0 y (x0) 2 2, z0 z (x0) 2 2}.

Improved complexities. CM-SOBA achieves an asymptotic linear speedup rate under more relaxed assumptions, compared to C-SOBA. Furthermore, by eliminating the influence of the gradient upper bound Bx on the convergence, we observe from (11) that CM-SOBA enjoys a faster rate, especially when Bx is large.

5. EF-SOBA Algorithm

Despite the simplicity, C-SOBA and CM-SOBA rely on the restrictive Assumptions 3.1 and 4.1 concerning bounded data heterogeneity. When local data distributions Dfi and Dgi differ drastically across workers, the bounded-dataheterogeneity assumption can be violated, significantly deteriorating the convergence performance of C-SOBA and CMSOBA. This section is devoted to developing compressed bilevel algorithms that are robust to data heterogeneity.

Error feedback to upper-level compressors. When updating x in the upper-level optimization, each worker i in C-SOBA or CM-SOBA transmits Cu i (Dk x,i) to the central server at each iteration k. Since Dk x,i does not approach zero due to the sampling randomness and data heterogeneity, Cu i (Dk x,i) does not converge to Dk x,i either. This reveals that compression-incurred distortion persists even when k , explaining why C-SOBA or CM-SOBA necessitates Assumption 3.1 or 4.1 to bound compression distortions.

Inspired by (Fatkhullin et al., 2023), we employ error feedback to alleviate the impact of data heterogeneity when solving the upper-level optimization. Consider recursions

hk+1 x,i = (1 θ)hk x,i + θ Dk x,i (12a)

mk+1 x,i = mk x,i + δu Cu i (hk+1 x,i mk x,i), (12b)

ˆhk+1 x = ˆhk x + δu

i=1 Cu i (hk+1 x,i mk x,i), (12c)

xk+1 = xk α ˆhk x, (12d)

in which δu is a positive scaling coefficient, mk x,i is an auxiliary variable to track hk x,i, and ˆhk x = (1/n) Pn i=1 mk x,i holds for any k 0. In each iteration k, it is the difference hk x,i mk x,i that is compressed and transmitted instead of hk x,i itself. When mk x,i converges to a fixed point as k , we have mk x,i hk x,i from (12b) and hence ˆhk x (1/n) Pn i=1 hk x,i. In other words, error feedback (12)

removes the compression-incurred distortion asymptotically, making x update along the exact momentum direction even when data heterogeneity exists.

Error feedback to lower-level compressors. One may naturally wonder whether the same error feedback technique (12) can be used for the lower-level compressors, i.e.,

mk+1 y,i = mk y,i + δℓ Cℓ i (Dk y,i mk y,i), (13a)

mk+1 z,i = mk z,i + δℓ Cℓ i (Dk z,i mk z,i), (13b)

mk+1 y = mk y + δℓ

i=1 Cℓ i (Dk y,i mk y,i), (13c)

mk+1 z = mk z + δℓ

i=1 Cℓ i (Dk z,i mk z,i), (13d)

and let y and z update along the direction my and mz. The answer, however, is negative. Since the hypergradient Dx,i used in x-update relies heavily on the accurate values of y and z as shown in (6a), a more refined error feedback is needed for lower-level compressors. As illustrated in Appendix A, y and z must be updated following an unbiased estimate of their gradient direction. However, my and mz provide biased estimates under the presence of δu. To address this issue, we propose using

ˆDk y =mk y + 1

i=1 Cℓ i (Dk y,i mk y,i) (14a)

ˆDk z =mk z + 1

i=1 Cℓ i (Dk z,i) mk z,i). (14b)

to update y and z for the unbiasedness (i.e., E[ ˆDk y] = E[Dk y] = y G(xk, yk) and the similar applies to z). The resulting algorithm, termed as compressed SOBA with Error Feedback, or EF-SOBA for short, is listed in Algorithm 2.

Theorem 5.1. Under Assumptions 2.1 2.4, if hyperparameters are set as in Appendix B.3, EF-SOBA converges as

k=0 E h Φ(xk) 2

(1 + ωu)2(1 + ωℓ)3/2

+ ω1/3 u (1 + ωu)1/3 2/3σ2/3

K2/3 + (1 + ωu)

+ (1 + ωu)2p

ωℓ(1 + ωℓ) n K + (1 + ωu)2ω3 ℓ n K

where represents algorithmic initialization constants detailed in Appendix B.3.

According to the above theorem, EF-SOBA achieves linear speedup convergence without relying on Assumption

Distributed Bilevel Optimization with Communication Compression

Algorithm 2 EF-SOBA

Input: α, β, γ, θ, ρ, δu, δℓ, x0, y0, z0( z0 2 ρ), {m0 x,i}, {m0 y,i}, {m0 z,i}, {h0 x,i}, ˆh0 x = 1 n Pn i=1 m0 x,i, m0 y = 1

n Pn i=1 m0 y,i, m0 z = 1

n Pn i=1 m0 z,i; for k = 0, 1, , K 1 do

on worker: Compute Dk x,i, Dk y,i, Dk z,i as in (6); Update hk+1 x,i and mk+1 x,i as in (12a) and (12b); Update mk+1 y,i and mk+1 z,i as in (13a) and (13b); Send Cu i (hk+1 x,i mk x,i), Cℓ i (Dk y,i mk y,i), Cℓ i (Dk z,i mk z,i) to the server; on server: Update ˆDk y and ˆDk z as in (14a) and (14b); xk+1 = xk α ˆhk x; yk+1 = yk β ˆDk y; zk+1 = zk γ ˆDk z; zk+1 = Clip( zk+1, ρ); Update ˆhk+1 x , mk+1 y , mk+1 z as in (12c), (13c), (13d); Broadcast xk+1, yk+1, zk+1 to all workers; end for

3.1 or 4.1. Furthermore, the convergence of EF-SOBA is unaffected by data heterogeneity b2.

6. Convergence Acceleration

While C-SOBA, CM-SOBA, and EF-SOBA can converge, their computational and communication complexities are inferior to those of the best-known single-level compression algorithms, such as EF21-SGDM (Loizou & Richt arik, 2020) and NEOLITHIC (Huang et al., 2022). In Appendix C.2, we leverage Multi-Step Compression (Huang et al., 2022) to expedite CM-SOBA and EF-SOBA, leading to CMSOBA-MSC and EF-SOBA-MSC, respectively, to achieve the same complexities as EF21-SGDM and NEOLITHIC, see Table 1. EF-SOBA-MSC converges as follows:

Theorem 6.1. Under Assumptions 2.1 2.4, with hyperparameters in Appendix C.2, EF-SOBA-MSC converges as

n T + (1 + ωℓ+ ωu) Θ(1)

where T denotes the total number of iterations (i.e., #outerloop recursion #inner rounds in MSC) of EF-SOBA-MSC, and Θ hides logarithm terms independent of T.

Similarly, CM-SOBA-MSC can converge at the same rate as given in (16). For details on algorithmic development and convergence properties, please refer to Appendix C.

7. Experiments

In this section, we evaluate the performance of the proposed compressed bilevel algorithms on two problems: hyper-

representation and hyperparameter optimization. To demonstrate the impact of data heterogeneity on the compared algorithms, we follow the approach in (Hsu et al., 2019) by partitioning the dataset using a Dirichlet distribution with a concentration parameter α = 0.1. We use the unbiased compressor, scaled rand-K, to compress communication, and K is specified differently in various experiments.

Hyper-representation. Hyper-representation can be formulated as bilevel optimization in which the upper-level problem optimizes the intermediate representation parameter to obtain better feature representation on validation data, while the lower-level optimizes the weights of the downstream tasks on training data. We conduct the experiments on MNIST dataset with MLP and CIFAR-10 dataset with CNN. The detailed problem formulation and experimental setup can be found in Appendix D.1.

Figure 1 compares C-SOBA, CM-SOBA, and EF-SOBA with non-compressed distributed SOBA (NC-SOBA) under homogeneous data distributions. It is observed that compressed algorithms can achieve a 10 reduction in communicated bits without substantial performance degradation. Figure 2 illustrates the performance under heterogeneous data distributions. It is observed that C-SOBA and CMSOBA deteriorate in this scenario. However, error feedback significantly benefits convergence, and its convergence (in terms of iterations) and test accuracy of EF-SOBA are close to those of NC-SOBA. This is consistent with the theory that EF-SOBA is more robust to data heterogeneity. To justify its efficiency on various datasets and models, additional results of CIFAR-10 with CNN are provided in Appendix D.3. Furthermore, for more detailed discussions regarding the adjustment of the momentum parameter, please consult Appendix D.3.

Hyperparameter optimization. Hyperparameter optimization aims to enhance the performance of learning models by optimizing hyperparameters. In our experiments on the MNIST dataset using an MLP model, the left two figures in Fig. 3 depict the test accuracy performance under homogeneous data distributions. It is observed that all compressed bilevel algorithms perform on par with the non-compressed algorithm but achieve a 10 reduction in communicated bits. The right two figures illustrate the test accuracy performance under heterogeneous data distributions, further corroborating the superiority of EF-SOBA in resisting data heterogeneity. The problem formulation, experimental setup, and additional numerical results can be found in Appendix D.2.

Runtime comparison. Understanding the importance of evaluating the practical trade-offs between computation and communication, we performed a runtime comparison to complement our theoretical findings with empirical evidence. The results of our hyper-representation experiment on the MNIST dataset with the MLP backbone are presented

Distributed Bilevel Optimization with Communication Compression

0 25 50 75 100 125 150 175 200 Iteration

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0 1 2 3 4 5 #Bits / n 1e7

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0 25 50 75 100 125 150 175 200 Iteration

Test accuracy

180 190 200 0.80

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0 1 2 3 4 5 6 #Bits / n 1e7

Test accuracy

NC-SOBA C-SOBA CM-SOBA EF-SOBA

Figure 2. Hyper-representation on MNIST under heterogeneous data distributions.

0 25 50 75 100 125 150 175 200 Iteration

Test accuracy

180 190 200 0.860

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 #Bits / n 1e6

Test accuracy

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0 25 50 75 100 125 150 175 200 Iteration

Test accuracy

180 190 200

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 #Bits / n 1e6

Test accuracy

NC-SOBA C-SOBA CM-SOBA EF-SOBA

Figure 3. Hyperparameter optimization on MNIST under homogeneous (left two figures) and heterogeneous (right two figures) data distributions.

0 100 200 300 400 500 Time(s)

Test accuracy

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0 100 200 300 400 500 Time(s)

Test accuracy

NC-SOBA C-SOBA CM-SOBA EF-SOBA

Figure 4. Running time comparison under homogeneous(left) and heterogeneous(right) data distributions.

in Fig. 4, conducted under both homogeneous and heterogeneous data distributions. It is evident that the convergence with respect to running time closely aligns with those observed for communication bits. Additionally, we observed that the computation time across all compared algorithms closely matches. This result highlights the computational efficiency of our proposed compression techniques, which introduce minimal computational overhead. Our experiments unequivocally demonstrate that communication time is the dominant factor affecting total runtime in a distributed scenario.

8. Conclusion and Limitation

This paper introduces the first set of distributed bilevel algorithms with communication compression and establishes their convergence guarantees. In experiments, these algorithms achieve a 10 reduction in communication overhead.

However, our developed algorithms are only compatible with unbiased compressors, excluding biased but contractive compressors such as Top-K. In future work, we will

explore bilevel algorithms with contractive compressors and investigate their convergence properties.

Impact Statement

This paper centers on the theoretical analysis of machine learning algorithm convergence. We do not foresee any significant societal consequences arising from our work, none of which we believe need to be explicitly emphasized.

Acknowlegement

The work of Kun Yuan is supported by Natural Science Foundation of China under Grants 12301392 and 92370121.

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, 2017.

Distributed Bilevel Optimization with Communication Compression

Arora, S., Du, S., Kakade, S., Luo, Y., and Saunshi, N. Provable representation learning for imitation learning via bi-level optimization. In International Conference on Machine Learning, pp. 367 376. PMLR, 2020.

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. Signsgd: compressed optimisation for nonconvex problems. In International Conference on Machine Learning, 2018.

Bertinetto, L., Henriques, J., Torr, P., and Vedaldi, A. Metalearning with differentiable closed-form solvers. In International Conference on Learning Representations (ICLR), 2019. International Conference on Learning Representations, 2019.

Chen, T., Sun, Y., and Yin, W. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34:25294 25307, 2021.

Chen, X., Huang, M., and Ma, S. Decentralized bilevel optimization. ar Xiv preprint ar Xiv:2206.05670, 2022.

Chen, X., Huang, M., Ma, S., and Balasubramanian, K. Decentralized stochastic bilevel optimization with improved per-iteration complexity. In International Conference on Machine Learning, pp. 4641 4671. PMLR, 2023a.

Chen, X., Xiao, T., and Balasubramanian, K. Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions. ar Xiv preprint ar Xiv:2306.12067, 2023b.

Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp. 571 582, 2014.

Dagr eou, M., Ablin, P., Vaiter, S., and Moreau, T. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Advances in Neural Information Processing Systems, 35:26698 26710, 2022.

Domke, J. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp. 318 326. PMLR, 2012.

Fatkhullin, I., Tyurin, A., and Richt arik, P. Momentum provably improves error feedback! ar Xiv preprint ar Xiv:2305.15155, 2023.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pp. 1568 1577. PMLR, 2018.

Gao, H., Gu, B., and Thai, M. T. On the convergence of distributed stochastic bilevel optimization algorithms over a network. In International Conference on Artificial Intelligence and Statistics, pp. 9238 9281. PMLR, 2023.

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming. ar Xiv preprint ar Xiv:1802.02246, 2018.

Gibiansky, A. Bringing hpc techniques to deep learning. Baidu Research, Tech. Rep., 2017.

Grazzi, R., Franceschi, L., Pontil, M., and Salzo, S. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pp. 3748 3758. PMLR, 2020.

He, Y., Huang, X., Chen, Y., Yin, W., and Yuan, K. Lower bounds and accelerated algorithms in distributed stochastic optimization with communication compression. ar Xiv preprint ar Xiv:2305.07612, 2023a.

He, Y., Huang, X., and Yuan, K. Unbiased compression saves communication in distributed optimization: When and how much? ar Xiv preprint ar Xiv:2305.16297, 2023b.

Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A twotimescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actorcritic. SIAM Journal on Optimization, 33(1):147 180, 2023.

Horvath, S., Ho, C.-Y., Horvath, L., Sahu, A. N., Canini, M., and Richt arik, P. Natural compression for distributed deep learning. Ar Xiv, abs/1905.10988, 2019.

Hsu, T. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. Co RR, abs/1909.06335, 2019. URL http://arxiv.org/abs/1909.06335.

Huang, M., Zhang, D., and Ji, K. Achieving linear speedup in non-iid federated bilevel learning. ar Xiv preprint ar Xiv:2302.05412, 2023a.

Huang, X., Chen, Y., Yin, W., and Yuan, K. Lower bounds and nearly optimal algorithms in distributed learning with communication compression. Advances in Neural Information Processing Systems, 35:18955 18969, 2022.

Huang, X., Li, P., and Li, X. Stochastic controlled averaging for federated learning with communication compression. ar Xiv preprint ar Xiv:2308.08165, 2023b.

Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pp. 4882 4892. PMLR, 2021.

Distributed Bilevel Optimization with Communication Compression

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pp. 583 598, 2014.

Liu, H., Simonyan, K., and Yang, Y. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

Loizou, N. and Richt arik, P. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77(3):653 710, 2020.

Lu, S., Cui, X., Squillante, M. S., Kingsbury, B., and Horesh, L. Decentralized bilevel optimization for personalized client learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5543 5547. IEEE, 2022a.

Lu, S., Zeng, S., Cui, X., Squillante, M., Horesh, L., Kingsbury, B., Liu, J., and Hong, M. A stochastic linearized augmented lagrangian method for decentralized bilevel optimization. Advances in Neural Information Processing Systems, 35:30638 30650, 2022b.

Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In International conference on machine learning, pp. 2113 2122. PMLR, 2015.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

Pedregosa, F. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737 746. PMLR, 2016.

Richt arik, P., Sokolov, I., and Fatkhullin, I. Ef21: A new, simpler, theoretically better, and practically faster error feedback. Ar Xiv, abs/2106.05203, 2021.

Safaryan, M. H., Shulgin, E., and Richt arik, P. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 2021.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel

distributed training of speech dnns. In INTERSPEECH, 2014.

Smola, A. and Narayanamurthy, S. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2):703 710, 2010.

Stich, S. U. Local sgd converges fast and communicates little. In International Conference on Learning Representations (ICLR), 2019.

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, 2018.

Strom, N. Scalable distributed dnn training using commodity gpu cloud computing. Interspeech 2015, 2015.

Tarzanagh, D. A., Li, M., Thrampoulidis, C., and Oymak, S. Fednest: Federated bilevel, minimax, and compositional optimization. In International Conference on Machine Learning, pp. 21146 21179. PMLR, 2022.

Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, 2018.

Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, 2017.

Xiao, P. and Ji, K. Communication-efficient federated hypergradient computation via aggregated iterative differentiation. ar Xiv preprint ar Xiv:2302.04969, 2023.

Yang, J., Ji, K., and Liang, Y. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34:13670 13682, 2021.

Yang, S., Zhang, X., and Wang, M. Decentralized gossipbased stochastic bilevel optimization over communication networks. Advances in Neural Information Processing Systems, 35:238 252, 2022.

Yang, Y., Xiao, P., and Ji, K. Achieving O(ϵ 1.5) complexity in hessian/jacobian-free stochastic bilevel optimization. ar Xiv preprint ar Xiv:2312.03807, 2023a.

Yang, Y., Xiao, P., and Ji, K. Simfbo: Towards simple, flexible and communication-efficient federated bilevel learning. ar Xiv preprint ar Xiv:2305.19442, 2023b.

Distributed Bilevel Optimization with Communication Compression

A. Necessity of Unbiased Lower Level Gradients

In this section, we demonstrate why the lower-level variables, y and z are supposed to be updated following an unbiased estimate of their gradient direction.

First of all, it s noteworthy that the lower level in a bilevel optimization problem is not equivalent to a single level one. Recall the final goal of our bilevel algorithm is to optimize Φ(x), and y, z are actually auxiliary variables to help improve the estimate precision of the hypergradient Φ(x). Consequently, by following Lemma B.4, yk, zk are expected to be good estimation of y (xk) and z (xk), respectively. Thus, in single-loop algorithms, where lower/upper level variables are updated alternatively, the lower level cannot be regarded as a single-level optimization problem since xk varies in each iteration.

Even if we assume the stability of upper-level solution xk, upper-level algorithms, e.g., EF21-SGDM which we use in EF-SOBA, cannot work well in the strongly-convex lower-level scenario. When we modify the non-convex objectives into a strongly-convex one, it s natural to believe that the algorithm can guarantee at least the same convergence rate as before, since the assumption has become stronger. However, it s worth noting that the convergence metric also varies in different settings. Specifically, the metric for a non-convex objective f(x) is usually f( x K) 2 2 with x K = 1 K+1 PK k=0 xk, while that for strongly-convex f(x) should be x K x 2 2 or f(x K) f(x ), where x is the minimum of f(x). In the typical SOBA structure, yk y (xk) 2 2 is the actually concerned metric, which negates applying EF21-SGDM for lower-level design.

Next, we go through some technical issues to demonstrate the role of unbiased gradient estimates in this different optimization problem. Consider the following recursion for optimizing a strongly-convex objective f(x):

xk+1 = xk αgk,

where gk is a stochastic estimator of f(xk). When trying to establish descent inequalities with respect to E h xk x 2 2

i , we consider

E h xk+1 x 2

i =E h xk αgk x 2

i = E h xk x 2

i 2αE xk x , gk + α2E h gk 2

If gk is an unbiased estimate of f(xk), E xk x , gk = E xk x , f(xk) can be directly used in proving

descent inequalities, leaving α2E h gk 2 2

i the only term including noise. Otherwise, if gk is a biased one, an additional term

2αE xk x , gk f(xk) which is upper bounded by α2 t E h xk x 2 2

i + αt E h gk f(xk) 2 2

i will either

include bigger noise (if t < 2) or deteriorate contraction of E h xk x 2 2

i (if t 2).

Remark. Based on the above discussions, it is possible that non-convex algorithms like EF21-SGDM without unbiasedness can be applied to the lower level by using yk, zk for hypergradient estimation. Exploration of this type of algorithms is left to future work.

B. Convergence Analysis

In this section, we provide proofs for the theoretical results in Sections 3, 4, 5 and 6. Throughout this section, we have the following notations.

yk y (xk), zk z (xk);

Fk: the σ-field of random variables already generated at the beginning of iteration k;

Ek[ ] E[ | Fk];

F k i fi(xk, yk), F k f(xk, yk), F k ,i f(xk, yk ), F k f(xk, yk ), Gk i gi(xk, yk), Gk g(xk, yk), Gk ,i gi(xk, yk ), Gk g(xk, yk );

Distributed Bilevel Optimization with Communication Compression

X k + E h xk+1 xk 2 2

i , Yk + E h yk+1 yk 2 2

i , Zk + E h zk+1 zk 2 2

i , Yk E h yk yk 2 2

E h zk zk 2 2

n Pn i=1 Dk x,i, Dk y 1

n Pn i=1 Dk y,i, Dk z 1

n Pn i=1 Dk z,i, ˆDk x 1

n Pn i=1 Cu i (Dk x,i);

ˆDk y,i and ˆDk z,i denote compressed local update estimations, which depend on different lower-level compression mechanisms:

( Cℓ i (Dk y,i), in Alg. 1 and 4, mk y,i + Cℓ i (Dk y,i mk y,i), in Alg. 2 and 5, ˆDk z,i

( Cℓ i (Dk z,i), in Alg. 1 and 4, mk z,i + Cℓ i (Dk z,i mk z,i), in Alg. 2 and 5.

The following lemmas are frequently used in our analysis.

Lemma B.1 ((Chen et al., 2023b), Lemma B.1). Under Assumptions 2.1, 2.2, if β < 2 Lg+µg , the following inequality holds:

yk β y Gk yk 2 (1 βµg) yk yk 2.

Lemma B.2 ((Chen et al., 2023b), Lemma B.2). Under Assumptions 2.1, 2.2, there exist positive constants L Φ, Ly , Lz , such that Φ(x), y (x), z (x) are L Φ, Ly , Lz -Lipschitz continuous, respectively. Moreover, we have z (x) 2 Cf

µg for all x Rdx.

Notation. For convenience, we define the following constants:

L2 1 L2 f + L2 gyyρ2, L2 2 L2 f + L2 gxyρ2, L2 3 L2 g(1 + 3L2 y ), L2 4 L2 1(1 + 3L2 y ) + 3L2 g L2 z ,

L2 5 L2 2(1 + 3L2 y ) + 3L2 g L2 z , σ2 1 σ2(1 + ρ2), σ2 σ2/R, σ2 1 σ2 1/R, ω1 1 + 6ωℓ(1 + ωℓ),

ω2 1 + 36ωu(1 + ωu), ωℓ ωℓ(ωℓ/(1 + ωℓ))R , ωu ωu (ωu/(1 + ωu))R .

Lemma B.3 (Variance Bounds). Under Assumption 2.3, we have the following variance bounds for Alg. 1 and 2:

Var Dk y,i | Fk σ2, Var Dk y | Fk σ2/n;

Var Dk x,i | Fk σ2 1, Var Dk x | Fk σ2 1/n;

Var Dk z,i | Fk σ2 1, Var Dk z | Fk σ2 1/n.

Proof. Note that zk 2 ρ, Lemma B.3 is a direct consequence of Assumption 2.3 and the definition of σ1.

Lemma B.4 (Gradient Bias). Under Assumptions 2.1, 2.2, 2.3, if ρ Cf/µg, the following inequality holds for C-SOBA, CM-SOBA (Alg. 1) and EF-SOBA (Alg. 2):

k=0 E h Ek(Dk x) Φ(xk) 2

k=0 Yk + 3L2 g

k=0 Zk. (18)

Proof. By Assumption 2.3, we have Ek(Dk x) = 2 xy Gkzk + x F k. Since Φ(xk) = 2 xy Gk zk + x F k , we have

E h Ek(Dk x) Φ(xk) 2

=E h ( x F k x F k ) + ( 2 xy Gk 2 xy Gk )zk + 2 xy Gk(zk zk ) 2

3E h x F k x F k 2

i + 3E h ( 2 xy Gk 2 xy Gk )zk 2

i + 3E h 2 xy Gk(zk zk ) 2

3L2 f E h yk yk 2

i + 3L2 gxyρ2E h yk yk 2

i + 3L2 g E h zk zk 2

where the first inequality uses Cauchy-Schwarz inequality and the second inequality uses Assumption 2.1, Lemma B.2 and ρ Cf/µg. Summing (19) from k = 0 to K 1 we achieve (18).

Distributed Bilevel Optimization with Communication Compression

Lemma B.5 (Local Vanilla Compression Error in Lower Level). Under Assumptions 2.1, 2.2, 2.3, 2.4, and 4.1(or 3.1), the following inequality holds for C-SOBA and CM-SOBA (Alg. 1):

i=1 E ˆDk y,i Dk y,i 2

k=0 Yk + Kωℓσ2 + 2Kωℓb2 g, (20)

i=1 E ˆDk z,i Dk z,i 2

k=0 Yk + 6ωℓL2 g

k=0 Zk + Kωℓσ2 1 + 4Kωℓb2 f + 4KωℓC2 f µ2g b2 g. (21)

Proof. By definition, we have ˆDk y,i = Cℓ i (Dk y,i), thus

E ˆDk y,i Dk y,i 2

ωℓE h Dk y,i 2

i ωℓE h y Gk i 2

2ωℓE h y Gk ,i 2

i + 2ωℓL2 g E h yk yk 2

i + ωℓσ2, (22)

where the first inequality uses Assumption 2.4, the second inequality uses Lemma B.3, the third inequality uses y Gk i = y Gk ,i + ( y Gk i y Gk ,i), Cauchy-Schwarz inequality and Assumption 2.1. Averaging (22) from i = 1 to n, we obtain

i=1 E ˆDk y,i Dk y,i 2

i=1 E h y Gk ,i y Gk 2

i + 2ωℓL2 g E h yk yk 2

2ωℓb2 g + 2ωℓL2 g E h yk yk 2

i + ωℓσ2, (23)

where the first inequality uses y Gk = 0, and the second inequality uses Assumption 4.1 (or Assumption 3.1). Summing (23) from k = 0 to K 1 we obtain (20). Similarly, we have

E ˆDk z,i Dk z,i 2

ωℓE h 2 yy Gk i zk + y F k i 2

6ωℓE h ( 2 yy Gk i 2 yy Gk ,i)zk 2

i + 6ωℓE h 2 yy Gk ,i(zk zk ) 2

i + 6ωℓE h y F k i y F k ,i 2

+ 4ωℓE h ( 2 yy Gk ,i 2 yy Gk )zk 2

i + 4ωℓE h y F k ,i y F k 2

6ωℓ L2 f + L2 gyyρ2 E h yk yk 2

i + 6ωℓL2 g E h zk zk 2

i + 4ωℓC2 f µ2g E h 2 yy Gk ,i 2 yy Gk 2

+ 4ωℓE h y F k ,i y F k 2

i + ωℓσ2 1, (24)

where the first inequality uses Assumption 2.4 and Lemma B.3, the second inequality uses 2 yy Gk zk + y F k = 0 and Cauchy-Schwarz inequality, the third inequality uses Assumption 2.1 and Lemma B.2. Averaging (24) from i = 1 to n, summing from k = 0 to K 1, we achieve (21).

Lemma B.6 (Global Vanilla Compression Error in Lower Level). Under Assumptions 2.1, 2.2, 2.3, 2.4 and 4.1(or 3.1), the following inequalities hold for C-SOBA and CM-SOBA (Alg. 1):

k=0 E y Gk ˆDk y 2

k=0 Yk + (1 + ωℓ)Kσ2

n + 2ωℓKb2 g n , (25)

k=0 E Ek(Dk z) ˆDk z 2

k=0 Yk + 6ωℓL2 g n

k=0 Zk + (1 + ωℓ)Kσ2 1 n + 4ωℓKb2 f n + 4ωℓC2 f Kb2 g nµ2g . (26)

Proof. By Assumption 2.3, we have

E y Gk ˆDk y 2

=E ( ˆDk y Dk y) + (Dk y y Gk) 2

= E ˆDk y Dk y 2

+ E h Dk y y Gk 2

Distributed Bilevel Optimization with Communication Compression

Thus, by Assumption 2.4 and conditional independence of the compressors, we have

E y Gk ˆDk y 2

i=1 ( ˆDk y,i Dk y,i)

+ E h Dk y y Gk 2

i=1 E ˆDk y,i Dk y,i 2

+ E h Dk y y Gk 2

i=1 E ˆDk y,i Dk y,i 2

where the inequality uses Lemma B.3. Summing (27) from k = 0 to K 1 and applying (20), we obtain (25). Similarly, we have

E Ek(Dk z) ˆDk z 2

i=1 E ˆDk z,i Dk z,i 2

nσ2 1. (28)

Summing (28) from k = 0 to K 1 and applying (21), we obtain (26).

B.1. Proof of Theorem 3.4

Before proving Theorem 3.4, we need a few additional lemmas.

Lemma B.7 ((Dagr eou et al., 2022), Lemma C.1 and C.2). Under Assumptions 2.1, 2.2, 3.3, there exists positive constants Lyx and Lzx, such that y (x) and z (x) are Lyx and Lzx-smooth, respectively.

Lemma B.8 (Bounded Second Moment of Vanilla Compressed Update Directions). Under Assumptions 2.1, 2.2, 2.3, 2.4 and 3.1, the following inequalities hold for C-SOBA (Alg. 1):

k=0 E ˆDk x 2

k=0 E h Ek(Dk x) 2

i + (1 + ωu)Kσ2 1 n + 2ωu Kb2 f n + 2ρ2ωu Kb2 g n , (29)

k=0 E ˆDk y 2

k=0 Yk + (1 + ωℓ)Kσ2

n + 2ωℓKb2 g n , (30)

k=0 E ˆDk z 2

k=0 Yk + 3 + 6ωℓ

k=0 Zk + (1 + ωℓ)Kσ2 1 n + 4ω1Kb2 f n + 4C2 fωℓKb2 g nµ2g . (31)

Proof. Let Fk x denote the σ-field of {Dk x,i | 1 i n} and random variables already generated at the beginning of iteration k. By Assumption 2.3 and Lemma B.3, we have

i = E h Ek(Dk x) 2

i + Var Dk x | Fk E h Ek(Dk x) 2

i + σ2 1 n . (32)

E h Dk x,i 2

i E h Ek(Dk x,i) 2

i + σ2 1. (33)

Averaging (33) from i = 1 to n, we obtain

i=1 E h Dk x,i 2

i=1 E h Ek(Dk x,i) 2

=E h Ek(Dk x) 2

i=1 E h Ek(Dk x,i) Ek(Dk x) 2

E h Ek(Dk x) 2

i=1 E h ( 2 xy Gk i 2 xy Gk)zk 2

i=1 E h x F k i x F k 2

Distributed Bilevel Optimization with Communication Compression

E h Ek(Dk x) 2

i + 2ρ2b2 g + 2b2 f + σ2 1, (34)

where the second inequality uses Cauchy-Schwarz inequality, and the third inequality uses Assumption 3.1. For ˆDk x, we have

=E h E h ˆDk x 2 Fk x ii

=E h Dk x 2

i + E E ˆDk x Dk x 2 Fk x

=E h Dk x 2

i=1 E E ˆDk x,i Dk x,i 2 Fk x

i=1 E h Dk x,i 2

where the inequality uses Assumption 2.4. Applying (32)(34) to (35) and summing from k = 0 to K 1, we obtain (29). By Assumption 2.1, we have

E h Ek(Dk y) 2

i = E h y Gk y Gk 2

i L2 g E h yk yk 2

Consequently,

=E h Ek(Dk y) 2

i + E ˆDk y Ek(Dk y) 2

L2 g E h yk yk 2

i + E ˆDk y y Gk 2

Summing (36) from k = 0 to K 1 and applying Lemma B.6, we obtain (30). Similarly, we have

E h Ek(Dk z) 2

i =E h ( 2 yy Gk 2 yy Gk )zk + 2 yy Gk (zk zk ) + ( y F k y F k ) 2

3(L2 f + L2 gyyρ2)E h yk yk 2

i + 3L2 g E h zk zk 2

=E h Ek(Dk z) 2

i + E ˆDk z Ek(Dk z) 2

3L2 1E h yk yk 2

i + 3L2 g E h zk zk 2

Summing (37) from k = 0 to K 1 and applying Lemma B.6, we obtain (31).

Lemma B.9 (Lower Level Convergence). Under Assumptions 2.1, 2.2, 2.3, 2.4, 3.1, 3.2, 3.3 and assume β < min n 2 µg+Lg , µg 8(1+4ωℓ/n)L2g

o , γ min n 1 Lg , µg 12(1+4ωℓ/n)L2g

v u u t µg L2 y β

L2yx B2x(1 + ωu/n) + (1 + ωu)σ2 1/n + 2ωub2 f/n + 2ρ2ωub2g/n ,

the following inequalities hold for C-SOBA (Alg. 1):

µgβ + 16(1 + ωℓ)Kβσ2

µgn + 24L2 y (1 + ωu)Kα2σ2 1 µgnβ

24L2 y α2(1 + ωu/n)

µgβ + 16L2 y α2

k=0 E h Ek(Dk x) 2

Distributed Bilevel Optimization with Communication Compression

+ 48L2 y ωu Kα2b2 f µgnβ +

nµg + 48L2 y ρ2ωu Kα2

µgγ + 136L2 1Y0

µ3gβ + 272(1 + ωℓ)L2 1Kβσ2

408L2 1L2 y (1 + ωu)α2

µ3gnβ + 8(1 + ωℓ)γ

µgn + 8(L2 z + Lzxρ)(1 + ωu)α2

408L2 1L2 y (1 + ωu/n)α2

µ3gβ + 272L2 1L2 y α2

µ4gβ2 + 8(L2 z + Lzxρ)(1 + ωu/n)α2

k=0 E h Ek(Dk x) 2

816L2 1L2 y ωu Kα2

µ3gnβ + 32ωℓKγ

nµg + 16(L2 z + Lzxρ)ωu Kα2

544L2 1ωℓKβ nµ3g + 816L2 1L2 y ρ2ωu Kα2

nµ3gβ + 16ρ2ωℓKγ

nµg + 8(L2 z + Lzxρ)ρ2ωu Kα2

Proof. We first separate Yk+1 into five parts:

Yk+1 =E h yk+1 yk 2

i + E h yk+1 yk 2

i 2E yk+1 yk , yk+1 yk

=E h yk+1 yk 2

i + E h yk+1 yk 2

i 2E yk yk , yk+1 yk + 2βE h D ˆDk y, yk+1 yk Ei

=E h yk+1 yk 2

i + E h yk+1 yk 2

i 2E yk yk , y (xk)(xk+1 xk)

2E yk yk , yk+1 yk y (xk)(xk+1 xk) + 2βE h D ˆDk y, yk+1 yk Ei , (40)

where the existence of y (xk) is guaranteed by Lemma B.7.

For the first part, by Assumptions 2.3 and 2.4, ˆDk y is an unbiased estimator of y Gk, thus

E h yk+1 yk 2

i =E yk β ˆDk y yk 2

= E (yk yk β y Gk) β( ˆDk y y Gk) 2

=E h yk yk β y Gk 2

i + E β( ˆDk y y Gk) 2

(1 βµg)2Yk + β2E ˆDk y y Gk 2

(1 βµg)Yk + β2E ˆDk y y Gk 2

where the first inequality uses Lemma B.1. For the second part, Lemma B.2 implies

E h yk+1 yk 2

i L2 y X k + = L2 y α2E ˆDk x 2

For the third part, we have

2E yk yk , y (xk)(xk+1 xk) =2αE yk yk , y (xk)Ek(Dk x)

βµg L2 y E h Ek(Dk x) 2

where the equality uses the unbiasedness of ˆDk x and the inequality uses Young s inequality and Lemma B.2. For the fourth part, we have

2E yk yk , yk+1 yk y (xk)(xk+1 xk) Lyx E yk yk xk+1 xk 2

Distributed Bilevel Optimization with Communication Compression

L2 yx 4L2 y E Ek yk yk 2 xk+1 xk 2 + L2 y X k +

4L2 y E yk yk 2Ek

ˆDk x 2 + L2 y α2E ˆDk x 2

L2 yx B2 x(1 + ωu/n) + (1 + ωu)σ2 1/n + 2ωub2 f/n + 2ρ2ωub2 g/n α2

4L2 y Yk + L2 y α2E ˆDk x 2

4 Yk + L2 y α2E ˆDk x 2

where the first inequality uses Taylor s expansion, Cauchy s inequality and Lemma B.7, the second inequality uses Young s inequality, the third inequality uses the definition of xk+1, the fourth inequality is due to

ˆDk x 2 = Ek(Dk x) 2 + Ek h Dk x Ek(Dk x) 2i + Ek

E ˆDk x Dk x 2 Fk x

B2 x + σ2 1 n + ωu

i=1 Ek h Dk x,i 2i

B2 x + σ2 1 n + ωu

n B2 x + σ2 1 + 2b2 f + 2ρ2b2 g ,

and the last inequality is due to α r

L2yx(B2x(1+ωu/n)+(1+ωu)σ2 1/n+2ωub2 f /n+2ρ2ωub2g/n). For the fifth part, by Young s

inequality and Lemma B.2 we have

2βE h D ˆDk y, yk+1 yk Ei β2E ˆDk y 2

+ L2 y α2E ˆDk x 2

Summing (40)(41)(42)(43)(44)(45) from k = 0 to K 1 and applying Lemma B.6 we obtain

βµg + 4(1 + ωℓ)Kβσ2

nµg + 12L2 y α2

k=0 E ˆDk x 2

k=0 E h Ek(Dk x) 2

+ 8ωℓL2 gβ nµg

k=0 Yk + 4β

k=0 E ˆDk y 2

+ 8ωℓKβb2 g nµg

βµg + 8(1 + ωℓ)Kβσ2

nµg + 12L2 y (1 + ωu)Kα2σ2 1 µgnβ +

12L2 y α2(1 + ωu/n)

βµg + 8L2 y α2

k=0 E h Ek(Dk x) 2

+ 4(1 + 4ωℓ/n)L2 gβ µg

k=0 Yk + 24L2 y ωu Kα2b2 f µgnβ +

nµg + 24L2 y ρ2ωu Kα2

where the second inequality uses Lemma B.8. Using β µg 8(1+4ωℓ/n)L2g , (46) implies (38). Similarly, we separate Zk+1

into five parts:

Zk+1 E h zk+1 zk+1 2

=E h zk+1 zk 2

i + E h zk+1 zk 2

i 2E zk zk , z (xk)(xk+1 xk)

2E zk zk , zk+1 zk z (xk)(xk+1 xk) + 2γE h D ˆDk z, zk+1 zk Ei , (47)

Distributed Bilevel Optimization with Communication Compression

where the inequality is due to Lemma B.2 and ρ Cf/µg. For the first part, we have

E h zk+1 zk 2

i =E h zk γEk(Dk z) zk 2

i + γ2E Ek(Dk z) ˆDk z 2

=E h (zk zk ) γ 2 yy Gk(zk zk ) γ( 2 yy Gkzk + y F k) 2

i + γ2E Ek(Dk z) ˆDk z 2

(1 + γµg)(1 γµg)2Zk + 1 + 1 γµg

γ2 2L2 1Yk + γ2E Ek(Dk z) ˆDk z 2

where the inequality uses Young s inequality, I γ 2 yy Gk 2 1 γµg and

E h ( 2 yy Gk 2 yy Gk )zk + ( y F k y F k ) 2

i 2 L2 f + L2 gyyρ2 Yk.

For the second part, Lemma B.2 implies

E h zk+1 zk 2

i L2 z X k + = L2 z α2E ˆDk x 2

For the third part, applying Young s inequality along with Lemma B.2 gives

2E zk zk , z (xk)(xk+1 xk) γµg

γµg L2 z E h Ek(Dk x) 2

For the fourth part, we have

2E zk zk , zk+1 zk z (xk)(xk+1 xk) Lzx E zk zk xk+1 xk 2 2Lzxρα2E ˆDk x 2

where the first inequality uses Taylor s expansion, Cauchy-Schwarz inequality and Lemma B.7, and the second inequality uses Lemma B.2 and ρ Cf/µg. For the fifth part, by Young s inequality and Lemma B.2 we have

2γE h D ˆDk z, zk+1 zk Ei γ2E ˆDk z 2

+ L2 z α2E ˆDk x 2

Summing (47)(48)(49)(50)(51)(52) from K = 0 to K 1 and applying Lemma B.6 we achieve

µgγ + 2(1 + ωℓ)Kγσ2 1 µgn + 4L2 z α2

µgγ + 4Lzxρα2

k=0 E ˆDk x 2

k=0 E h Ek(Dk x) 2

µg E ˆDk z 2

+ 8L2 1 µ2g + 12ωℓL2 1γ µgn

k=0 Yk + 12ωℓL2 gγ µgn

k=0 Zk + 8ωℓKγb2 f nµg + 8C2 fωℓKγb2 g nµ3g

µgγ + 4(1 + ωℓ)Kγσ2 1 µgn + 4(L2 z + Lzxρ)(1 + ωu)Kα2σ2 1 µgnγ

µ2gγ2 + 4(L2 z + Lzxρ)(1 + ωu/n)α2

k=0 E h Ek(Dk x) 2

+ 8L2 1 µ2g + 6(1 + 4ωℓ/n)L2 1γ µg

k=0 Yk + 6(1 + 4ωℓ/n)L2 gγ µg

nµg + 8(L2 z + Lzxρ)ωu Kα2

b2 f + 16ρ2ωℓKγ

nµg + 8(L2 z + Lzxρ)ρ2ωu Kα2

where the second inequality uses Lemma B.8 and ρ Cf/µg. Using γ µg 12(1+4ωℓ/n)L2 g , we obtain

µgγ + 8(1 + ωℓ)Kγσ2 1 µgn + 8(L2 z + Lzxρ)(1 + ωu)Kα2σ2 1 µgnγ

Distributed Bilevel Optimization with Communication Compression

µ2gγ2 + 8(L2 z + Lzxρ)(1 + ωu/n)α2

k=0 E h Ek(Dk x) 2

+ 17L2 1 µ2g

k=0 Yk + 32ωℓKγ

nµg + 16(L2 z + Lzx Cf/µg)ωu Kα2

nµg + 16(L2 z + Lzxρ)ρ2ωu Kα2

Applying (38) to (54) achieves (39).

Now we are ready to prove 3.4. We first restate the theorem in a more detailed way.

Theorem B.10 (Convergence of C-SOBA). Under the conditions that Assumptions 2.1, 2.2, 2.3, 2.4, 3.1, 3.2, 3.3 hold and β < min n 2 µg+Lg , µgn 8(1+4ωℓ/n)L2g

o , γ min n 1 Lg , µg 12(1+4ωℓ/n)L2g

o , ρ Cf/µg,

( 1 5L Φ(1 + ωu/n),

µgβ 360L2 y (L2 2 + 17κ2g L2 1)(1 + ωu/n), r µgγ 120L2g(L2 z + Lzxρ)(1 + ωu/n),

240L2 y (L2 2 + 17κ2g L2 1),

120L2g L2 z ,

v u u t µg L2 y β

L2yx B2x(1 + ωu/n) + (1 + ωu)σ2 1/n + 2ωub2 f/n + 2ρ2ωub2g/n

C-SOBA (Alg. 1) converges as

k=0 E h Φ(xk) 2

2 0 Φ Kα + 24(L2 2 + 17κ2 g L2 1) 0 y µg Kβ + 12L2 g 0 z µg Kγ + 48(L2 2 + 17κ2 g L2 1)(1 + ωℓ)βσ2

L Φ(1 + ωu)α

n + 24L2 g(1 + ωℓ)γ

+ 72L2 y (L2 2 + 17κ2 g L2 1)(1 + ωu)α2

µgnβ + 24L2 g(L2 z + Lzxρ)(1 + ωu)α2

n + 96L2 gωℓγ µgn + 144L2 y (L2 2 + 17κ2 g L2 1)ωuα2

µgnβ + 48L2 g(L2 z + Lzxρ)ωuα2

n + 96(L2 2 + 17κ2 g L2 1)ωℓβ µgn + 48L2 gρ2ωℓγ µgn + 144L2 y ρ2(L2 2 + 16κ2 g L2 1)ωuα2

+ 24L2 gρ2(L2 z + Lzxρ)ωuα2

If we further choose parameters as

2 + 8(1 + 4ωℓ/n)L2 g µg +

2K (1 + ωℓ)σ2 + 2ωℓb2g

Lg + 12(1 + 4ωℓ/n)L2 g µg +

v u u t2K (1 + ωℓ)σ2 1 + 4ωℓb2 f + 2(C2 f/µ2g)ωℓb2g

Distributed Bilevel Optimization with Communication Compression

5L Φ(1 + ωu/n) +

360L2 y (L2 2 + 17κ2g L2 1)(1 + ωu/n) µgβ +

120L2g(L2 z + Lzx Cf/µg)(1 + ωu/n)

240L2 y (L2 2 + 17κ2g L2 1) µ2gβ2 +

120L2g L2 z µ2gγ2

v u u t L2yx B2x(1 + ωu/n) + (1 + ωu)σ2 1/n + 2ωub2 f/n + 2C2 fωub2g/(µ2gn)

C-SOBA (Alg. 1) converges as order

k=0 E h Φ(xk) 2

(1 + ωℓ+ ωu) σ + p

(ωℓ+ ωu) (bf + bg)

+ 3/4( 4 1 + ωℓ σ + 4 ωℓ p

bg) 1 + ωuσ + ωu(bf + bg) + n + ωu Bx

(1 + ωu)(1 + ωℓ/n) σ + p

ωu(1 + ωℓ/n)(bf + bg) n K

(1 + ωℓ/n)(1 + ωu/n) Bx

K + (1 + ωℓ/n + ωu/n)

where 0 Φ Φ(x0), 0 y y0 y0 2 2, 0 z z0 z0 2 2, and max{ 0 Φ, 0 x, 0 y, 0 z}.

Proof. By L Φ-smoothness of Φ (Lemma B.2), we have

E Φ(xk) + E Φ(xk), xk+1 xk + L Φ

=E Φ(xk) αE h Ek h D Φ(xk), ˆDk x Eii + L Φα2

2 E ˆDk x 2

2 E h Φ(xk) 2

2 E h Ek(Dk x) 2

2 E h Φ(xk) Ek(Dk x) 2

2 E ˆDk x 2

Summing (57) from k = 0 to K 1, we obtain

k=0 E h Φ(xk) 2

k=0 E h Ek(Dk x) 2

k=0 E h Φ(xk) Ek(Dk x) 2

k=0 E ˆDk x 2

Applying Lemma B.4, B.8, B.9 to (58), we obtain

k=0 E h Φ(xk) 2

2 0 Φ Kα + 24(L2 2 + 17κ2 g L2 1) 0 y µg Kβ + 12L2 g 0 z µg Kγ + 48(L2 2 + 17κ2 g L2 1)(1 + ωℓ)βσ2

L Φ(1 + ωu)α

n + 24L2 g(1 + ωℓ)γ

+ 72L2 y (L2 2 + 17κ2 g L2 1)(1 + ωu)α2

µgnβ + 24L2 g(L2 z + Lzxρ)(1 + ωu)α2

k=0 E h Ek(Dk x) 2

Distributed Bilevel Optimization with Communication Compression

n + 96L2 gωℓγ µgn + 144L2 y (L2 2 + 17κ2 g L2 1)ωuα2

µgnβ + 48L2 g(L2 z + Lzxρ)ωuα2

n + 96(L2 2 + 17κ2 g L2 1)ωℓβ µgn + 48L2 gρ2ωℓγ µgn + 144L2 y ρ2(L2 2 + 16κ2 g L2 1)ωuα2

+ 24L2 gρ2(L2 z + Lzxρ)ωuα2

L Φ(1 + ωu/n)α + 72L2 y (L2 2 + 17κ2 g L2 1)(1 + ωu/n)α2

µgβ + 24L2 g(L2 z + Lzxρ)(1 + ωu/n)α2

+ 48L2 y (L2 2 + 17κ2 g L2 1)α2

µ2gβ2 + 24L2 g L2 z α2

Note that (55) implies C1 0, (56) is a direct result of (60).

B.2. Proof of Theorem 4.2

Before proving Theorem 4.2, we need a few additional lemmas.

Lemma B.11 (Lower Level Convergence). Under Assumptions 2.1, 2.2, 2.3, 2.4, 4.1, when β < min n 2 µg+Lg , nµg 8ωℓL2g

γ min n 1 Lg , nµg 36ωℓL2g

o , ρ Cf/µg, the following inequalities hold for CM-SOBA (Alg. 1):

βµg + 4βK(1 + ωℓ)σ2

nµg + 8βKωℓb2 g nµg + 4L2 y β2µ2g

k=0 X k +, (61)

γµg + 50L2 1Y0

βµ3g + 100βK(1 + ωℓ)L2 1σ2

nµ3g + 6K(1 + ωℓ)γσ2 1 µgn

100L2 y L2 1 β2µ4g + 12L2 z γ2µ2g

k=0 X k + + (200βL2 1 + 24γρ2µ2 g)Kωℓb2 g nµ3g + 24Kωℓγb2 f µgn . (62)

Proof. By Assumptions 2.3 and 2.4, ˆDk y is an unbiased estimator of y Gk, thus

E h yk+1 yk 2

i =E yk β ˆDk y yk 2

= E (yk yk β y Gk) β( ˆDk y y Gk) 2

=E h yk yk β y Gk 2

i + E β( ˆDk y y Gk) 2

(1 βµg)2Yk + β2E ˆDk y y Gk 2

where the inequality uses Lemma B.1. Consequently, we have

Yk+1 (1 + βµg)E h yk+1 yk 2

i + 1 + 1 βµg

E h yk+1 yk 2

(1 βµg)Yk + 2β2E ˆDk y y Gk 2

+ 2L2 y βµg X k +, (64)

where the first inequality uses Young s inequality, and the second inequality uses (63), β < 1/µg and Lemma B.2. Summing (64) from k = 0 to K 1 and applying Lemma B.6, we obtain

k=0 E ˆDk y y Gk 2

+ 2L2 y β2µ2g

Distributed Bilevel Optimization with Communication Compression

βµg + 2βK(1 + ωℓ)σ2

nµg + 4βKωℓb2 g nµg + 2L2 y β2µ2g

k=0 X k + + 4ωℓL2 gβ nµg

k=0 Yk. (65)

By β nµg 8ωℓL2 g , (65) implies (61). Similarly,

E h zk+1 zk 2

i E h zk+1 zk 2

i = E h zk γEk(Dk z) zk 2

i + γ2E Ek(Dk z) ˆDk z 2

=E h (zk zk ) γ 2 yy Gk(zk zk ) γ( 2 yy Gkzk + y F k) 2

i + γ2E Ek(Dk z) ˆDk z 2

(1 + γµg)(1 γµg)2Zk + 1 + 1 γµg

γ2 2L2 1Yk + γ2E Ek(Dk z) ˆDk z 2

where the first inequality uses ρ Cf/µg and Lemma B.2, the second inequality uses Young s inequality, I γ 2 yy Gk 2 1 γµg and

E h ( 2 yy Gk 2 yy Gk )zk + ( y F k y F k ) 2

L2 f + L2 gyy C2 f µ2g

Consequently,

Zk+1 1 + γµg

E h zk+1 zk 2

i + 1 + 2 γµg

E h zk+1 zk 2

Zk + 6L2 1γ µg Yk + 3γ2

2 E ˆDk z Ek(Dk z) 2

+ 3L2 z γµg X k +, (67)

where the first inequality uses Young s inequality and the second inequality uses (66), γ 1/µg and Lemma B.2. Summing (67) from K = 0 to K 1, we achieve

γµg + 12L2 1 µ2g

k=0 Yk + 3γ

k=0 E ˆDk z Ek(Dk z) 2

+ 6L2 z γ2µ2g

γµg + 3K(1 + ωℓ)γσ2 1 µgn + 6L2 z γ2µ2g

k=0 X k + + 12L2 1 µ2g + 18ωℓL2 1γ nµg

k=0 Yk + 18ωℓL2 gγ nµg

k=0 Zk + 12Kωℓγb2 f µgn

+ 12ρ2Kωℓγb2 g nµg

γµg + 3K(1 + ωℓ)γσ2 1 µgn + 6L2 z γ2µ2g

k=0 X k + + 25L2 1 2µ2g

k=0 Zk + 12Kωℓγb2 f µgn + 12ρ2Kωℓγb2 g nµg , (68)

where the second inequality uses Lemma B.6 and ρ Cf/µg, the third inequality uses γ nµg 36ωℓL2 g . (68) further implies

γµg + 6K(1 + ωℓ)γσ2 1 µgn + 12L2 z γ2µ2g

k=0 X k + + 25L2 1 µ2g

k=0 Yk + 24Kωℓγb2 f µgn + 24ρ2Kωℓγb2 g nµg . (69)

Applying (61) to (69), we achieve (62).

Lemma B.12 (Global Vanilla Compression Error in Upper Level). Under Assumptions 2.1, 2.2, 2.3, 2.4, 4.1, the following inequality holds for CM-SOBA (Alg. 1):

k=0 E ˆDk x Ek(Dk x) 2

(1 + ωu)Kσ2 1 n + 6ωu

k=0 E h Φ(xk) 2

i + 6ωu L2 2 n

k=0 Yk + 6ωu L2 g n

+ 6ωu Kb2 f n + 6ρ2ωu Kb2 g n . (70)

Distributed Bilevel Optimization with Communication Compression

Proof. By Assumptions 2.3 and 2.4, we have

E ˆDk x Ek(Dk x) 2

=E ( ˆDk x Dk x) + (Dk x Ek(Dk x)) 2

= E ˆDk x Dk x 2

+ E h Dk x Ek(Dk x) 2

i=1 E h Cu i (Dk x,i) Dk x,i 2

i=1 E h Dk x,i Ek(Dk x,i) 2

i=1 E h Dk x,i 2

i + σ2 1 n , (71)

where the inequality uses Assumption 2.4 and Lemma B.3. We next bound the second moment of Dk x,i.

E h Dk x,i 2

i =E h (Dk x,i Ek(Dk x,i)) + Ek(Dk x,i) 2

=E h Dk x,i Ek(Dk x,i) 2

i + E h Ek(Dk x,i) 2

σ2 1 + 6E h Φ(xk) 2

i + 6E h x F k ,i x F k 2

i + 6E h ( 2 xy Gk ,i 2 xy Gk )zk 2

+ 6E h ( 2 xy Gk i 2 xy Gk ,i)zk 2

i + 6E h 2 xy Gk (zk zk ) 2

i + 6E h x F k i x F k ,i 2

σ2 1 + 6E h Φ(xk) 2

i + 6E h x F k ,i x F k 2

i + 6ρ2E h 2 xy Gk ,i 2 xy Gk 2

i + 6L2 2Yk

+ 6L2 g Zk, (72)

where the first inequality uses Lemma B.3 and Cauchy-Schwarz inequality, and the second inequality uses Assumption 2.1 and Lemma B.2. Combining (71) (72) and applying Assumption 4.1, we obtain

E ˆDk x Ek(Dk x) 2

(1 + ωu)σ2 1 n + 6ωu

n E h Φ(xk) 2

i + 6ωu L2 2 n Yk + 6ωu L2 g n Zk + 6ωub2 f n + 6ρ2ωub2 g n . (73)

Summing (73) from k = 0 to K 1 achieves (70).

Lemma B.13 (Momentum-Gradient Bias). Under Assumptions 2.1, 2.2, 2.3, 2.4, 4.1, assuming ρ Cf/µg, the following inequality holds for CM-SOBA (Alg. 1):

k=0 E h hk x Φ(xk) 2

i h0 x Φ(x0) 2

θ + (1 + ωu)Kθσ2 1 n + 6ωuθ

k=0 E h Φ(xk) 2

k=0 Yk + 6L2 g

k=0 Zk + 2L2 Φ θ2

+ 6ωu Kθb2 f n + 6ρ2ωu Kθb2 g n . (74)

Further assuming β < min n 2 µg+Lg , nµg 8ωℓL2 g

o , γ min n 1 Lg , nµg 36ωℓL2 g

o , θ min n 1, n 12ωu

o and applying Lemma B.11, we have

k=0 E h hk x Φ(xk) 2

i h0 x Φ(x0) 2

θ + 13(L2 2 + 25κ2 g L2 1)Y0

βµg + 26L2 g Z0

+ 26(L2 2 + 25κ2 g L2 1)(1 + ωℓ)Kβσ2

n + 39L2 g(1 + ωℓ)γ

2L2 Φ θ2 + 26L2 y (L2 2 + 25κ2 g L2 1) β2µ2g + 78L2 z L2 g γ2µ2g

Distributed Bilevel Optimization with Communication Compression

n + 52(L2 2 + 25κ2 g L2 1)ωℓβ nµg + 156L2 gρ2ωℓγ nµg

n + 156L2 gωℓγ nµg

k=0 E h Φ(xk) 2

Proof. Note that ˆDk x and xk+1 are mutually independent conditioned on Fk, we have

E h hk+1 x Φ(xk+1) 2

=E (1 θ)(hk x Φ(xk)) + θ( ˆDk x Ek(Dk x)) + θ(Ek(Dk x) Φ(xk)) + ( Φ(xk) Φ(xk+1) 2

=E h (1 θ)(hk x Φ(xk)) + θ(Ek(Dk x) Φ(xk)) + ( Φ(xk) Φ(xk+1) 2

i + θ2E ˆDk x Ek(Dk x) 2

By Jensen s inequality,

E h (1 θ)(hk x Φ(xk)) + θ(Ek(Dk x) Φ(xk)) + ( Φ(xk) Φ(xk+1) 2

(1 θ)E h hk x Φ(xk) 2

" (Ek(Dk x) Φ(xk)) + 1

θ ( Φ(xk) Φ(xk+1))

(1 θ)E h hk x Φ(xk) 2

i + 2θE h Ek(Dk x) Φ(xk) 2

θE h Φ(xk) Φ(xk+1) 2

(1 θ)E h hk x Φ(xk) 2

i + 2θE h Ek(Dk x) Φ(xk) 2

i + 2L2 Φ θ X k +, (76)

where the second inequality uses Cauchy-Schwarz inequality, and the third inequality uses Lemma B.2. Summing (75)(76) from k = 0 to K 1 and applying Lemma B.4 and B.12, we obtain (74).

Now we are ready to prove Theorem 4.2. We first restate the theorem in a more detailed way.

Theorem B.14 (Convergence of CM-SOBA). Under Assumptions 2.1, 2.2, 2.3, 2.4, 4.1 and assuming β < min n 2 µg+Lg , µgn 8ωℓL2g

o , γ min n 1 Lg , µgn 36ωℓL2g

o , θ min n 1, n 12ωu

o , ρ Cfµg, α min{ 1 2L Φ , C2} with

2L2 Φ θ2 + 26(L2 2 + 25κ2 g L2 1)L2 y β2µ2g + 78L2 g L2 z γ2µ2g

CM-SOBA (Alg. 1) converges as

k=0 E h Φ(xk) 2

4 0 Φ Kα + 2 0 x Kθ + 26(L2 2 + 25κ2 g L2 1) 0 y µg Kβ + 52L2 g 0 z µg Kγ + 52(1 + ωℓ)(L2 2 + 25κ2 g L2 1)β µgn σ2

n + 78L2 g(1 + ωℓ)γ

n + 104(L2 2 + 25κ2 g L2 1)ωℓβ nµg + 312L2 gρ2ωℓγ nµg

n + 312L2 gωℓγ nµg

If we further choose parameters as

α = 1 2L Φ + C 1 2 ,

Distributed Bilevel Optimization with Communication Compression

2 + 8ωℓL2 g µgn +

2K (1 + ωℓ)σ2 + 2ωℓb2g

Lg + 36ωℓL2 g µgn +

v u u t3K (1 + ωℓ)σ2 1 + 4ωℓb2 f + 4ωℓ(C2 f/µ2g)b2g

v u u t K (1 + ωu)σ2 1 + 6ωub2 f + 6ωu(C2 f/µ2g)b2g

CM-SOBA (Alg. 1) converges as order

k=0 E h Φ(x K) 2 2

(1 + ωu + ωℓ) σ + p

(ωu + ωℓ) (bf + bg)

n K + (1 + ωu/n + ωℓ/n)

where we define 0 Φ Φ(x0), 0 x h0 x Φ(x0) 2 2, 0 y y0 y0 2 2, 0 z z0 z0 2 2, and max{ 0 Φ, 0 x, 0 y, 0 z}.

Proof. By L Φ-smoothness of Φ (Lemma B.2), we have

E Φ(xk) + E Φ(xk), xk+1 xk + L Φ

2 E h xk+1 xk 2

=E Φ(xk) + E hk x 2 , xk+1 xk + E Φ(xk) hk x 2 , xk+1 xk + L Φ

2 E h xk+1 xk 2

2 E h Φ(xk) hk x 2

2 E h Φ(xk) 2

Summing (78) from k = 0 to K 1, we have

k=0 E h Φ(xk) 2

k=0 X k + +

k=0 E h hk x Φ(xk) 2

By the choice of α, we have

thus by applying Lemma B.13 to (79) we obtain (77).

B.3. Proof of Theorem 5.1

Lemma B.15 (Bounded Lower Level Updates). Under Assumptions 2.1, 2.2, the following inequalities hold for Alg. 2:

k=0 Yk + 3L2 y

k=0 X k +, (80)

k=0 Zk + 3L2 z

k=0 X k +. (81)

Distributed Bilevel Optimization with Communication Compression

Proof. By Cauchy-Schwarz inequality and Ly -Lipschitz continuity of y (x) (Lemma B.2), we have

Yk + =E h (yk+1 yk+1 ) + (yk+1 yk ) + (yk yk) 2

3Yk+1 + 3E h yk+1 yk 2

3Yk+1 + 3Yk + 3L2 y X k +. (82)

Summing (82) from k = 0 to K 1 obtains (80). Similary, (81) is achieved by applying Cauchy-Schwarz inequality and Lz -Lipschitz continuity of z (x).

The following Lemma describes the contractive property when multiplying (1 + ω) 1 to an ω-unbiased compressor. Lemma B.16 ((He et al., 2023a), Lemma 1). Assume C : Rd C Rd C is an ω-unbiased compressor, then for any x Rd C, it holds that

" 1 1 + ω C(x) x

Lemma B.17 (Bounded Difference of Local Update Directions in Lower Level). Under Assumptions 2.1 and 2.3, the following inequalities hold for EF-SOBA (Alg. 2):

E h Dk+1 y,i Dk y,i 2

i 2L2 g X k + + 2L2 g Yk + + 3σ2, (83)

E h Dk+1 z,i Dk z,i 2

i 6L2 1X k + + 6L2 1Yk + + 6L2 g Zk + + 3σ2 1. (84)

Proof. For the first inequality, we have

E h Dk y,i Dk 1 y,i 2

i =E Ek Dk y,i Dk 1 y,i 2 E h Ek(Dk y,i) Dk 1 y,i 2

2E h Ek(Dk y,i) Ek 1(Dk 1 y,i ) 2

i + 2E h Dk 1 y,i Ek 1(Dk 1 y,i ) 2

2L2 g X k 1 + + 2L2 g Yk 1 + + 3σ2, (85)

which is exactly (83), where the first inequality uses Lemma B.3, the second inequality uses Cauchy-Schwarz inequality, the third inequality uses Assumption 2.1 and Lemma B.3. Similarly, we have

E h Dk z,i Dk 1 z,i 2

i 2E h Ek(Dk z,i) Ek 1(Dk 1 z,i ) 2

i + 3σ2 1. (86)

Ek(Dk z,i) Ek 1(Dk 1 z,i ) = 2 yy Gk i (zk zk 1) + ( 2 yy Gk i 2 yy Gk 1 i )zk 1 + ( y F k i y F k 1 i ),

by Cauchy-Schwarz inequality and Assumption 2.1 we have

E h Ek(Dk z,i) Ek 1(Dk 1 z,i ) 2

i 3L2 g Zk 1 + + 3L2 gyyρ2 + 3L2 f X k 1 + + Yk 1 + ,

which together with (86) leads to (84).

Lemma B.18 (Memory Bias in Lower Level). Under Assumptions 2.1, 2.2, 2.3, 2.4, and assuming δℓ= (1 + ωℓ) 1, the following inequalities hold for EF-SOBA (Alg. 2):

i=1 E h mk+1 y,i Dk y,i 2

E(D0 y,i) m0 y,i 2

2 + 12ωℓ(1 + ωℓ)L2 3

k=0 X k + + 72ωℓ(1 + ωℓ)L2 g

+ 18ωℓ(1 + ωℓ)(K 1)σ2, (87)

i=1 E h mk+1 z,i Dk z,i 2

E(D0 z,i) m0 z,i 2

2 + 36ωℓ(1 + ωℓ)L2 4

k=0 X k + + 216ωℓ(1 + ωℓ)L2 1

+ 216ωℓ(1 + ωℓ)L2 g

k=0 Zk + 18ωℓ(1 + ωℓ)(K 1)σ2 1. (88)

Distributed Bilevel Optimization with Communication Compression

Proof. By the choice of δℓ, we have

E h mk+1 y,i Dk y,i 2

" mk y,i + 1 1 + ωℓ Cℓ i (Dk y,i mk y,i) Dk y,i

E h Dk y,i mk y,i 2

1 1 2(1 + ωℓ)

E h mk y,i Dk 1 y,i 2

i + 3ωℓE h Dk y,i Dk 1 y,i 2

where the first inequality uses Lemma B.16, and the second inequality uses Young s inequality. Applying Lemma B.17 to (89), we obtain

E h mk+1 y,i Dk y,i 2

i 1 1 2(1 + ωℓ)

E h mk y,i Dk 1 y,i 2

i + 6ωℓL2 g X k 1 + + 6ωℓL2 g Yk 1 + + 9ωℓσ2. (90)

For the initial term, we have

E h m1 y,i D0 y,i 2

i 1 1 1 + ωℓ

E h D0 y,i m0 y,i 2

i ωℓ 1 + ωℓ E h E(D0 y,i) m0 y,i 2

1 + ωℓ , (91)

where the first inequality uses Lemma B.16, and the second inequality uses Lemma B.3. Averaging (90) from i = 1 to n and summing from k = 1 to K 2 and applying (91), we reach

i=1 E h mk+1 y,i Dk y,i 2

E(D0 y,i) m0 y,i 2

2 + 12ωℓ(1 + ωℓ)L2 g

+ 12ωℓ(1 + ωℓ)L2 g

k=0 Yk + + 18ωℓ(1 + ωℓ)Kσ2. (92)

Applying Lemma B.15 to (92), we reach (87). Similarly, we have

E h mk+1 z,i Dk z,i 2

i 1 1 2(1 + ωℓ)

E h mk z,i Dk 1 z,i 2

i + 3ωℓE h Dk z,i Dk 1 z,i 2

Applying Lemma B.17 to (93) leads to

E h mk+1 z,i Dk z,i 2

i 1 1 2(1 + ωℓ)

E h mk z,i Dk 1 z,i 2

i + 18ωℓL2 g Zk 1 + + 18ωℓL2 1 X k 1 + + Yk 1 + + 9ωℓσ2 1.

For the initial term, we have

E h m1 z,i D0 y,i 2

i 1 1 1 + ωℓ

E h D0 z,i m0 z,i 2

i ωℓ 1 + ωℓ E h E(D0 z,i) m0 z,i 2

i + ωℓσ2 1 1 + ωℓ , (95)

where the first inequality uses Lemma B.16, and the second inequality uses Lemma B.3. Averaging from i = 1 to n, summing from k = 1 to K 2, applying (95) and Lemma B.15, we reach (88).

Lemma B.19 (Local EF21 Compression Error in Lower Level). Under Assumptions 2.1, 2.2, 2.3, 2.4, assuming δℓ= (1 + ωℓ) 1, the following inequalities hold for EF-SOBA (Alg. 2):

i=1 E ˆDk y,i Dk y,i 2

ωℓ(1 + 4ωℓ)

E(D0 y,i) m0 y,i 2

2 + 4ωℓω1L2 3

k=0 X k + + 24ωℓω1L2 g

+ 6ωℓω1Kσ2. (96)

i=1 E ˆDk z,i Dk z,i 2

ωℓ(1 + 4ωℓ)

E(D0 z,i) m0 z,i 2

2 + 12ωℓω1L2 4

k=0 X k + + 72ωℓω1L2 1

+ 72ωℓω1L2 g

k=0 Zk + 6ωℓω1Kσ2 1. (97)

Distributed Bilevel Optimization with Communication Compression

Proof. By the definition of ˆDk y,i, we have

E ˆDk y,i Dk y,i 2

=E h mk y,i + C(Dk y,i mk y,i) Dk y,i 2

i ωℓE h Dk y,i mk y,i 2

2ωℓE h mk y,i Dk 1 y,i 2

i + 2ωℓE h Dk y,i Dk 1 y,i 2

2ωℓE h mk y,i Dk 1 y,i 2

i + 4ωℓL2 g X k 1 + + 4ωℓL2 g Yk 1 + + 6ωℓσ2, (98)

where the first inequality uses Assumption 2.4, the second inequality uses Cauchy-Schwarz inequality, the third inequality uses Cauchy-Schwarz inequality and Lemma B.17. For the initial term, Assumption 2.3 implies

E ˆD0 y,i D0 y,i 2

(1 + ωℓ)E h m1 y,i D0 y,i 2

i ωℓ E(D0 y,i) m0 y,i 2

2 + ωℓσ2. (99)

Averaging (98) from i = 1 to n, summing from k = 1 to K 1, applying (99), Lemma B.15 and B.18, we obtain (96). Similarly, we have

E ˆDk z,i Dk z,i 2

2ωℓE h mk z,i Dk 1 z,i 2

i + 12ωℓL2 1X k 1 + + 12ωℓL2 1Yk 1 + + 12ωℓL2 g Zk 1 + + 6ωℓσ2 1, (100)

E ˆD0 z,i D0 z,i 2

ωℓ E(D0 z,i) m0 z,i 2

2 + ωℓσ2 1. (101)

Averaging (100) from i = 1 to n, summing from k = 1 to K 1, applying (101), Lemma B.15 and B.18, we obtain (97).

Lemma B.20 (Global EF21 Compression Error in Lower Level). Under Assumptions 2.1, 2.2, 2.3, 2.4, assuming δℓ= (1 + ωℓ) 1, the following inequalities hold for EF-SOBA (Alg. 2):

k=0 E y Gk ˆ Dky 2

ωℓ(1 + 4ωℓ)

E(D0 y,i) m0 y,i 2

2 + (1 + 6ωℓω1)Kσ2

n + 4ωℓω1L2 3 n

+ 24ωℓω1L2 g n

k=0 Yk, (102)

k=0 E Ek(Dk z) ˆDk z 2

ωℓ(1 + 4ωℓ)

E(D0 z,i) m0 z,i 2

2 + (1 + 6ωℓω1)Kσ2 1 n + 12ωℓω1L2 4 n

+ 72ωℓω1L2 1 n

k=0 Yk + 72ωℓω1L2 g n

k=0 Zk. (103)

Proof. By Assumption 2.3, we have

E y Gk ˆDk y 2

=E ( ˆDk y Dk y) + (Dk y y Gk) 2

= E ˆDk y Dk y 2

+ E h Dk y y Gk 2

i=1 E ˆDk y,i Dk y,i 2

Summing (104) from k = 0 to K 1 and applying Lemma B.19, we obtain (102). Similarly, we have

E Ek(Dk z) ˆDk z 2

i=1 E ˆDk z,i Dk z,i 2

nσ2 1. (105)

Summing (105) from k = 0 to K 1 and applying Lemma B.19, we obtain (103).

Distributed Bilevel Optimization with Communication Compression

Lemma B.21 (Lower Level Convergence). Under Assumptions 2.1, 2.2, 2.3, 2.4, assuming β < min n 2 µg+Lg , µgn 96ωℓω1L2g

γ min n 1 Lg , µgn 432ωℓω1L2g

o , ρ Cf/µg, δℓ= (1 + ωℓ) 1, the following inequalities hold for EF-SOBA (Alg. 2):

βµg + 4βωℓ(1 + 4ωℓ)

E(D0 y,i) m0 y,i 2

2 + 16ωℓL2 3ω1β µgn + 4Ly2

+ 4βK(1 + 6ωℓω1)σ2

µgn , (106)

γµg + 50L2 1Y0

βµ3g + 6γωℓ(1 + 4ωℓ)

E(D0 z,i) m0 z,i 2

2 + 100L2 1βωℓ(1 + 4ωℓ)

E(D0 y,i) m0 y,i 2

+ 100K(1 + 6ωℓω1)L2 1βσ2

µ3gn + 6K(1 + 6ωℓω1)γσ2 1 µgn

400ωℓω1L2 1L2 3β µ3gn + 100L2 1L2 y β2µ4g + 72ωℓω1L2 4γ µgn + 12L2 z γ2µ2g

k=0 X k +. (107)

Proof. By Assumptions 2.3 and 2.4, ˆDk y is an unbiased estimator of y Gk, thus we have

E h yk+1 yk 2

i =E yk β ˆDk y yk 2

= E (yk yk β y Gk) β( ˆDk y y Gk) 2

=E h yk yk β y Gk 2

i + E β( ˆDk y y Gk) 2

(1 βµg)2Yk + β2E ˆDk y y Gk 2

where the inequality uses Lemma B.1. Consequently,

Yk+1 (1 + βµg)E h yk+1 yk 2

i + 1 + 1 βµg

E h yk+1 yk 2

(1 βµg)Yk + 2β2E ˆDk y y Gk 2

+ 2L2 y βµg X k +, (109)

where the first inequality uses Young s inequality, the second inequality uses (109), Lemma B.2 and β < 1/µg. Summing (109) from k = 0 to K 1 and applying Lemma B.20, we obtain

k=0 E ˆDk y y Gk 2

+ 2L2 y β2µ2g

βµg + 2βωℓ(1 + 4ωℓ)

E(D0 y,i) m0 y,i 2

2 + 2βK(1 + 6ωℓω1)σ2

8ωℓω1L2 3β nµg + 2L2 y β2µ2g

+ 48ωℓω1L2 gβ nµg

k=0 Yk. (110)

By β nµg 96ωℓω1L2g , (110) implies (106). Similarly, by ρ Cf/µg and Lemma B.2, we have

E h zk+1 zk 2

i E h zk+1 zk 2

i = E h zk γEk(Dk z) zk 2

i + γ2E Ek(Dk z) ˆDk z 2

=E h (zk zk ) γ 2 yy Gk(zk zk ) γ( 2 yy Gkzk + y F k) 2

i + γ2E Ek(Dk z) ˆDk z 2

(1 + γµg)(1 γµg)2Zk + 1 + 1 γµg

γ2 2L2 1Yk + γ2E Ek(Dk z) ˆDk z 2

Distributed Bilevel Optimization with Communication Compression

where the last inequality uses Young s inequality, I γ 2 yy Gk 2 1 γµg and

E h ( 2 yy Gk 2 yy Gk )zk + ( y F k y F k ) 2

L2 f + L2 gyy C2 f µ2g

Consequently,

Zk+1 1 + γµg

E h zk+1 zk 2

i + 1 + 2 γµg

E h zk+1 zk 2

Zk + 6L2 1γ µg Yk + 3γ2

2 E ˆDk z Ek(Dk z) 2

+ 3L2 z γµg X k +, (112)

where the first inequality uses Young s inequality, the second inequality uses (111), Lemma B.2 and γ 1/µg. Summing (112) from K = 0 to K 1 and applying Lemma B.20 we achieve

γµg + 12L2 1 µ2g

k=0 Yk + 3γ

k=0 E ˆDk z Ek(Dk z) 2

+ 6L2 z γ2µ2g

γµg + 3γωℓ(1 + 4ωℓ)

E(D0 z,i) m0 z,i 2

2 + 3K(1 + 6ωℓω1)γσ2 1 µgn + 36ωℓω1L2 4γ µgn + 6L2 z γ2µ2g

+ 12L2 1 µ2g + 216ωℓω1L2 1γ nµg

k=0 Yk + 216ωℓω1L2 gγ nµg

γµg + 3γωℓ(1 + 4ωℓ)

E(D0 z,i) m0 z,i 2

2 + 3K(1 + 6ωℓω1)γσ2 1 µgn + 36ωℓω1L2 4γ µgn + 6L2 z γ2µ2g

+ 25L2 1 2µ2g

k=0 Zk, (113)

where the last inequality uses γ µgn 432ωℓω1L2g . This further implies

γµg + 6γωℓ(1 + 4ωℓ)

E(D0 z,i) m0 z,i 2

2 + 6K(1 + 6ωℓω1)γσ2 1 µgn + 72ωℓω1L2 4γ µgn + 12L2 z γ2µ2g

+ 25L2 1 µ2g

k=0 Yk. (114)

Applying (106) to (114), we achieve (107).

Lemma B.22 (Momentum-Gradient Bias). Under Assumptions 2.1, 2.2, 2.3, assuming ρ Cf/µg, the following inequality holds for EF-SOBA (Alg. 2):

k=0 E h hk x Φ(xk) 2

i h0 x Φ(x0) 2

θ + Kθσ2 1 n + 6L2 2

k=0 Yk + 6L2 g

k=0 Zk + 2L2 Φ θ2

k=0 X k +. (115)

Proof. Note that xk+1 is conditionally independent of Dk x given filtration Fk, we have

E h hk+1 x Φ(xk+1) 2

=E h (1 θ)hk x + θDk x Φ(xk+1) 2

Distributed Bilevel Optimization with Communication Compression

=E Ek (1 θ)(hk x Φ(xk)) + θ(Dk x Ek(Dk x)) + θ(Ek(Dk x) Φ(xk)) + ( Φ(xk) Φ(xk+1))

=E h (1 θ)(hk x Φ(xk)) + θ(Ek(Dk x) Φ(xk)) + ( Φ(xk) Φ(xk+1)) 2

i + θ2E h Dk x Ek(Dk x) 2

(1 θ)E h hk x Φ(xk) 2

" (Ek(Dk x) Φ(xk)) + 1

θ( Φ(xk) Φ(xk+1))

(1 θ)E h hk x Φ(xk) 2

i + 2θE h Ek(Dk x) Φ(xk) 2

i + 2L2 Φ θ X k + + θ2σ2 1 n , (116)

where the first inequality uses Jensen s inequality and Lemma B.3, the second inequality uses Cauchy-Schwarz inequality and Lemma B.2. Summing (116) from k = 0 to K 1 and applying Lemma B.4, we achieve (115).

Lemma B.23 (Local Momentum Bias). Under Assumptions 2.1, 2.2, 2.3, the following inequality holds for EF-SOBA (Alg. 2):

i=1 E h hk x,i Ek(Dk x,i) 2

i=1 h0 x,i E(D0 x,i) 2 + 6L2 5 θ2

k=0 X k + + 36L2 2 θ2

k=0 Yk + 36L2 g θ2

+ 2Kσ2 1. (117)

Proof. By the update rule of hk x,i, we have

E h hk x,i Ek(Dk x,i) 2

=E h (1 θ)(hk 1 x,i Ek 1(Dk 1 x,i )) + θ(Dk 1 x,i Ek 1(Dk 1 x,i )) + (Ek 1(Dk 1 x,i ) Ek(Dk x,i)) 2

(1 θ)E h hk 1 x,i E(Dk 1 x,i ) 2

i + 2θσ2 1 + 2

θE h Ek(Dk x,i) Ek 1(Dk 1 x,i ) 2

where the inequality uses Jensen s inequality, Cauchy-Schwarz inequality and Lemma B.3. For the last term, we have

E h Ek(Dk x,i) Ek 1(Dk 1 x,i ) 2

=E h ( 2 xy Gk i 2 xy Gk 1 i )zk + 2 xy Gk 1 i (zk zk 1) + ( F k i F k 1 i ) 2

3L2 gxyρ2 + 3L2 f X k 1 + + Yk 1 + + 3L2 g Zk 1 + , (119)

where the inequality uses Cauchy-Schwarz inequality and Assumption 2.1. Averaging (118) from i = 1 to n, summing from k = 1 to K 1 and applying (119), we obtain

i=1 E h hk x,i Ek(Dk x,i) 2

i=1 h0 x,i E(D0 x,i) 2 + 2Kσ2 1 + 6L2 2 θ2

k=0 X k + + 6L2 2 θ2

k=0 Zk +. (120)

By applying Lemma B.15 to (120), we obtain (117).

Lemma B.24 (Global EF21 Compression Error in Upper Level). Under Assumptions 2.1, 2.2, 2.3, 2.4, assuming δu = (1 + ωu) 1, and m0 x,i = h0 x,i for i = 1, , n, the following inequality holds for EF-SOBA (Alg. 2):

k=0 E ˆhk x hk x 2

6ωu(1 + ωu)θ

i=1 h0 x,i E(D0 x,i) 2 + 36ωu(1 + ωu)L2 5

k=0 X k + + 216ωu(1 + ωu)L2 2

+ 216ωu(1 + ωu)L2 g

k=0 Zk + 18ωu(1 + ωu)Kθ2σ2 1, (121)

where we define hk x 1

n Pn i=1 hk x,i.

Distributed Bilevel Optimization with Communication Compression

Proof. By Lemma B.16, we have

E h mk+1 x,i hk+1 x,i 2

mk x,i + Cu i (hk+1 x,i mk x,i) 1 + ωu hk+1 x,i

E h hk+1 x,i mk x,i 2

For E h hk+1 x,i mk x,i 2

i , we have

E h hk+1 x,i mk x,i 2

=E h (hk x,i mk x,i) + θ(Dk x,i E(Dk x,i)) + θ(E(Dk x,i) hk x,i) 2

1 + 1 2(1 + ωu)

E h mk x,i hk x,i 2

i + (1 + 2(1 + ωu))θ2σ2 1 + (1 + 2(1 + ωu))θ2E h hk x,i E(Dk x,i) 2

where the inequality uses Young s inequality and Assumption 2.3. Combining (122)(123) achieves

E h mk+1 x,i hk+1 x,i 2

i 1 1 2(1 + ωu)

E h mk x,i hk x,i 2

i + 3ωuθ2σ2 1 + 3ωuθ2E h hk x,i E(Dk x,i) 2

Averaging (124) from i = 1 to n and summing from k = 0 to K 1 gives

i=1 E h mk x,i hk x,i 2

i 2(1 + ωu)

i=1 m0 x,i h0 x,i 2 + 6ωu(1 + ωu)Kθ2σ2 1

+ 6ωu(1 + ωu)θ2 K 1 X

i=1 E h hk x,i E(Dk x,i) 2

Applying Lemma B.23 to (125) and noting that m0 x,i = h0 x,i, we obtain

i=1 E h mk x,i hk x,i 2

18ωu(1 + ωu)Kθ2σ2 1 + 6ωu(1 + ωu)θ

i=1 h0 x,i E(D0 x,i) 2 + 36ωu(1 + ωu)L2 5

+ 216ωu(1 + ωu)L2 2

k=0 Yk + 216ωu(1 + ωu)L2 g

k=0 Zk. (126)

By the update rules we have ˆhk x = 1

n Pn i=1 mk x,i, thus (121) is a direct result of (126) by applying the following inequality:

k=0 E ˆhk x hk x 2

i=1 E h mk x,i hk x,i 2

Lemma B.25 (Update Direction Bias in Upper Level). Under Assumptions 2.1, 2.2, 2.3, 2.4, assuming ρ Cf/µg, δu = (1 + ωu) 1, and m0 x,i = h0 x,i for i = 1, , n, the following inequality holds for EF-SOBA (Alg. 2):

k=0 E ˆhk x Φ(xk) 2

2 h0 x Φ(x0) 2

θ + 12ωu(1 + ωu)θ

i=1 h0 x,i E(D0 x,i) 2

+ 4L2 Φ θ2 + 72ωu(1 + ωu)L2 5

k=0 X k + + 12ω2L2 2

Distributed Bilevel Optimization with Communication Compression

k=0 Zk + 2Kθ

n + 36ωu(1 + ωu)Kθ2 σ2 1. (127)

Further assume β < min n 2 µg+Lg , µgn 96ωℓω1L2 g

o , γ min n 1 Lg , µgn 432ωℓω1L2 g

o and applying Lemma B.21, we have

k=0 E ˆhk x Φ(xk) 2

θ h0 x Φ(x0) 2 + 12ωu(1 + ωu)θ

i=1 h0 x,i E(D0 x,i) 2 + 24ω2(L2 2 + 25κ2 g L2 1) µgβ Y0 + 48ω2L2 g µgγ Z0

+ 48ω2(1 + 6ωℓω1)(L2 2 + 25κ2 g L2 1)Kβ µgn σ2 +

n + 36ωu(1 + ωu)θ2 + 72ω2(1 + 6ωℓω1)L2 gγ µgn

+ 48ω2ωℓ(1 + 4ωℓ)(L2 2 + 25κ2 g L2 1)β µgn2

E(D0 y,i) m0 y,i 2

2 + 72ω2ωℓ(1 + 4ωℓ)L2 gγ µgn2

E(D0 z,i) m0 z,i 2

" 4L2 Φ θ2 + 72ωu(1 + ωu)L2 5 + 12ω2(L2 2 + 25κ2 g L2 1)

4L2 y β2µ2g + 16ωℓω1L2 3β µgn

12L2 z γ2µ2g + 72ωℓω1L2 4γ µgn

k=0 X k +. (128)

Proof. By Cauchy-Schwarz inequality, we have

E ˆhk x Φ(xk) 2

2E ˆhk x hk x 2

+ 2E h hk x Φ(xk) 2

Summing (129) from k = 0 to K and applying Lemma B.22 and B.24, we obtain (127).

Now we are ready to prove Theorem 5.1. We first restate the theorem in a more detailed way.

Theorem B.26 (Convergence of EF-SOBA). Under Assumptions 2.1, 2.2, 2.3, 2.4, assuming β < min n 2 µg+Lg , µgn 96ωℓω1L2g

γ min n 1 Lg , µgn 432ωℓω1L2 g

o , ρ Cf/µg, δℓ= (1 + ωℓ) 1, δu = (1 + ωu) 1, m0 x,i = h0 x,i for i = 1, , n, α

min{ 1 2L Φ , C3} with

" 4L2 Φ θ2 + 72ωu(1 + ωu)L2 5 + 12ω2(L2 2 + 25κ2 g L2 1)

4L2 y β2µ2g + 16ωℓω1L2 3β µgn

12L2 z γ2µ2g + 72ωℓω1L2 4γ µgn

EF-SOBA (Alg. 2) converges as

k=0 E h Φ(xk) 2

i 2 0 Φ Kα + 2 0 x Kθ + 12ωu(1 + ωu)θ 0 h K + 24ω2(L2 2 + 25κ2 g L2 1) 0 y µg Kβ

+ 48ω2L2 g 0 z µg Kγ + 48ω2(1 + 6ωℓω1)(L2 2 + 25κ2 g L2 1)β µgn σ2

n + 36ωu(1 + ωu)θ2 + 72ω2(1 + 6ωℓω1)L2 gγ µgn

+ 48ω2ωℓ(1 + 4ωℓ)(L2 2 + 25κ2 g L2 1)β µg Kn2

E(D0 y,i) m0 y,i 2

Distributed Bilevel Optimization with Communication Compression

+ 72ω2ωℓ(1 + 4ωℓ)L2 gγ µg Kn2

E(D0 z,i) m0 z,i 2

If we further choose parameters as

α = 1 2L Φ + C 1 3 , ρ = Cf/µg, δℓ= 1 1 + ωℓ , δu = 1 1 + ωu ,

2 + 96ωℓω1L2 g µgn +

2ωℓ(1 + 4ωℓ) 0my

2Kσ2(1 + 6ωℓω1)

Lg + 432ωℓω1L2 g µgn +

3ωℓ(1 + 4ωℓ) 0mz

Kσ2 1(1 + 6ωℓω1)

6ωu(1 + ωu) 0 h 0x +

Kσ2 1 n 0x + 3 s

18ωu(1 + ωu)Kσ2 1 0x

EF-SOBA (Alg. 2) converges as order

k=0 E h Φ(xk) 2

(1 + ωu)2(1 + ωℓ)3/2

n K + ω1/3 u (1 + ωu)1/3 2/3σ2/3

K2/3 + (1 + ωu)

+ (1 + ωu)2p

ωℓ(1 + ωℓ) n K + (1 + ωu)2ω3 ℓ n K

where we define 0 Φ Φ(x0), 0 x h0 x Φ(x0) 2 2, 0 h 1 n Pn i=1 h0 x,i E(D0 x,i) 2 2, 0 y y0

y0 2 2, 0 z z0 z0 2 2, 0 my 1 n Pn i=1 E(D0 y,i) m0 y,i 2 2, 0 mz 1 n Pn i=1 E(D0 z,i) m0 z,i 2 2, and max{ 0 Φ, 0 x, 0 h, 0 y, 0 z, 0 my, 0 mz}.

Proof. By L Φ-smoothness of Φ (Lemma B.2), we have

E Φ(xk) + E Φ(xk), xk+1 xk + L Φ

2 E h xk+1 xk 2

=E Φ(xk) + E 1

2 ˆhk x, xk+1 xk + E Φ(xk) 1

2 ˆhk x, xk+1 xk + L Φ

2 E h xk+1 xk 2

2 E Φ(xk) ˆhk x 2

2 E h Φ(xk) 2

Summing (131) from k = 0 to K 1, we have

k=0 E h Φ(xk) 2

k=0 X k + +

k=0 E ˆhk x Φ(xk) 2

By the choice of α, we have

thus by applying Lemma B.25 to (132) we obtain (130).

C. Convergence Acceleration

In this section, we present the algorithmic design as well as convergence results of the two variants mentioned in Sec. 6, namely CM-SOBA-MSC and EF-SOBA-MSC.

Distributed Bilevel Optimization with Communication Compression

C.1. Algorithm Design

In this subsection, we propose two variants of the proposed algorithms to help further improve the convergence rate. These variants are both based on the multi-step compression (MSC) technique (He et al., 2023a) as shown in Alg. 3.

Algorithm 3 MSC Module

Input: vector x, ω-unbiased compressor C, communication rounds R; Initialize v0 = 0; for r = 1, , R do

Send compressed vector C(x vr 1) to the receiver; Update vr = vr 1 + (1 + ω) 1C(x vr 1); end for Output: MSC (x; C, R) v R/ 1 (ω/(1 + ω))R .

Note that MSC contains a loop of length R, we may also introduce an R-time sampling step to balance the computation and communication complexities. Specifically, we use the following steps instead of (6):

r=1 2 xy G(xk, yk; ξk,r i )zk + x F(xk, yk; ϕk,r i ),

r=1 y G(xk, yk; ξk,r i ),

r=1 2 yy G(xk, yk; ξk,r i )zk + y F(xk, yk; ϕk,r i ).

Intuitively, by replacing Cℓ i ( ), Cu i ( ) with MSC ; Cℓ i , R , MSC ( ; Cu i , R), (6) with (133), the outer loop of the double-loop variants are equivalent to the original algorithm except for reducing the gradient/Jacobian sampling variance σ2 by R times, as well as reducing the compression variance ωu, ωℓto ωu (ωu/(1 + ωu))R and ωℓ(ωℓ/(1 + ωℓ))R, see Lemma C.2. For convenience, we name the so-generated variants of CM-SOBA and EF-SOBA as CM-SOBA-MSC and EF-SOBA-MSC, respectively.

A detailed description of CM-SOBA-MSC and EF-SOBA-MSC is in Algorithms 4 and 5, respectively.

Algorithm 4 CM-SOBA-MSC Algorithm

Input: α, β, γ, θ, ρ, R, x0, y0, z0( z0 2 ρ), h0 x; for k = 0, 1, , K 1 do

on worker: Compute Dk x,i, Dk y,i, Dk z,i as in (133); Send MSC Dk x,i; Cu i , R , MSC Dk y,i; Cℓ i , R , and MSC Dk z,i; Cℓ i , R to server; on server:

n Pn i=1 MSC Dk x,i; Cu i , R , ˆDk y = 1

n Pn i=1 MSC Dk y,i; Cℓ i , R , ˆDk z = 1

n Pn i=1 MSC Dk z,i; Cℓ i , R ; xk+1 = xk α hk x; yk+1 = yk β ˆDk y; zk+1 = zk γ ˆDk z; zk+1 = Clip( zk+1, ρ); hk+1 x = (1 θ)hk x + θ ˆDk x; Broadcast xk+1, yk+1, zk+1 to all workers; end for

Distributed Bilevel Optimization with Communication Compression

Algorithm 5 EF-SOBA-MSC Algorithm

Input: α, β, γ, θ, ρ, δu, δℓ, x0, y0, z0( z0 2 ρ), {m0 x,i}, {m0 y,i}, {m0 z,i}, {h0 x,i}, ˆh0 x = 1 n Pn i=1 m0 x,i, m0 y = 1 n Pn i=1 m0 y,i, m0 z = 1

n Pn i=1 m0 z,i; for k = 0, 1, , K 1 do

on worker: Compute Dk x,i, Dk y,i, Dk z,i as in (133); hk+1 x,i = (1 θ)hk x,i + θ Dk x; mk+1 x,i = mk x,i + δu MSC hk+1 x,i mk x,i; Cu i , R , mk+1 y,i = mk y,i + δℓ MSC Dk y,i mk y,i; Cℓ i , R , mk+1 z,i = mk z,i + δℓ MSC Dk z,i mk z,i; Cℓ i , R ; Send MSC hk+1 x,i mk x,i; Cu i , R , MSC Dk y,i mk y,i; Cℓ i , R , MSC Dk z,i mk z,i; Cℓ i , R to server; on server:

ˆDk y = mk y + 1

n Pn i=1 MSC Dk y,i mk y,i; Cℓ i , R , ˆDk z = mk z + 1

n Pn i=1 MSC Dk z,i mk z,i; Cℓ i , R ; xk+1 = xk α ˆhk x; yk+1 = yk β ˆDk y; zk+1 = zk γ ˆDk z; zk+1 = Clip( zk+1, ρ); ˆhk+1 x = ˆhk x + δu

n Pn i=1 MSC hk+1 x,i mk x,i; Cu i , R ; mk+1 y = mk y + δℓ

n Pn i=1 MSC Dk y,i mk y,i; Cℓ i , R ; mk+1 z = mk z + δℓ

n Pn i=1 MSC Dk z,i mk z,i; Cℓ i , R ; Broadcast xk+1, yk+1, zk+1 to all workers; end for

C.2. Convergence of CM-SOBA-MSC and EF-SOBA-MSC

Similar to Lemma B.3, we have the following lemma for the gradient accumulation mechanism.

Lemma C.1 (Reduced Variance). Under Assumption 2.3, we have the following variance bounds for CM-SOBA-MSC (Alg. 4) and EF-SOBA-MSC (Alg. 5):

Var Dk y,i | Fk σ2, Var Dk y | Fk σ2/n;

Var Dk x,i | Fk σ2 1, Var Dk x | Fk σ2 1/n;

Var Dk z,i | Fk σ2 1, Var Dk z | Fk σ2 1/n.

The following lemma describes the property of the MSC module (Alg. 3).

Lemma C.2 ((He et al., 2023a), Lemma 2). Assume C is an ω-unbiased compressor, and R is any positive integer. MSC ( ; C, R) is then an ω-unbiased compressor with

ω = ω ω 1 + ω

Consequently, MSC ( ; Cu i , R) is equivalent to an ωu-unbiased compressor. Similarly, MSC ; Cℓ i , R is equivalent to an ωℓ-unbiased compressor. The following is a technical lemma.

Lemma C.3 ((He et al., 2023a), Lemma 11). For R 4(1 + ω) ln (4(1 + ω)), it holds that

We have the following convergence result for CM-SOBA-MSC (Alg. 4).

Distributed Bilevel Optimization with Communication Compression

Theorem C.4 (Convergence of CM-SOBA-MSC). Under Assumptions 2.1, 2.2, 2.3, 2.4, 4.1 and assuming β < min n 2 µg+Lg , µgn 8 ωℓL2g

o , γ min n 1 Lg , µgn 36 ωℓL2g

o , θ min n 1, n 12 ωu

o , ρ Cf/µg, α min{ 1 2L Φ , C2} with

2L2 Φ θ2 + 26(L2 2 + 25κ2 g L2 1)L2 y β2µ2g + 78L2 g L2 z γ2µ2g

CM-SOBA-MSC (Alg. 4) converges as

k=0 E h Φ(xk) 2

4 0 Φ Kα + 2 0 x Kθ + 26(L2 2 + 25κ2 g L2 1) 0 y µg Kβ + 52L2 g 0 z µg Kγ + 52(1 + ωℓ)(L2 2 + 25κ2 g L2 1)β µgn σ2

n + 78L2 g(1 + ωℓ)γ

n + 104(L2 2 + 25κ2 g L2 1) ωℓβ nµg + 312L2 gρ2 ωℓγ nµg

n + 312L2 g ωℓγ nµg

b2 f. (134)

If we further choose parameters as

4(1 + ωu + ωℓ) ln

4(1 + ωu + ωℓ) + (1 + ωu + ωℓ)2(b2 f + b2 g)2

α = 1 2L Φ + C 1 2 ,

2 + 8 ωℓL2 g µgn +

2K (1 + ωℓ) σ2 + 2 ωℓb2g

Lg + 36 ωℓL2 g µgn +

v u u t3K (1 + ωℓ) σ2 1 + 4 ωℓb2 f + 4 ωℓ(C2 f/µ2g)b2g

v u u t K (1 + ωu) σ2 1 + 6 ωub2 f + 6 ωu(C2 f/µ2g)b2g

CM-SOBA-MSC (Alg. 4) converges as order

k=0 E h Φ(xk) 2

n T + (1 + ωu + ωℓ) Θ(1)

where T KR is the total number of iterations of CM-SOBA-MSC (Alg. 4), Θ hides logarithmic terms independent of T, and is as defined in Theorem B.14.

Proof. By Lemma C.1 and C.2, the outer loop of CM-SOBA-MSC (Alg. 4) is equivalent to CM-SOBA (Alg. 1), except for using gradient/Jacobian oracles and unbiased compressors with variance reduced by a factor of R. Thus, (134) is a direct corollary of Theorem B.14. By applying Lemma C.3, the choice of R implies

ωu 1, ωℓ 1, R ωu σ2

b2 f + b2g , R ωℓ σ2

b2 f + b2g .

Distributed Bilevel Optimization with Communication Compression

Consequently, by the choice of α, β, γ, θ, ρ and R, CM-SOBA-MSC (Alg. 4) converges as

k=0 E h Φ(xk) 2

(1 + ωu + ωℓ) σ + p

R( ωu + ωℓ) (bf + bg)

n T + (1 + ωu/n + ωℓ/n) R

n T + (1 + ωu + ωℓ) Θ(1)

For EF-SOBA-MSC (Alg. 5), we have the following notations for convenience:

ω1 1 + 6 ωℓ(1 + ωℓ), ω2 1 + 36 ωu(1 + ωu).

Now we are ready to prove Theorem 6.1. The detailed result is as follows.

Theorem C.5 (Convergence of EF-SOBA-MSC). Under assumptions 2.1, 2.2, 2.3, 2.4 and assuming ρ Cf/µg,

β < min n 2 µg+Lg , µgn 96 ωℓ ω1 L2 g o , γ min n 1 Lg , µgn 432 ωℓ ω1L2g

o , δℓ= (1 + ωℓ) 1, δu = (1 + ωu) 1, m0 x,i = h0 x,i for

i = 1, , n, α min n 1 2L Φ , C4 o with

" 4L2 Φ θ2 + 72 ωu(1 + ωu)L2 5 + 12 ω2(L2 2 + 25κ2 g L2 1)

4L2 y β2µ2g + 16 ωℓ ω1L2 3β µgn

+ 12 ω2L2 g

12L2 z γ2µ2g + 72 ωℓ ω1L2 4γ µgn

EF-SOBA-MSC (Alg. 5) converges as

k=0 E h Φ(xk) 2

i 2 0 Φ Kα + 2 0 x Kθ + 12 ωu(1 + ωu)θ 0 h K + 24 ω2(L2 2 + 25κ2 g L2 1) 0 y µg Kβ

+ 48 ω2L2 g 0 z µg Kγ + 48 ω2(1 + 6 ωℓ ω1)(L2 2 + 25κ2 g L2 1)β µgn σ2

n + 36 ωu(1 + ωu)θ2 + 72 ω2(1 + 6 ωℓ ω1)L2 gγ µgn

+ 48 ω2 ωℓ(1 + 4 ωℓ)(L2 2 + 25κ2 g L2 1)β µg Kn2

E(D0 y,i) m0 y,i 2

+ 72 ω2 ωℓ(1 + 4 ωℓ)L2 gγ µg Kn2

E(D0 z,i) m0 z,i 2

If we further choose parameters as

R = 4(1 + ωu + ωℓ) ln 4(1 + ωu + ωℓ) + (1 + ωu)2 n3

α = 1 2L Φ + C 1 4 , ρ = Cf/µg, δℓ= 1 1 + ωℓ , δu = 1 1 + ωu ,

2 + 96 ωℓ ω1L2 g µgn +

2 ωℓ(1 + 4 ωℓ) 0my

2K σ2(1 + 6 ωℓ ω1)

Lg + 432 ωℓ ω1L2 g µgn +

3 ωℓ(1 + 4 ωℓ) 0mz

K σ2 1(1 + 6 ωℓ ω1)

Distributed Bilevel Optimization with Communication Compression

6 ωu(1 + ωu) 0 h 0x +

K σ2 1 n 0x + 3 s

18 ωu(1 + ωu)K σ2 1 0x

EF-SOBA-MSC (Alg. 5) converges as order

k=0 E h Φ(xk) 2

n T + (1 + ωu + ωℓ) Θ(1)

where T : KR is the total number of iterations of EF-SOBA-MSC (Alg. 5), Θ hides logarithmic terms independent of T, and is as defined in Theorem B.26.

Proof. By Lemma C.1 and C.2, the outer loop of EF-SOBA-MSC (Alg. 5) is equivalent to EF-SOBA (Alg. 2), except for using gradient/Jacobian oracles and unbiased compressors with variance reduced by a factor of R. Thus, (135) is a direct corollary of Theorem B.26. By applying Lemma C.3, the choice of R implies

ωu 1, ωℓ 1, R ωu σ

Consequently, by the choice of δu, δℓ, α, β, γ, θ, ρ and R, EF-SOBA-MSC (Alg. 5) converges as

k=0 E h Φ(xk) 2

(1 + ωu)2(1 + ωℓ)3/2

n T + (R ωu)1/3(1 + ωu)1/3 2/3σ2/3

T 2/3 + (1 + ωu) R

+ (1 + ωu)2p

ωℓ(1 + ωℓ) R n T + (1 + ωu)2 ω3 ℓ R n T

n T + (1 + ωu + ωℓ) Θ(1)

D. Experimental Specifications

D.1. Hyper-Representation

Problem formulation. Following (Franceschi et al., 2018), the hyper-representation problem can be formulated as:

min λ L(λ) = 1 |Dv|

ξ Dv L(ω (λ), λ; ξ)

s.t. ω (λ) = arg min ω 1 |Dτ|

η Dτ L(ω, λ; η)

where L stands for the cross entropy loss here, Dv and Dτ denote the validation set and training set, respectively. Hyperrepresentation consists of two nested problems, where the upper-level optimizes the intermediate representation parameter λ to obtain better feature representation on validation data, and the lower-level optimizes the weights ω of the downstream tasks on training data.

Datasets and model architecture. For MNIST, we use a 2-layer multilayer perceptron (MLP) with 200 hidden units. Therefore, the upper problem optimizes the hidden layer with 157,000 parameters, and the lower problem optimizes the output layer with 2,010 parameters. For CIFAR-10, we train the 7-layer Le Net (Lecun et al., 1998), where we treat the last fully connected layer s parameters as lower-level variables and the rest layers parameters as upper-level variables.

Hyperparameter settings. According to the optimal relation shown in (10), we set the compression parameter K = 200 for lower-level and K = 2000 for upper-level. The dataset is partitioned to 10 workers both under homogeneous and heterogeneous distributions. The batch size of workers stochastic oracle is 512 for MNIST and 1000 for CIFAR-10. The moving average parameter θ of CM-SOBA and EF-SOBA is 0.1. We optimize the stepsizes for all compared algorithms via grid search, each ranging from [0.001, 0.05, . . . , 0.5], which is summarized in Table 2.

Distributed Bilevel Optimization with Communication Compression

Table 2. Stepsize selection for experiments of hyper-representation

Algorithm Dataset Stepsize [α, β, γ]

NC-SOBA MNIST [0.5, 0.1, 0.01] C-SOBA MNIST [0.1, 0.1, 0.01] CM-SOBA MNIST [0.1, 0.1, 0.01] EF-SOBA MNIST [0.5, 0.1, 0.01] NC-SOBA CIFAR-10 [0.1, 0.001, 0.001] C-SOBA CIFAR-10 [0.05, 0.001, 0.001]

D.2. Hyperparameter Optimization

Problem formulation. Hyperparameter optimization can be formulated as:

min λ L(λ) = 1 |Dv|

ξ Dv L(ω (λ); ξ)

s.t. ω (λ) = arg min ω 1 |Dτ|

η Dτ (L(ω; η) + R(ω, λ))

where L is the loss function, R(ω, λ) is a regularizer, Dv and Dτ denote the validation set and training set. To perform logistic regression with regularization, following (Pedregosa, 2016; Grazzi et al., 2020; Chen et al., 2022), we define L = log(1 + e yx T ω) and R = 1

2 Pp i=1 eλiω2 i on synthetic dataset. For MNIST, we have the model parameter ω Rp c

with p = 784 and c = 10. Following (Grazzi et al., 2020), we set L as the cross entropy loss and R = 1 cp Pc j=1 Pp i=1 eλiω2 ij.

Datasets. We construct synthetic heterogeneous data by a linear model y = sign(x T ω + ϵ z), where ϵ = 0.1 is the noise rate and z R is the noise vector sampled from standard normal distribution. The distribution of x R100 on worker i is N(0, i2) if i%2 = 0 otherwise χ2(i). Additionally, we assume there are 5 workers with 500 training data and 500 validation data respectively. For MNIST, We partition it to 10 workers under both homogeneous and heterogeneous data distributions.

Hyperparameter settings. For the experiments in this study, where the upper and lower levels share the same compressed dimension, we use a uniform compression parameter for both levels: K = 10 for the MNIST dataset and K = 20 for the synthetic dataset. The batch size for the synthetic dataset is 50 and for MNIST is 512. We optimize the stepsizes for all compared algorithms via grid search, each ranging from [0.001, 0.05, . . . , 0.5], which is summarized in Table 3.

Table 3. Stepsize selection for experiments of hyperparameter optimization

Algorithm Dataset Stepsize [α, β, γ]

NC-SOBA Synthetic [0.1, 0.01, 0.001] C-SOBA Synthetic [0.05, 0.01, 0.001] CM-SOBA Synthetic [0.05, 0.01, 0.001] EF-SOBA Synthetic [0.1, 0.01, 0.001] NC-SOBA MNIST [0.1, 0.1, 0.1] C-SOBA MNIST [0.1, 0.1, 0.1] CM-SOBA MNIST [0.1, 0.1, 0.1] EF-SOBA MNIST [0.1, 0.1, 0.1]

Additional results on synthetic data. It can be seen from Figure 5 that under the heterogeneous data distribution, our proposed EF-SOBA outperforms with a similar convergence rate and much fewer communication bits. These results are also consistent with those on MNIST in Figure 3, which implies the broad application of our algorithms on various datasets and problem setups.

D.3. Additional Results

Results on CIFAR-10. Under the experimental setup in Appendix D.1, we evaluate the performance of compressed algorithms on CIFAR-10 under homogeneous data distributions. We only draw the result of C-SOBA here to compare with NC-SOBA because from results on MNIST we can see other algorithms perform worse than C-SOBA under the homogeneous data distributions. From Figure 6, it can be seen that with nearly 10 communication bits savings, our compressed algorithm converges to the same test accuracy as non-compressed algorithm. It validates the effectiveness of our

Distributed Bilevel Optimization with Communication Compression

0 200 400 600 800 1000 1200 1400 Iteration

NC-SOBA C-SOBA CM-SOBA EF-SOBA

0 1 2 3 4 5 6 7 8 #Bits / n 1e6

NC-SOBA C-SOBA CM-SOBA EF-SOBA

Figure 5. Hyperparameter optimization on synthetic heterogeneous data.

0 200 400 600 800 1000 Iteration

Test accuracy

NC-SOBA C-SOBA

0.0 0.5 1.0 1.5 2.0 2.5 #Bits / n 1e8

Test accuracy

NC-SOBA C-SOBA

Figure 6. Hyper-representation on CIFAR-10 under homogeneous data distributions.

proposed algorithms even under complicated model architecture and large dataset. Notice that the backbone test accuracy is not satisfactory here, we suspect that it s because the bilevel structure of the hyper-representation problem brings challenges to the training.

More comparison baselines. In our study, we evaluate our compression algorithms for distributed stochastic bilevel optimization against the SOBA algorithm, which serves as our non-compression baseline (referred to as NC-SOBA). Furthermore, we include Fed Nest(Tarzanagh et al., 2022) as another non-compression baseline for comparison with SOBA. We conduct hyper-representation experiments on the MNIST dataset, employing an MLP backbone and utilizing homogeneous data distributions. We implement Fed Nest based on its publicly available source code. As illustrated in Fig. 7, it is evident that NC-SOBA achieves faster convergence in terms of communication bits compared to Fed Nest.

0.0 0.2 0.4 0.6 0.8 1.0 #Bits / n 1e9

Test accuracy

Fed Nest NC-SOBA

Figure 7. Hyper-representation on MNIST under homogeneous data distributions.

Distributed Bilevel Optimization with Communication Compression

Tuning the momentum parameter in CM-SOBA. In Fig. 2, CM-SOBA performs inferiorly to C-SOBA due to the momentum parameter θ being set to a fixed value of 0.1, without further optimization, which may lead to sub-optimal results. To mitigate this limitation, we conducted additional experiments employing a refined approach for selecting the momentum parameter. The results are depicted in Fig. 8, illustrating that with an appropriately tuned momentum parameter, CM-SOBA indeed outperforms C-SOBA. This underscores the significance of momentum parameter optimization for the effective implementation of CM-SOBA.

0 25 50 75 100 125 150 175 200 Iteration

Test accuracy

180 190 200 0.80

NC-SOBA C-SOBA CM-SOBA EF-SOBA

Figure 8. Hyper-representation on MNIST under heterogeneous data distributions.

D.4. Ablation Studies

Ablation on MSC rounds. We propose algorithm variants in Sec. 6 to enhance theoretical convergence rate, by utilizing the multi-step compression and gradient accumulation mechanism. It s worth noting that when CM-SOBA (Alg. 1) is a special case of CM-SOBA-MSC (Alg. 4) with R = 1. Same thing happens to EF-SOBA (Alg. 2) and EF-SOBA-MSC (Alg. 5). Thus a natural question is, how should we select R in practice, and whether R > 1 can be more effective than R = 1? To address this issue, we conduct ablation experiments on the hyperparameter optimization task on MNIST dataset. The problem formulation, data and stepsizes are set consistent with Appendix D.2.

Figure 9 displays the loss curve of CM-SOBA-MSC (left) and EF-SOBA-MSC (right) with different R s in the hyperparameter optimization task on MNIST under heterogeneous data distributions. With R = 2, both algorithms perform better than those with R = 1, 4, 5. We demonstrate that when R is too small, the gradient bias induced by compression error and sampling randomness slows down the convergence, while a much larger R trades communication/computation savings to update directions with little improvement, making it less effective. Generally speaking, there is a trade-off in the selection of R, and we recommend choosing suitable R s by cross validation.

One can also observe from Figure 9 that EF-SOBA-MSC with R = 1 (which is exactly EF-SOBA) has a worse performance than CM-SOBA-MSC with R = 1 (which is exactly CM-SOBA), even if the data is constructed heterogeneously. This

0.0 0.5 1.0 1.5 2.0 2.5 #Bits / n 1e5

R=1 R=2 R=4 R=5

0.0 0.5 1.0 1.5 2.0 #Bits / n 1e5

R=1 R=2 R=4 R=5

Figure 9. Ablation on MSC rounds R for CM-SOBA-MSC (left) and EF-SOBA-MSC (right), conducted on hyperparameter optimization task on MNIST heterogeneous data.

Distributed Bilevel Optimization with Communication Compression

phenomenon is consistent with our convergence results, that EF-SOBA is more susceptible to large ω s than CM-SOBA. Consequently, we recommend using EF-SOBA-MSC with R > 1 when severely aggressive compressors are applied.

Table 4. Compressor choices under different strategies. Strategy K for lower-level rand-K K for upper-level rand-K Communicated entries per iter dx : p

dy 6 68 80 1 : 1 1 78 80

0 100000 200000 300000 400000 500000 #Bits / n

Test accuracy

Figure 10. Ablation on compressor choices conducted on hyper-presentation optimization task on MNIST heterogeneous data.

Ablation on compressor choices. We evaluate various compressor choices while maintaining consistent per-round communication cost constraints. In Fig. 10, we present the performance comparison of C-SOBA on the hyper-presentation problem using the MNIST dataset, where dx = 157000 and dy = 2010. All experiments employ identical learning rates: α = 2e-2, β = 8e-3, and γ = 8e-4. The Rand-K compressors, outlined in Table 4, are selected based on two strategies:

Strategy 1: ωu

dx 2dy = Θ q

Strategy 2: ωu

It is evident that Strategy 1 (as recommended in (10)) outperforms Strategy 2.