# laser_linear_compression_in_wireless_distributed_optimization__ecd65eed.pdf

LASER: Linear Compression in Wireless Distributed Optimization

Ashok Vardhan Makkuva * 1 Marco Bondaschi * 1 Thijs Vogels 1 Martin Jaggi 1 Hyeji Kim 2 Michael Gastpar 1

Data-parallel SGD is the de facto algorithm for distributed optimization, especially for large scale machine learning. Despite its merits, communication bottleneck is one of its persistent issues. Most compression schemes to alleviate this either assume noiseless communication links, or fail to achieve good performance on practical tasks. In this paper, we close this gap and introduce LASER: Line Ar Compre Ssion in Wir Eless Dist Ributed Optimization. LASER capitalizes on the inherent low-rank structure of gradients and transmits them efficiently over the noisy channels. Whilst enjoying theoretical guarantees similar to those of the classical SGD, LASER shows consistent gains over baselines on a variety of practical benchmarks. In particular, it outperforms the state-of-the-art compression schemes on challenging computer vision and GPT language modeling tasks. On the latter, we obtain 50-64% improvement in perplexity over our baselines for noisy channels. Code is available at https: //github.com/Bond1995/LASER.

1. Introduction

Distributed optimization is one of the most widely used frameworks for training large scale deep learning models (Bottou et al., 2018; Dean et al., 2012; Tang et al., 2020). In particular, data-parallel SGD is the workhorse algorithm for this task. Underpinning this approach is the communication of large gradient vectors between the workers and the central server which performs their aggregation. While these methods harness the inherent parallelism to reduce the overall training time, their communication cost is a major bottleneck that limits scalability to large models. Design

*Equal contribution 1School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland 2Department of Electrical and Computer Engineering, UT Austin, Austin, TX, USA. Correspondence to: Ashok Vardhan Makkuva <ashok.makkuva@epfl.ch>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

of communication-efficient distributed algorithms is thus a must for reaping the full benefits of distributed optimization (Xu et al., 2020).

Existing approaches to reduce the communication cost can be broadly classified into two themes: (i) compressing the gradients before transmission; or (ii) utilizing the communication link for native over-the-air aggregation (averaging) across workers. Along (i), a number of gradient compression schemes have been designed such as quantization (Bernstein et al., 2018; Vargaftik et al., 2022), sparsification (Aji & Heafield, 2017; Isik et al., 2022), hybrid methods (Jiang et al., 2018; Basu et al., 2019), and low-rank compression (Wang et al., 2018; Vogels et al., 2019). These methods show gains over the full-precision SGD in various settings (Xu et al. (2020) is a detailed survey). Notwithstanding the merits, their key shortcoming is that they assume a noiseless communication link between the clients and the server. In settings such as federated learning with differential privacy or wireless communication, these links are noisy. Making them noiseless requires error-correcting codes which exacerbates the latency, as the server needs to wait till it receives the gradient from each worker before aggregating (Guo et al., 2020).

Under theme (ii), communication cost is reduced by harnessing the physical layer aspects of (noisy) communication. In particular, the superposition nature of wireless channels is exploited to perform over-the-air averaging of gradients across workers, which reduces the latency, see e.g. (Shi et al., 2020) and the references therein. Notable works include A-DSGD (Amiri & Gündüz, 2020b), analog-gradientaggregation (Guo et al., 2020; Zhu et al., 2019), channel aware quantization (Chang & Tandon, 2020), etc. However, to the best of our knowledge, the majority of these approaches are restricted to synthetic datasets and shallow neural networks (often single layer) and do not scale well to the practical neural network models (which we verify in Sec. 4). This leads to a natural question:

Can we design efficient and practical gradient compression schemes for noisy communication channels?

In this work, we precisely address this and propose LASER, a principled gradient compression scheme for distributed training over wireless noisy channels. Specifically, we make the following contributions:

LASER: Linear Compression in Wireless Distributed Optimization

Power budget

LASER Z-SGD

Sketching Random-K Signum

Noiseless SGD

Figure 1: Final test perplexity after 20k iterations (lower is better) vs. power budget for GPT-2 language modeling on WIKITEXT-103. LASER consistently requires orders-of-magnitude less power than other methods for the same perplexity.

Table 1: Power required (lower is better) to reach the target perplexity on WIKITEXT-103. Z-SGD sends the uncompressed gradients directly, while LASER sends a rank-4 approximation. LASER requires 16 less power than Z-SGD to achieve the target perplexity over a wide interval. In the very-high-power regime with perplexity close to that of the noiseless SGD, we see no power gains.

Target Power required Reduction Z-SGD LASER

80 160 K 10 K 16 50 640 K 40 K 16 40 2560 K 160 K 16 35 2560 K 160 K 16

Capitalizing on the inherent low-rank structure of the gradients, LASER efficiently computes these low-rank factors and transmits them reliably over the noisy channel while allowing the gradients to be averaged in transit (Sec. 3).

We show that LASER enjoys similar convergence rate as that of the classical SGD for both quasi-convex and non-convex functions, except for a small additive constant depending on the channel degradation (Thm 1).

We empirically demonstrate the superiority of LASER over the baselines on the challenging tasks of (i) language modeling with GPT-2 WIKITEXT-103 and (ii) image classification with RESNET18 (CIFAR10, CIFAR100) and 1-LAYER NN MNIST. With high gradient compression (165 ), LASER achieves 5064% perplexity improvement in the low and moderate power regimes on WIKITEXT-103. To the best of our knowledge, LASER is the first to exhibit such gains for GPT language modeling (Sec. 4).

Notation. Euclidean vectors and matrices are denoted by bold letters x, y, M, etc. denotes the Frobenius norm for matrices and the ℓ2-norm for Euclidean vectors. O( ) is an upper bound subsuming universal constants whereas e O( ) hides any logarithmic problem-variable dependencies.

2. Background

Distributed optimization. Consider the (synchronous) data-parallel distributed setting where we minimize an objective f : Rd R defined as the empirical loss on a

global dataset D = {(xj, yj)}N j=1:

min θ Rd f(θ), f(θ) 1

j=1 ℓ(xj, yj; θ),

where ℓ( ) evaluates the loss for each data sample (xj, yj) on model θ. In this setup, there are k (data-homogeneous) training clients, where the ith client has access to a stochastic gradient oracle gi, e.g. mini-batch gradient on a set of samples randomly chosen from D, such that E[gi|θ] = f(θ) for all θ Rd. In distributed SGD (Robbins & Monro, 1951; Bottou et al., 2018), the server aggregates all gis and performs the following updates:

θt+1 = θt γt 1

i=1 g(t) i ,

E[g(t) i |θt] = f(θt), t 0,

where {γt}t 0 is a stepsize schedule. Implicit here is the assumption that the communication link between the clients and the server is noiseless, which we expound upon next.

Communication model. For the communication uplink from the clients to the server, we consider the standard wireless channel for over-the-air distributed learning (Amiri & Gündüz, 2020a; Guo et al., 2020; Zhu et al., 2019; Chang & Tandon, 2020; Wei & Shen, 2022a): the additive slow-fading channel, e.g., the classical multiple-access-channel (Nazer & Gastpar, 2007). The defining property of this family is the superposition of incoming wireless signals (enabling over-the-air computation) possibly corrupted together with an independent channel noise (Shi et al., 2020). Specifically, we denote the channel as a (random) mapping ZP ( ) that

LASER: Linear Compression in Wireless Distributed Optimization

transforms the set of (time-varying) messages transmitted by the clients {xi}i [k] Rd to its noisy version y Rd

received by the server:

y = ZP ({xi})

i=1 xi + Z,

where the noise Z Rd is independent of the channel inputs and has zero mean and unit variance per dimension, i.e. E Z 2 = d. The power constraint on each client xi 2 Pt at time t serves as a communication cost (and budget), while the power policy {Pt} allots the total budget P over T epochs as per the average power constraint (Wei & Shen, 2022b; Amiri & Gündüz, 2020b). A key metric that captures the channel degradation quality is the signalto-noise ratio per coordinate (SNR), defined as the ratio between the average signal energy (P) and that of the noise (d), i.e. SNR P/d. The larger it is the better the signal fidelity. The power budget P encourages the compression of signals: if each client can transmit the same information xi via fewer entries (smaller d), they can utilize more power per entry (higher SNR) and hence a more faithful signal.

The downlink communication from the server to the clients is usually modeled as a standard broadcast channel (Cover, 1972): for input x with x 2 Pb, the output yi = x+Zi, one for each of the clients. Usually in practice, Pb P and therefore we set Pb = , though our results readily extend to finite Pb.

In the rest of the paper by channel we mean the uplink channel. The channel model in Eq. (1) readily generalizes to the fast fading setup as discussed in Sec. 4.4.

Gradient transmission over the channel. In the distributed optimization setting the goal is to communicate the (timevarying) local gradients gi Rd to the central server over the noisy channel in Eq. (1). Here we set the messages xi as linear scaling of gradients (as we want to estimate the gradient average), i.e. xi = ai gi with the scalars ai R enforcing the power constraints:

i=1 ai gi + Z, ai gi 2 Pt. (2)

Now the received signal is a weighted sum of the gradients corrupted by noise, whereas we need the sum of the gradients P

i gi (upto zero mean additive noise) for the model training. Towards this goal, a common mild technical assumption is that the gradient norms { gi } are known at the receiver at each communication round (Chang & Tandon, 2020; Guo et al., 2020) (can be relaxed

in practice, Sec. 4). The optimal scalars are then given by ai = Pt/(maxj gj ), i [K], which are uniform across all the clients ( E.1). Now substituting this ai in Eq. (2) and rearranging, the effective channel can be written as

y = e ZP ({gi}) 1

i=1 gi + maxi gi

(noisy channel)

Equivalently, we can assume this as the actual channel model where the server receives the gradient average corrupted by a zero mean noise proportional to the gradients. Note that the noise magnitude decays in time as gradients converge to zero. We denote e ZP ( ) as simply ZP ( ) henceforth as these two mappings are equivalent.

Z-SGD. Recall that the SGD aggregates the uncompressed gradients directly. In the presence of the noisy channel, it naturally modifies to

θt+1 = θt γt ZP ({g(t) i }). (Z-SGD)

Thus Z-SGD is a canonical baseline to compare against. It has two sources of stochasticity: one stemming for the stochastic gradients and the other from the channel noise. While the gradient in the Z-SGD update still has the same conditional mean as the noiseless case (zero mean Gaussian in noisy channel), it has higher variance due to the Gaussian term. When P = , Z-SGD reduces to SGD.

3. LASER: Novel Linear Compression cum Transmission Scheme

In this section we describe our main contribution, LASER, a novel method to compress gradients and transmit them efficiently over noisy channels. The central idea underpinning our approach is that, given the channel power constraint in Eq. (1), we can get a more faithful gradient signal at the receiver by transmitting its appropriate compressed version (fewer entries sent and hence more power per entry) as opposed to sending the full-gradient naively as in Z-SGD. This raises a natural question: what s a good compression scheme that facilitates this? To address this, we posit that we can capitalize on the inherent low-rank structure of the gradient matrices (Martin & Mahoney, 2021; Mazumder et al., 2010; Yoshida & Miyato, 2017) for efficient gradient compression and transmission. Indeed, as illustrated below and in Thm 1, we can get a variance reduction of the order of the smaller dimension when the gradient matrices are approximately low-rank.

More concretely, let us consider the single worker case where the goal is to transmit the stochastic gradient g Rm m (viewed as a matrix) to the server with constant power Pt = P. Further let s suppose that g is approximately

LASER: Linear Compression in Wireless Distributed Optimization

rank-one, i.e. g pq , with the factors p, q Rm known. If we transmit g uncompressed over the noisy channel, as in Z-SGD, the server receives y Z-SGD = g + ( g /

P) Z Rm m. On the other hand, if we capitalize on the low-rank structure of g and instead transmit the factors p and q with power P/2 each, the server would receive:

where Zp and Zq are the channel noise. Now we reconstruct the stochastic gradient as

y LASER ypy q = (p + (

P) Zq) . (3)

Conditioned on the gradient g, while the received signal y has the same mean g under both Z-SGD and LASER, we observe that for Z-SGD it has variance E y Z-SGD g 2 = g 2/SNR with SNR P/m2, whereas that of LASER is roughly g 2 (4/m SNR)(1+1/(m SNR)), as further elaborated in Definition 1. When SNR is of constant order Ω(1), we observe that the variance for LASER is roughly O(m) times smaller than that of Z-SGD, which is significant given that variance directly affects the convergence speed of stochastic-gradient based methods (Bottou et al., 2018).

More generally, even if the gradients are not inherently lowrank and we only know their rank factors approximately, with standard techniques like error-feedback (Seide et al., 2014) we can naturally generalize the aforementioned procedure, which is the basis for LASER. Alg. 1 below details LASER and Thm 1 establishes its theoretical justification. While LASER works with any power policy {Pt} in noisy channel, it suffices to consider the constant law Pt = P as justified in Sec. 4.2.

3.1. Algorithm

For distributed training of neural network models, we apply Alg. 1 to each layer independently. Further we use it only for the weight matrices (fully connected layers) and the convolutional filters (after reshaping the multi-dimensional tensors to matrices), and transmit the bias vectors uncompressed. Now we delineate the two main components of LASER: (i) Gradient compression + Error-feedback (EF), and (ii) Power allocation + Channel transmission.

Gradient compression and error feedback (7-9). Since we transmit low-rank gradient approximations, we use error feedback (EF) to incorporate the previous errors into the current gradient update. This ensures convergence of SGD with biased compressed gradients (Karimireddy et al., 2019). For the rank-r compression of the updated gradient M, Cr(M), we use the Power SGD algorithm from Vogels et al. (2019), a linear compression scheme to compute the left and right

Algorithm 1 LASER

0: input: initial model parameters θ Rm n, learning rate γ, compression rank r, power budget P 0: output: trained parameters θ 0: at each worker i = 1, . . . , k do 0: initialize memory ei 0 Rm n

0: for each iterate t = 0, . . . do 0: Compute a stochastic gradient gi Rm n

0: M i ei + γgi 0: P i, Qi Cr(M i) 0: ei M i DECOMPRESS(Cr(M i)) 0: α, β POWERALLOC({Cr(M j), M j}) 0: Y p, Y q Zα({P j}), Zβ({Qj}) 0: g DECOMPRESS(Y p, Y q) 0: θ θ g 0: end for 0: end at=0

singular components P Rm r and Q Rn r respectively. Power SGD uses a single step of the subspace iteration (Stewart & Miller, 1975) with a warm start from the previous updates to compute these factors. The approximation error, M P Q , is then used to update the error-feedback for next iteration. Note that the clients do not have access to the channel output and only include the local compression errors into their feedback. The decompression function in line 9 is given by DECOMPRESS(P , Q) P Q Rm n.

Power allocation and channel transmission (10-11). This block is similar to Eq. (3) we saw earlier but generalized to multiple workers and higher rank. For each client, to transmit the rank-r factors P and Q over the noisy channel, we compute the corresponding power-allocation vectors α, β Rr +, given by α, β = POWERALLOC(P , Q, M). This allocation is uniform across all the clients. Given these power scalars, all the clients synchronously transmit the corresponding left factors over the channel which results in Y p Rm r. Similarly for Y q Rn r. Finally, the stochastic gradient for the model update is reconstructed as g = Y p Y q . For brevity we defer the full details to E.1.

3.2. Theoretical Results

We now provide theoretical justification for LASER for learning parameters in Rm n with m n (without loss of generality). While our algorithm works for any number of clients, for the theory we consider k = 1 to illustrate the primary gains with our approach. Our results readily extend to the multiple clients setting following Cordonnier (2018). Specifically, Thm 1 below highlights that the asymptotic convergence rate of LASER is almost the same as that of the classical SGD, except for a small additive constant λLASER which is O(m) times smaller than that of Z-SGD.

LASER: Linear Compression in Wireless Distributed Optimization

Our results hold for both quasi-convex and arbitrary nonconvex functions. We start with the preliminaries. Definition 1 (Channel influence factor). For any compression cum transmission algorithm ALG, let y ALG(g) be the reconstructed gradient at the server after transmitting g over the noisy channel. Then the channel influence factor λALG is defined as

λALG EZ y ALG(g) g 2

The influence factor gauges the effect of the channel on the variance of the final gradient y ALG: if the original stochastic gradient g has variance σ2 with respect to the actual gradient f, then y ALG has (1 + λALG)σ2. Note that this variance directly affects the convergence speed of the SGD and hence the smaller λALG is, the better the compression scheme is. In view of this, the following fact ( B.2) illustrates the crucial gains of LASER compared to Z-SGD, which are roughly of order O(m):

λLASER 4 (m/r)SNR

1 + 1 (n/r)SNR

1 SNR = λZ-SGD. (5)

In the low-rank (Vogels et al., 2019) and constant-order SNR regime where r = O(1) and SNR = Ω(1), we observe that λLASER is roughly O(m) times smaller than λZ-SGD. In other words, the effective SNR seen by LASER roughly gets boosted to O(m SNR) due to capitalizing on the lowrank factors whereas Z-SGD perceives only the standard factor SNR. Constant-order SNR, i.e. P/mn = Ω(1), means that the energy used to transmit each coordinate is roughly a constant, analogous to the constant-order bits used in quantization schemes (Vargaftik et al., 2021). In fact, a weaker condition that P/4r2 > 1 suffices ( E.3). With a slight abuse of notation, we denote the first upper bounding quantity in Eq. (5) as λLASER too and DECOMPRESS(Cr( )) as Cr( ) for brevity.

We briefly recall the standard assumptions for SGD convergence following the framework in Bottou et al. (2018) and Stich & Karimireddy (2019). Assumption 1. The objective f : Rm n R is differentiable and µ-quasi-convex for a constant µ 0 with respect to θ , i.e. f(θ) f(θ ) + µ

2 θ θ 2 f(θ), θ θ , θ Rm n. Assumption 2. f is L-smooth for some L > 0, i.e. f(θ ) f(θ) + f(θ), θ θ + L

2 θ θ 2, θ, θ Rm n. Assumption 3. For any θ, a gradient oracle g(θ, ξ) = f(θ) + ξ, and conditionally independent noise ξ, there exist scalars (M, σ2) 0 such that E [ξ|θ] = 0, E[ ξ 2|θ] M f(θ) 2 + σ2. Assumption 4. The compressor Cr( ) satisifes the δr-

compression property: there exists a δr [0, 1] such that ECr Cr(M) M 2 (1 δr) M 2, M Rm n.

δr-compression is a standard assumption in the convergence analysis of Error Feedback SGD (EF-SGD) (Stich & Karimireddy, 2020). It ensures that the norm of the feedback memory remains bounded. We make the following assumption on the influence factor λLASER, which ensures that the overall composition of the channel and compressor mappings, ZP (Cr( )), still behaves nicely.

Assumption 5. The channel influence factor λLASER satisfies λLASER 1/(10(2/δr + M)).

We note that a similar assumption is needed for convergence even in the hypothetical ideal scenario when the clients have access to the channel output ( B.2), which we do not have. This bound can be roughly interpreted as λLASER = O(δr). We are now ready to state our main result.

Theorem 1 (LASER convergence). Let {θt}t 0 be the LASER iterates (Alg. 1) with constant stepsize schedule {γt = γ}t 0 and suppose Assumptions 2-5 hold. Denote θ argminθ f(θ), f f(θ ), and τ

10L 2 δr + M . Then for k = 1,

(i) if f is µ-quasi convex for µ > 0, there exists a stepsize γ 1 τ(1+λLASER) such that

Ef(θout) f =

τ(1 + λLASER) θ0 θ 2 exp µT τ(1 + λLASER)

+ σ2(1 + λLASER)

where θout is chosen from {θ}T 1 t=0 such that θout = θt with probability (1 µγ/2) t. (ii) if f is µ-quasi convex for µ = 0, there exists a stepsize γ 1 τ(1+λLASER) such that

Ef(θout) f = O

τ θ0 θ 2(1 + λLASER)

where θout is chosen uniformly at random from {θ}T 1 t=0 . (iii) if f is an arbitrary non-convex function, there exists a

LASER: Linear Compression in Wireless Distributed Optimization

stepsize γ 1 τ(1+λLASER) such that

E f(θout) 2 = O

τ f(θ0) f 2(1 + λLASER)

L(f(θ) f )(1 + λLASER)

where θout is chosen uniformly at random from {θ}T 1 t=0 . (iv) Z-SGD obeys the convergence bounds (i)-(iii) with δr = 1 and λLASER replaced by λZ-SGD.

LASER vs. Z-SGD. Thus the asymptotic rate of LASER is dictated by the timescale (1 + λLASER)/T, very close to the 1/T rate for the classical SGD. In contrast, Z-SGD has the factor (1 + λZ-SGD)/T with λZ-SGD = O(m) λLASER.

Multiple clients. As all the workers in LASER (Alg. 1) apply the same linear operations for gradient compression (via Power SGD), Thm 1 can be extended to (homogenous) multiple workers by shrinking the constants σ2, SNR, λLASER, and λZ-SGD by a factor of k, following Cordonnier (2018).

Proof. (Sketch) First we write the LASER iterates {θt}t 0 succinctly as

θt+1 = θt Z(Cr(et + γtgt)),

et+1 = (et + γtgt) Cr(et + γtgt).

First we establish a bound on the gap to the optimum, E θt+1 θ 2, by the descent lemma (Lemma 11). This optimality gap depends on the behavior of the error updates via E et 2, which we characterize by the error-control lemma (Lemma 12). When f is quasi-convex, these two lemmas help us establish a recursive inequality between the optimality gap Ef(θt+1) f at time t + 1 and with that of at time t: Ef(θt) f . Upon unrolling this recursion and taking a weighted summation, Lemma 3 establishes the desired result. In the case of non-convexity, the same idea helps us to control E f(θt) 2 in a similar fashion and when combined with Lemma 6, yields the final result. The proof for Z-SGD is similar.

4. Experimental Results

We empirically demonstrate the superiority of LASER over state-of-the-art baselines on a variety of benchmarks, summarized in Table 2.

Setup. We consider four challenging tasks of practical interest: (i) GPT language modeling on WIKITEXT-103, and (ii, iii, iv) image classification on MNIST, CIFAR10 and CIFAR100. For the language modeling, we use the GPT-2 like architecture following Pagliardini (2023) ( F). RESNET18 is used for the CIFAR datasets. For MNIST, we

Table 2: Benchmarks for evaluating LASER. Baseline refers to the noiseless SGD.

Model Dataset Metric Baseline

GPT-2 (123.6 M) WIKITEXT Perplexity 19.2

RESNET18 (11.2 M) CIFAR10 Top-1 accuracy

93.0% CIFAR100 73.1%

1-LAYER NN (7850) MNIST 92.3%

Power budget

LASER Z-SGD Sketching Random-K Noiseless SGD

Figure 2: Test accuracy (higher the better) for a given power budget on CIFAR-10 for different algorithms. LASER demonstrates consistent accuracy gains over the baselines over a wide range of power levels.

use a 1-hidden-layer network for a fair comparison with Amiri & Gündüz (2020b). For distributed training of these models, we consider k = 4 clients for language modeling and k = 16 for image classification. We simulate the noisy channel by sampling Z N(0, Id). To gauge the performance of algorithms over a wide range of noisy conditions, we vary the power P geometrically in the range [0.1, 10] for MNIST, [250, 128000] for CIFAR10 and CIFAR100, and [10000, 1024 10000] for WIKITEXT-103. The chosen ranges can be roughly split into low-moderate-high power

Table 3: Power required (lower the better) to reach the given target accuracy on CIFAR-10. LASER requires 16 lesser power than the Z-SGD to achieve the same targetaccuracy. Equivalently, LASER tolerates more channel noise than the Z-SGD for the same target accuracy as is partly supported by our theoretical analysis.

Target Power required Reduction LASER Z-SGD

88% 250 4000 16 89% 500 8000 16 90% 1000 16000 16 91% 2000 32000 16

LASER: Linear Compression in Wireless Distributed Optimization

regimes. Recall from noisy channel that the smaller the power, the higher the noise in the channel.

Baselines. We benchmark LASER against three different sets of baselines: (i) Z-SGD, (ii) SIGNUM, RANDOM-K, SKETCHING, and (iii) A-DSGD. Z-SGD sends the uncompressed gradients directly over the noisy channel and acts as a canonical baseline. The algorithms in (ii) are state-ofthe-art distributed compression schemes for noiseless communication (Vogels et al., 2019). SIGNUM (Bernstein et al., 2018) transmits the gradient sign followed by the majority vote and SKETCHING (Rothchild et al., 2020; Haddadpour et al., 2020) uses a Count Mean Sketch to compress the gradients. We omit comparison with quantization methods (Vargaftik et al., 2022) given the difference in our objectives and the settings (noisy channel). A-DSGD (Amiri & Gündüz, 2020b) is a popular compression scheme for noisy channels, relying on Top-K and random sketching. However A-DSGD does not scale to tasks of the size we consider and hence we benchmark against it only on MNIST. SGD serves as the noiseless baseline (Table 2). All the compression algorithms use the error-feedback, and use the compression factor (compressed-gradient-size/original-size) 0.2, the optimal in the range [0.1, 0.8]. We report the best results among 3 independent runs for all the baselines ( F).

4.1. Results on Language Modeling and Image Classification

For GPT language modeling, Fig. 1 in Sec. 1 highlights that LASER outperforms the baselines over a wide range of power levels. To the best of our knowledge, this is the first result of its kind to demonstrate gains for GPT training over noisy channels. Specifically, we obtain 64% improvement in perplexity over Z-SGD (76 vs. 212) in the low power regime (P = 10 K) and 50% (35 vs. 71) for the moderate one (P = 160 K). This demonstrates the efficacy of LASER especially in the limited power environment. Indeed, Table 1 illustrates that for a fixed target perplexity, LASER requires 16 less power than the second best, ZSGD. In the very high power regime, we observe no clear gains (as expected) compared to transmitting the uncompressed gradients directly via the Z-SGD. We note that for the language modeling task, the popular optimization algorithm is Adam W (Loshchilov & Hutter, 2017) and hence Z-SGD (LASER) here refers to the noisy transmission via ZSGD (noisy channel) and subsequent gradient based update by Adam W.

We observe a similar trend for CIFAR10 classification, as Fig. 2 and Table 3 demonstrate the superiority of LASER over other compression schemes; RANDOM-K does better than the other baselines till moderate power levels after which Z-SGD dominates. SIGNUM is considerably worse than others, as it hasn t converged yet after 150 epochs, and

Power budget P

Accuracy after 150 epochs

LASER, constant LASER, linear

LASER, step Z-SGD, constant

Z-SGD, linear Z-SGD, step

Figure 3: Accuracy vs. budget P for various laws. Constant is the best for both LASER and Z-SGD.

hence omitted. With regards to power reduction, Table 3 highlights that LASER requires just (1/16)th the power compared to Z-SGD to reach any target accuracy till 91%. We observe similar gains for CIFAR100 ( F).

Table 4 compares the performance of LASER against various compression algorithms on MNIST. In the very noisy regime (P = 0.1), RANDOM-K is slightly better than LASER and outperforms the other baselines, whereas in the moderate (P = 1) and high power (P = 10) regimes, LASER is slightly better than the other algorithms. On the other hand, we observe that A-DSGD performs worse than even simple compression schemes like RANDOM-K in all the settings.

4.2. Power Control: Static vs. Dynamic Policies

The formulation in noisy channel allows for any power control law Pt as long as it satisfies the average power constraint: P

t(Pt/T) P. This begs a natural question: what s the best power scheme for LASER? To answer this, for CIFAR10 classification, under a fixed budget P we consider different power policies with both increasing and decreasing power across epochs: the constant, piecewise constant and linear schemes. Fig. 3 illustrates the results for the decreasing power laws, while Fig. 7 their increasing counterparts. These results highlight that the constant power policy achieves the best performance for both LASER and Z-SGD, compared to the time-varying ones. Further LASER attains significant accuracy gains over Z-SGD for all the power control laws. Interestingly LASER performs the same with all the power schemes. We posit this behavior to the fact that the noisy channel already contains a time-varying noise due to the term maxi gi Pt . Since the gradients decay over time, this inherently allows for an implicit power/SNR-control law even with a constant Pt, thus

LASER: Linear Compression in Wireless Distributed Optimization

Table 4: Test accuracy (higher the better) after 50 epochs on MNIST for low, moderate, and high power regimes.

Algorithm Test accuracy

P = 0.1 P = 1 P = 10

Z-SGD 81.3% 87.9% 91.9%

SIGNUM 76.7% 83.2% 85.4% RANDOM-K 86.1% 89.3% 91.5% SKETCHING 81.9% 88.2% 91.7%

A-DSGD 81.6% 86.9% 87.3% LASER 84.3% 89.9% 92.3%

Table 5: Communication cost (lower the better) for GPT language modeling on WIKITEXT-103. LASER transmits the lowest volume of data during training.

Algorithm Data sent per iteration

Z-SGD 496 MB (1 )

SIGNUM 15 MB (33 ) RANDOM-K 99 MB (5 ) SKETCHING 99 MB (5 )

A-DSGD n/a n/a LASER 3 MB (165 )

enabling the constant power scheme to fare as good as the others. Hence, without loss of generality, we consider the static power schedule for our theory and experiments. We refer to F.7 for a detailed discussion.

4.3. Computational Complexity and Communication Cost

Recall from Alg. 1 that the two critical components of LASER are gradient compression and channel transmission. To gauge their efficacy we analyze them via two important metrics: (i) computational complexity of compression and (ii) communication cost of transmission. For (ii), recall from Eq. (1) that the power constraint indirectly serves as a communication cost and encourages compression. Table 5 quantitatively measures the total data sent by clients for each training iteration (doesn t change with the power P) for GPT language modeling on WIKITEXT-103. As illustrated, LASER incurs the lowest communication cost among all the baselines with 165 cost reduction as compared to the Z-SGD, followed by SIGNUM which obtains 33 reduction. Interestingly, LASER also achieves the best perplexity scores as highlighted in Fig. 1. For these experiments, we let rank r = 4 for LASER and the best compression factor 0.2 for the baselines (as detailed earlier). SIGNUM does not require any compression factor. For (i), since LASER relies on Power SGD for the rank decomposition, it inherits the same low-complexity benefits: Tables 3-7 of Vogels et al. (2019) demonstrate that Power SGD is efficient with significantly lower computational needs and has much smaller processing time/batch as compared to baselines without any accuracy drop. In fact, it is the core distributed algorithm behind the recent breakthrough DALL-E ( E in Ramesh et al. (2021)).

4.4. Slow and fast fading channels

The slow/non-fading model in Eq. (1) readily generalizes to the popular fast fading channel (Guo et al., 2020; Amiri & Gündüz, 2020a): y = P

i γixi + Z, where γi are the channel fading coefficients. A standard technique here in the literature is to assume that channel-state-information

(CSI) is known in the form of fading coefficients or their statistics, which essentially reduces the problem to a non-fading one. Likewise LASER can be extended to the fast fading channel as well.

5. Related Work

In relation to our work, the existing literature can be broadly classified into two categories:

(i) Compression schemes with noiseless communication. Assuming a noiseless bit pipe from clients to the server, quantization methods (Dettmers, 2015; Alistarh et al., 2017; Horvóth et al., 2022; Li et al., 2018; Wen et al., 2017; Yu et al., 2019; Vargaftik et al., 2021) quantize each coordinate and send as fewer bits as possible. Sparsification techniques (Ivkin et al., 2019; Stich et al., 2018; Sun et al., 2019; Tsuzuku et al., 2018; Wangni et al., 2018) send a reduced number of coordinates, based on criteria such as Top/Random-K, as opposed to sending the full gradient directly. Hybrid methods (Dryden et al., 2016; Lim et al., 2019) combine both. Rank compression methods (Yu et al., 2018; Cho et al., 2019; Wang et al., 2018) spectrally decompose gradient matrix (often via SVD) and transmit these factors. Since SVD is computationally prohibitive, we rely on the state-of-the-art light-weight compressor Power SGD (Vogels et al., 2019).

(ii) Compression schemes for noisy channels. The main idea here is to enable over-the-air-aggregation of gradients via the superposition nature of wireless channels (Nazer & Gastpar, 2007) thus reducing the communication latency and bandwidth. The popular A-DSGD (Amiri & Gündüz, 2020b) relies on Top-K sparsification and random sketching. However, being memory intensive, A-DSGD is restricted to MNIST with 1-layer NN and doesn t scale beyond. Guo et al. (2020) propose an analog-gradient-aggregation scheme but it is limited to shallow neural networks. Chang & Tandon (2020) design a digital quantizer for training over Gaussian MAC channels. (iii) Power laws. In the absence of explicit power constraints, Wei & Shen (2022a) show that

LASER: Linear Compression in Wireless Distributed Optimization

O(1/t2) noise-decay ensures the standard 1/T convergence rate for noisy FED-AVG whereas Saha et al. (2022) propose a t0.8 increase in SNR for the decentralized setup.

6. Conclusion

We propose a principled gradient compression scheme, LASER, for wireless distributed optimization over additive noise channels. LASER attains significant gains over its baselines on a variety of metrics such as accuracy/perplexity, complexity and communication cost. It is an interesting avenue of future research to extend LASER to channels with downlink noise and fast fading without CSI.

Acknowledgements

Ashok would like to thank Ananda Theertha Suresh for helpful discussions about the project. This work was partly supported by Swiss National Science Foundation under Grant 200364, ARO Award W911NF2310062, ONR Award N00014-21-1-2379, and NSF Award CNS-2008824.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Aji, A. F. and Heafield, K. Sparse communication for distributed gradient descent. ar Xiv preprint ar Xiv:1704.05021, 2017.

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems, 30, 2017.

Amiri, M. M. and Gündüz, D. Federated learning over wireless fading channels. IEEE Transactions on Wireless Communications, 19(5):3546 3557, 2020a.

Amiri, M. M. and Gündüz, D. Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air. IEEE Transactions on Signal Processing, 68:2155 2169, 2020b.

Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparselocal-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019.

Bernstein, J., Zhao, J., Azizzadenesheli, K., and Anandkumar, A. signsgd with majority vote is communication effi-

cient and fault tolerant. ar Xiv preprint ar Xiv:1810.05291, 2018.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018.

Chang, W.-T. and Tandon, R. Mac aware quantization for distributed gradient descent. In GLOBECOM 20202020 IEEE Global Communications Conference, pp. 1 6. IEEE, 2020.

Cho, M., Muthusamy, V., Nemanich, B., and Puri, R. Gradzip: Gradient compression using alternating matrix factorization for large-scale deep learning. In Neur IPS. 2019.

Cordonnier, J.-B. Convex optimization using sparsified stochastic gradient descent with memory. Technical report, 2018.

Cover, T. Broadcast channels. IEEE Transactions on Information Theory, 18(1):2 14, 1972. doi: 10.1109/TIT. 1972.1054727.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.

Dettmers, T. 8-bit approximations for parallelism in deep learning. ar Xiv preprint ar Xiv:1511.04561, 2015.

Dryden, N., Moon, T., Jacobs, S. A., and Van Essen, B. Communication quantization for data-parallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1 8. IEEE, 2016.

Guo, H., Liu, A., and Lau, V. K. Analog gradient aggregation for federated learning over wireless networks: Customized design and convergence analysis. IEEE Internet of Things Journal, 8(1):197 210, 2020.

Haddadpour, F., Karimi, B., Li, P., and Li, X. Fedsketch: Communication-efficient and private federated learning via sketching. Co RR, abs/2008.04975, 2020. URL https://arxiv.org/abs/2008.04975.

Horvóth, S., Ho, C.-Y., Horvath, L., Sahu, A. N., Canini, M., and Richtárik, P. Natural compression for distributed deep learning. In Mathematical and Scientific Machine Learning, pp. 129 141. PMLR, 2022.

Isik, B., Pase, F., Gunduz, D., Weissman, T., and Zorzi, M. Sparse random networks for communication-efficient federated learning. ar Xiv preprint ar Xiv:2209.15328, 2022.

LASER: Linear Compression in Wireless Distributed Optimization

Ivkin, N., Rothchild, D., Ullah, E., Stoica, I., Arora, R., et al. Communication-efficient distributed sgd with sketching. Advances in Neural Information Processing Systems, 32, 2019.

Jiang, J., Fu, F., Yang, T., and Cui, B. Sketchml: Accelerating distributed machine learning with data sketches. In Proceedings of the 2018 International Conference on Management of Data, pp. 1269 1284, 2018.

Karimireddy, S. P., Rebjock, Q., Stich, S., and Jaggi, M. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pp. 3252 3261. PMLR, 2019.

Li, Y., Park, J., Alian, M., Yuan, Y., Qu, Z., Pan, P., Wang, R., Schwing, A., Esmaeilzadeh, H., and Kim, N. S. A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 175 188. IEEE, 2018.

Lim, H., Andersen, D. G., and Kaminsky, M. 3lc: Lightweight and effective traffic compression for distributed machine learning. Proceedings of Machine Learning and Systems, 1:53 64, 2019.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Martin, C. H. and Mahoney, M. W. Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. The Journal of Machine Learning Research, 22(1):7479 7551, 2021.

Mazumder, R., Hastie, T., and Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11: 2287 2322, 2010.

Nazer, B. and Gastpar, M. Computation over multipleaccess channels. IEEE Transactions on information theory, 53(10):3498 3516, 2007.

Pagliardini, M. GPT-2 modular codebase implementation. https://github.com/epfml/llm-baselines, 2023.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot textto-image generation. In International Conference on Machine Learning, pp. 8821 8831. PMLR, 2021.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Rothchild, D., Panda, A., Ullah, E., Ivkin, N., Stoica, I., Braverman, V., Gonzalez, J., and Arora, R. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning, pp. 8253 8265. PMLR, 2020.

Saha, R., Rini, S., Rao, M., and Goldsmith, A. J. Decentralized optimization over noisy, rate-constrained networks: Achieving consensus by communicating differences. IEEE Journal on Selected Areas in Communications, 40(2):449 467, 2022. doi: 10.1109/JSAC.2021. 3118428.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association, 2014.

Shi, Y., Yang, K., Jiang, T., Zhang, J., and Letaief, K. B. Communication-efficient edge AI: Algorithms and systems. IEEE Communications Surveys & Tutorials, 22(4): 2167 2191, 2020.

Stewart, G. and Miller, J. Methods of simultaneous iteration for calculating eigenvectors of matrices. Topics in Numerical Analysis II, 2, 1975.

Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019.

Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for SGD with delayed gradients and compressed updates. The Journal of Machine Learning Research, 21(1):9613 9648, 2020.

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified SGD with memory. Advances in Neural Information Processing Systems, 31, 2018.

Sun, H., Shao, Y., Jiang, J., Cui, B., Lei, K., Xu, Y., and Wang, J. Sparse gradient compression for distributed sgd. In Database Systems for Advanced Applications: 24th International Conference, DASFAA 2019, Chiang Mai, Thailand, April 22 25, 2019, Proceedings, Part II, pp. 139 155. Springer, 2019.

Tang, Z., Shi, S., Chu, X., Wang, W., and Li, B. Communication-efficient distributed deep learning: A comprehensive survey. ar Xiv preprint ar Xiv:2003.06307, 2020.

Tsuzuku, Y., Imachi, H., and Akiba, T. Variance-based gradient compression for efficient distributed deep learning. ar Xiv preprint ar Xiv:1802.06058, 2018.

LASER: Linear Compression in Wireless Distributed Optimization

Vargaftik, S., Ben-Basat, R., Portnoy, A., Mendelson, G., Ben-Itzhak, Y., and Mitzenmacher, M. Drive: One-bit distributed mean estimation. Advances in Neural Information Processing Systems, 34:362 377, 2021.

Vargaftik, S., Basat, R. B., Portnoy, A., Mendelson, G., Itzhak, Y. B., and Mitzenmacher, M. Eden: Communication-efficient and robust distributed mean estimation for federated learning. In International Conference on Machine Learning, pp. 21984 22014. PMLR, 2022.

Vogels, T., Karimireddy, S. P., and Jaggi, M. Power SGD: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019.

Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. Advances in Neural Information Processing Systems, 31, 2018.

Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.

Wei, X. and Shen, C. Federated learning over noisy channels: Convergence analysis and design examples. IEEE Transactions on Cognitive Communications and Networking, 8(2):1253 1268, 2022a.

Wei, X. and Shen, C. Federated learning over noisy channels: Convergence analysis and design examples. IEEE Transactions on Cognitive Communications and Networking, 8(2):1253 1268, 2022b.

Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30, 2017.

Xu, H., Ho, C.-Y., Abdelmoniem, A. M., Dutta, A., Bergou, E. H., Karatsenidis, K., Canini, M., and Kalnis, P. Compressed communication for distributed deep learning: Survey and quantitative evaluation. Technical report, 2020.

Yoshida, Y. and Miyato, T. Spectral norm regularization for improving the generalizability of deep learning. ar Xiv preprint ar Xiv:1705.10941, 2017.

Yu, M., Lin, Z., Narra, K., Li, S., Li, Y., Kim, N. S., Schwing, A., Annavaram, M., and Avestimehr, S. Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed cnn training. Advances in Neural Information Processing Systems, 31, 2018.

Yu, Y., Wu, J., and Huang, J. Exploring fast and communication-efficient algorithms in large-scale distributed networks. ar Xiv preprint ar Xiv:1901.08924, 2019.

Zhu, G., Wang, Y., and Huang, K. Broadband analog aggregation for low-latency federated edge learning. IEEE Transactions on Wireless Communications, 19(1):491 506, 2019.

LASER: Linear Compression in Wireless Distributed Optimization

1 Introduction 1

2 Background 2

3 LASER: Novel Linear Compression cum Transmission Scheme 3

3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Experimental Results 6

4.1 Results on Language Modeling and Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.2 Power Control: Static vs. Dynamic Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.3 Computational Complexity and Communication Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.4 Slow and fast fading channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Related Work 8

6 Conclusion 9

A Error feedback and SGD convergence toolbox 13

B Technical lemmas for LASER convergence 15

B.1 Power allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

B.2 Channel influence factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B.3 Optimality gap and error bounds for LASER iterates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C Proof of Thm 1 18

D Proof of technical lemmas 19

D.1 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

D.2 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D.3 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

D.4 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

D.5 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

D.6 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

D.7 Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

E Additional details about noisy channel and LASER 23

E.1 Channel transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

E.2 Detailed steps for Alg. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

LASER: Linear Compression in Wireless Distributed Optimization

E.3 Constant-order SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

F Experimental details 25

F.1 WIKITEXT-103 experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

F.2 CIFAR10 experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

F.3 CIFAR100 experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

F.4 MNIST experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

F.5 Rank-accuracy tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

F.6 Power allocation across workers and neural network parameters . . . . . . . . . . . . . . . . . . . . . . 31

F.7 Static vs. dynamic power policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

F.8 Baselines implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F.8.1 Count-Mean Sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F.8.2 Random K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F.8.3 Signum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A. Error feedback and SGD convergence toolbox

In this section we briefly recall the main techniques for the convergence analysis of SGD with error feedback (EF-SGD) from (Stich & Karimireddy, 2020). We consider k = 1 clients with a compressor Cr( ) and without any channel communication noise ZP (Sec. 2):

θt+1 = θt Cr(et + γtgt)

et+1 = (et + γtgt) Cr(et + γtgt). (EF-SGD)

Now we define the virtual iterates {eθt}t 0 which are helpful for the convergence analysis:

eθt θt et. (6)

Hence eθt+1 = θt et γtgt = eθt γtgt. First we consider the case when f is quasi-convex followed by the non-convex setting. In all the results below, we assume that the objective f is L-smooth, gradient oracle g has (M, σ2)-bounded noise, and that Cr( ) satisfies the δr compression property (Assumptions 2, 3, and 4).

f is quasi-convex:

The following lemma gives a handle on the gap to optimality E eθt θ 2.

Lemma 1 ((Stich & Karimireddy, 2020), Lemma 8). Let {θt, et}t 0 be defined as in EF-SGD. Assume that f is µ-quasi convex for some µ 0. If γt 1 4L(1+M) for all t 0, then for {eθt}t 0 defined in Eq. (6),

E eθt+1 θ 2 1 µγt

E eθt θ 2 γt

2 E(f(θt) f ) + γ2 t σ2 + 3Lγt E θt eθt 2 . (7)

The following lemma bounds the squared norm of the error, i.e. E et 2, appearing in Eq. (7). Recall that a positive sequence {at}t 0 is τ-slow decreasing for parameter τ 1 if at+1 at and at+1(1 + 1/2τ) at. The sequence {at}t 0 is τ-slow increasing if {a 1 t }t 0 is τ-slow decreasing (Stich & Karimireddy, 2020), Definition 10.

Lemma 2 ((Stich & Karimireddy, 2020), Lemma 22). Let et be as in (EF-SGD) for a δr-approximate compressor Cr and

LASER: Linear Compression in Wireless Distributed Optimization

stepsizes {γt}t 0 with γt+1 1 10L(2/δr+M), t 0 and {γ2 t }t 0 2 δr -slow decaying. Then

E 3L et+1 2 δr

t i E f(θt i) 2 + γtσ2 . (8)

Furthermore, for any 4

δr -slow increasing non-negative sequence {wt}t 0 it holds:

t=0 wt E et 2 1

t=0 wt E f(θt) 2 + σ2 T X

The following result controls the summations of the optimality gap that appear when combining Lemma 1 and Lemma 2.

Lemma 3 ((Stich & Karimireddy, 2020), Lemma 13). For every non-negative sequence {rt}t 0 and any parameters d a > 0, c 0, T 0, there exists a constant γ 1

d, such that for constant stepsizes {γt = γ}t 0 and weights wt := (1 aγ) (t+1) it holds

γt (1 aγt) rt wt

γt rt+1 + cγtwt

= e O dr0 exp a T

Combining the above lemmas, we obtain the following result for the convergence rate of EF-SGD.

Theorem 2 ((Stich & Karimireddy, 2020), Theorem 22). Let {θt}t 0 denote the iterates of the error compensated stochastic gradient descent (EF-SGD) with constant stepsize {γt = γ}t 0 and with a δr-approximate compressor on a differentiable function f : Rd R under Assumptions 2 and 3. Then, if f

satisfies Assumption 1 for µ > 0, then there exists a stepsize γ 1 10L(2/δr+M) (chosen as in Lemma 3) such that

where the output θout {θt}T 1 t=0 is chosen to be θt with probability proportional to (1 µγ/2) t. satisfies Assumption 1 for µ = 0, then there exists a stepsize γ 1 10L(2/δr+M) (chosen as in Lemma 3) such that

Ef(θout) f = O L(1/δr + M) θ0 θ 2

where the output θout {θt}T 1 t=0 is chosen uniformly at random from the iterates {θt}T 1 t=0 .

f is non-convex:

Now we consider the case where f is an arbitrary non-convex function. The above set of results extend in a similar fashion to this setting too as described below:

Lemma 4 ((Stich & Karimireddy, 2020), Lemma 9). Let {θt, et}t 0 be defined as in EF-SGD. If γt 1 2L(1+M) for all

t 0, then for {eθt}t 0 defined in Eq. (6),

E[f(eθt+1)] E[f(eθt)] γt

4 E f(θt) 2 + γ2 t Lσ2

2 E θt eθt 2 . (9)

Lemma 5 ((Stich & Karimireddy, 2020), Lemma 22). Let et be as in (EF-SGD) for a δr-approximate compressor Cr and stepsizes {γt}t 0 with γt+1 1 10L(2/δr+M), t 0 and {γ2 t }t 0 2 δr -slow decaying. Then

E 3L et+1 2 δr

t i E f(θt i) 2 + γtσ2 . (10)

LASER: Linear Compression in Wireless Distributed Optimization

Furthermore, for any 4

δr -slow increasing non-negative sequence {wt}t 0 it holds:

t=0 wt E et 2 1

t=0 wt E f(θt i) 2 + σ2 T X

Lemma 6 ((Stich & Karimireddy, 2020), Lemma 14). For every non-negative sequence {rt}t 0 and any parameters d 0, c 0, T 0, there exists a constant γ 1

d, such that for constant stepsizes {γt = γ}t 0 it holds:

ΨT := 1 T + 1

dr0 T + 1 + 2 cr0

Now we have the final convergence result for the non-convex setting.

Theorem 3 ((Stich & Karimireddy, 2020), Theorem 22). Let {θt}t 0 denote the iterates of the error compensated stochastic gradient descent (EF-SGD) with constant stepsize {γt = γ}t 0 and with a δr-approximate compressor on a differentiable function f : Rd R under Assumptions 2 and 3. Then, if f is an arbitrary non-convex function, there exists a stepsize γ 1 10L(1/δr+M) (chosen as in Lemma 6), such that

E f(θout) 2 = O

L(1/δr + M)(f(θ0) f )

L(f(θ0) f )

where the output θout {θt}T 1 t=0 is chosen uniformly at random from the iterates {θt}T 1 t=0 .

B. Technical lemmas for LASER convergence

Towards the convergence analysis of LASER for k = 1, we rewrite the Alg. 1 succinctly as:

θt+1 = θt Z(α,β) (Cr(et + γtgt))

et+1 = (et + γtgt) Cr(et + γtgt) , (LASER)

where the channel corrupted gradient approximation Z(α,β)( ) is given by

Z(α,β)(Cr(et + γtgt) | {z } =P Q )

pi + pi αi Z(i) m

qi + qi βi Z(i) n

and α = (αi)r i=1 and β = (βi)r i=1 are appropriate power allocations to transmit the respective left and right factors P = [p1, . . . , pr] Rm r and Q = [q1, . . . , qr] Rn r for the decomposition Cr(et + γtgt) = P Q . Z(i) m Rm and Z(i) n Rn denote the independent channel noises for each factor i [r].

Thus we observe from LASER that it has an additional channel corruption in the form of Z(α,β)( ) as compared to the EF-SGD. Now in the remainder of this section, we explain how to choose the power allocation (α, β) (App. B.1), how to control the influence of the channel Z(α,β)( ) on the convergence of LASER (App. B.2), and utilize these results to establish technical lemmas along the lines of App. A for LASER (App. B.3).

B.1. Power allocation

In this section, we introduce the key technical lemmas about power allocation that are crucial for the theoretical results. We start with the rank one case.

Lemma 7 (Rank-1 power allocation). For a power P > 0 and m, n N with m n, define the function f P : R+ R+ R+ as

f P (α, β) 1 + m

and the constraint set SP {(α, β) : α 0, β 0, α + β = P}. Then for the minimizer (α , β ) =

LASER: Linear Compression in Wireless Distributed Optimization

argmin(α,β) SP f P (α, β), we have

f P (α , β ) 1 + 4 m SNR

1 + 1 n SNR

Further the minimizer is given by

Lemma 8 (Rank-r power allocation). For a power P > 0, m, n, r N with m n, and positive scalars κ1, . . . , κr > 0 with P

i κi = 1, define the function f P : (R+)r (R+)r R+ as

, α = (αi)r i=1, β = (βi)r i=1,

and the constraint set SP {{(α, β) : α 0, β 0, P

i(αi + βi) = P}. Then there exists a power allocation scheme (α , β ) SP such that

min (α,β) SP f P (α, β) f P (α , β ) 1 + 4 (m/r) SNR

1 + 1 (n/r) SNR

where SNR P mn. Further (α , β ) is given by

Pi/2, m = n

β i = Pi α i ,

Remark 1. In other words, we first divide the power P proportional to κi for each i [r] and further allocate this Pi amongst α i and β i as per the optimal rank one allocation scheme in Lemma 7.

B.2. Channel influence factor

In this section we establish the bounds for the channel influence defined in Eq. (4) for both Z-SGD and LASER. This helps us give a handle to control the second moment of the gradient corrupted by channel noise.

Lemma 9 (Channel influence on Z-SGD). For the Z-SGD algorithm that sends the uncompressed gradients directly over the noisy channel with power constraint P, we have

λZ-SGD = 1 SNR, (12)

where SNR = P mn.

Lemma 10. For the LASER algorithm with the optimal power allocation (α, β) (chosen as in Lemma 8), we have

λLASER 4 (m/r) SNR

1 + 1 (n/r) SNR

where SNR = P mn.

Remark 2. Note that for the optimal power allocation via Lemma 8, we need the positive scalars κ1, . . . , κr. In the context of LASER, we will later see in the proof in App. D that κi pi 2.

LASER: Linear Compression in Wireless Distributed Optimization

Thus Lemma 9 and Lemma 10 establish that

λLASER 4 (m/r) SNR

1 + 1 (n/r) SNR

1 SNR = λZ-SGD.

In the low-rank (Vogels et al., 2019) and constant-order SNR regime where r = O(1) and SNR = Ω(1), we observe that λLASER is roughly O(m) times smaller than λZ-SGD.

Note on assumption between λLASER and δr. Recall from LASER that the local memory et has only access to the compressed gradients and not the channel output. In an hypothetical scenario, where it has access to the same, it follows that EZ Z(α,β)(Cr(M)) M 2 (1 (δr λLASER)) M 2. Hence for the compression property in this ideal scenario, we need λLASER δr.

B.3. Optimality gap and error bounds for LASER iterates

In this section, we characterize the gap to the optimality and the error norm for the LASER iterates {θt}t 0 (similar to Lemmas 1, 2, 2 and 5 for EF-SGD). Towards the same, first we define the virtual iterates { eθt}t 0 as follows:

eθt θt et . (14)

eθt+1 = θt+1 et+1 = eθt γtgt + Cr(et + γtgt) Z(α,β) (Cr(et + γtgt)) . (15)

The following lemma controls the optimality gap E eθt θ 2 when f is quasi-convex.

Lemma 11 (Descent for quasi-convex). Let {θt, et}t 0 be defined as in LASER. Assume that f is µ-quasi convex for

some µ 0 and that Assumptions 2 and 3 hold. If γt 1 4L(1+M) 1 2λLASER

for all t 0, then for {eθt}t 0 defined in Eq. (14),

E eθt+1 θ 2 1 µγt

E eθt θ 2 γt

2 E(f(θt) f ) + γ2 t σ2(1 + λLASER)

+ (3Lγt(1 + λLASER) + λLASER)E θt eθt 2 . (16)

Notice that Lemma 11 is similar to Lemma 1 for noiseless EF-SGD except for an additional channel influence factor λLASER. The following result bounds the error norm.

Lemma 12 (Error control). Let et be as in (LASER) for a δr-approximate compressor Cr and stepsizes {γt}t 0 with γt 1 10L(2/δr+M)(1+λLASER), t 0 and {γ2 t }t 0 2 δr -slow decaying. Further suppose that Assumption 5 holds. Then

3L(1 + λLASER) + λLASER

E et+1 2 δr

t i E f(θt i) 2

+ γtσ2(1 + λLASER) .

Furthermore, for any 4

δr -slow increasing non-negative sequence {wt}t 0 it holds:

3L(1 + λLASER) + λLASER

t=0 wt E et 2 1

t=0 wt E f(θt) 2

+ σ2(1 + λLASER)

The following lemma establishes the progress in the descent for non-convex case.

Lemma 13 (Descent for non-convex). Let {θt, et}t 0 be defined as in LASER and that Assumptions 2 and 3 hold. If

LASER: Linear Compression in Wireless Distributed Optimization

γt 1 4L(1+M)(1+λLASER) for all t 0, then for {eθt}t 0 defined in Eq. (14),

E[f(eθt+1)] E[f(eθt)] γt

4 E f(θt) 2 + γ2 t Lσ2(1 + λLASER)

+ E θt eθt 2 L2γt

2 + LλLASER

C. Proof of Thm 1

Proof. We prove the bounds in (i) and (ii) when f is quasi-convex, (iii) when f is an arbitrary non-convex function, and (iv) for Z-SGD.

(i), (ii) f is µ-quasi-convex: Observe that the assumptions of Thm 1 automatically satisfy the conditions of Lemma 11. Denoting rt E eθt+1 θ 2 and st E(f(θt) f ), for any wt > 0 we obtain

2 st (16) wt

γt rt+1 + γtwtσ2(1 + λLASER) + 3wt(L(1 + λLASER) + λLASER

γt )E et 2 .

Taking summation on both sides and invoking Lemma 2 (assumption on wt verified below),

γt rt+1 + 2γtwtσ2(1 + λLASER) + 1

t=0 wt E f(θt) 2 .

Since f is L-smooth, we have f(θt) 2 2L(f(θt) f ). Now rewriting the above inequality, we have

γt rt+1 + 2γtwtσ2(1 + λLASER) .

Substituting WT PT t=0 wt,

t=0 wtst 6 WT

γt rt+1 + 2γtwtσ2(1 + λLASER) =: ΞT .

Now it remains to derive the estimate for ΞT . Towards this, (i) if µ > 0 and with constant stepsize γt = γ 1 10L( 2

δr +M)(1+λLASER), we observe that (1 µγ

16 and by Example 1 in (Stich & Karimireddy, 2020), the

weights wt = 1 µγ

2 (t+1) are 2τ-slow increasing with τ = 2 δr . Hence the claim in (i) follows by applying Lemma 3 and observing that the sampling probablity to choose θout from {θt}T 1 t=0 is same as wt.

For (ii) with constant stepsize and µ = 0, we apply Lemma 6 by setting the weights wt = 1.

(iii) f is non-convex The proof in this case is very similar to that of the above. Denoting rt 4E[f(eθt) f ], st E f(θt) 2, c = 4Lσ2(1 + λLASER), and wt = 1, we have from Lemma 13 that

(19) rt 4γt rt+1

2 3L(1 + λLASER), multiplying both sides of the above inequality by wt and taking summation, we obtain

t=0 wtst (18) 1 WT

which upon rearranging gives

t=0 wtst 12

LASER: Linear Compression in Wireless Distributed Optimization

Now invoking Lemma 6 yields the final result in (iii).

Z-SGD: Recall from Z-SGD that the iterates {θt}t 0 are given by

θt+1 = θt γt ZP (gt).

Thus Z-SGD can be thought of as a special case of EF-SGD with no compression, i.e. δr = 1, and hence we can utilize the same convergence tools. It remains to estimate the first and second moments of the stochastic gradient ZP (gt). Recall from the definition of ZP in the noisy channel that ZP (gt) = gt + gt

P Zt, where Zt is a zero-mean independent channel noise, and from Assumption 3 that gt = f(θt) + ξt with a (M, σ2)-bounded noise ξt. Hence

E [ZP (gt)|θt] = E[gt|θt] = f(θt),

E ZP (gt) f(θt) 2|θt = E ZP (gt) gt + gt f(θt) 2|θt

= E ZP (gt) gt 2|θt + E gt f(θt) 2|θt

4= E λZ-SGD gt 2|θt + E ξt 2

= λZ-SGD f(θt) 2 + (1 + λZ-SGD)E ξt 2

(M + 1)(1 + λZ-SGD) f(θt) 2 + (1 + λZ-SGD)σ2.

Thus Z-SGD satifies the (f M, eσ2)-bounded noise condition in Assumption 3 with f M = (M + 1)(1 + λZ-SGD) and eσ2 = (1+λZ-SGD)σ2. Thus the claim (iv) follows from applying Thm 2 and Thm 3 with the constants δr 1, M f M, σ2 eσ2.

Finally, Lemma 9 and Lemma 10 establish the relation between the channel influence factors λZ-SGD and λLASER.

D. Proof of technical lemmas

D.1. Proof of Lemma 7

Proof. Since log( ) is a monotonic function, minimizing f P (α, β) over SP = {(α, β) : α 0, β 0, α + β = P} is

equivalent to minimizing log f P (α, β) = log 1 + m

α + log 1 + n

β . Define the Lagrangian L(α, β, λ) as

L(α, β, λ) log 1 + m

+ log 1 + n

+ λ(α + β P).

Letting αL = βL = 0, we obtain that m α(m+α) = n β(n+β). Now constraining α + β = P, we obtain the following quadratic equation:

If m = n, the solution is given by α = β = P/2. If m = n, the solution is given by

It is easy to verify that (α , β ) is the unique minimizer to f P since it s convex over SP . Now it remains to show the upper bound for f P (α , β ). Without loss of generality, in the reminder of the proof we assume m < n and denote α by simply α. Rewriting the optimal α in Eq. (20) in terms of SNR = P/mn, we obtain

(1 + n SNR)(1 + m SNR) (1 + m SNR)

LASER: Linear Compression in Wireless Distributed Optimization

Now substituting this α and corresponding β in f P (α, β) = 1 + m

β and rearranging the terms, we get

f P (α, β) = 1 + 1 SNR

1 1 2α mn SNR

= 1 + 1 n SNR

n m 1 1 2α mn SNR

n < 1. Now we study the behavior of α in Eq. (21) as a function of γ. In particular, define g(γ)

1 + n SNR 1 + nγ SNR. Observe that g(1) = 1 + n SNR and g (1) = n SNR

2 . Rewriting Eq. (21) as a function of γ, we get

α mn = g(γ) (1 + nγ SNR)

= g(1) + g (1)(γ 1) (1 + nγ SNR) + g

2 (γ 1)2 + g

3! (γ 1)3 + . . . n(1 γ)

3! (1 γ)2 + . . . .

Utilizing the fact that g (1) = 1

4 n2SNR2 1+n SNR, g (1) = 3

8 n3SNR3 (1+n SNR)2 and so forth, we obtain

1 2α mn SNR = 2(1 γ)

2 1 4 n2 SNR2

1 + n SNR + 1

3! 3 8 n3 SNR3

(1 + n SNR)2 (1 γ) + . . .

n SNR 1 2 1 4 n2 SNR2

4 n SNR 1 + n SNR.

Substituting this bound back in the experssion for f P yields the final bound:

f P (α, β) 1 + 4 nγ SNR

1 + 1 n SNR

= 1 + 4 m SNR

1 + 1 n SNR

D.2. Proof of Lemma 8

Proof. To minimize f P (α, β) over SP = {{(α, β) : α 0, β 0, P

i(αi + βi) = P}, we consider a slightly relaxed version that serves as an upper bound to this problem. In particular, first we divide the power P into P1, . . . , Pr such that P

i Pi = P and Pi 0. Then for each Pi we find the optimal αi and βi from rank-1 allocation scheme in Lemma 7 and compute the corresponding objective value. In the end, we find a tractable scheme for division of power P among P1, . . . , Pr minimizing this objective. Mathematically,

min (α,β) SP f P (α, β) min {P

i Pi=P } min {(αi,βi):αi+βi=Pi,i [r]}

i κi min (αi,βi):αi+βi=Pi

(Lemma 7) min {P

1 + 4 m SNRi

1 + 1 n SNRi

κi SNRi + 4 mn

LASER: Linear Compression in Wireless Distributed Optimization

Choosing SNRi κi, i.e. SNRi = SNR κi P

j κj , and substituting this allocation above, we obtain

min (α,β) SP f P (α, β) 1 + 4 m SNR

+ 4 mn SNR2 R

1 + 4 (m/r) SNR

1 + 4 (n/r) SNR

where we used the inequality P

i κi 2 r together with the fact that P

D.3. Proof of Lemma 9

Proof. Recall from Z-SGD that the stochastic gradient reconstructed at the receiver after transmitting g is y Z-SGD(g) ZP (g) = g + g

P Z, where Z is a zero-mean independent channel noise in Rm n. Thus

λZ-SGD = 1 g 2 EZ y Z-SGD(g) g 2 = 1 g 2 g 2

P E Z 2 = mn

D.4. Proof of Lemma 10

Proof. In view of LASER, denote the error compensated gradient at time t as M = et + γtgt and its compression as M r = Cr(M) = Pr i=1 piq i with orthogonal factors {pi} and orthonormal {qi} (without loss of generality). After transmitting these factors of M r via the noisy channel, we obtain

y LASER(M r) = Z(α,β)(M r) =

pi + pi αi Z(i) m

qi + qi βi Z(i) n

Denote epi pi + pi αi Z(i) m , eqi qi + qi βi Z(i) n , and Z = (Z(i) m , Z(i) n )r i=1. We observe that EZ[y LASER(M r)] = M r. Hence

EZ y LASER(M r) M r 2 = EZ X

i epieq i 2 M r 2

i EZ epi 2 EZ eqi 2 X

i p 2 q 2 1 + m

(Lemma 8) = M r 2 (f P (α, β) 1) ,

where we set κi = pi 2/ M r 2. Now choosing (α, β) = (α , β ) as in Lemma 8 yields the desired result.

D.5. Proof of Lemma 11

Proof. From Eq. (15), we have that

eθt+1 = eθt γtgt + Cr(et + γtgt) Z(α,β) (Cr(et + γtgt)) .

LASER: Linear Compression in Wireless Distributed Optimization

Denoting Error Z = Cr(et + γtgt) Z(α,β) (Cr(et + γtgt)), we observe that EZ[Error Z] = 0 and EZ Error Z 2 λLASER Cr(et + γtgt) 2 λLASER et + γtgt 2 (see App. D.4). Thus

E eθt+1 θ 2

= E eθt θ γtgt 2 + E Error Z 2

= E eθt θ 2 2γt E gt, eθt θ + γ2 t E gt 2 + E Error Z 2

E eθt θ 2 2γt E gt, θt θ + 2γt E gt, θt eθt + γ2 t E gt 2 + λLASERE et + γtgt 2

= E eθt θ 2 2γt E gt, θt θ + 2γt E gt, θt eθt (1 + λLASER) + γ2 t E gt 2(1 + λLASER)

+ λLASERE et 2

(Assump. 3) E eθt θ 2 2γt E f(θt), θt θ + 2γt E f(θt), θt eθt (1 + λLASER)

+ (M + 1)(1 + λLASER)γ2 t E f(θt) 2 + γ2 t σ2(1 + λLASER) + λLASERE et 2. (22)

Now we closely follow the steps as in the proof of (Stich & Karimireddy, 2020), Lemma 8. Since f is L-smooth, we have f(θt) 2 2L(f(θt) f . Further, by Assumption 1,

2 f(θt), θt θ µ θt θ 2 2(f(θt) f ) ,

and since 2 a, b α a 2 + α 1 b 2 for α > 0, a, b Rd, we have

2 f(θt), eθt θt 1

2L f(θt) 2 + 2L θt eθt 2 f(θt) f + 2L θt eθt 2 .

And by a + b 2 (1 + β) a 2 + (1 + β 1) b 2 for β > 0 (via Jensen s inequality), we observe

2 eθt θ 2 + θt eθt 2 .

Plugging these inequalities in Eq. (22), we obtain that

E eθt+1 θ 2

E eθt θ 2 γt (1 λLASER 2L(M + 1)(1 + λLASER)γt) E(f(θt) f )

+ γ2 t σ2(1 + λLASER) + (µγt + 2Lγt(1 + λLASER))E et 2.

Utilizing the fact that γt 1 2λLASER 4L(M+1)(1+λLASER) and µ L yields the desired claim.

D.6. Proof of Lemma 12

Proof. The proof of Lemma 12 is very similar to that of Lemma 2 for EF-SGD. In that proof, a key step is to establish that (3L(2/δ + M)γ2 t ) δ 64L and (3Lγt 4/δ) 1. In our setting, γt 1 10L(2/δr+M)(1+λLASER) and λLASER 1 10(2/δr+M). Thus 3L(1 + λLASER) + λLASER

δr + M (1 + λLASER)γt γt + λLASER

10 1 10L( 2

δr + M)(1 + λLASER)

LASER: Linear Compression in Wireless Distributed Optimization

4 δr (3L(1 + λLASER)γt + λLASER) = 3L(1 + λLASER) 4

δr γt + λLASER 4 δr

D.7. Proof of Lemma 13

Proof. From Eq. (15), we have that

eθt+1 = eθt γtgt + Cr(et + γtgt) Z(α,β) (Cr(et + γtgt)) .

Denoting Error Z = Cr(et + γtgt) Z(α,β) (Cr(et + γtgt)), we observe that EZ[Error Z] = 0 and EZ Error Z 2 λLASER Cr(et + γtgt) 2 λLASER et + γtgt 2 (see App. D.4). Using the smoothness of f,

f(eθt+1) f(eθt) γt f(eθt), gt + f(eθt), Error Z + L

2 γtgt + Error Z 2

Taking expectation on both sides,

Ef(eθt+1) Ef(eθt) γt E f(eθt), f(θt) + L

2 γ2 t E gt 2 + λLASERE et + γtgt 2 .

Rewriting f(eθt), f(θt) = f(θt) 2 + f(eθt) f(θt), f(θt) and using a, b 1

2 b 2, we can simplify the expression as

f(eθt) f(θt), f(θt) 1

2 f(θt) f(eθt) 2 + 1

2 θt eθt 2 + 1

2 f(θt) 2 .

Pluggin this inequality back together with E gt 2 (M + 1)E f(θt) 2 + σ2, we get

Ef(eθt+1) Ef(eθt) γt

2 (1 2γt L(M + 1)(1 + λLASER)) E f(θt) 2 + Lγ2 t σ2(1 + λLASER)

Now utilizing the fact γt 1 4L(M+1)(1+λLASER) establishes the desired result.

E. Additional details about noisy channel and LASER

E.1. Channel transformation

Recall from Eq. (2) in Sec. 2 that the server first obtains y = Pk i=1 aigi + Z, where aigi 2 P (note that we use the constant scheme Pt = P as justified in Sec. 4.2). Now we want to show that for estimating the gradient sum P

i gi through a linear transformation on y, the optimal power scalars are given by ai =

P maxj gj , i [k], which yields the channel model in (noisy channel).

Towards this, first let k = 2 (the proof for general k is similar). Thus our objective is

min a1,a2,b E y

b g1 g2 2 .

LASER: Linear Compression in Wireless Distributed Optimization

For any a1, a2, b, we have that

b g1 g2 2 = min a1,a2,b: aigi 2 P E g1 a1

b 1 + g2 a2

= min a1,a2,b: aigi 2 P E f(θ)( 1 + 2) + 1 ξ1 + 2 ξ2 + Z

= min a1,a2,b: aigi 2 P

f(θ) 2( 1 + 2)2 + 2 1 E ξ1 2 + 2 2 E ξ2 2 + E Z 2

where we used the fact that g1 = f(θ) + ξ1 and g2 = f(θ) + ξ2 with zero-mean and independent ξ1, ξ2, and Z. We now observe that for any fixed b the optimal ai s are given by a1 = a2 = b, i.e. 1 = 2 = 0. To determine the optimal b, we have to solve

max b s.t. b gi 2 P,

which yields b =

P/maxi gi . The proof for general k is similar.

E.2. Detailed steps for Alg. 1

Recall from Alg. 1 that power allocation among clients is done via the function POWERALLOC({Cr(M j), M j}). The theoretically optimal power allocation is discussed in App. B.1, and given explicitly in Lemma 8. However we empirically observe that we can relax this allocation scheme and even simpler schemes suffice to beat the other considered baselines. This is detailed in App. F.6.

LASER: Linear Compression in Wireless Distributed Optimization

E.3. Constant-order SNR

As discussed in Sec. 3.2 and established in Lemmas 9 and 10 of App. B.2, we have that

λLASER 4 (m/r) SNR

1 + 1 (n/r) SNR

1 SNR = λZ-SGD.

In the low-rank (Vogels et al., 2019) and constant-order SNR regime where r = O(1) and SNR = Ω(1), we observe that λLASER is roughly O(m) times smaller than λZ-SGD. Note that this is only a sufficient theoretical condition to ensure that the ratio between λLASER and λZ-SGD is smaller than one. In fact, a much weaker condition that P/4r2 > 1 suffices. To establish this, we note

λZ-SGD = 4r

1 + r n SNR

The first term is usually negligible since we always fix the rank r = 4, which is much smaller compared to m in the architectures we consider. Thus if P/4r2 > 1, we see that the above ratio is smaller than one. Note that the constant-order SNR assumption already guarantees this: SNR = Ω(1) P mn P r2, since r is smaller than both m and n. On the other hand, for the RESNET18 architecture with L = 61 layers and r = 4, the power levels P = 250, 500 violate the above condition as P/(Lr2) < 4 (note that the budget P here is for the entire network and hence replaced by P/L). But empirically we still observe the accuracy gains in this low-power regime (Fig. 2 in the paper).

F. Experimental details

We provide technical details for the experiments demonstrated in Sec. 4.

F.1. WIKITEXT-103 experimental setup

This section concerns the experimental details used to obtain Fig. 1 and Table 1 in the main text. Table 6 collects the settings we adopted to run our code. Table 7 describes the model architecture, with its parameters, their shape and their uncompressed size.

Table 6: Default experimental settings for the GPT-2 model used to learn the WIKITEXT-103 task.

Dataset WIKITEXT-103 Architecture GPT-2 (as implemented in (Pagliardini, 2023))

Number of workers 4 Batch size 15 per worker Accumulation steps 3

Optimizer Adam W (β1 = 0.9, β2 = 0.95) Learning rate 0.001 Scheduler Cosine # Iterations 20000 Weight decay 1 10 3

Dropout 0.2 Sequence length 512 Embeddings 768 Transformer layers 12 Attention heads 12

Power budget 6 levels: 10k, 40k, 160k, 640k, 2560k, 10240k Power allocation Proportional to norm of compressed gradients (uncompressed gradients for Z-SGD) Compression Rank 4 for LASER; 0.2 compression factor for other baselines

Repetitions 1

LASER: Linear Compression in Wireless Distributed Optimization

Table 7: Parameters in the GPT-2 architecture, with their shape and uncompressed size.

Parameter Gradient tensor shape Matrix shape Uncompressed size

transformer.wte 50304 768 50304 768 155 MB transformer.wpe 512 768 512 768 1573 KB transformer.h.ln_1 ( 12) 768 768 1 (12 ) 3 KB transformer.h.attn.c_attn ( 12) 2304 768 2304 768 (12 ) 7078 KB transformer.h.attn.c_proj ( 12) 768 768 768 768 (12 ) 2359 KB transformer.h.ln_2 ( 12) 768 768 1 (12 ) 3 KB transformer.h.mlp.c_fc ( 12) 3072 768 3072 768 (12 ) 9437 KB transformer.h.mlp.c_proj ( 12) 768 3072 768 3072 (12 ) 9437 KB transformer.ln_f 768 768 1 3 KB

Total 496 MB

LASER: Linear Compression in Wireless Distributed Optimization

F.2. CIFAR10 experimental setup

This section concerns the experimental details used to obtain Fig. 2 and Table 3 in the main text. Table 8 collects the settings we adopted to run our code. Table 9 describes the model architecture, with its parameters, their shape and their uncompressed size.

Table 8: Default experimental settings for the RESNET18 model used to learn the CIFAR10 task.

Dataset CIFAR10 Architecture RESNET18

Number of workers 16 Batch size 128 per worker

Optimizer SGD Momentum 0.9 Learning rate Grid-searched in {0.001, 0.005, 0.01, 0.05} for each power level # Epochs 150 Weight decay 1 10 4, 0 for Batch Norm parameters

Power budget 10 levels: 250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000 Power allocation Proportional to norm of compressed gradients (uncompressed gradients for Z-SGD) Compression Rank 4 for LASER; 0.2 compression factor for other baselines

Repetitions 3, with varying seeds

Table 9: Parameters in the Res Net18 architecture, with their shape and uncompressed size.

Parameter Gradient tensor shape Matrix shape Uncompressed size

layer4.1.conv2 512 512 3 3 512 4608 9437 KB layer4.0.conv2 512 512 3 3 512 4608 9437 KB layer4.1.conv1 512 512 3 3 512 4608 9437 KB layer4.0.conv1 512 256 3 3 512 2304 4719 KB layer3.1.conv2 256 256 3 3 256 2304 2359 KB layer3.1.conv1 256 256 3 3 256 2304 2359 KB layer3.0.conv2 256 256 3 3 256 2304 2359 KB layer3.0.conv1 256 128 3 3 256 1152 1180 KB layer2.1.conv2 128 128 3 3 128 1152 590 KB layer2.1.conv1 128 128 3 3 128 1152 590 KB layer2.0.conv2 128 128 3 3 128 1152 590 KB layer4.0.shortcut.0 512 256 1 1 512 256 524 KB layer2.0.conv1 128 64 3 3 128 576 295 KB layer1.1.conv1 64 64 3 3 64 576 147 KB layer1.1.conv2 64 64 3 3 64 576 147 KB layer1.0.conv2 64 64 3 3 64 576 147 KB layer1.0.conv1 64 64 3 3 64 576 147 KB layer3.0.shortcut.0 256 128 1 1 256 128 131 KB layer2.0.shortcut.0 128 64 1 1 128 64 33 KB linear 10 512 10 512 20 KB conv1 64 3 3 3 64 27 7 KB Bias vectors (total) 38 KB

Total 45 MB

LASER: Linear Compression in Wireless Distributed Optimization

F.3. CIFAR100 experimental results

This section concerns experimental results on CIFAR100. We used the same RESNET18 architecture as for CIFAR10 (except for the final layer, adapted to the 100-class dataset). We once again compared LASER to the usual baselines. Fig. 4 and Table 12 collect the results that we obtained. It can be seen that LASER outperforms the other algorithms with an even wider margin compared to the CIFAR10 and WIKITEXT-103 tasks, with a power gain of around 32 across different accuracy targets. SIGNUM is much more sensitive to noise and performs much worse than the other algorithms; therefore, we decided to leave out its results in order to improve the quality of the plot. Table 10 collects the settings we adopted to run our code. Table 11 describes the model architecture, with its parameters, their shape and their uncompressed size.

Table 10: Default experimental settings for the RESNET18 model used to learn the CIFAR100 task.

Dataset CIFAR100 Architecture RESNET18

Number of workers 16 Batch size 128 per worker

Optimizer SGD Momentum 0.9 Learning rate Grid-searched in {0.001, 0.005, 0.01, 0.05} for each power level LR decay /10 at epoch 150 # Epochs 200 Weight decay 1 10 4

0 for Batch Norm parameters

Power budget 10 levels: 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000, 256000 Power allocation Proportional to norm of compressed gradients (uncompressed gradients for Z-SGD)

Repetitions 3, with varying seeds Compression Rank 4 for LASER; 0.2 compression factor for other baselines

Table 11: Parameters in the Res Net18 architecture, with their shape and uncompressed size.

Parameter Gradient tensor shape Matrix shape Uncompressed size

layer4.1.conv2 512 512 3 3 512 4608 9437 KB layer4.0.conv2 512 512 3 3 512 4608 9437 KB layer4.1.conv1 512 512 3 3 512 4608 9437 KB layer4.0.conv1 512 256 3 3 512 2304 4719 KB layer3.1.conv2 256 256 3 3 256 2304 2359 KB layer3.1.conv1 256 256 3 3 256 2304 2359 KB layer3.0.conv2 256 256 3 3 256 2304 2359 KB layer3.0.conv1 256 128 3 3 256 1152 1180 KB layer2.1.conv2 128 128 3 3 128 1152 590 KB layer2.1.conv1 128 128 3 3 128 1152 590 KB layer2.0.conv2 128 128 3 3 128 1152 590 KB layer4.0.shortcut.0 512 256 1 1 512 256 524 KB layer2.0.conv1 128 64 3 3 128 576 295 KB layer1.1.conv1 64 64 3 3 64 576 147 KB layer1.1.conv2 64 64 3 3 64 576 147 KB layer1.0.conv2 64 64 3 3 64 576 147 KB layer1.0.conv1 64 64 3 3 64 576 147 KB layer3.0.shortcut.0 256 128 1 1 256 128 131 KB layer2.0.shortcut.0 128 64 1 1 128 64 33 KB linear 100 512 100 512 205 KB conv1 64 3 3 3 64 27 7 KB Bias vectors (total) 38 KB

Total 45 MB

LASER: Linear Compression in Wireless Distributed Optimization

Power budget

LASER Z-SGD Sketching Random-K Noiseless SGD

Figure 4: Test accuracy (higher the better) for a given power budget on CIFAR-100 for different algorithms. The advantage of LASER is evident across the entire power spectrum.

Table 12: Power required (lower the better) to reach the given target accuracy on CIFAR-100. LASER requires 16 32 lesser power than the Z-SGD to achieve the same targetaccuracy. Equivalently, LASER tolerates more channel noise than the Z-SGD for the same target accuracy as is partly supported by our theoretical analysis.

Target Power required Reduction LASER Z-SGD

65% 500 8000 16 68% 1000 32000 32 70% 2000 64000 32 71% 8000 256000 32

LASER: Linear Compression in Wireless Distributed Optimization

F.4. MNIST experimental setup

This section concerns the experimental details used to obtain Table 4 in the main text. Table 13 collects the settings we adopted to run our code.

Table 13: Default experimental settings for the 1-LAYER NN used to learn the MNIST task.

Dataset MNIST Architecture 1-LAYER NN

Number of workers 16 Batch size 128 per worker

Optimizer SGD Momentum 0.9 Learning rate 0.01 # Epochs 50 Weight decay 1 10 4,

Power budget 3 levels: 0.1, 1, 10 Power allocation Proportional to norm of compressed gradients (uncompressed gradients for Z-SGD)

Repetitions 3, with varying seeds Compression Rank 2 for LASER; 0.1 compression factor for other baselines

F.5. Rank-accuracy tradeoff

There exists an inherent tradeoff between the decomposition rank r (and hence the compression factor δr) and the final model accuracy. In fact, a small rank r implies aggressive compression and hence the compression noise dominates the channel noise. Similarly, for a high decomposition rank, the channel noise overpowers the compression noise as the power available per each coordinate is small. We empirically investigate this phenomenon for CIFAR10 classification over various power regimes in Fig. 5.

1 2 4 8 16 32 Rank

Test accuracy after 150 epochs

P=250 P=4000 P=128000

Figure 5: Final accuracy vs. compression rank tradeoff for CIFAR-10 classification, for low, medium and high power regimes. Rank4/Rank-8 compression is optimal for all the three regimes. It reveals two interesting insights: (i) performance is uniformly worse in all the regimes with overly aggressive rank-one compression, and (ii) higher rank compression impacts low power regime more significantly than the medium and high-power counterparts. This confirms with the intuition that at low power (and hence noisier channel), it is better to allocate the limited power budget appropriately to few essential rank components as opposed to thinning it out over many.

As Fig. 5 reveals, either Rank-4 or Rank-8 compression is optimal for all the three power regimes. Further we observe two

LASER: Linear Compression in Wireless Distributed Optimization

interesting trends: (i) the final accuracy is uniformly worse in all the regimes with overly aggressive rank-one compression, and (ii) higher rank compression impacts the low power regime more significantly than the medium and high-power counterparts. This is in agreement with the intuition that at low power (and hence noisier channel), it is better to allocate the limited power budget appropriately to few essential rank components as opposed to thinning it out over many. This phenomenon can be theoretically explained by characterizing the compression factor δr as a function of rank r and its effect on the model convergence. While the precise expression for δr is technically challenging, given the inherent difficulty in analyzing the Power SGD algorithm (Vogels et al., 2019), we believe that a tractable characterization of this quantity (via upper bounds etc.) can offer fruitful insights into the fundamental rank-accuracy tradeoff at play.

20 40 60 80 100 120 140 Epoch

Fraction of energy in Top-8 components

First hidden layer Central layer Last hidden layer

Figure 6: Fraction of energy in the top 8 components of the gradients of three layers in the network: the first and last hidden layer, and one central layer.

To further shed light on this phenomenon, we trained the noiseless SGD on CIFAR10 and captured the evolution across the epochs of the energy contained in the top eight components of each gradient matrix. As illustrated in Fig. 6, we observe that for the first and last hidden layers, 80% of the energy is already captured in these eight components. On the other hand, for the middle layer this fares around 55%. It is interesting to further explore this behavior for GPT models and other tasks.

F.6. Power allocation across workers and neural network parameters

The choice of power allocation over the layers of the network is perhaps the most important optimization required in our experimental setup. Notice that, because of Eq. (2), all clients must allocate the same power to a given gradient, since otherwise it would be impossible to recover the correct average gradient. However, workers have a degree of freedom in choosing how to distribute the power budget among gradients, i.e. among the layers of the network, and this power allocation can change over the iterations of the model training.

App. B.1 analyzes power allocation optimality from a theoretical point of view. On the experimental side, simpler schemes are enough to get significant gains over the other baselines. As a matter of fact, we considered the following power allocation scheme for the experiments: at each iteration, each worker determines locally how to allocate its power budget across the gradients. Then, we assume that this power allocation choice is communicated by the client to the server noiselessly. The server then takes the average of the power allocation choices, and communicates the final power allocation to the clients. The clients then use this power allocation to send the gradients to the server via the noisy channel.

For the determination of each worker s power allocation, three schemes were considered:

uniform power to each gradient;

power proportional to the Frobenius norm (or the square of it) of the gradients;

power proportional to the norm of the compressed gradients (i.e., the norm of what is actually communicated to the server).

For Z-SGD, where there is no gradient compression, the best power allocation turned out to be the one proportional to the norm of the gradients, independently of the power constraint imposed. For all the other algorithms, the best is power proportional to the norm of the compressed gradients.

F.7. Static vs. dynamic power policy

As discussed in Sec. 4.2, we analyzed different power allocation schemes across iterations, when a fixed budget in terms of average power over the epochs is given. Fig. 3 shows the results for decreasing power allocations, while Fig. 7 here shows their increasing counterparts. We observe that LASER exhibits similar gains over Z-SGD for all the power control

LASER: Linear Compression in Wireless Distributed Optimization

laws. Further, constant power remains the best policy for both LASER and Z-SGD. Whilst matching the constant power performance, the power-decreasing control performs better than the increasing counterpart for Z-SGD, especially in the low-power regime, where the accuracy gains are roughly 4 5%.

Power budget

Accuracy after 150 epochs

LASER, constant LASER, linear

LASER, step Z-SGD, constant

Z-SGD, linear Z-SGD, step

Figure 7: Final accuracy vs. power budget P with various power control schemes, for distributed training across 16 workers with RESNET18 on CIFAR10. For each budget P, we consider three increasing power control laws, as studied in the literature [1], that satisfy the average power constraint: (i) constant power, Pt = P, (ii) piecewise constant, with the power levels Pt {P/3, 2P/3, P, 4P/3, 5P/3}, and (iii) linear law between the levels P/3 and 5P/3. The performance of increasing power allocation schemes is equal or worse compared to their decreasing counterparts of Fig. 3.

LASER: Linear Compression in Wireless Distributed Optimization

F.8. Baselines implementation

In this section we describe our implementation of the baselines considered in the paper.

F.8.1. COUNT-MEAN SKETCHING

Algorithm 2 COUNT-MEAN SKETCHING

0: function COMPRESS(gradient matrix M Rn m) 0: Treat M as a vector of length nm. 0: The number of samples b is set to mn (compression factor). 0: If the resulting b is less than 1, we set b = 1. 0: Sample a set of mn indices I i.i.d. between 0 and b 1 using the same seed on all workers. 0: Sample a set of mn signs (+1 or 1) S i.i.d. using the same seed used for I. 0: ˆC 0 Rb

0: for j = 0, . . . , mn 1 do 0: ˆC(I(j)) ˆC(I(j)) + S(j) M(j) 0: end for 0: return ˆC 0: function AGGREGATE+DECOMPRESS(worker s values ˆC1 . . . ˆCk) 0: Sample I and S as before, using the same seed. 0: ˆ M 0 Rn m

0: ˆ M(I) 1

k Pk i=1 ˆCi(I) S

0: return ˆ M =0

Power is allocated proportional to compressed gradients norms. The algorithm is implemented without local error feedback, since error feedback causes the algorithm to diverge. The compression factor was grid-searched in {0.1, 0.2, 0.5, 0.8} and 0.2 was finally chosen as the overall best.

F.8.2. RANDOM K

Algorithm 3 Random K

0: function COMPRESS(gradient matrix M Rn m) 0: Treat M as a vector of length nm. 0: The number of samples b is set to mn (compression factor). 0: If the resulting b is less than 1, we set b = 1. 0: Sample a set of b indices I without replacement, using the same seed on all workers. 0: return Looked up values S = M(I).

0: function AGGREGATE+DECOMPRESS(worker s values S1 . . . Sk) 0: ˆ M 0 Rn m

0: ˆ M(I) 1

k Pk i=1 Si 0: return ˆ M =0

Power is allocated proportional to compressed gradients norms. The algorithm is implemented with local error feedback. The compression factor was grid-searched in {0.1, 0.2, 0.5, 0.8} and 0.2 was finally chosen as the overall best.

LASER: Linear Compression in Wireless Distributed Optimization

F.8.3. SIGNUM

Algorithm 4 SIGNUM

0: function COMPRESS(gradient matrix M Rn m) 0: Compute the signs S { 1, 1}n m of M 0: return S 0: function AGGREGATE+DECOMPRESS(worker s signs S1 . . . Sk) 0: return SIGN(Pk i=1 Si) =0

We implemented SIGNUM following (Bernstein et al., 2018). We run it in its original form, without error feedback. Power is allocated proporional to the compressed gradients norms. Since the compressed gradients are simply the sign matrices, in this case power is allocated proportional to the square root of the number of parameters in each layer mn. Unlike the other baselines, SIGNUM does not require any compression factor.