# private_federated_learning_with_autotuned_compression__b6d27593.pdf

Private Federated Learning with Autotuned Compression

Enayat Ullah * 1 2 Christopher A. Choquette-Choo * 3 Peter Kairouz * 3 Sewoong Oh * 3 4

We propose new techniques for reducing communication in private federated learning without the need for setting or tuning compression rates. Our on-the-ﬂy methods automatically adjust the compression rate based on the error induced during training, while maintaining provable privacy guarantees through the use of secure aggregation and differential privacy. Our techniques are provably instance-optimal for mean estimation, meaning that they can adapt to the hardness of the problem with minimal interactivity. We demonstrate the effectiveness of our approach on real-world datasets by achieving favorable compression rates without the need for tuning.

1. Introduction

Federated Learning (FL) is a form of distributed learning whereby a shared global model is trained collaboratively by many clients under the coordination of a central service provider. Often, clients are entities like mobile devices which may contain sensitive or personal user data. FL has a favorable construction for privacy-preserving machine learning, since user data never leaves the device. Building on top of this can provide strong trust models with rigorous user-level differential privacy (DP) guarantees (Dwork et al., 2010a), which has been studied extensively in the literature (Dwork et al., 2010b; Mc Mahan et al., 2017b; 2022; Kairouz et al., 2021b).

More recently, it has become evident that secure aggregation (Sec Agg) techniques (Bonawitz et al., 2016; Bell et al., 2020) are required to prevent honest-but-curious servers from breaching user privacy (Fowl et al., 2022; Hatamizadeh et al., 2022; Suliman and Leith, 2022). Indeed, Sec Agg

*Equal contribution 1The Johns Hopkins University 2Work completed while on internship at Google. 3Google Research 4University of Washington. Correspondence to: <enayat@jhu.edu>, <{cchoquette,sewoongo,kairouz}@google.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Noise Multiplier, z

Best Compression Rate at =1%

Task F-EMNIST

Stack Overflow Next Word Prediction Shakespeare

Figure 1. The optimal compression rates differ highly between different tasks. For each noise multiplier, we run compression rates of 2i, for i [1, 13], and report the highest compression rate with a maximum of = 1% relative drop in accuracy from the same model without compression. We interpolate accuracies for ﬁne-grained results. See Sec. 5 for task descriptions.

and DP give a strong trust model for privacy-preserving FL (Kairouz et al., 2021a; Chen et al., 2022a; Agarwal et al., 2021; Chen et al., 2022b; Xu et al., 2023). However, Sec Agg can introduce signiﬁcant communication burdens, especially in the large cohort setting which is preferred for DP. In the extreme, this can signiﬁcantly limit the scalability of DP-FL with Sec Agg. This motivated the study of privacy-utilitycommunication tradeoffs by Chen et al. (2022a), where they found that signiﬁcant communication reductions could be attained essentially for free (see Figure 1) by using a variant of the Count Sketch linear data structure.

However, the approach of Chen et al. (2022a) suffers from a major drawback: it is unclear how to set the sketch size (i.e., compression rate) a priori. Indeed, using a small sketch may lead to a steep degradation in utility whereas a large sketch may hurt bandwidth. Further, as demonstrated in Figure 1, the optimal compression rate can differ substantially for different datasets, tasks, model architectures, and noise multipliers (altogether, DP conﬁgurations). This motivates the following question which we tackle in this paper:

Can we devise adaptive (autotuning) compression techniques which can ﬁnd favorable compression rates in an instance-speciﬁc manner?

Naive approaches to tuning the compression rate. Before introducing our proposed approaches, we discuss the

Private Federated Learning with Autotuned Compression

challenges of using approaches based on standard hyperparameter tuning. The two most common approaches are:

1. Grid-search ( Genie ): Here, we a priori ﬁx a set of compression rates, try all of them and choose the one which has the best compression-utility (accuracy) tradeoff. The problem is that this requires trying most or all compression rates, and would likely lead to even more communication in totality. 2. Doubling-trick: This involves starting with an initial optimistic guess for the compression rate, using it, and evaluating the utility for this choice. The process is then repeated, with the compression rate halved each time, until a predetermined stopping criterion is met. While this method ensures that the required communication is no more than twice the optimal amount, it can be difﬁcult to determine a principled stopping criterion (see Figure 1 where we show that optimal compression rates depend on the task and model used). It may be desirable to maintain a high level of utility close to that of the uncompressed approach, but this requires knowledge of the utility of the uncompressed method, which is unavailable.

In our approach, instead of treating the optimization procedure as a closed-box that takes the compression rate as input, we open it up to identify the core component, mean estimation, which is more amenable to adaptive tuning.

The proxy of mean estimation. A core component of ﬁrstorder optimization methods is estimating the mean of gradients at every step. From an algorithmic perspective, this mean-estimation view has been fruitful in incorporating modern constraints, such as privacy (Bassily et al., 2014; Abadi et al., 2016; Choquette-Choo et al., 2022), robustness (Diakonikolas et al., 2019; Prasad et al., 2018) and compression (Alistarh et al., 2017; Chen et al., 2022a; Suresh et al., 2022), by simply designing appropriate mean estimation procedures, while reusing the rest of the optimization method as is. In FL, the popular Federated Averaging (Fed Avg) algorithm (Mc Mahan et al., 2017a), which we will use, computes a mean of client updates at every round.

Our theoretical results. Previously, (Chen et al., 2022a) studied federated mean estimation under the constraints of Sec Agg and (ϵ, δ)-DP, and showed that with n clients (data points) in d dimensions, the optimal (worst-case) communication cost is min d, n2ϵ2 bits such that the mean squared error is of the same order as optimal error under (central) DP, without any compression or Sec Agg. However, in practice, the instances are rarely worst-case, and one would desire a method with more instance-speciﬁc performance. Towards this, we introduce two ﬁne-grained descriptions of instance classes, parameterized by the (a) norm of mean (Sec. 3.1), which is motivated by the fact that in FL, the mean vector is the average per-round gradient and from classical optimiza-

tion wisdom, its norm goes down as training progresses, and (b) tail-norm of mean (Sec. 3.2), which captures the setting where the mean is approximately sparse, motivated by empirical observations of this phenomena on gradients in deep learning (Micikevicius et al., 2017; Shi et al., 2019).

For both of these settings, we design adaptive procedures which are oblivious to the values of norm and tail-norm, yet yield performance competitive to a method with complete knowledge of these. Speciﬁcally, for the norm of mean setting, our proposed procedure has a per-client communication complexity of roughly O(n2ϵ2M 2) bits where M 1 is the ratio of norm of mean to the worst-case norm bound on data points. For the tail-norm setting, we get an improved communication complexity given by the generalized sparsity of the mean see Sec. 3.2 for details. Further, for both settings, we show that our proposed procedures achieve (a) optimal error under DP, without communication constraints, and (b) optimal communication complexity, under the Sec Agg constraint, up to poly-logarithmic factors.

We note that adaptivity to tail-norm implies adaptivity to norm, but this comes at a price of log (d), rather than two, rounds of communication, which may be less favorable, especially in FL settings. We also show that interaction is necessary for achieving (nearly) optimal communication complexity, adaptively.

Finally, even without the need for adaptivity, e.g. in centralized settings, as by-products, our results yield optimal rates for DP mean estimation for the above ﬁne-grained instance classes, parameterized by norm or tail-norm of mean, which could be of independent interest.

Our techniques. Our compression technique, countmedian-of-means sketching, is based on linear sketching, and generalizes the count-mean sketch used in (Chen et al., 2022a). Our proposed protocol for federated mean estimation (FME) comprises of multiple rounds of communication in which each participating client sends two (as opposed to one) sketches. The ﬁrst sketch is used to estimate the mean, as in prior work, whereas the second sketch, which is much smaller, is used to track certain statistics for adaptivity. The unifying guiding principle behind all the proposed methods is to set compression rate such that the total compression error does not overwhelm the DP error.

Experimental evaluation. We map our mean estimation technique to the Fed Avg algorithm and test it on three standard FL benchmark tasks: character/digit recognition task on the F-EMNIST dataset and next word prediction on Shakespeare and Stackoverﬂow datasets (see Sec. 5 for details). We ﬁnd that our proposed technique can obtain favorable compression rates without tuning. In particular, we ﬁnd that our one-shot approach tracks the potentially unachievable Genie baseline (shown in Figure 1), with no

Private Federated Learning with Autotuned Compression

harm to model utility beyond a small slack ( 1%) which we allow for any compression method, including the Genie.

Related work. Mean estimation is one of most widely studied problems in the DP literature. A standard setting deals with bounded data points (Steinke and Ullman, 2015; Kamath and Ullman, 2020), which is what we focus on in our work. However, mean estimation has also been studied under probabilistic assumptions on the data generating process (Karwa and Vadhan, 2017; Bun and Steinke, 2019; Kamath et al., 2020; Biswas et al., 2020; Liu et al., 2021; 2022). The work most related to ours is that of (Chen et al., 2022a), which showed that in the worst-case, O(min(d, n2ϵ2)) bits of per-client communication is sufﬁcient and necessary for achieving optimal error rate for Sec Agg-compatible distributed DP mean estimation. Besides this, (Agarwal et al., 2018; 2021; Kairouz et al., 2021a) also study compression under DP in FL settings, but rely on quantization and thus incur Ω(d) per-client communication. Finally, many works, such as (Feldman and Talwar, 2021; Chen et al., 2020; Asi et al., 2022; Duchi et al., 2018; Girgis et al., 2021), study mean estimation, with and without compression, under local DP (Warner, 1965; Kasiviswanathan et al., 2011). However, we focus on Sec Agg-compatible (distributed) DP.

There has been a signiﬁcant amount of research on optimization with compression in distributed and federated settings. The most common compression techniques are quantization (Alistarh et al., 2017; Wen et al., 2017) and sparsiﬁcation (Aji and Heaﬁeld, 2017; Stich et al., 2018). Moreover, the works of (Makarenko et al., 2022; Jhunjhunwala et al., 2021; Chen et al., 2018) consider adaptive compression wherein the compression rate is adjusted across rounds. However, the compression techniques used in the aforementioned works are non-linear, and thus are not Sec Agg compatible. The compression techniques most related to ours are of (Ivkin et al., 2019; Rothchild et al., 2020; Haddadpour et al., 2020) using linear sketching. However, they do not consider adaptive compression or privacy. Finally, the works of (Arora et al., 2022; 2023; Bu et al., 2021) use random projections akin to sketching in DP optimization, but in a centralized setting, for improved utility or run-time.

Organization. We consider two tasks, Federated Mean Estimation (FME) and Federated Optimization (FO). The former primarily serves as a subroutine for the latter by considering the vector z at a client to be the model update at that round of Fed Avg. In Sec. 3, we propose two algorithms for FME : Adapt Norm, in Alg. 1 (with the formal claim of adaptivity in Thm. 3.1) and Adapt Tail, in Alg. 2 (with the formal claim of adaptivity in Thm. 3.4), which adapt to the norm of the mean and the tail-norm of the mean, respectively. In Sec. 4, we show how to extend our Adapt Norm FME protocol to the FO setting in Alg. 3. Finally, in Sec. 5, we evaluate the performance of the Adapt Norm

approach for FO on benchmark tasks in FL.

2. Preliminaries

Deﬁnition 2.1 ((ϵ, δ)-Differential Privacy). An algorithm A satisﬁes (ϵ, δ)-differential privacy if for all datasets D and D differing in one data point and all events E in the range of the A, we have, P (A(D) E) eεP (A(D ) E) + δ.

Secure Aggregation (Sec Agg). Sec Agg is a cryptographic technique that allows multiple parties to compute an aggregate value, such as a sum or average, without revealing their individual contributions to the computation. In the context of FL, the works of (Bonawitz et al., 2016; Bell et al., 2020) proposed practical Sec Agg schemes. We assume Sec Agg as default, as is the case in typical FL systems.

Count-mean sketching. We describe the compression technique of (Chen et al., 2022a), which is based on the sparse Johnson-Lindenstrauss (JL) random matrix/countmean sketch data structure (Kane and Nelson, 2014). The sketching operation is a linear map, denoted as S : Rd RP C, where P, C N are parameters. The corresponding unsketching operation is denoted as U : RP C Rd. To explain the sketching operation, we begin by introducing the count-sketch data structure. A count-sketch is a linear map, which for j [P], is denoted as Sj : Rd RC. It is described using two hash functions: bucketing hash hj : [d] [C] and sign hash: sj : [d] { 1, 1}, mapping the q-th co-ordinate zq to Pd i=1 sj(i)1 (hj(i) = hj(q)) zi.

The count-mean sketch construction pads P count sketches to get S : Rd RP C, mapping z as follows,

P) S1 S2 SP z .

The above, being a JL matrix, approximately preserves norms of z i.e. S(z) z , which is useful in controlling sensitivity, thus enabling application of DP techniques. The unsketching operation is simply U(S(z)) = S S(z). This gives an unbiased estimate, E[ST S(z)] = z, whose variance scales as d z 2/(PC). This captures the trade-off between compression rate, d/PC, and error.

3. Instance-Optimal Federated Mean Estimation (FME)

A central operation in standard federated learning (FL) algorithms is averaging the client model updates in a distributed manner. This can be posed as a standard distributed mean estimation (DME) problem with n clients, each with a vector zi Rd sampled i.i.d. from an unknown distribution D with population mean µ. The goal of the server is to estimate µ while communicating only a small number of bits with the clients at each round. Once we have a commu-

Private Federated Learning with Autotuned Compression

nication efﬁcient scheme, this can be readily integrated into the learning algorithm of choice, such as Fed Avg.

In order to provide privacy and security of the clients data, mean estimation for FL has additional requirements: we can only access the clients via Sec Agg (Bonawitz et al., 2016) and we need to satisfy DP (Dwork et al., 2014). We refer to this problem as Federated Mean Estimation (FME). To bound the sensitivity of the empirical mean, bµ({zi}n i=1) := (1/n) Pn i=1 zi, the data is assumed to be bounded by zi G. Since gradient norm clipping is almost always used in the DP FL setting, we assume G is known. The ﬁrst result characterizing the communication cost of FME is by Chen et al. (2022a), who propose an unbiased estimator based on count-mean sketching that satisfy (ϵ, δ)-differential privacy and (order) optimal error of

E[ µ µ 2] = Θ G2 d n2ϵ2 + 1

with an (order) optimal communication complexity of O(n2ϵ2). The error rate in (1) is optimal as it matches the information theoretic lower bound of (Steinke and Ullman, 2015) that holds even without any communication constraints. The communication cost of O(n2ϵ2) cannot be improved for the worst-case data as it matches the lower bound in (Chen et al., 2022a). However, it might be possible to improve on communication for speciﬁc instances of µ. We thus ask the fundamental question of how much communication is necessary and sufﬁcient to achieve the optimal error rate as a function of the hardness of the problem.

3.1. Adapting to the Instance s Norm

Our more ﬁne-grained error analysis of the scheme in (Chen et al., 2022a) shows that, with a sketch size of O(PC) that costs O(PC) in per client communication, one can achieve

E[ µ µ 2] = O G2 d(M 2 + 1/n)

PC + d n2ϵ2 + 1

where we deﬁne the normalized norm of the mean, M = µ

G [0, 1] and G is the maximum z as chosen for a sensitivity bound under DP FL. The ﬁrst error term captures how the sketching error scales as the norm of the mean, MG. When M is signiﬁcantly smaller than one, as motivated by our application to FL in Sec. 1, a signiﬁcantly smaller choice of the communication cost, PC = O(min n2ϵ2, nd M 2 + 1/n ), is sufﬁcient to achieve the optimal error rate of Eq. (1). The dominant term is smaller than the standard sketch size of PC = O(n2ϵ2) by a factor of M 2 [0, 1]. However, selecting this sketch size requires knowledge of M. This necessitates a scheme that adapts to the current instance by privately estimating M with a small communication cost. This leads to the design of our proposed interactive Alg. 1.

Precisely, we call an FME algorithm instance-optimal with respect to M if it achieves the optimal error rate of Eq. (1)

Algorithm 1 Adapt Norm FME

Require: Sketch sizes C1, P, C, P, noise variance σ, σ, γ, client data {zc}c 1: for j = 1 to 2 do 2: Select n random clients {z(j) c }n c=1 and broadcast sketching operators Sj and S of sizes (Cj, P) and ( C, P), respectively.

3: Sec Agg: νj =Sec Agg({Q(c) j }n c=1) + N 0, σ2

νj = Sec Agg({ Q(c) j }n c=1), where Q(c) j

clip B(Sj(z(j) c )), Q(c) j clip B( Sj(z(j) c )) 4: Unsketch DP mean: Compute µj = Uj(νj) 5: Estimate DP norm: bnj=clip B( νj )+Laplace( σ)

6: Cj+1 = max l min n2ϵ2 log(1/δ), nd (bnj+γ)2

with communication complexity of O(M 2n2ϵ2) for every instance whose norm of the mean is bounded by GM. We propose a novel adaptive compression scheme in Section 3.1.1 and show instance optimality in Section 3.1.2.

3.1.1. INSTANCE-OPTIMAL FME FOR NORM M

We present a two-round procedure that achieves the optimal error in Eq. (1) with instance-optimal communication complexity of O min d, M 2n2ϵ2/ log (1/δ) , M 2nd , without prior knowledge of M. The main idea is to use countmean sketch and in the ﬁrst round, to construct a private yet accurate estimate of M. This is enabled by the fact the count-mean sketch approximately preserves norms. In the second round, we set the sketch size based on this estimate. Such interactivity is (i) provably necessary for instanceoptimal FME as we show in Thm. 3.3, and (ii) needed in other problems that target instance-optimal bounds with DP (Berrett and Butucea, 2020).

The client protocol (line 3 in Alg. 1) is similar to that in Chen et al. (2022a); each client computes a sketch, clips it, and sends it to the server. However, it crucially differs in that each client sends two sketches. The second sketch is used for estimating the statistic to ensure that we can achieve instance-optimal compression-utility tradeoffs as outlined in (i) and (ii) above. We remark that even though only one sketch is needed for the Adapt Norm approach, since the statistic can be directly estimated from the (ﬁrst) sketch S without a second sketch, allowing for additional minor communication optimization, we standardize our presentation on this two-sketch approach to make it inline with the Adapt Tail approach presented in Sec. 3.2 which requires both sketches.

The server protocol, in Alg. 1, aggregates the sketches using Sec Agg (line 3). In the ﬁrst round, Qj s are not used, i.e.,

Private Federated Learning with Autotuned Compression

C1 = 0. Only Qj s are used to construct a private estimate of the norm of the mean bn1 (line 5). We do not need to unsketch Qj s as the norm is preserved under our sketching operation. We use this estimate to set the next round s sketch size, Cj+1 (line 6), so as to ensure the compression error of the same order as privacy and statistical error; this is the same choice we made when we assumed oracle knowledge of M see discussion after Eq. (2). In the second round and onwards, we use the ﬁrst sketch, which is aggregated using Sec Agg and privatized by adding appropriately scaled Gaussian noise (line 3). Finally, the Qj s are unsketched to estimate the mean. Note that the clients need not send Qj s after the ﬁrst round.

3.1.2. THEORETICAL ANALYSIS

We show that the Adapt Norm approach (Alg. 1) achieves instance-optimality with respect to M; an optimal error rate of Eq. (1) is achieved for every instance of the problem with an efﬁcient communication complexity of O(M 2n2ϵ2) (Theorem 3.1). The optimality follows from a matching lower bound in Corollary 3.2. We further establish that interactivity, which is a key feature of Algorithm 1, is critical in achieving instance-optimality. This follows from a lower bound in Theorem 3.3 that proves a fundamental gap between interactive and non-interactive algorithms. Theorem 3.1. For any choice of the failure probability 0 < β < min log(1/δ)

n2ϵ2 , 1 , Alg. 1 with B = 2G, C1 = 0, P =

P = Θ (log (4/β)) , C = 2, σ2 = 256B2 log(1/δ)

nϵ and γ = 2B log(8/β)

log(16/β) n satisﬁes (ϵ, δ)-DP, the output µ2 is an unbiased estimate of mean, and its error is bounded as,

E[ µ2 µ ]2 O G2 d log (1/δ)

Finally, the number of rounds is two, and with probability at least 1 β, the total per-client communication complexity is O min n2ϵ2 log(1/δ), nd M 2 + 1 n2ϵ2 + 1

We provide a proof in App. C. The error rate matches Eq. (1) and cannot be improved in general. Compared to the target communication complexity of O((M 2 + 1/n)n2ϵ2), the above communication complexity has an additional term 1/(nϵ)2, which stems from the fact that the norm can only be accessed privately. In the interesting regime where the error is strictly less than the trivial M 2G2, the extra 1/(nϵ)2 + 1/n is smaller than M 2. The resulting communication complexity is O(M 2n2ϵ2/ log (1/δ)). This nearly matches the oracle communication complexity that has the knowledge of M. In the following, we make this precise.

Alg. 1 is instance-optimal with respect to M. The next theorem shows that even under the knowledge of M, no

unbiased procedure under a Sec Agg constraint with optimal error can have smaller communication complexity than the above procedure. We provide a proof in App. E.

Corollary 3.2 ((Chen et al., 2022a) Theorem 5.3). Let K, d, n N, M, G, ϵ, δ 0, and P1(d, G, M) := D over Rd : z G for z D and µ(D) MG . For any K, any K-round unbiased Sec Aggcompatible protocol A (see App. E for details) such that ED Dn[ A(D) µ 2] =

O min M 2G2, G2

n + G2d log(1/δ)

P1(d, G, M), there exists a distribution D P1(d, G, M), such that on dataset D Dn, w.p. 1, the total per-client communication complexity is Ω min d, M 2n2ϵ2

log(1/δ), M 2nd bits.

Interaction is necessary. A key feature of our algorithm is interactivity: the norm of the mean estimated in the prior round is used to determine the sketch size in the next round. We show that at least two rounds of communication are necessary for any algorithm with instance-optimal communication complexity. This proves a fundamental gap between interactive and non-interactive approaches in solving FME. We provide a proof in App. E.

Theorem 3.3. Let K, d, n N, n 2, G, ϵ, δ > 0, and P2(d, G) := D over Rd : z G for z D . Let A be a K-round unbiased Sec Agg-compatible protocol with E D Dn[ A(D) µ(D) 2] = O min µ(D) 2 , G2

G2d log(1/δ)

n2ϵ2 and total per-client communication complex-

ity of O min d, µ(D) 2n2ϵ2

log(1/δ) , µ(D) 2 nd bits with

probability > 1

n, point-wise D P2(d, G). Then K 2.

3.2. Adapting to the Instance s Tail-norm

The key idea in norm-adaptive compression is interactivity. On top of interactivity, another key idea in tail-normadaptive compression is count median-of-means sketching.

Count median-of-means sketching. Our new sketching technique takes R N independent count-mean sketches of Chen et al. (2022a) (see Sec. 2). Let S(i) : Rd RP C

denote the i-th count-mean sketch. Our sketching operation, S : Rd RR P C, is deﬁned as the concatenation: S(z) = Sz = (S(1)z) (S(2)z) . . . (S(R)z) . Our median-ofmeans unsketching takes the median over R unsketched estimates: U(S(z))j = Median({(S(i) S(i)z)j}R i=1). This median trick boosts the conﬁdence to get a high probability of success (as opposed to a guarantee in expectation) and to get tail-norm (as opposed to norm) based bounds (Charikar et al., 2002). However, in non-private settings, it sufﬁces to take multiple copies of a count-sketch, which in our notation, corresponds to setting P = 1. On the other hand,

Private Federated Learning with Autotuned Compression

Algorithm 2 Adapt Tail FME

Require: Sketch sizes R, C1, P, R, C, P, noise σ, σ, {kj}j,

j, rounds K, client data {zc}c 1: for j = 1 to K do 2: Select n random clients {z(j) c }n c=1 and broadcast sketching operators Sj and Sj of sizes (R, Cj, P) and ( R, C, P) respectively.

3: Sec Agg: νj =Sec Agg({Q(c) j }n c=1) + N 0, σ2

νj = Sec Agg({ Q(c) j }n c=1) , where

Q(c) j [clip B(S(i) j (z(j) c ))]R i=1, Q(c) j

[clip B( S(i) j (z(j) c ))] R i=1 4: Unsketch DP mean: Compute µj = Topkj(Uj(νj))

5: Estimate error: ej = Sj( µj) νj

6: γj = γj + Laplace( σ) and ej = ej + Laplace(2 σ) 7: If ej γj, set µ = µj, break 8: Cj+1 = 2Cj 9: end for

we set P > 1 to get bounded sensitivity which is useful for DP, yielding our (novel) count-median-of-means sketching technique. We remark that in some places, we augment the unsketching operation with Topk , which returns top-k coordinates, of a vector, in magnitude.

Approximately-sparse setting. We expect improved guarantees when µ is approximately sparse. This is captured by the tail-norm. Given z Rd, z G, and k [d], the normalized tail-norm is zn-tail(k) 2 := 1 G min z: z 0 k z z 2 . This measures the error in the best k-sparse approximation. We show that the mean estimate via count median-of-means sketch is still unbiased (Lemma C.7), and for C = Ω(Pk), has error bounded as,

µ µ 2 2 =O d G2

µn-tail(k) 2 + 1

If all tail-norms are known, then we can set the sketch size PC = mink max k, min nd, n2ϵ2 µn-tail(k) 2+ 1

n , and achieve the error rate in Eq. (1). When k = 0, this recovers the previous communication complexity of countmean sketch in Sec. 3.1. For optimal choice of k, this communication complexity can be smaller. We aim to achieve it adaptively, without the (unreasonable) assumption of knowledge of all the tail-norms.

3.2.1. INSTANCE-OPTIMAL FME FOR TAIL-NORM

Previously, we estimated the norm of the mean in one round. Now, we need multiple tail-norms. We propose a doublingtrick based scheme. Starting from an optimistic guess of the sketch size, we progressively double it, until an appropriate stopping criterion on the error is met. The main challenge is

in estimating the error of the ﬁrst sketch S (for the current choice of sketch size), as naively this would require the uncompressed true vector z to be sent to the server. To this end, we show how to use the second sketch S so as to obtain a reasonable estimate of this error, while using signiﬁcantly less communication than transmitting the original z.

We re-sketch the unsketched estimate, U(S(z)), of z with S. The reason is that this can now be compared with the second sketch S(z) to get an estimate of the er-

ror: S(U(S(z))) S(z) 2 = S(U(S(z)) z) 2

U(S(z))) z 2, where we used the linearity and normpreservation property of the count-sketch.

Alg. 2 presents the pseudo-code for Adapt Tail. The client protocol remains the same; each participating client sends two independent (count median-of-mean) sketches to the server. The server, starting from an optimistic choice of initial sketch size PC1, obtains the aggregated sketches from the clients via Sec Agg and adds noise to the ﬁrst sketch for DP (line 4). It then unsketches the ﬁrst sketch to get an estimate of mean, µj (line 4) we note that Topkj is not applied (i.e., kj = d) for the upcoming result but will be useful later. We then sketch µj with the second sketch S and compute the norm of difference with the aggregated second sketch from clients (line 5) this gives an estimate of the error of µj. Finally, we want to stop if the error is sufﬁciently small, dictated by the threshold γj, which will be set as target error in Eq. (1). To preserve privacy at this step, we use the well-known Above Threshold algorithm (Dwork et al., 2014), which adds noise to both the error and threshold (line 6) and stops if the noisy error is smaller than the noisy threshold (line 7). If this criterion is not met, then we double the ﬁrst sketch size (line 8) and repeat.

3.2.2. THEORETICAL ANALYSIS

Given a non-decreasing function g : [d] R, we deﬁne the generalized sparsity as ktail(g(k); µ) :=

mink [d] n k : µn-tail(k) 2 g(k) o . In the special case

when g(k) 0, this recovers the sparsity of µ.

We now present our result for the proposed Adapt Tail approach.

Theorem 3.4. For any choice of failure probability 0<β<1, Alg. 2 with B=2G, kj = d, K= log (d) , σ = 4B

256R(log(d))2B2 log(1/δ)

ϵ2 , R = 2 log (8d log (d) /β) , C1 = 8P, P = Θ(2 log (8R log (d) /β)) , C = 2 P, R = 1, P = Θ (2 log (4d log (d) /β)) , γj = γ =

log(8 log(d)/β) n ,

+ α and α =

32B(log( log(d) )+log(8/β))

nϵ , satisﬁes (ϵ, δ)-DP, and outputs an unbiased estimate of the mean. With probability at least

Private Federated Learning with Autotuned Compression

1 β, the error is bounded as,

µ µ 2 = O G2

n + G2d log (1/δ)

the total per-client communication complexity is O (ktail(g(k)); µ)), where g(k) = max 1 nd, log(1/δ)

4 log(8 log(d)/β)

n , and number of rounds is log (d) .

We provide the proof of Thm. 3.4 in App. C. First, we argue that the communication complexity is of the same order as of this method with prior knowledge of all tail norms plugging in g(k) in the deﬁnition of ktail(g(k); µ) gives us that ktail(g(k); µ) is the smallest k such that k

min nd, n2ϵ2 µn-tail(k) 2 + 1

n . This is what we obtained before, which further is no larger than the result on adapting to norm of mean (Thm. 3.1) see discussion after Eq. (3). However, this algorithm requires log (d) rounds of interaction, as opposed to two in Thm. 3.1. It therefore depends on the use-case which of the two, total communication or number of rounds, is more important.

The error rate in Thm. 3.4 matches that in Eq. (1), known to be optimal in the worst case. However, it may be possible to do better for speciﬁc values of tail-norms. Under the setting where the algorithm designer is given k < d and γ and is promised that µn-tail(k) γ, we give a lower bound of Ω G2 min 1, γ2 + k n2ϵ2 + 1

n, d n2ϵ2 + 1

n on error of (central) DP mean estimation (see Thm. B.2). The second term here is new, and we show that a simple tweak to our procedure underlying Thm. 3.4 adding a Topk operation, with exponentially increasing k, to the unsketched estimate sufﬁces to achieve this error (see Thm. B.3). The resulting procedure is biased and has a communication complexity of O(ktail(γ2; µ)), which we show to the optimal under Sec Agg constraint among all, potentially biased, multiround procedures (see Thm. B.4). Due to space constraints, we defer formal descriptions of these results to App. B.2.

4. Federated Optimization/Learning

In the previous section, we showed how to adapt the compression rate to the problem instance for FME. In this section, we will use our proposed FME procedure for autotuning the compression rate in Federated Optimization (FO). We use the ubiquitous Fed Avg algorithm (Mc Mahan et al., 2017a), which is an iterative procedure, computing an average of the client updates at every round. It is this averaging step that we replace with our corresponding FME procedures (Alg. 1 or Alg. 2) We remind the reader that when moving from FME to FO, that the client data zc is now the (difference of) model(s) updated via local training at the client.

However, recall that our proposed procedures for FME re-

Algorithm 3 Adapt Norm FL

Require: Sketch sizes L1 = RPC1 and L = R P C noise multiplier σ, model dimension d, adaptation method adapt, a constant c0, clipping threshold B, rounds K. 1: for j = 1 to K do 2: Sample n clients; broadcast current model and sketching operators Sj, Sj of sizes Lj = RPCj and L.

3: Sec Agg: νj =Sec Agg({Q(c) j }n c=1)+N 0, σ2B2

0.9n IP C ,

νj = Sec Agg({ Q(c) j }n c=1) , where

Q(c) j [clip B(S(i) j (z(j) c ))]R i=1, Q(c) j

[clip B( S(i) j (z(j) c ))] R i=1 4: Unsketch DP mean: µj = Uj(νj) 5: Second sketch: nj = vj 6: γ2 = 20σ2, σ = σ/

0.1 7: bnj = clip B( nj) + N(0, σ2B2)

8: Cj+1 = c0(bnj+γ)2

B2P σ2 9: end for

quire multiple rounds of communication. While this can be achieved in FO by freezing the model for the interactive FME rounds, it is undesirable in a real-world application of FL as it would increase the total wall-clock time of the training process. Thus, we propose an additional heuristic for our Adapt Norm and Adapt Tail algorithms when used in FO/FL: to use a stale estimate of the privately estimated statistic using the second sketch. We discuss this for the Adapt Norm procedure and defer details and challenges surrounding Adapt Tail to App. H.

Two Stage method (Alg. 4). First, we describe a simple approach based on the observation from the Genie that a single ﬁxed compression rate works well. We assume that the norm of the updates remains relatively stable throughout training. To estimate it, we run W warm-up rounds as the ﬁrst stage. Then, using this estimate, we compute a ﬁxed compression rate, by balancing the errors incurred due to compression and privacy, akin to Adapt Norm in FME, which we then use for the second stage. Because the ﬁrst stage is run without compression, it is important that we can minimize W, which may be possible through prior knowledge of the statistic, e.g., proxy data distributions or other hyperparameter tuning runs.

Adapt Norm (Alg. 3). Our main algorithm at every round uses two sketches: one to estimate the mean for FL and the other to compute an estimate of its norm, which is used to set the (ﬁrst) sketch size for the next round. This is akin to the corresponding FME Alg. 1 with the exception that it uses stale estimate of norm, from the prior round, to set the sketch size in the current round in our experiments, we ﬁnd that this heuristic still provides accurate estimates of the norm at the current round. Further, we split the

Private Federated Learning with Autotuned Compression

Algorithm 4 Two Stage FL

Require: Sketch sizes L1 = RPC1 and L = R P C noise multiplier σ, model dimension d, adaptation method adapt, a constant c0, clipping threshold B, rounds K. 1: for j = 1 to W do 2: Sec Agg: νj =Sec Agg({Q(c) j }n c=1)+N 0, σ2B2

0.9n IP C ,

where Q(c) j [clip B(S(i) j (z(j) c ))]R i=1

3: nj = Sec Agg({Q(c) j }n c=1)

4: Unsketch DP mean: µj = Uj(νj) 5: bnj = clip B( nj) + N(0, σ2B2/0.1) 6: end for

7: bn = 1 W PW j=1 bnj, C = c0(bn+

B2P σ2 8: for j = W + 1 to K do 9: Select n random clients and broadcast current model and sketching operator Sj of size L = RPC.

10: Receive {Q(c) j }n c=1 from clients.

11: νj =Sec Agg({Q(c) j }n c=1) + N 0, σ2B2

12: Unsketch DP mean: µj = Uj(νj) 13: end for

privacy budget between the mean and norm estimation parts heuristically in the ratio 9 : 1, and set sketch size parameters R = 1, R = P = P = log (d) . Finally, the constant c0 is set such that the total error in FME, at every round, is at most 1.1 times the DP error.

We also note some minor changes between Alg. 1 (FME) and its adaptation to Alg. 3 (FO), made only for simpliﬁcation, explained below. We replace the Laplace noise added to norm by Gaussian noise for ease of privacy accounting in practice. Further, the expression for the sketch size (line 9) may look different, however, for all practically relevant regime of parameters, it is the same as line 7 in Alg. 1.

5. Empirical Analysis on Federated Learning

In this section, we present experimental evaluation of the methods proposed in Sec. 4 for federated optimization, in standard FL benchmarks. We deﬁne the average compression rate of an T-round procedure as be relative decrease in the average bits communicated, i.e., d T PT t=1 Lt+ Lt where

d is the model dimensionality, Lt and Lt are sizes of ﬁrst and second sketches at round t. This is equivalently the harmonic average of the per-round compression rates.

Setup. We focus on three common FL benchmarks: Federated EMNIST (F-EMNIST) and Shakespeare represent two relatively easier tasks, and Stack Overﬂow Next Word Prediction (SONWP) represents a relatively harder task. F-EMNIST is an image classiﬁcation task, whereas Shakespeare and SONWP are language modelling. We follow the

exact same setup (model architectures, hyper parameters) as Chen et al. (2022a) except where noted below. Our full description can be found in App. F.

Unlike Chen et al. (2022a), we use ﬁxed clipping (instead of adaptive clipping) and tune the server s learning rate for each noise multiplier, z. As noted by Choquette-Choo et al. (2022), we also zero out high ℓ1 norm updates 100 for improved utility and use their hyper parameters for SONWP.

Deﬁning Feasible Algorithms. Consider that introducing compression into a DP FL algorithm introduces more noise into its optimization procedure. Thus, we may expect that our algorithms will perform worse than the baseline. We deﬁne to be the max allowed relative drop in utility (validation accuracy) when compared to their baseline without compression. Then, our set of feasible algorithms are those that achieve at least 1 of the baseline performance. This lets us study the privacy-utility tradeoff under a ﬁxed utility and closely matches real-world constraints often a practitioner will desire an approach that does not signiﬁcantly impact the model utility, where captures their tolerance.

Algorithms. Our baseline is the Genie, which is the method of Chen et al. (2022a), run on the grid of exponentially increasing compression rates 2b, b [0, 13]. This requires computation that is logarithmic in the interval size and signiﬁcantly more communication bandwidth in totality thus, this is unachievable in practice but serves as our best-case upper bound. Our proposed adaptive algorithms are Adapt Norm and Two Stage method, described in Sec. 4.

Achieving Favorable Compression Without Tuning. Because the optimal Genie compression rates may be unattainable without signiﬁcantly increased communication and computation, we instead desire to achieve nontrivial compression rates, as close this genie as possible, but without any (signiﬁcant) additional computation, while also ensuring that the impact on utility remains negligible. Thus, to ensure minimal impact on utility, we run our adaptive protocols so that the error introduced by compression is much smaller than that which is introduced by DP. We heuristically choose that error from compression can be at most 10% of DP: this choice of this threshold is such that the error is small enough to be dominated by DP and yet large enough for substantial compression rates. We emphasize that we chose this value heuristically and do not tune it.

In extensive experiments across all three datasets (Figure 2), we ﬁnd that our methods achieve essentially the same utility models (as seen by the colored text). We further ﬁnd that our methods achieve nontrivial and favorable communication reduction though still far from the genie, they already recover a signiﬁcant fraction of the compression rate. This represents a signiﬁcant gain in computation: to calculate this genie we ran tens of jobs per noise multiplier for our

Private Federated Learning with Autotuned Compression

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Noise Multiplier, z

Average Compression Rate

79 78 75 82 80

83 80 79 76

Compression Method

Genie Two Stage Adapt Norm

(a) F-EMNIST

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Noise Multiplier, z

Average Compression Rate

Compression Method

Genie Two Stage Adapt Norm

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Noise Multiplier, z

Average Compression Rate

56 54 53 51 49

Compression Method

Genie Two Stage Adapt Norm

(c) Shakespeare Figure 2. Our adaptive approaches achieve favorable compression rates without tuning. The Genie represents the optimal ﬁxed compression rate obtained via signiﬁcant hyper parameter tuning. Our adaptive approaches are all 1-shot using a ﬁxed constant of 0.1. Text represents the ﬁnal validation accuracy, with the Genie at = 1%.

proposed methods, we run but one.

Comparing our methods, we ﬁnd that the Adapt Norm approach performs best. It achieves favorable compression rates consistently across all datasets with compression rates that can well adapt to the speciﬁc error introduced by DP at a given noise multiplier. For example, we ﬁnd on the Shakespeare dataset in Figure 2 (c) that the compression-privacy curve follows that of the genie. In contrast, the Two Stage approach typically performs much worse. This is in part due to construction: we require a signiﬁcant number of warmup rounds (we use W = 75) to get an estimate of the gradient norm. Because these warmup rounds use no compression, it drastically lowers the (harmonic) average compression rate. We remark that in scenarios where strong priors exist, e.g., when tuning a model on a similar proxy dataset locally prior to deploying in production, it may be possible to signiﬁcantly reduce or even eliminate W so as to make this approach more competitive. However, our fully-adaptive Adapt Norm approach is able to perform well even without any prior knowledge of the such statistics.

Benchmarking computation overhead. We remark that sketching is a computationally efﬁcient algorithm requiring computation similar to clipping for encoding and decoding the update. Though our adaptive approaches do introduce minor additional computation (e.g., a second sketch), these do not signiﬁcantly impact the runtime. In benchmark experiments on F-EMINST, we found that standard DP-Fed Avg takes 3.63s/round, DP-Fed Avg with non-adaptive sketching takes 3.63s/round, and our Adapt Norm approach take 3.69s/round. Noting that our methods may impact convergence, so we also ﬁx the total number of rounds a-priori in all experiments; this provides a fair comparison for all methods where any slow-down in convergence is captured by the utility decrease.

Choosing the relative error from compression (c0). Though we chose c0 = 10% heuristically and without tuning, we next run experiments selecting other values of c0 to show that our intuition of ensuring the relative error is

much smaller leads to many viable values of this hyperparameter. We sweep values representing {25, 5, 1}% error and compare the utility-compression tradeoff in App. A. We ﬁnd that c0 well captures the utility-compression tradeoff, where smaller values consistently lead to higher utility models with less compression, and vice versa. We ﬁnd that the range of suitable values is quite large, e.g., even as low as c0 = 1%, non-trivial compression can be attained.

6. Conclusion and Discussion

We design auto-tuned compression techniques for private federated learning that are compatible with secure aggregation. We accomplish this by creating provably optimal autotuning procedures for federated mean estimation and then mapping them to the federated optimization (learning) setting. We analyze the proposed mean estimation schemes and show that they achieve order optimal error rates with order optimal communication complexity, adapting to the norm of the true mean (Section 3.1) and adapting to the tailnorm of the true mean (Section 3.2). Our results show that we can attain favorable compression rates that recover much of the optimal Genie, in one-shot without any additional computation. Although the ℓ2 error in mean estimation is a tractable proxy for auto-tuning compression rate in federated learning, we found that it may not always correlate well with the downstream model accuracy. In particular, in our adaptation of Adapt Tail to federated learning (in App. H), we found that that the procedure is able to attain very high compression rates for the federated mean estimation problem, with little overhead in ℓ2 error, relative to the DP error. However, these compression rates are too high to result in useful model accuracy. So a natural direction for future work is to design procedures which improve upon this proxy in federated learning settings.

Private Federated Learning with Autotuned Compression

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016.

Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan Mc Mahan. cpsgd: Communication-efﬁcient and differentially-private distributed sgd. Advances in Neural Information Processing Systems, 31, 2018.

Naman Agarwal, Peter Kairouz, and Ziyu Liu. The skellam mechanism for differentially private federated learning. Advances in Neural Information Processing Systems, 34: 5052 5064, 2021.

Alham Fikri Aji and Kenneth Heaﬁeld. Sparse communication for distributed gradient descent. ar Xiv preprint ar Xiv:1704.05021, 2017.

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efﬁcient sgd via gradient quantization and encoding. Advances in neural information processing systems, 30, 2017.

Raman Arora, Raef Bassily, Crist obal A Guzm an, Michael Menart, and Enayat Ullah. Differentially private generalized linear models revisited. In Advances in Neural Information Processing Systems, 2022.

Raman Arora, Raef Bassily, Tom as Gonz alez, Crist obal Guzm an, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, 2023.

Hilal Asi, Vitaly Feldman, and Kunal Talwar. Optimal algorithms for mean estimation under local differential privacy. ar Xiv preprint ar Xiv:2205.02466, 2022.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efﬁcient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pages 464 473. IEEE, 2014.

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in neural information processing systems, 32, 2019.

James Henry Bell, Kallista A Bonawitz, Adri a Gasc on, Tancr ede Lepoint, and Mariana Raykova. Secure singleserver aggregation with (poly) logarithmic overhead. In Proceedings of the 2020 ACM SIGSAC Conference on

Computer and Communications Security, pages 1253 1269, 2020.

Thomas Berrett and Cristina Butucea. Locally private nonasymptotic testing of discrete distributions is faster using interactive mechanisms. Advances in Neural Information Processing Systems, 33:3164 3173, 2020.

Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. Advances in Neural Information Processing Systems, 33:14475 14485, 2020.

Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan Mc Mahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. ar Xiv preprint ar Xiv:1611.04482, 2016.

Jean Bretagnolle and Catherine Huber. Estimation des densit es: risque minimax. S eminaire de probabilit es de Strasbourg, 12:342 363, 1978.

Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efﬁcient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34:19680 19691, 2021.

Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. Advances in Neural Information Processing Systems, 32, 2019.

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pages 693 703. Springer, 2002.

Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communicationefﬁcient distributed learning. Advances in Neural Information Processing Systems, 31, 2018.

Wei-Ning Chen, Peter Kairouz, and Ayfer Ozgur. Breaking the communication-privacy-accuracy trilemma. Advances in Neural Information Processing Systems, 33: 3312 3324, 2020.

Wei-Ning Chen, Christopher A Choquette-Choo, and Peter Kairouz. Communication efﬁcient federated learning with secure aggregation and differential privacy. In Neur IPS 2021 Workshop Privacy in Machine Learning, 2021.

Wei-Ning Chen, Christopher A Choquette-Choo, Peter Kairouz, and Ananda Theertha Suresh. The fundamental price of secure aggregation in differentially private federated learning. ar Xiv preprint ar Xiv:2203.03761, 2022a.

Private Federated Learning with Autotuned Compression

Wei-Ning Chen, Ayfer Ozgur, and Peter Kairouz. The poisson binomial mechanism for unbiased federated learning with secure aggregation. In International Conference on Machine Learning, pages 3490 3506. PMLR, 2022b.

Christopher A Choquette-Choo, H Brendan Mc Mahan, Keith Rush, and Abhradeep Thakurta. Multi-epoch matrix factorization mechanisms for private machine learning. ar Xiv preprint ar Xiv:2211.06530, 2022.

Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596 1606. PMLR, 2019.

John Duchi. Lecture notes for statistics 311/electrical engineering 377. URL: https://stanford. edu/class/stats311/Lectures/full notes. pdf. Last visited on, 2:23, 2016.

John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113 (521):182 201, 2018.

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715 724, 2010a.

Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N Rothblum, and Sergey Yekhanin. Pan-private streaming algorithms. In ics, pages 66 80, 2010b.

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

Vitaly Feldman and Kunal Talwar. Lossless compression of efﬁcient private local randomizers. In International Conference on Machine Learning, pages 3208 3219. PMLR, 2021.

Liam H Fowl, Jonas Geiping, Wojciech Czaja, Micah Goldblum, and Tom Goldstein. Robbing the fed: Directly obtaining private data in federated learning with modiﬁed models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=fwz Ugo0FM9v.

Antonious M Girgis, Deepesh Data, Suhas Diggavi, Peter Kairouz, and Ananda Theertha Suresh. Shufﬂed model of federated learning: Privacy, accuracy and communication trade-offs. IEEE journal on selected areas in information theory, 2(1):464 478, 2021.

Farzin Haddadpour, Belhal Karimi, Ping Li, and Xiaoyun Li. Fedsketch: Communication-efﬁcient and private federated learning via sketching. ar Xiv preprint ar Xiv:2008.04975, 2020.

Ali Hatamizadeh, Hongxu Yin, Pavlo Molchanov, Andriy Myronenko, Wenqi Li, Prerna Dogra, Andrew Feng, Mona G Flores, Jan Kautz, Daguang Xu, et al. Do gradient inversion attacks make federated learning unsafe? ar Xiv preprint ar Xiv:2202.06924, 2022.

Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, et al. Communication-efﬁcient distributed sgd with sketching. Advances in Neural Information Processing Systems, 32, 2019.

Divyansh Jhunjhunwala, Advait Gadhikar, Gauri Joshi, and Yonina C Eldar. Adaptive quantization of model updates for communication-efﬁcient federated learning. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3110 3114. IEEE, 2021.

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv preprint ar Xiv:1902.03736, 2019.

Peter Kairouz, Ziyu Liu, and Thomas Steinke. The distributed discrete gaussian mechanism for federated learning with secure aggregation. In International Conference on Machine Learning, pages 5201 5212. PMLR, 2021a.

Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shufﬂing. In International Conference on Machine Learning, pages 5213 5225. PMLR, 2021b.

Gautam Kamath and Jonathan Ullman. A primer on private statistics. ar Xiv preprint ar Xiv:2005.00010, 2020.

Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. In Conference on Learning Theory, pages 2204 2235. PMLR, 2020.

Daniel M Kane and Jelani Nelson. Sparser johnsonlindenstrauss transforms. Journal of the ACM (JACM), 61(1):1 23, 2014.

Vishesh Karwa and Salil Vadhan. Finite sample differentially private conﬁdence intervals. ar Xiv preprint ar Xiv:1711.03908, 2017.

Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3): 793 826, 2011.

Private Federated Learning with Autotuned Compression

Xiyang Liu, Weihao Kong, Sham Kakade, and Sewoong Oh. Robust and differentially private mean estimation. Advances in neural information processing systems, 34: 3887 3901, 2021.

Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In Conference on Learning Theory, pages 1167 1246. PMLR, 2022.

Maksim Makarenko, Elnur Gasanov, Rustem Islamov, Abdurakhmon Sadiev, and Peter Richt arik. Adaptive compression for communication-efﬁcient distributed training. ar Xiv preprint ar Xiv:2211.00188, 2022.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefﬁcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics, pages 1273 1282. PMLR, 2017a.

Brendan Mc Mahan, Keith Rush, and Abhradeep Guha Thakurta. Private online preﬁx sums via optimal matrix factorizations. ar Xiv preprint ar Xiv:2202.08312, 2022.

H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017b.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. ar Xiv preprint ar Xiv:1710.03740, 2017.

Ilya Mironov. R enyi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263 275. IEEE, 2017.

Rasmus Pagh and Mikkel Thorup. Improved utility analysis of private countsketch. ar Xiv preprint ar Xiv:2205.08397, 2022.

Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018.

Eric Price and David P Woodruff. (1+ eps)-approximate sparse recovery. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 295 304. IEEE, 2011.

Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. Fetchsgd: Communication-efﬁcient federated learning with sketching. In International Conference on Machine Learning, pages 8253 8265. PMLR, 2020.

Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, and Simon See. Understanding top-k sparsiﬁcation in distributed deep learning. ar Xiv preprint ar Xiv:1911.08772, 2019.

Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. ar Xiv preprint ar Xiv:1501.06095, 2015.

Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsiﬁed sgd with memory. Advances in Neural Information Processing Systems, 31, 2018.

Mohamed Suliman and Douglas Leith. Two models are better than one: Federated learning is not private for google gboard next word prediction. ar Xiv preprint ar Xiv:2210.16947, 2022.

Ananda Theertha Suresh, Ziteng Sun, Jae Hun Ro, and Felix Yu. Correlated quantization for distributed mean estimation and optimization. ar Xiv preprint ar Xiv:2203.04925, 2022.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63 69, 1965.

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30, 2017.

Zheng Xu, Yanxiang Zhang, Galen Andrew, Christopher A. Choquette-Choo, Peter Kairouz, H. Brendan Mc Mahan, Jesse Rosenstock, and Yuanbo Zhang. Federated learning of gboard language models with differential privacy, 2023.

Private Federated Learning with Autotuned Compression

A. Additional Figures

F-EMNIST Relative Error, c0 Metric Noise Multiplier (z) 0.1 0.2 0.3 0.5

Validation Accuracy 81.23 79.70 77.68 74.92

Compression rate, r 339 341 1443 2391

Validation Accuracy 78.87 75.53 77.01 74.97

Compression rate, r 101 212 284 332

Validation Accuracy 81.34 78.53 77.81 75.36

Compression rate, r 47 94 117 134

Validation Accuracy 81.84 79.04 78.43 75.66

Compression rate, r 24 47 59 68

Validation Accuracy 82.51 79.59 78.67 75.56

Compression rate, r 5 9 12 13

Table 1. Adapt Norm is stable with respect to choices in the relative error constant c0. We observe that as this constant is increased so that the compression error is larger with respect to the privacy error, the model s utility degrades and the compression rates obtained increase, and vice versa; this aligns with our expectations as the total error in the FME estimate increases (decreases). However, all choices of constant below 0.1 lead to viable models with non-trivial compression rates. Results for F-EMNIST with values displaying the mean across 5 runs. Bolded rows represent conﬁgurations used in the rest of our experiments.

Private Federated Learning with Autotuned Compression

Shakespeare Relative Error, c0 Metric Noise Multiplier (z) 0.1 0.2 0.3 0.5 0.7

Validation Accuracy 55.41 53.91 52.48 50.44 48.68

Compression rate, r 133 210 342 460 511

Validation Accuracy 55.71 53.93 52.26 49.12 46.96

Compression rate, r 24 64 117 235 367

Validation Accuracy 55.77 54.17 52.57 49.87 47.84

Compression rate, r 12 29 49 98 154

Validation Accuracy 55.89 54.29 52.82 50.23 48.12

Compression rate, r 7 17 27 50 75

Validation Accuracy 55.83 54.28 52.85 50.66 48.87

Compression rate, r 2 5 8 13 18

Table 2. Adapt Norm is stable with respect to choices in the relative error constant c0. We observe that as this constant is increased so that the compression error is larger with respect to the privacy error, the model s utility degrades and the compression rates obtained increase, and vice versa; this aligns with our expectations as the total error in the FME estimate increases (decreases). However, all choices of constant below 0.1 lead to viable models with non-trivial compression rates. Results for Shakespere with values displaying the mean across 5 runs. Bolded rows represent conﬁgurations used in the rest of our experiments.

Private Federated Learning with Autotuned Compression

SONWP Relative Error, c0 Metric Noise Multiplier (z) 0.1 0.3 0.5 0.7

Validation Accuracy 23.77 22.36 22.07 21.33

Compression rate, r 15 33 45 48

Validation Accuracy 23.97 22.59 22.06 21.15

Compression rate, r 2 4 12 16

Validation Accuracy 24.03 22.70 22.12 21.31

Compression rate, r 1 3 7 10

Validation Accuracy 23.93 22.39 22.14 21.35

Compression rate, r 1 2 4 6

Validation Accuracy 24.10 22.57 22.17 21.38

Compression rate, r 1 1 2 3

Table 3. Adapt Norm is stable with respect to choices in the relative error constant c0. We observe that as this constant is increased so that the compression error is larger with respect to the privacy error, the model s utility degrades and the compression rates obtained increase, and vice versa; this aligns with our expectations as the total error in the FME estimate increases (decreases). However, all choices of constant below 0.1 lead to viable models with non-trivial compression rates. Results for SONWP with values displaying the mean across 5 runs. Bolded rows represent conﬁgurations used in the rest of our experiments.

Private Federated Learning with Autotuned Compression

B. Missing details from Section 3

In this section, we provide additional details missing from the main text on FME due to space constraints.

B.1. Instance optimal FME with bounded norm of the mean

Optimal error rate with bounded mean norm. We consider a problem of DP mean estimation when the data is at the server, but under the assumetion that µ M with a known M. First, we investigate the statistical complexity of the problem, without DP, showing a lower bound on the error of Ω(G2 min(M 2, 1/n)) see Thm. D.1 in Appendix D.1. Secondly, in order to understand the complexity under differential privacy constraint, we study the empirical version of the problem wherein for a ﬁxed dataset, the goal is estimate its empirical mean. Our main result is the following.

Theorem B.1. For any d, n N, G, M > 0, deﬁne the instance class, b P1(n, d, G, M) = {zi}n i=1 : zi Rd, zi G, bµ({zi}n i=1) MG For any ϵ = O(1) and 2 Ω(n) δ 1 n1+Ω(1) , we have,

min A:A is (ϵ,δ) DP max D b P1(n,d,G,M) E A(D) bµ(D) 2

= Θ G2 min M 2, d log (1/δ)

We convert the lower bound for the empirical problem to the statistical problem, via a known re-sampling trick (see (Bassily et al., 2019), Appendix C). The proof of Thm. B.1 is based on reduction to the (standard) DP mean estimation wherein the above assumption on the bound on norm of the mean is absent, for which the optimal error is known (Steinke and Ullman, 2015; Kamath and Ullman, 2020). Our result above shows that this additional restriction of bµ({zi}n i=1) MG does not make the problem much easier in terms of achievable error. In the interesting regime when G > M G p

d log (1/δ)/(nϵ), this yields the (same) optimal error with potentially signiﬁcantly smaller communication.

Instance-speciﬁc tightness: We show that for any dataset with mean bµ, the mean-squared error of count-mean sketch (i.e. R = 1) with added noise of variance σ2 is exactly d 1

P C bµ 2 + dσ2

n2 see Lemma C.6 for the statement. Therefore, bµ is the only statistic that controls the MSE of this procedure. Adapting to the norm of mean thus provides a tight instance-speciﬁc communication complexity for this method.

B.2. Instance optimal FME with bounded tail-norm of the mean

Lower bound on error: The error rate in Thm. 3.4 matches the rate in Eqn. (1), known to be optimal in the worst case, i.e. general values of tail-norms. However, it may be possible to do better in special cases. Towards this, we establish lower bounds for (central) DP mean estimation, where the algorithm designer is given k < d and γ and is promised that µn-tail(k) γ. For this setting, we present a lower bound on error of Ω G2 min 1, γ2 + k n2ϵ2 + 1

n, d n2ϵ2 + 1

n . The ﬁrst and third term are standard, achieved by outputting zero and our procedure (or simply, by Gaussian mechanism) respectively. The second term is new, which is absent from the guarantees of our procedure.

To prove the lower bound, we ﬁrst show, in Theorem D.2, in App. D.1, that the statistical error, without DP, is unchanged, Ω min G2, G2

n . Secondly, in order to establish the fundamental limits under differential privacy, we study the empirical version of the problem, where the goal is to estimate the empirical mean, and give the following result.

Theorem B.2. For any n, d, k N, G, M, γ 0, deﬁne the instance class b P2(n, d, G, k, γ) = n {zi}n i=1 : zi Rd, zi G, bµ(D)n-tail(k) 2 γ2o

For ϵ = O(1) and 2 Ω(n) δ 1 n1+Ω(1) , we have

min A:A is (ϵ,δ)-DP max D P(n,d,G,k,γ) E A(D) bµ(D) 2

= Ω G2 min 1, γ2 + k log (1/δ)

n2ϵ2 , d log (1/δ)

We convert the lower bound for the empirical problem to the statistical problem, via a known re-sampling trick (see (Bassily et al., 2019), Appendix C). We provide the proof of Thm. B.2 in App. D.2.

Private Federated Learning with Autotuned Compression

Achieving the optimal error with small communication: We show that a simple tweak to our procedure underlying Thm. 3.4 adding a Top-k to the unsketched estimate, with exponentially increasing k sufﬁces to achieve the error in Thm. B.2. The procedure is adaptive in k, but requires prior knowledge of γ and is a biased estimator of the mean. In the following, we abuse notation and write ktail(γ2; µ) which corresponds to plugging g(k) = γ2 in its deﬁnition.

Theorem B.3. For any γ > 0, Alg. 2 with the same parameter settings as in Thm. 3.4 except with kj = 2j and γj =

log(8 log(d)/β) m + p

+ α satisﬁes (ϵ, δ)-DP. With probability at least 1 β, the ﬁnal output satisﬁes,

µ µ 2 = O G2 γ2 + 1

n + ktail(γ2; µ) log (1/δ)

the total per-client communication complexity is O ktail(γ2; µ) and number of rounds is log (d) .

We provide the proof of Thm. B.3 in App. C.

Optimal communication complexity: Next, we investigate the communication complexity of multi-round Sec Aggcompatible, potentially biased, schemes with prior knowledge of k and γ such that µn-tail(k) γ. Our main result is the following.

Theorem B.4. Let d, n, k, K N, d 2k, ϵ, δ, G, α, γ > 0. Deﬁne P2(d, G, γ, k) = n D over Rd : z G, z D, and µ(D)n-tail(k) 2 γ2o . For any K, any protocol A in the class of K-round,

Sec Agg-compatible protocols (see App. E for details) such that its MSE, ED Dn[ A(D) µ 2] α2 for all D P2(d, G, γ, k), there exists a distribution D P2(d, G, γ, k), such that on dataset D Dn, w.p. 1, the total per-client communication complexity is Ω(k log (Gd/kα)).

Plugging in the error α2 from Thm. B.3 establishes that complexity complexity of Thm. B.3 is tight upto poly-logarithmic factors.

Instance-speciﬁc tightness. We show, in Lemma C.5 in App. C, that for any P, C = Ω(Pk) and R = log (d), for any dataset with mean bµ, the error of the count median-of-means sketch underlying Thm. B.3 with added noise of variance σ2, with high probability, is Ω G2 bµn-tail(k) 2 + σ2k .

Discussion on Thm. 3.4 vs Thm. B.3: While the error in procedure deﬁned in Thm. 3.4 may be larger than that in Thm. B.3, the former has some attractive properties that make it useful in practice. Speciﬁcally, it is unbiased which is desirable in stochastic optimization and it does not require knowledge of any new hyper-parameter γ, which is unclear how to adapt to. Further, the additional top k operation, which is the sole difference between the two procedures, does not seem to provide signiﬁcant beneﬁts in our downstream FL experiments. Consequently, we use the procedure underlying Thm. 3.4 in our experimental settings, and defer detailed investigation of the method in Thm. B.3 to future work.

C. Proofs of Error Upper Bounds for FME

C.1. Useful Lemmas

In this section, we collect and state lemmas that will be used in the proofs of the main results.

Lemma C.1. (Kane and Nelson, 2014) For any i [R], Count Sketch matrix S(i) RP C d with P = Ω τ 1 log (1/β)

τ satisﬁes that, for any z Rd, with probability at least 1 β, (1 τ) z 2 S(i)z 2 (1 + τ) z 2

The result below shows that Count-median-of-means sketch preserves norms and heavy hitters.

Lemma C.2. Let 0 < α, τ, β < 1 and k, d N and k d. For P = Θ τ 1 log (2R/β) , C = max 8Pk, k 8P α, 1

τ and R = 2 log (2d/β) , the sketch S satisﬁes that for any z Rd, with probability at least 1 β,

1. JL property:

(1 τ) z 2 S(i)z 2 (1 + τ) z 2 i [R]

Private Federated Learning with Autotuned Compression

2. ℓ guarantee: Let z = U(S(z)), then,

( zq zq)2 α ztail(k) 2

3. Sparse recovery:

(a) Let z = Topk( z)

z z 2 2 1 + 7 α) ztail(k) 2

(b) Let bz = Top2k( z)

bz z 2 2 (1 + 10α) ztail(k) 2

where ztail(k) = min z: z 0 k z z 2.

Proof. P = 1 corresponds to the standard Countsketch based method. For P > 1, we modify the analysis as follows. The ﬁrst part directly follows from the result of (Kane and Nelson, 2014), stated as Lemma C.1, which gave a construction of a sparse Johnson-Lindenstrauss transform based on Count Sketch.

We now proceed to the second part. Let Hk [d] denote the indices of k largest co-ordinates of z. We ﬁrst consider the estimate of zq based on one row (i-th row), which is give as,

z(i) q = S(i) S(i)z

p=1 s(i) p (j)s(i) p (q)1 h(i) p (j) = h(i) p (q) zj

p=1 s(i) p (j)s(i) p (q)1 h(i) p (j) = h(i) p (q) zj

From moment assumptions, E[E] = 0 which gives us that E[ z(i) q ] = zq. We now decompose the error into two terms,

p=1 s(i) p (j)s(i) p (q)1 h(i) p (j) = h(i) p (q) zj

p=1 s(i) p (j)s(i) p (q)1 h(i) p (j) = h(i) p (q) zj

By construction,

P (EHk = 0) P j Hk, p [P] : h(i) p (j) = h(i) p (q) Pk

for C 8Pk. For the other term, we directly compute its second moment,

E[E2 Tk] = E

p=1 s(i) p (j)s(i) p (q)1 h(i) p (j) = h(i) p (q) zj

p=1 E h 1 h(i) p (j) = h(i) p (q) i z2 j + 0

Private Federated Learning with Autotuned Compression

Therefore for C k 8P α, from Chebyshev s inequality, ETk

k with probability at least 7/8. Combining, for

C = max k 8P α, 8Pk , with probability at least 3/4,

z(i) q zq 2 α ztail(k) 2

With R = 2 log (1/β ) , and zq = median n z(i) q o R

i=1; using the standard boosting guarantee based on a median of

estimates, we have, with probability at least 1 β ,

( zq zq)2 α ztail(k) 2

Setting β = β 2d and doing a union bound on all coordinates gives that with probability at least 1 β, for all q [d],

( zq zq)2 α ztail(k) 2

The third item in the Lemma statement is based on converting the ℓ to ℓ2 guarantee; speciﬁcally, instantiating Lemma C.3

with 2 = α ztail(k) 2

k gives us,

z z 2 1 + 5α + 2 α ztail(k) 2

1 + 7 α ztail(k) 2 .

Similarly, from Lemma C.3, we get,

bz z 2 (1 + 10α) ztail(k) 2 ,

which completes the proof.

Lemma C.3. Let z Rd and z Rd such that z z 2 2. Deﬁne z = Topk( z) and bz = Top2k( z). Then,

z z 2 2 5k 2 + ztail(k) 2 + 2

z bz 2 2 10k 2 + ztail(k) 2

Proof. This is a fairly standard fact, though typically not presented in the form above (see, for instance (Price and Woodruff, 2011)). We give a proof for completeness and its use in many parts of the manuscript. Let I denote the indices of top k co-ordinates of z and let I and b I denote the top k and top 2k co-ordinates of z. We proceed with the ﬁrst part of the lemma. Note that,

z z 2 = z z I 2 = (z z) I 2 + z Ic 2 (4)

where Ic denotes the complement set of I. The ﬁrst term is bounded as (z z) I 2 I z z 2 = k 2. We

decompose the second term z Ic 2 as follows,

z Ic 2 = z I\ I 2 + z Ic 2 z I\I 2

Let a = maxi I\ I zi and b = mini I\I zi. From the ℓ guarantee and the fact that the sets I and I are indices of top k

elements of z and z respectively, we have that (a b)2 4 z z 2 4 2. We note consider two cases (a). b 0 and

Private Federated Learning with Autotuned Compression

b > 0. In the ﬁrst case, we have that a 2 . Plugging all these simpliﬁcations in Eqn. (4), we get,

z z 2 k 2 + z Ic 2 + z I\ I 2 z I\I 2

k 2 + z Ic 2 + I\ I a2

k 2 + z Ic 2 + 4k 2 = z Ic 2 + 5k 2

In the other case (b). b > 0, we get,

z z 2 k 2 + z Ic 2 + z I\ I 2 z I\I 2

k 2 + z Ic 2 + I\ I a2 I\ I b2

= k 2 + z Ic 2 + I\ I a2 b2

k 2 + z Ic 2 + I\ I (b + 2 )2 b2

k 2 + z Ic 2 + 2 I\ I b + 4 I\ I 2

k 2 + z Ic 2 + 2

r I\ I z Ic + 4k 2

5k 2 + z Ic 2 + 2

where the ﬁrst equality follows since |I| = I = k, and the second last inequality follows since r I\ I b z Ic and b is

value of the the minimal element in I\ I. Finally, note that z Ic = ztail(k) . Hence, combining the two cases yields the statement claimed in the lemma.

The second part follows analogously. In particular, repeating the initial steps gives us

z bz 2 2k 2 + z Ic 2 + z I\b I 2 zb I\I 2 .

Deﬁning a and b as before and considering the two cases give us that in the ﬁrst case (a). b 0, we have,

z bz 2 2k 2 + z Ic 2 + z I\b I 2

2k 2 + z Ic 2 + I\b I a2

2k 2 + z Ic 2 + 4k 2

= 6k 2 + z Ic 2 .

In the second case, let bk = I\b I ; note that b I\I = k + bk. Therefore,

z bz 2 2k 2 + z Ic 2 + z I\b I 2 zb I\I 2

2k 2 + z Ic 2 + bka2 (k + bk)b2

2k 2 + z Ic 2 + 4bk 2 + 4bkb kb2

k + 2k 2 + 4bk 2 + z Ic 2

10k 2 + z Ic 2 .

where in the last inequality, we used that bk k. Combining the two cases ﬁnishes the proof.

Private Federated Learning with Autotuned Compression

The following result gives a heavy hitter guarantee for count-median-of-means sketch with noise.

Lemma C.4. Let z Rd, R, C, P be parameters of Count-median-of-means sketch S and let ξ N(0, σ2I). Let z = U (S(z) + ξ). For C max 8Pk, k 8P α and R = 2 log (2d/β) gives us that with probability at least 1 β,

z z 2 2α ztail(k) 2

Let z = Topk( z) and bz = Top2k( z). With probability at least 1 β, we have

z z 2 2 2 1 + 7 α ztail(k) 2 + 5kσ2

bz z 2 2 (1 + 20α) ztail(k) 2 + 10kσ2

Proof. The proof extends the analysis in (Pagh and Thorup, 2022) which was limited to count-sketch (P = 1). Speciﬁcally, we apply Lemma 3.4 in (Pagh and Thorup, 2022) plugging in an estimate of one row error obtained from our sketch. In the proof of Lemma C.2, for C max 8Pk, k 8P α , for one row estimate z(i), we have that with probability 3/4,

α ztail(k) 2

For a ﬁxed coordinate q, let z(i) q denote its estimate from the i-th row of the count sketch. We have,

z(i) q = zq + 1

p=1 s(i) p (j)s(i) p (q)1 h(i) p (j) = h(i) p (q) zj

p=1 s(i) p (q)ξi p

The reason that the second term has

P in the denominator and not P is because the noise is added after sketching, which itself performs division by

P operation. Now, A is the original non-private estimate and B is the additional noise. Since ξj p N(0, σ2) and s(i) p are random signs, the random variable B N(0, σ2). Applying Lemma 3.4 from (Pagh and

Thorup, 2022) gives us that for 2 > α ztail(k) 2

k , the mean estimate z, with probability 1 2d exp R min 1, 2

satisﬁes z(i) z 2

We set 2 = max 2α ztail(k) 2

k , σ2 note that with this setting, the success probability is at least 1 2d exp R

Hence, setting R = 2 log (2d/β) yields the claimed ℓ guarantee with probability 1 β. For the Top k and Top 2k guarantees, we apply Lemma C.3 with the above , which yields,

2α ztail(k) 2

+ ztail(k) 2 + 2

k ztail(k) 2 p

= 1 + 14 α ztail(k) 2 + 5kσ2 + 2

k ztail(k) σ

2 1 + 7 α ztail(k) 2 + 5kσ2,

where the last inequality follows from AM-GM inequality. Similarly, from Lemma C.3, we get

bz z 2 (1 + 20α) ztail(k) 2 + 10kσ2,

which ﬁnishes the proof.

In the following, we give a lower bound on the error of count median-of-means sketch with noise, for any instance.

Private Federated Learning with Autotuned Compression

Lemma C.5. Let R, C, P be parameters of Count-median-of-means sketch S and let ξi = [ξ1 i , ξ2 i , , ξR i ] , where ξj i N(0, σ2IP C) for all i [n], j [R]. Consider any dataset D = {z1, z2, , zn} and let µ = Topk U 1

n Pn i=1 S(zi) + ξi . For C 100Pk, and R = 800 log (2d/β) with probability at least 1 β,

bµtail(k) 2

2 + 0.125σ2k

Proof. Since the sketching operation is linear, it sufﬁces to show the result for sketching the mean of the dataset, bµ := 1 n Pn i=1 zi. For a ﬁxed coordinate q, let µ(i) q denote its estimate from the i-th row of the count-median-of-means sketch. We have,

µ(i) q bµq 2 =

p=1 s(i) p (j)s(i) p (q)1 h(i) p (j) = h(i) p (q) bµj

p=1 s(i) p (q)ξi p

= A2 + B2 + 2AB B2 2 |A| |B|

Since ξj p N(0, σ2) and s(i) p are random signs, the random variable B N(0, σ2). Using the CDF table for the normal distribution, we have that with

P B2 4σ2 = 2PX N(0,1) (X 1) 0.04

P B2 0.25σ2 = 1 P B2 > 0.25σ2 = 1 2PX N(0,1) (X > 0.5) 0.4

With C max 100Pk, k 8P α as in the proof of Lemma C.2, |A|

α bµ tail(k)

k with probability at least 0.99. Hence, with probability at least 0.55, we have that µ(i) q bµq 2 0.25σ2 4σ α bµtail(k)

We now argue ampliﬁcation for the median: Deﬁne Ii q = 1 bµq µi q bµq + ε . Note that EIi q 0.45. Then,

P (bµq µq bµq + ε) P

Ii q EIi q 0.05R

by setting of R. Now, similarly, we have

P (bµq µq bµq ε) β

Combining both gives us that with probability at least 1 β, for all q [d]

( µq bµq)2 ε2 (5)

Now, let µ = Topk U 1

n Pn i=1 S(zi) + ξi and let I be the set of coordinates achieving its Top k. We have,

µ bµ 2 = µ I bµ I + bµ Ic 2 = ( µ bµ) I 2 + bµ Ic 2 ( µ bµ) I 2 + bµtail(k) 2

For the other term, from Eqn. (5), we have

( µ bµ) I 2 kε2

Combining, we get,

µ bµ 2 bµtail(k) 2 + kε2 = bµtail(k) 2 + 0.25σ2k 4σ

k α bµtail(k)

Consider two cases:

Private Federated Learning with Autotuned Compression

Case 1 : bµtail(k) 8σ

k α. Note that

bµtail(k) 2 4σ

k α bµtail(k)

bµtail(k) 2

2 bµtail(k) 8σ

In this case, the bound becomes,

bµtail(k) 2

2 + 0.25σ2k

Case 2 : bµtail(k) 0.125σ

k 4 α . Note that

k α bµtail(k) 0.125σ2k bµtail(k) 0.125σ

In this case, the bound becomes,

µ bµ 2 bµtail(k) 2 + 0.125σ2k

We now set α to make sure the two cases exhaust all the possibilities. In particular,

k α = 0.125σ

k 4 α α = 1 128

Combining, this gives us,

bµtail(k) 2

2 + 0.125σ2k

which completes the proof.

In the following, we compute the mean-squared error of count-mean sketch with noise, for any instance.

Lemma C.6. For the count-median-of-means sketching operation described in Section 3.2, with R = 1 and ξi N(0, σ2IP C), i [n], we have that for any dataset D = {zi}n i=1 with zi Rd

i=1 S(zi) + ξi

PC bµ(D) 2 + dσ2

Proof. Since the sketching operation is linear, it sufﬁces to show the result for sketching the mean of the dataset, bµ := 1 n Pn i=1 zi. Further, since R = 1, the sketching and unsketching operation simpliﬁes and we get,

E S(1) S(1)bµ + ξ bµ

2 = E S(1) S(1)bµ bµ

2 + E S(1) S(1)ξ

For the ﬁrst term, we look at the error in every coordinate recall h(1) j : [d] [C] denotes the bucketing hash function and C and P denotes the number of columns and rows of the Count-mean sketch. We have

S(1) S(1)bµ

q=1,q =j E1 h(1) j (q) = j bµ2 q

q=1,q =j bµ2 q

Private Federated Learning with Autotuned Compression

where in the above, we use that P h(1) j (q) = j = 1

E S(1) S(1)bµ bµ

For the other term, by direct computation, we have that,

E S(1) S(1)ξ

j=1 E S(1) S(1)ξ 2

j=1 E S(1) S(1) 2

j=1 E S(1) S(1) 2

where the above uses the fact that the sketching matrix and the Gaussian noise vector are independent and the the variance of the diagonal entries of S(1) S(1) are 1. Combining the above yields the statement in the lemma.

The following result shows that count median-of-means sketch, (even) with added noise, is an unbiased estimator. Lemma C.7. For the count-median-of-means sketching and unsketching operations described in Section 3.2, for any P, R, C 1, σ2 0, any z Rd, we have that

E[U(S(z) + ξ)] = z

where ξ = [ξ(1), ξ(2), ξ(R)] , ξ(i) N(0, σ2IP C) for i [R].

Proof. We have that j-th co-ordinate is, U(S(z) + ξ)j = Median

S(i) S(i)z + ξ(i)

. Let z(i) = n S(i) S(i)z + ξ(i) o and let z = U(S(z) + ξ). Note that zj = Median n

z(1) j , z(2) j , z(R) j o . Further, the j-th

co-ordinate of z(i) is,

z(i) j = zj + 1

p=1 s(i) p (q)s(i) p (j)1 h(i) p (q) = h(i) p (j) zq

p=1 s(i) p (j)ξ(i) p

zj = Median n

z(1) j , z(2) j , z(R) j o = Median n zj + A(1) j + B(1) j , zj + A(2) j + B(2) j , , zj + A(R) j + B(R) j o

= zj + Median n A(1) j + B(1) j , A(2) j + B(2) j , , A(R) j + B(R) j o

Thus, E[zj] = zj + E[Median({A(1) j + B(1) j , A(2) j + B(2) j , , A(R) j + B(R) j })]. Observe that the random variables A(i) j and B(i) j and thus A(i) j + B(i) j are symmetric about zero by construction. Therefore,

E[Median({A(1) j + B(1) j , A(2) j + B(2) j , , A(R) j + B(R) j })]

= E[Median({ A(1) j B(1) j , A(2) j B(2) j , , A(R) j B(R) j })]

= E[Median({A(1) j + B(1) j , A(2) j + B(2) j , , A(R) j + B(R) j })]

Private Federated Learning with Autotuned Compression

Hence, E[Median({A(1) j + B(1) j , A(2) j + B(2) j , , A(R) j + B(R) j })] = 0. Repeating this for all co-ordinates completes the proof.

In the following, we show that count median-of-means sketch, with added noise, is an unbiased estimator, even when its size is estimated using data.

Lemma C.8. Let S : Rd RR P C and U : RR P C Rd be the count-median-of-means sketching and unsketching operations described in Section 3.2. Let z Rd be a random variable with mean µ Rd. For any functions f1, f2, f3 with range N, random variable ξ, σ2 0, such that sketch-size P = f1(z, ξ), R = f2(z, ξ), C = f3(z, ξ), we have that

E[U(S(z) + ξ)] = µ

where ξ = [ξ(1), ξ(2), ξ(R)] , ξ(i) N(0, σ2IP C) for i [R].

Proof. The proof follows simply from the law of total expectation. We have that,

E[U(S(z) + ξ)] = Ez, ξ[ES,ξ[U(S(z) + ξ)| ξ, z] = Ez, ξ[z| ξ, z] = µ

where the second equality follows from Lemma C.7.

We state the guarantee for Above Threshold mechanism with the slight modiﬁcation that we want to stop when the query output is below (as opposed to above) a threshold γ.

Lemma C.9. ((Dwork et al., 2010a), Theorem 3.4) Given a sequence of T B-sensitive queries {qt}T t=1, dataset D and a threshold γ, the Above Threshold mechanism guarantees that with probability at least 1 β,

1. If the algorithm halts at time t, then qt(D) γ + 8B(log(T )+log(2/β))

2. If the algorithm doesn t halt at time t, then qt(D) γ 8B(log(T )+log(2/β))

C.2. Proof of Theorem 3.1

We start with the privacy analysis. Note that in the ﬁrst round, only the second sketch is used, and in the second round, only the ﬁrst sketch is used. In both cases, the clipping operation ensures that the sensitivity is bounded. To elaborate, in the ﬁrst round, as in (Chen et al., 2022a), the sensitivity of ν1 = 1

n Pn c=1 Q(c),clipped 1 is at most 2B

n . Similarly, in the second round, let ν2 and ν 2 be the norm estimates on neighbouring datasets. The sensitivity is bounded as,

| ν2 2 ν 2 2| ν2 ν 2 2 = 1

clip B( Q(1)) 2 clip B( Q

(1) 2 ) 2 2B

Hence, using the guarantees of Gaussian and Laplace mechanism (Dwork et al., 2014), with the stated noise variances, the ﬁrst two rounds satisfy ϵ

2, 0 -DP respectively. Finally, applying standard composition, we have that the algorithm satisﬁes (ϵ, δ)-DP.

The unbiasedness claim follows from linearity of expectation and Lemma C.8 wherein ξ is the random second sketch S used to set C in the sketch size.

We now proceed to the utility analysis. Let bµ1 and bµ2 denote that empirical means of clients data sampled in the ﬁrst and second rounds. From concentration of empirical mean of i.i.d bounded random vectors (see Lemma 1 in (Jin et al., 2019)), we have that with probability at least 1 β

4 , for j {1, 2},

bµj µ 2 2B2 log (16/β)

Now, from linearity and ℓ2-approximation property of Count Sketch with the prescribed sketch size (see Lemma C.2), we have that with probability 1 β

n2 1 = S1 (bµ1) 2 1

bµ1 2 . (7)

Private Federated Learning with Autotuned Compression

Deﬁne n1 := ν1 , bn1 = clip B( n1) + Laplace( σ), and n1 := bn1 + γ. Note that from the setting of B and the above ℓ2-approximation guarantee, with probability at least 1 β

4 , no clipping occurs. Further, from concentration of Laplace random variables (see Fact 3.7 in (Dwork et al., 2014)), we have that with probability at least 1 β/4,

|bn1 n1| σ log (8/β) = 2B log (8/β)

Therefore, with probability at least 1 3β

4 , we have,

n1 = bn1 + γ n1 |n1 bn1| + γ 1

2 bµ1 2B log (8/β)

2 bµ1 bµ2 2B log (8/β)

where the last inequality follows from the setting of γ. We now decompose the error of the output µ := µ2 as,

µ µ 2 2 µ bµ2 2 + 2 bµ2 µ 2 = 2 µ bµ2 2 + 4B2 log (16/β)

where the last inequality holds with probability at least 1 β/4 from Eqn. (6). Let ξ2 N(0, σ2IP C) be the Gaussian noise added to the sketch. For the ﬁrst term above,

µ µ2 2 U2(ν2) bµ2 2 =

c=1 clip B(Q(c) 2 ) + ξ2

c=1 Q(c) 2 + ξ2

where the last equality follows from the ℓ2-approximation property of count sketch (Lemma C.2) which, with the setting of B, ensures that no clipping occurs with probability at least 1 β/4.

Now, since we are only using one row of the sketch, the sketching and unsketching operations simplify to yield,

µ µ 2 = (S2) (S2bµ2 + ξ2) bµ2 2 .

We now use Lemma C.6 to get,

E µ bµ2 2 = E d 1

PC2 bµ2 2 + dσ2

C2 min n2ϵ2

log (1/δ), nd bµ2 2

C2 > min n2ϵ2

log (1/δ), nd bµ2 2

n + d G2 log (1/δ)

n2ϵ2 + dβG2 + dσ2

n + G2 log (1/δ)

where the ﬁrst inequality follows from the setting of C2 and Eqn (9) which gives us that C2 min n2ϵ2 log(1/δ), nd bµ2 2

with probability at least 1 3β

4 and that C2, P 1; the last inequality follows from setting of β log(1/δ)

Private Federated Learning with Autotuned Compression

The communication complexity is P C + PC2. Note that C2 = max min n2ϵ2 log(1/δ), nd n2 1 G2P , 2 and,

n1 = bn1 + γ n1 + |n1 bn1| + γ

2 bµ1 + 2B log (8/β)

2 bµ1 µ + 2B log (8/β)

B log (16/β) n + 2B log (8/β)

where the ﬁrst and third inequality holds with probability at least 1 3β

4 from Eqn. (6) (7) and (8), and the last inequality follows from setting of γ.

This gives a total communication complexity of max 4P, 16 min n2ϵ2 log(1/δ), nd M 2G2+γ2

G2 with probability at least 1 β. Plugging in the values of P and γ gives the claimed statement.

C.3. Proof of Thm. 3.4

We start with the privacy analysis. There are two steps in each round of interaction which accesses data: sketching the vectors and error estimate. For the ﬁrst access, from the clipping operation, the ℓ2 sensitivity of each row of the combined sketch νj = 1 n Pn c=1 clip B(Q(c) j ) is bounded by 2B

n . We apply the Gaussian mechanism guarantee in terms of R enyi Differential Privacy (RDP) (Mironov, 2017) together with composition over rows and number of rounds. Converting the RDP guarantee to approximate DP guarantee gives us that with the stated noise variance, the quantity {νj} log(d) j=1 satisﬁes ϵ

2, δ -DP. For the second, we compute the ℓ2 sensitivity of error estimate as follows. Given two neighbouring datasets D and D such that w.l.o.g. the datapoint z1 in D is replaced by z 1 in D . Let νj and νj be the quantities for D and D

respectively. Since µj is private, we ﬁx it and compute sensitivity as, Sj( µj) νj Sj( µj) ν j νj ν j

clip B( S(i) j (z1)) clip B( S(i) j (z 1))

where the ﬁrst and second steps follow from triangle inequality, the third from the fact that clipped vectors have norm at most B and the last from setting of R. From the Above Threshold guarantee (Lemma C.9), the prescribed settings of noise added to the threshold, of standard deviation σ and to the query ej, of standard deviation 2 σ, imply the error estimation steps satisﬁes ϵ

2, 0 -DP. We remark that in our case, the threshold γj changes for j-th query, but in standard Above Threshold, the threshold is ﬁxed. However, note that γj = γ(some ﬁxed value) + 16 p

kjσ(changing). This changing part of the threshold can be absorbed in the query itself, without changing its sensitivity, thereby reducing it to standard Above Threhsold with ﬁxed threshold. Finally, combining the above DP guarantees using basic composition of differential privacy gives us (ϵ, δ)-DP.

The unbiasedness claim follows from linearity of expectation and Lemma C.8 wherein ξ is the random second sketch S used to set C in the sketch size.

We now proceed to the utility proof. The proof consists of two parts. First, we show that when the algorithm stops, it guarantees that the error of the output is small. Then, we give a high probability bound on the stopping time.

We start with the ﬁrst part. Recall that µ is the true mean and let bµj denote the empirical mean of the cohort selected in step j of the algorithm. We ﬁrst decompose the error into statistical and empirical error as follows; for any j log (d)

µ µ 2 2 bµj µ 2 + 2 µ bµj 2 (10)

We bound the ﬁrst term bµj µ 2 by standard concentration arguments (see Lemma 1 in (Jin et al., 2019)). Speciﬁcally,

Private Federated Learning with Autotuned Compression

with probability at least 1 β

4 , for all j log (d) , we have,

bµj µ 2 2B2 log (8 log (d) /β)

We now bound the second term in Eqn. (10). Let bej be the error in sketching with Sj, deﬁned as, bej = µj bµj . Note that ej, deﬁned in Algorithm 2, is an estimate of bej. Speciﬁcally, ﬁxing the random {Sj}j, from linearity and ℓ2-approximation property of Count Sketch with the prescribed sketch size (see Lemma C.2) we have that with probability 1 β

4 , for all j log (d) , we have,

e2 j = Sj ( µj) νj 2 = Sj ( µj bµj) 2 1

µj bµj 2 = 1

Let bj be the guess on which the algorithm stops. Using the utility guarantee of Above Threshold mechanism (Lemma C.9), with probability at least 1 β

ebj γbj + α (13)

where α = 32B(log( log(d) )+log(8/β))

nϵ are as stated in the Theorem statement. Combining Eqn. (12) and Eqn. (13), we have that when the algorithm halts, with probability at least 1 β

γbj + α 2 = be2 bj 3

To compute the error of the output µ, from Eqn. (10), and above, we have that with probability at least 1 β,

µ µ 2 16G2 log (8 log (d) /β)

γbj + α 2 (14)

G2 log (8 log (d) /β)

n + d G2 log (1/δ) (log (8d/β))3

n2ϵ2 + G2 (log ( log (d) ) + log (8/β))2

We now give a high-probability bound on the stopping time. Given a vector z Rd, deﬁne ztail(k) 2 = min z: z 0 k z z 2. The error is bounded as,

c=1 clip B(Q(c) j )

d Uj(Sj(bµj) + ξj) bµj 2 (18)

(bµj)tail(kj) 2 + 10dσ2

where the second equality holds using the JL property in Lemma C.2 which shows that with probability at least 1 β

4 , norms of all vectors are preserved upto a relative tolerance [ 1

2], hence no clipping is done in all rounds. The bound on the second term follows from the sketch recovery guarantee in Lemma C.4 (with α = 1 in the lemma).

Let γ2 = max 1 n, σ2d n2G2 and g(k) = γ2

d k 8 log(8 log(d)/β)

n . By construction, our strategy uses guesses j and j such that

2j ktail(g(k); µ) 2j, and kj = 2j = 2 2j 2ktail(g(k); µ).

Let I be the set of indices corresponding tail(ktail(g(k); µ)) of µ i.e. the d ktail(g(k); µ) smallest coordinates of µ and let

Private Federated Learning with Autotuned Compression

µI be a vector such that (µI)i = µi if i I, otherwise (µI)i = 0. From monotonicity of the errors, we have that (bµj)tail(k(j) 2 (bµj)tail(|I|) 2 (bµj)I 2 2 (bµj)I µI 2 + 2 µtail(ktail(g(k);µ)) 2

2 (bµj µ)I 2 + 2 µtail(ktail(g(k);µ)) 2

2 bµj µ 2 + 2 µtail(ktail(g(k);µ)) 2

4B2 log (8 log (d) /β)

n + 2 µtail(ktail(g(k);µ)) 2

2γ2G2ktail(g(k); µ)

where the second last inequality follows from vector concentration bound in Eqn. (11) which holds with probability at least 1 β

4 for all j log (d) , and the last inequality holds from the deﬁnition of g(k).

From a union bound over the success of sketching and error estimation and using Eqn. (12), with probability at least 1 β,

(bµj)tail(k(j)) 2 + 10σ2d

(bµj)tail(k(j)) 2 + 15dσ2

96γ2G2 + 15dσ2

n2 121γ2G2 (20)

Finally, from the Above Threshold guarantee (using contra-positive of second part of Lemma C.9), since ej γ α = 15γG, with probability at least 1 β

4 , it halts.

The communication complexity now follows by the setting of the two sketch sizes. The total communication for sketches {Sj}j is

2j = Θ ktail(g(k); µ) log (8d log (d) /β) log2 (16 log (8d log (d) /β) log (d) /β)

where we use that j 2ktail(g(k); µ), with holds with probability at least 1 β, as argued above. Similarly, the total

communication for sketches n Sj o

j=1 R C P = j2 P 2 = Θ (log (ktail(g(k); µ)) log (4d log (d) /β)) (22)

The total communication is the sum of the two terms in Eqn. (21) and (22) which is dominated by the ﬁrst term. This completes the proof.

C.4. Proof of Thm. B.3

The privacy analysis follows as in Thm. 3.4. However, we note that in our application of Above Threshold here, the threshold γj changes for j-th query, but in standard Above Threshold, the threshold is ﬁxed. In our case, γj = γ(some ﬁxed value) + 16 p

kjσ(changing) this changing part of the threshold can be absorbed in the query itself, without changing its sensitivity, thereby reducing it to standard Above Threhsold with ﬁxed threshold.

We now proceed to the utility proof, which again consists of two parts. First, we show that when the algorithm stops, it guarantees that the error of the output is small. Secondly, we give a high probability bound on the stopping time.

We start with the ﬁrst part. The proof is identical to that of Thm. B.3 up to Eqn. (14). Recall that µ is the true mean and bµj denotes the empirical mean of the cohort selected in step j of the algorithm. Let kj = 2j and bej be the error in sketching

Private Federated Learning with Autotuned Compression

with Sj, deﬁned as, bej = µj bµj . Further, let bj be the guess on which the algorithm stops. From Eqn. (14) in Thm. B.3, With probability at least 1 β,

µ µ 2 16G2 log (8 log (d) /β)

γbj + 2 α 2

16G2 log (8 log (d) /β)

log (8 log (d) /β) n + q

We now give a high probability bound on bj, where the algorithm stops, which we will plug-in in the above bound. Given a vector z Rd, deﬁne ztail(k) 2 = min z: z 0 k z z 2. The error is bounded as,

c=1 clip B(Q(c) j )

= Topkj (Uj(Sj(bµj) + ξj)) bµj 2

32 (bµj)tail(kj) 2 + 10σ2kj

where the second equality holds using the JL property in Lemma C.2 which shows that with probability at least 1 β

4 , norms of all vectors are preserved upto a relative tolerance [ 1

2], hence no clipping is done in all rounds. The bound on the second term follows from the sketch recovery guarantee in Lemma C.4 (with α = 1 in the lemma).

By construction, our strategy uses guesses j and j such that 2j ktail(γ2; µ) 2j, and kj = 2j = 2 2j 2ktail(γ2; µ).

Let I be the set of indices corresponding tail(ktail(γ2; µ)) of µ i.e. the d ktail(γ2; µ) smallest coordinates of µ and let µI be a vector such that (µI)i = µi if i I, otherwise (µI)i = 0. We have that (bµj)tail(k(j) 2 (bµj)tail(|I|) 2 (bµj)I 2 2 (bµj)I µI 2 + 2 µtail(ktail(γ2;µ)) 2

2 (bµj µ)I 2 + 2 µtail(ktail(γ2;µ)) 2

2 bµj µ 2 + 2 µtail(ktail(γ2;µ)) 2

4B2 log (8 log (d) /β)

n + 2 µtail(ktail(γ2;µ)) 2

where the second last inequality follows from vector concentration bound in Eqn. (11) which holds with probability at least 1 β

4 for all j log (d) .

From a union bound over the success of sketching and error estimation and using Eqn. (12), with probability at least 1 β,

j 32 (bµj)tail(k(j)) 2 + 10σ2kj

48 (bµj)tail(k(j)) 2 + 15kjσ2

48 (bµj)tail(|I|) 2 + 15kjσ2

196B2 log (8 log (d) /β)

n + 96 µtail(ktail(γ2;µ)) 2 + 15kjσ2

This gives us that

log (8 log (d) /β) n + q

Private Federated Learning with Autotuned Compression

Finally, from the Above Threshold guarantee (using contra-positive of second part of Lemma C.9), since ej γj α =

log(8 log(d)/β) n + q

, with probability at least 1 β

4 , it halts. Thus, kbj kj 2ktail(γ2; µ). Plugging this

in Eqn. (23) gives the claimed error bound.

The communication complexity now follows by the setting of the two sketch sizes. The total communication for sketches {Sj}j is

kj = Θ ktail(γ2; µ) log (8d log (d) /β) log2 (16 log (8d log (d) /β) log (d) /β) (25)

where we use that j 2ktail(γ2; µ), with holds with probability at least 1 β, as argued above. Similarly, the total

communication for sketches n Sj o

j=1 R C P = j2 P 2 = Θ log ktail(γ2; µ) log (4d log (d) /β) (26)

The total communication is the sum of the two terms in Eqn. (25) and (26) which is dominated by the ﬁrst term. This completes the proof.

D. Proofs of Error Lower Bounds for FME

D.1. Non-private statistical error lower bounds

We ﬁrst recall some notation: for a probability distribution D, let µ(D) := Ez D[z] denote its mean, when it exists.

Theorem D.1. Let d N, G and M such that 0 < M 1, deﬁne the instance class,

P1(d, G, M) = Probability distribution D over Rd : z G for z D and µ(D) MG

Then, for any n N, we have,

min A:(Rd)n Rd max D P1(d,G,M) E D Dn,A A(D) µ(D) 2 = Θ G2 min M 2, 1

Proof. The upper bound is achieved by the best of the following two estimators: M 2G2 for the trivial estimator which outputs zero regardless of the problem instance, and G2

n for the sample mean estimator.

For the lower bound, ﬁrst note that it sufﬁces to consider d = 1 since the right hand side doesn t explicit depend on d and one can always embed a one-dimensional instance in any higher dimensions simply by appending zero co-ordinates to the data points. Our lower bound uses the (symmetric) Bernoulli mean estimation problem, the proof of which is classical (see Example 7.7 in (Duchi, 2016)). However, in its standard formulation, there is no constraint on the mean that µ(D) MG. Consequently, we modify the proof to incorporate it and we present it (almost) in entirety, below.

As is standard in minimax lower bounds, we reduce the estimation problem to (binary) hypothesis testing: let Z = { G, G} and consider two distributions D1 and D2 over it deﬁned as follows,

Pz D1 (z = G) = 1 + τ

2 Pz D2 (z = G) = 1 τ

where τ 0 is a parameter to be set later. Note that |µ(D1)| = |µ(D2)| = τG. Note that since D1, D2 P1(1, G, M), this gives us the constraint that τ MG

Private Federated Learning with Autotuned Compression

The minimax risk is lower bounded as,

min A:(Rd)n Rd max D P1(d,G,M) E D Dn,A A(D) µ(D) 2 min A:(Rd)n Rd max D {D1,D2} E D Dn,A A(D) µ(D) 2

min A:(Rd)n Rd E D Unif{D1,D2} E D Dn,A A(D) µ(D) 2

min Ψ:(Rd)n {D1,D2} G2τ 2 P D Unif(D1,D2),D Dn (Ψ(D) = D)

2 1 TV(Dn 1 , Dn 2 )2

where TV(Dn 1 , Dn 2 ) denotes the total variation distance between distributions. In the above, the third inequality is the reduction from estimation to (Bayessian) testing (see Proposition 7.3 in (Duchi, 2016)) with Ψ being the test function: herein, the nature ﬁrst chooses D Unif {D1, D2}, and conditioned on this choice, we observe n i.i.d. samples from D, and the goal is to infer D. Further, the last equality follows from Proposition 2.17 in (Duchi, 2016).

We now upper bound total variation distance by KL divergence using Bretagnolle Huber inequality (Bretagnolle and Huber, 1978) this is the key step which differs from the proof in the standard setup wherein a weaker bound based on Pinsker s inequality sufﬁces. We get,

TV(Dn 1 , Dn 2 )2 1 exp ( KL (Dn 1 Dn 2 )) = 1 exp ( n KL (D1 D2))

= 1 exp nτlog 1 + τ

1 exp 3nτ 2

where the ﬁrst equality uses the chain rule for KL divergence, the second equality follows from direct computation and the last inequality holds for τ 1

2 note that, in our setup above, we have the constraint that τ M, which isn t necessarily violated with the assumption τ 1

2. This gives us that,

min A:(Rd)n Rd max D P1(d,G,M) E D Dn,A A(D) µ(D) 2 G2τ 2

2 exp 3nτ 2

Finally, setting τ 2 = min M 2

3n yields that,

min A:(Rd)n Rd max D P1(d,G,M) E D Dn,A A(D) µ(D) 2 G2 min M 2

2 exp ( 1) G2

12 min M 2, 2

which completes the proof.

Theorem D.2. Let k, d N such that k < d, G, γ > 0 such that γ 1. Deﬁne the instance class

P2(d, G, γ, k) = n Probability distribution D over Rd : z G for z D, and µ(D)n-tail(k) 2 γ2o

Then, for any n N, we have,

min A:(Rd)n Rd max D P2(d,G,k,γ) E D Dn,A A(D) µ(D) 2 = Θ min G2, G2

Before presenting the proof, we note that in the above statement, we exclude k = d. This is because, for k = d, µ(D)n-tail(k) = Gµ(D) and hence P2(d, G, γ, k) = P1(d, G, γ). The optimal rate then follows from Theorem D.1.

Proof. The upper bound follows from the guarantees in the standard setting of mean estimation with bounded data, ignoring the additional structure of bound on tail norm. Speciﬁcally, the bound G2 is achieved from outputting zero, and G2

2 is achieved by the sample mean estimator.

Private Federated Learning with Autotuned Compression

For the lower bound, we will establish the result for γ = 0, and the main claim will follow since P2(d, G, k, γ) P2(d, G, k, 0) for any γ 0. Further, assume that d 2, otherwise, the claim follows from Theorem D.1.

Suppose for contradiction that there exists G > 0, d, n, k N with k < d such that there exists an algorithm A : Rd n Rd with the guarantee,

max D P2(d,G,k,0) E D Dn,A A(D) µ(D) 2 = o min G2, G2

Now, consider an instance D P1(1, G, G), deﬁned in Theorem D.1. We will use Algorithm A to break the lower bound in Theorem D.1 hence establishing a contradiction.

Given D Dn, we simply append d 1 zero co-ordinates to every data point in D let D denote this constructed dataset. Note that every data point in D can be regarded as a sample from a product distribution D = D (D0)d 1, where distribution D0 is a point mass at zero. Further, note that D P2(d, G, k, 0). Applying algorithm A yields,

E D Dn,A A(D) µ(D) 2 = o min G2, G2

However, note that µ(D) = [µ( D), 0, , 0] . Hence, post-processing A by only taking its ﬁrst two co-ordinates deﬁning A as A( D) := [(A(D))1 , (A(D))2] gives us that A(D) µ(D) 2 A( D) µ( D) 2 . Plugging this above, we get,

max D P1(1,G,G) E D Dn,A

A( D) µ( D) 2 = o min G2, G2

This contradicts the statement in Theorem D.1 establishing the claimed lower bound.

D.2. Lower bounds on error under differential privacy

Proof of Thm. B.1. The upper bound is established by the known guarantees of the best of the following procedures: output zero for M 2G2 bound and Gaussian mechanism for G2d log(1/δ)

n2ϵ2 bound. For the lower bound, we consider three cases (a). M 1, (b). M 4

n and (c). 4 n < M < 1. The ﬁrst case M 1 is trivial wherein we simply ignore the bound M on norm of mean and apply known lower bounds for DP-mean estimation with bounded data (see (Steinke and Ullman, 2015; Kamath and Ullman, 2020)). Speciﬁcally, deﬁne the following instance class,

b P (n, d, G) = {z1, z2, zn} : zi Rd, zi G

The above instance class is the standard setting of mean estimation with bounded data for which it is known (see (Steinke and Ullman, 2015; Kamath and Ullman, 2020)) that for ϵ = O(1) and 2 Ω(n) δ 1 n1+Ω(1) , we have

min A:A is (ϵ,δ) DP max D b P(n,d,G) E A(D) bµ(D) 2 = Θ min G2, G2d log (1/δ)

Using the fact that for M 1, we have, b P1(n, d, G, M) = b P(n, d, G), and hence,

min A:A is (ϵ,δ) DP max D b P1(n,d,G,M) E A(D) bµ(D) 2 = min A:A is (ϵ,δ) DP max D b P(n,d,G) E A(D) bµ(D) 2

= Ω min G2, G2d log (1/δ)

= Ω min M 2G2, G2d log (1/δ)

For the second case M 4 n, suppose to contrary that there exists d, n N, G, M > 0 satisfying M 4 n and ϵ = O(1), 2 Ω(n) δ 1 n1+Ω(1) such that there exists an (ϵ, δ)-DP algorithm A with the following guarantee,

max D b P1(n,d,G,M) E A(D) bµ(D) 2 = o min M 2G2, G2d log (1/δ)

Private Federated Learning with Autotuned Compression

We will use the above algorithm to break a known lower bound, hence implying a contradiction. Consider the instance class b P (1, d, MG), i.e. n = 1, deﬁned above. From the DP-mean estimation lower bound (Eqn. (27)), we have that for ϵ = O(1) and δ = Θ(1),

min A:A is (ϵ,δ) DP max D b P(1,d,M) E A(D) bµ(D) 2 = Θ min M 2G2, M 2G2d log (1/δ)

= Θ G2 min M 2, M 2d

= Ω G2 min M 2, M 2d

= Ω(M 2G2) (28)

where the second and third equality follows from conditions on ϵ and δ.

We now modify A to solve DP mean estimation for datasets in b P (1, d, MG) . Towards this, given D b P (1, d, MG) i.e.

D = {z}, construct dataset D of n samples as follows: D = n nz

4 , 0, 0, 0 o . It is easy to see that D b P1(n, d, G, M)

n. Applying (ϵ, δ)-DP A to D gives us that,

E A( D) bµ( D) 2 = o G2 min M 2, d log (1/δ)

We now post-process A( D) to obtain a solution for A( D). Firstly, deﬁne δ = 1

2 δ. With probability δ, output z as the mean of dataset D, in the other case, output A( D); call this output µ. Note that in the ﬁrst case, we don t preserve privacy, whereas the other case satisﬁes (ϵ, δ)-DP by DP property of A and post-processing. Hence, overall, the reduced method satisﬁes (ϵ, 1

2)-DP. We remark that this is done since since the usual lower bounds (Eqn. (27)) assumes that δ = Θ(1) as therein n = 1. The accuracy is,

E µ bµ(D) 2 δE A( D) bµ( D) 2 = o G2 min M 2, dδ log (1/δ)

= o G2 min M 2, d n2ϵ2

= o G2 min M 2, d

where the second equality follows since δ log (1/δ) 1 for 2 Ω(n) δ 1 n1+Ω(1) , third follows since ϵ = O(1) and in the ﬁnal equality, we use the assumption that M 4

n. This contradicts the lower bound in Eqn. (28).

We now proceed to the ﬁnal case, 4

n < M < 1 wherein we will proceed as in case two. Suppose to contrary that there exists d, n N, G, M > 0 satisfying 4

n < M < 1 and ϵ = O(1), 2 Ω(n) δ 1 n1+Ω(1) such that there exists an (ϵ, δ)-DP algorithm A with the following guarantee,

max D b P1(n,d,G,M) E A(D) bµ(D) 2 = o G2 min M 2, d log (1/δ)

Deﬁne N := Mn

2 and consider the instance class b P(N, d, G), as deﬁned above. Again, from the DP-mean estimation lower bound, we have that for ϵ = O(1) and 2 Ω(N) δ 1 N 1+Ω(1) ,

min A:A is (ϵ,δ) DP max D b P(N,d,G) E A(D) bµ(D) 2 = Θ min G2, G2d log (1/δ)

We now modify A to solve DP mean estimation for datasets in b P(N, d, G). Towards this, given D b P(N, d, G), construct dataset D of n samples by appending n N samples of 0 Rd to D. Note that the norm of data points in D is upper

Private Federated Learning with Autotuned Compression

bounded by G and bµ( D) = N

2n MG. Hence, we have that D b P1(n, d, G, M).

Applying (ϵ, δ)-DP algorithm A to D gives us that,

E A( D) bµ( D) 2 = o G2 min M 2, d log (1/δ)

We now post-process A( D) to obtain a solution for dataset D. With probability 1 N1+Ω(1) δ, output bµ(D) as the solution, in the other case, output n

N A( D); call this solution µ. As in the previous case, in the ﬁrst case, we don t preserve privacy while the second case satisﬁes (ϵ, δ)-DP from DP property of A and post-processing. Hence, the combined method satisﬁes ϵ, 1 N1+Ω(1) -DP. Further, the accuracy guarantee is,

E µ bµ(D) 2 δE n

N A( D) bµ(D) 2 δ n2

N 2 E A( D) bµ( D) 2

N 2 min M 2, log (1/δ)

N 2 min δM 2, δ log (1/δ)

= o G2 min 1, log (N)

where the last equality follows from the setting of N and from δ log (1/δ) 1 log (N) as N 2 when Mn > 4. This contradicts the lower bound in Eqn. (29) for δ = 1 N 1+Ω(1) . Combining the three cases ﬁnishes the proof.

Proof of Thm. B.2. The upper bound is obtained by the best of the following procedures: output zero for G2 term, our proposed method Thm. 3.4 for γ2 + k G2 log(1/δ)

n2ϵ2 term, and d G2 log(1/δ)

n2ϵ2 from Gaussian mechanism. For the lower bound, we ﬁrst deﬁne the following instance classes,

b P1(n, d, G, M) = {z1, z2, zn} : zi Rd, zi G, bµ({zi}n i=1) MG

b P (n, d, G) = {z1, z2, zn} : zi Rd, zi G

The second instance class is the standard setting of mean estimation with bounded data and it is known (see (Steinke and Ullman, 2015; Kamath and Ullman, 2020)) that for ϵ = O(1) and 2 Ω(n) δ 1 n1+Ω(1) , we have

min A:A is (ϵ,δ) DP max D b P(n,d,G) E A(D) bµ(D) 2 = Θ min G2, G2d log (1/δ)

For the ﬁrst instance class, we showed a lower bound in Thm. B.1. We will now use these two results to construct a reduction argument for the lower bound claimed in the theorem statement. Suppose to contrary that there exists k, d, n N, G, γ > 0 and ϵ = O(1), 2 Ω(n) δ 1 n1+Ω(1) such that there exists an algorithm A with the following guarantee,

max D P(n,d,G,k,γ) E A(D) bµ(D) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

Consider instance classes b P1 n, d k, G

2, γ and b P n, k, G

. Given datasets D1 b P1 n, d k, G

2, γ and D2

b P n, k, G

, we will use Algorithm A to solve DP mean estimation for both these datasets. Speciﬁcally, construct dataset D by stacking samples from D2 to corresponding samples from D1. Note that each sample in D has norm bounded by G. Further, bµ(D) = [bµ(D1), bµ(D2)] . Therefore, we have,

bµ(D)tail(k) = G bµ(D)n-tail(k) = min µ: µ is k-sparse bµ(D) µ bµ(D) [ 0, bµ(D2)] = bµ(D1) γG

Private Federated Learning with Autotuned Compression

We therefore have that D P(n, d, G, k, γ). Apply (ϵ, δ)-DP A to get A(D) = [ µ1, µ2]. The accuracy guarantee of A gives us that,

E A(D) bµ(D) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

= E µ1 bµ(D1) 2 + E µ2 bµ(D2) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

Note that both the outputs µ1 and µ2 satisfy (ϵ, δ)-DP from DP guarantee of A and post-processing property. From the above, we get the following accuracy guarantees,

E µ1 bµ(D1) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

E µ2 bµ(D2) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

We now consider three cases, (a). k > d/2, (b). k d/2 and γ2 < k log(1/δ)

n2ϵ2 and (c). k d/2 and γ2 k log(1/δ)

n2ϵ2 . For the ﬁrst case (a). k > d/2, from Eqn. (31), we get,

E µ2 bµ(D2) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

= o min G2, G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

= o min G2, k G2 log (1/δ)

where the second equality uses γ2 1 and the third equality uses that d < 2k.

For the second case k d/2 and γ2 < k log(1/δ)

n2ϵ2 , we again use Eqn. (31) to get,

E µ2 bµ(D2) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

= o min G2, 2k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

= o min G2, k G2 log (1/δ)

where the second equality uses γ2 < k G2 log(1/δ)

n2ϵ2 and the third equality uses 2k d.

Combining cases (a). and (b). together contradicts the lower bound for DP mean estimation, mentioned in Eqn. (30). For the third case k d/2 and γ2 k log(1/δ)

n2ϵ2 , from Eqn. (32), we get

E µ1 bµ(D1) 2 = o min G2, γ2G2 + k G2 log (1/δ)

n2ϵ2 , d G2 log (1/δ)

= o min G2, 2γ2G2, d G2 log (1/δ)

= o min γ2G2, (k + (d k))G2 log (1/δ)

= o min γ2G2, 2(d k)G2 log (1/δ)

= o min γ2G2, (d k)G2 log (1/δ)

Private Federated Learning with Autotuned Compression

where the second equality uses that γ2 k log(1/δ)

n2ϵ2 , and the second last equality uses that k d/2 = k d k. This

contradicts the lower bound for DP mean estimation in Thm. B.1 since D2 b P1 n, d k, G

2, γ . Combining all the above cases ﬁnishes the proof.

E. Proofs of Communication Lower Bounds for FME

Multi-round protocols with Sec Agg: We set up some notation for multi-round protocols. For K N, a K-round protocol A consists of a sequence of encoding schemes, denoted as {Et}K t=1 and a decoding scheme U. The encoding scheme in the t-th round, Et : {Oj}j<t Rd Ot where Oj denote its output space in the j-th round. Since we are operating under the Sec Agg constraint, the set Ot is identiﬁed with a ﬁnite ﬁeld over which Sec Agg operates. and let bt := log (|Ot|) be the number of bits used to encode messages in the t-th round. For dataset D = {zi}n i=1 distributed among n clients, we

recursively deﬁne outputs of Sec Agg, O1 = Sec Agg({E1(zi)}n i=1) and Ot = Sec Agg n Et {Oj}j<t , zi on

after K rounds, the decoding scheme, denoted as U : O1 O2 OK Rd, outputs U {Ot}t K =: A(D). The

total per-client communication complexity is PK t=1 bt.

Given a set M in a normed space (X, ), and γ > 0, we use N (P, γ, ) and N ext (P, γ, ) to denote its covering and exterior covering numbers at scale γ see Section 4.2 in (Vershynin, 2018) for deﬁnitions.

Proof of Corollary 3.2. This follows uses the one-round compression lower bound for unbiased schemes, Theorem 5.3 in (Chen et al., 2022a). Speciﬁcally, (Chen et al., 2022a) showed that for any given d, M, for any unbiased encoding and decoding scheme with ℓ2 error at most α, there exists a data point z Rd with z M such that its encoding requires

α bits. We ﬁrst extend the same lower bound to multi-round protocols. Assume to the contrary that there exists a

K > 0 and a K-round protocol with encoding and decoding schemes {Ei}K i=1 and U respectively, such that the total size

of messages communicated, PK i=1 bi = o d M 2

α bits. Now, note that the set of possible encoding schemes are range (or communication) restricted, but otherwise arbitrary. Thus, we can construct a one round encoding scheme which simulates the K round scheme {Ei}K i=1. Speciﬁcally, it ﬁrst encodes the data point using E1 to get a message size b1, then uses E2 on the output and data to get a message of size b2, and so on, until it has used all the K encoding schemes. Finally, it concatenates all messages and sends it to the server, which decodes it using U. The total size of the messages thus is PK i=1 bi = o d M 2

α , which contradicts the result in (Chen et al., 2022a), and hence establishes the same lower bound for multi-round schemes. We now extend the compression lower bound by reducing FME under Sec Agg to compression. We simply deﬁne the probability distribution as supported on a single data point z, so that all clients have the same data point. Due to Sec Agg constraint, the decoder simply observes the average of messages i.e the same number of bits as communicated by each client. Hence, if there exists a multi-round scheme Sec Agg-compatible scheme for FME for this instance, with total per-client communication of o d M 2

α bits, then we can simulate it to solve the (one client) compression

problem with the same communication complexity, implying a contradiction. Finally, plugging in the optimal error α2 as min M 2, G2

n + G2 log(1/δ)

n2ϵ2 , where optimality is established in Thm. B.1, gives us the claimed bound.

Proof of Thm. 3.3. It sufﬁces to show that a one-round i.e. K = 1, Sec Agg-compatible, unbiased, communication protocol cannot be adaptive i.e have optimal error and optimal communication complexity simultaneously for all values of the norm of the mean of instance. Note that such a protocol A gets as inputs n, d, G, ϵ, δ, and outputs an encoding scheme E : Rd O, and decoding scheme U : O Rd where O is the output space of the encoding scheme and can be identiﬁed with the ﬁnite ﬁeld over which Sec Agg operates. In the proof of Corollary 3.2, we showed that the communication complexity for optimal error, even under knowledge of M = µ(D) , is b = Ω min d, M 2n2ϵ2 G2 log(1/δ), M 2nd

G2 bits. To show that, we proved a compression lower bound for any protocol A, there exists a distribution D, supported on a single point z such that µ(D) M, and that E(z) has size at least b bits w.p. 1. We modify this, by deﬁning a distribution D supported on n 0, z o

such that P D(z) = ln n

n . Note that the norm of the mean shrinks to M log(n)

n M, and hence the optimal communication complexity for such a distribution is smaller. By direct computation, it follows that upon sampling n i.i.d. points from P, with probability at least 1 1

n, there is at least one client with data point z. However, the encoding function E, oblivious to

Private Federated Learning with Autotuned Compression

the knowledge of the data points in other clients, still produces an encoding of z in b bits, and thus fails with probability at least 1 1

Theorem E.1. For any K > 0, set M Rd and K-round encoding schemes {Ei}K i=1 and decoding scheme U, if the following holds,

max D:µ(D) M E{Ei},U,D Dn U {AEi(D)}i K µ(D) 2 α2,

then the total communication PK i=1 bi log (N ext(M, α, 2))

Proof. The proof follows the covering argument in (Chen et al., 2022a). Given a dataset D = {zi}n i=1 distributed among n cleints, deﬁne outputs of Sec Agg recursively, in the multi-round scheme as, O1 = Sec Agg({E1(zi)}n i=1) and

Ot = Sec Agg n Et {Oj}j<t , zi on

. Consider the set

C = n EUU({Cj}j K) : Cj Oj o

Note that since |Oj| 2bj, we have that |C| 2 PK j=1 bj.

For each value of µ(D) M, we simply pick a distribution D supported on the singleton {µ(D)}. Now, from the premise in the Theorem statement and Jensen s inequality, we get that,

max D:µ(D) M min randomness in {Ei}i

EU h U {Oi}i K i µ(D) 2 max D:µ(D) M E{Ei}i

EU h U {Oi}i K i µ(D) 2

max D:µ(D) M E{Ei}i,U U {Oi}i K µ(D) 2 α2,

This gives us a mapping, which for each µ(D), gives a point, which is at most α2 away from bµ(D) in 2 2 this is a α exterior covering of M with respect 2. However, every point in the above cover lies in C. Therefore,

i=1 bi log (C) log N ext(M, α, 2) .

which completes the proof.

Proof of Thm. B.4. Note the for any distribution D P2(d, G, γ, k), its mean µ(D) lies in M = z Rd : z 2 G and ztail(k) γ . The claim follows simply from Thm. E.1 by plugging in the the covering number of M. Towards this, deﬁne M := z Rd : z 2 G and ztail(k) 0 , which is simply the set of k-sparse vectors in d dimensions, bounded in norm by G. Now, since M M, from monotonicity property of exterior covering numbers (see Section 4.2 in (Vershynin, 2018)), we have N ext M, α, 2 N ext (M, α, 2). Further, we relate the exterior

and (non-exterior) covering numbers as N ext M, α, 2 N M, 2α, 2 (see Section 4.2 in (Vershynin, 2018)).

We now compute N M, 2α, 2 using the standard volume argument. Give a set, let Volk denote its k-dimensional

(Lebesque) volume. We have that Volk( M) = d k Volk(Bk G) = d k Gk Ck, where Bk G is a ball in k dimensions of radius G,

and Ck is a k-dependent constant. Further, note that d k d

k k for d 2k. Now, since sum of volumes of all balls in the cover should be at least as large as this volume, we get, d

k Gk Ck Volk( M) N M, 2α, 2 Volk(Bk 2α) = N M, 2α, 2 Ck (2α)k

= N M, 2α, 2 d G

Finally, using Thm. E.1 plugging in the derived lower bound gives the stated guarantee.

Private Federated Learning with Autotuned Compression

F. Empirical Details

In all cases, we use central DP with ﬁxed ℓ2 clipping denoted as η. We tune the server learning rate for each noise multiplier on each dataset, with al other hyperparemters held ﬁxed. On all datasets, we select n [100, 1000] and train for a total R = 1500 rounds.

F-EMNIST has 62 classes and N = 3400 clients with a total of 671, 585 training samples. Inputs are single-channel (28, 28) images. On F-EMNIST, the server uses momentum of 0.9 and η = 0.49 with the client using a learning rate of 0.01 without momentum and a mini-batch size of 20. We use n = 100 clients per round. Our optimal server learning rates are {0.6, 0.4, 0.2, 0.1, 0.08} for noise multipliers in {0.1, 0.2, 0.3, 0.5, 0.7}, respectively.

The Stack Overﬂow (SO) dataset is a large-scale text dataset based on responses to questions asked on the site Stack Overﬂow. The are over 108 data samples unevenly distributed across N = 342, 477 clients. We focus on the next word prediction (NWP) task: given a sequence of words, predict the next words in the sequence. On Stack Overﬂow, the server uses a momentum of 0.95 and η = 1.0 with the client using a learning rate of 1.0 without momentum and limited to 256 elements per client. Our optimal server learning rates are {1.0, 1.0, 0.5, 0.5} for noise multipliers in {0.1, 0.3, 0.5, 0.7}, respectively. We use n = 1000 clients per round.

Shakespeare is similar to SONWP but is focused on character prediction and instead built from the collective works of Shakespeare, partitioned so that each client is a speaking character with at least two lines. There are N = 715 characters (clients) with 16, 068 training samples and 2, 356 test samples. On Shakespeare, the server uses a momentum of 0.9 and η = 1.85; the client uses a learning rate of 5.0 and mini-batch size of 4. Our optimal server learning rates are {0.1, 0.1, 0.08, 0.06, 0.04, 0.03, } for noise multipliers in {0.05, 0.1, 0.2, 0.3, 0.5, 0.7}, respectively. We use = 100 clients per round.

F.1. Model Architectures

Model: "Large Model: 1M parameter CNN" _________________________________________________________________ Layer type Output Shape Param # ================================================================= Conv2D None, 26, 26, 32 320

Max Pooling2D None, 13, 13, 32 0

Conv2D None, 11, 11, 64 18496

Dropout None, 11, 11, 64 0

Flatten None, 7744 0

Dense None, 128 991360

Dropout None, 128 0

Dense None, 62 7998

================================================================= Total params: 1,018,174 Trainable params: 1,018,174 Non-trainable params: 0 _________________________________________________________________

Figure 3. F-EMNIST model architecture.

Private Federated Learning with Autotuned Compression

Model: "Stack Overflow Next Word Prediction Model" _________________________________________________________________ Layer type Output Shape Param # ================================================================= Input Layer None, None 0

Embedding None, None, 96 960384

LSTM None, None, 670 2055560

Dense None, None, 96 64416

Dense None, None, 10004 970388

================================================================= Total params: 4,050,748 Trainable params: 4,050,748 Non-trainable params: 0 _________________________________________________________________

Figure 4. Stack Overﬂow Next Word Prediction model architecture.

Model: "Shakespeare Character Prediction Model" _________________________________________________________________ Layer type Output Shape Param # ================================================================= Embedding None, 80, 8 720

LSTM None, 80, 256 271360

LSTM None, 80, 256 525312

Dense None, 80, 90 23130

================================================================= Total params: 820,522 Trainable params: 820,522 Non-trainable params: 0 _________________________________________________________________

Figure 5. Shakespeare character prediction model architecture.

G. DP and Sketching Empirical Details

Noise multiplier to ε-DP We specify the privacy budgets in terms of the noise multiplier z, which together with the clients per round n, total clients N, number of rounds R, and the clipping threshold completely specify the trained model ε-DP. Because the ﬁnal ε-DP values depend on the sampling method: e.g., Poisson vs. ﬁxed batch sampling, which depends on the production implementation of the FL system, we report the noise multipliers instead. Using Mironov (2017), our highest noise multipliers roughly correspond to ε = {5, 10} using δ = 1/N and privacy ampliﬁcation via ﬁxed batch sampling for SONWP and F-EMNIST.

Sketching We display results in terms of the noise multiplier which fully speciﬁes the ε-DP given our other parameters (n, N, and R). We use a count-mean sketch as proposed in Chen et al. (2021) which compresses gradients to a sketch matrix of size (t, w)=(length,width). We ﬁnd empirically that this had no major difference compared with our proposed median-of-means sketch. We use t = 15 as used in Chen et al. (2022a). We use this value for all our experiments and calculate the width = d/(r length) where gradient Rd and r is the compression rate. The full algorithms for decoding and encoding can be found in Chen et al. (2022a) but we note that we do not normalize by the sketch length and instead normalize by its square root, which gives a (approximately) preserved sketched gradient norm.

Private Federated Learning with Autotuned Compression

H. Adapt Tail: Empirical Results and Challenges in FL

In this section, we describe some natural mappings of Adapt Tail for FME to FO, as well as discuss some challenges encountered in FL experiments.

The client protocol again remains the same with zc replaced by model updated via local training. For the server-side procedure, as in the case of Adapt Norm (Sec. 4), the key idea is to plug-in our FME procedure, Alg. 2, in the averaging step of the Fed Avg algorithm, yielding Alg. 5. Further rather than using log (d) interactive rounds to estimate the k-th tail norm, we use a stale estimates computed from prior rounds as in our Adapt Norm approach. Consequently, we do not use the Above Threshold mechanism (in Alg. 2), but rather account for composition for noise added for adaptivity. This means we do not noise the threshold γ, and add Gaussian (instead of Laplace) noise to error. Further, c0 is a constant which determines the fraction of error relative to DP error, that we want to tolerate is at most, say 10% as we use across all experiments, including these, in this paper.

The ﬁnal proposed change yields a few variants of our method. Instead of doubling the sketch size at every round, we update it as follows, Cj+1 = (1 + ηsign( ej γj)) Cj

where we set η = 0.2 < 1. Setting η = 1 recovers our doubling scheme; in general, this means that instead of resetting to our initial (small) sketch size for every FME instance, we increase/decrease the sketch size based on if the error ej is smaller/larger than the target γ. While sign( ej γj) is a convenient choice to measure the distance of error from the target threshold , we also tried a few other options. For instance, we can use the (absolute) error with exponential and linear updates, Cj+1 = (1 + η( ej γj)) Cj and Cj+1 = Cj + η( ej γj) , as well as use relative, as opposed to absolute error.

Algorithm 5 Adapt Tail FL

Require: Sketch sizes L1 = RPC1 and L = R P C noise multiplier σ, model dimension d, a constant c0, rounds K, η

1: for j = 1 to K do 2: Select n random clients and broadcast sketching operators Sj and Sj of sizes Lj = RPCj and L.

3: νj =Sec Agg({Q(c) j }n c=1) + N 0, σ2B2

0.9n IP C ,

νj =Sec Agg({ Q(c) j }n c=1) , where Q(c) j [clip B(S(i) j (z(j) c ))]R i=1, Q(c) j [clip B( S(i) j (z(j) c ))] R i=1 4: Unsketch DP mean: µj = Uj(νj)

5: Second sketch: ej = Sj( µj) νj

0.1n 7: ej = ej + N(0, σ2/0.1) 8: Cj+1 = (1 + ηsign( ej γ)) Cj 9: end for

On Adapt Tail s performance for federated optimization. We found that the above mappings of Adapt Tail, did not do well in our experiments. In particular, we found that this method almost always led to a severe overestimate of the attainable compression rates, which correspondingly hurt the utility of the ﬁnal model. Often, we found this deterioration was drastic, even beyond = 20%.

Next, we discuss our observations and our hypotheses for why this method did not perform well in practice. First, recall that the logic behind our mapping from adaptive procedure for FME to FO is that to use it to solve intermediate FME problems arising from Fed Avg. Thus, all our theoretical guarantees are for the FME problem. We ﬁnd the the Adpat Tail procedure indeed ﬁnds very high compression rates, within small additional error (e.g., our chosen 10%) in FME. However, these compression rates are often too large to result in a reasonable accuracy for the learning task.

From this, and contrasted with the empirical success we observe with the Adapt Norm approach, we believe the most plausible issue is that the federated mean estimation error may not be the best proxy metric. Though we ﬁnd very favorable compression rates with the Adapt Norm method by also building on this, this method does achieve worse compression-utility tradeoffs than the (potentially unattainable, due to computation requirements) genie.