# privacy_amplification_via_random_checkins__e0767e8f.pdf Privacy Amplification via Random Check-Ins Borja Balle Peter Kairouz H. Brendan Mc Mahan Om Thakkar Abhradeep Thakurta Differentially Private Stochastic Gradient Descent (DP-SGD) forms a fundamental building block in many applications for learning over sensitive data. Two standard approaches, privacy amplification by subsampling, and privacy amplification by shuffling, permit adding lower noise in DP-SGD than via na ıve schemes. A key assumption in both these approaches is that the elements in the data set can be uniformly sampled, or be uniformly permuted constraints that may become prohibitive when the data is processed in a decentralized or distributed fashion. In this paper, we focus on conducting iterative methods like DP-SGD in the setting of federated learning (FL) wherein the data is distributed among many devices (clients). Our main contribution is the random check-in distributed protocol, which crucially relies only on randomized participation decisions made locally and independently by each client. It has privacy/accuracy trade-offs similar to privacy amplification by subsampling/shuffling. However, our method does not require server-initiated communication, or even knowledge of the population size. To our knowledge, this is the first privacy amplification tailored for a distributed learning framework, and it may have broader applicability beyond FL. Along the way, we improve the privacy guarantees of amplification by shuffling and show that, in practical regimes, this improvement allows for similar privacy and utility using data from an order of magnitude fewer users. 1 Introduction Modern mobile devices and web services benefit significantly from large-scale machine learning, often involving training on user (client) data. When such data is sensitive, steps must be taken to ensure privacy, and a formal guarantee of differential privacy (DP) [16, 15] is the gold standard. For this reason, DP has been adopted by companies including Google [20, 10, 18], Apple [2], Microsoft [13], and Linked In [31], as well as the US Census Bureau [26]. Other privacy-enhancing techniques can be combined with DP to obtain additional benefits. In particular, cross-device federated learning (FL) [27] allows model training while keeping client data decentralized (each participating device keeps its own local dataset, and only sends model updates or gradients to the coordinating server). However, existing approaches to combining FL and DP make a number of assumptions that are unrealistic in real-world FL deployments such as [11]. To highlight these challenges, we must first review the state-of-the-art in centralized DP training, where differentially private stochastic gradient descent (DP-SGD) [33, 9, 1] is ubiquitous. It achieves optimal error for convex problems [9], and can also be applied to non-convex problems, including deep learning, where the privacy amplification offered by randomly subsampling data to form batches is critical for obtaining meaningful DP guarantees under computational constraints [25, 9, 1, 5, 35]. Deep Mind. borja.balle@gmail.com Google. {kairouz, mcmahan, omthkkr, athakurta}@google.com 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Attempts to combine FL and the above lines of DP research have been made previously; notably, [28, 3] extended the approach of [1] to FL and user-level DP. However, these works and others in the area sidestep a critical issue: the DP guarantees require very specific sampling or shuffling schemes assuming, for example, that each client participates in each iteration with a fixed probability. While possible in theory, such schemes are incompatible with the practical constraints and design goals of cross-device FL protocols [11]; to quote [23], a comprehensive recent FL survey, such a sampling procedure is nearly impossible in practice. 3 The fundamental challenge is that clients decide when they will be available for training and when they will check in to the server, and by design the server cannot index specific clients. In fact, it may not even know the size of the participating population. Our work targets these challenges. Our primary goal is to provide strong central DP guarantees for the final model released by FL-like protocols, under the assumption of a trusted4 orchestrating server. This is accomplished by building upon recent work on amplification by shuffling [19, 12, 18, 22, 6] and combining it with new analysis techniques targeting FL-specific challenges (e.g., client-initiated communications, non-addressable global population, and constrained client availability). We propose the first privacy amplification analysis specifically tailored for distributed learning frameworks. At the heart of our result is a novel technique, called random check-in, that relies only on randomness independently generated by each individual client participating in the training procedure. We show that distributed learning protocols based on random check-ins can attain privacy gains similar to privacy amplification by subsampling/shuffling (see Table 1 for a comparison), while requiring minimal coordination from the server. While we restrict our exposition to distributed DP-SGD within the FL framework for clarity and concreteness (see Figure 1 for a schematic of one of our protocols), we note that the techniques used in our analyses are broadly applicable to any distributed iterative method and might be of interest in other applications5. Figure 1: A schematic of the Random Check-ins protocol with Fixed Windows (Section 3.1) for Distributed DP-SGD (Algorithm 1). For the central DP guarantee, all solid arrows represent communication over privileged channels not accessible to any external adversary. (a) n clients performing random check-ins with a fixed window of m time steps. X denotes that the client randomly chose to abstain from participating. (b) A time step at the server, where for training time i [m], the server selects a client j from those who checked-in for time i, requests an update for model θi, and then updates the model to θi+1 (or gradient accumulator if using minibatches). Contributions The main contributions of this paper can be summarized as follows: 1. We propose random check-ins, the first privacy amplification technique for distributed systems with minimal server-side overhead. We also instantiate three distributed learning protocols that use random check-ins, each addressing different natural constraints that arise in applications. 2. We provide formal privacy guarantees for our protocols, and show that random check-ins attain similar rates of privacy amplification as subsampling and shuffling while reducing the need for 3In cross-silo FL applications [23], an enumerated set of addressable institutions or data-silos participate in FL, and so explicit server-mediated subsampling or shuffling using existing techniques may be feasible. 4Notably, our guarantees are obtained by amplifying the privacy provided by local DP randomizers; we treat this use of local DP as an implementation detail in accomplishing the primary goal of central DP. As a byproduct, our approach offers (weaker) local DP guarantees even in the presence of an untrusted server. 5In particular, the Federated Averaging [27] algorithm, which computes an update based on multiple local SGD steps rather than a single gradient, can immediately be plugged into our framework. 0.2 0.4 0.6 0.8 1.0 ε0 EFMRTT 19 Theorem 5.1 n = 103 EFMRTT 19 Theorem 5.1 n = 103 Figure 2: Values of ε (for δ = 10 6) after amplification by shuffling of ε0-DP local randomizers obtained from: Theorem 5.1 (solid lines) and [19, Theorem 7] (dotted lines). The grey line represents the threshold of no amplification (ε = ε0); after crossing the line amplification bounds become vacuous. Observe that our bounds with n = 103 and n = 104 are similar to the bounds from [19] with n = 104 and n = 105, respectively. server-side orchestration. We also provide utility guarantees for one of our protocols in the convex case that match the optimal privacy/accuracy trade-offs for DP-SGD in the central setting [8]. 3. As a byproduct of our analysis, we improve privacy amplification by shuffling [19] on two fronts. For ε0-DP local randomizers, we improve the dependency of the final central DP ε by O(e0.5ε0). Figure 2 provides a numerical comparison of the bound from [19] with our bound; for typical parameter values this improvement allows us to provide similar privacy guarantees while reducing the number of required users by one order of magnitude. We also extend the analysis to the case of (ε0, δ0)-DP local randomizers, including Gaussian randomizers that are widely used in practice. Related work Our work considers the paradigm of federated learning as a stylized example throughout the paper. We refer the reader to [23] for an excellent overview of the state-of-theart in federated learning, along with a suite of interesting open problems. There is a rich literature on studying differentially private ERM via DP-SGD [33, 9, 1, 36, 34, 30]. However, constraints such as limited availability in distributed settings restrict direct applications of existing techniques. There is also a growing line of works on privacy amplification by shuffling [10, 19, 12, 4, 6, 22, 18] that focus on various ways in which protocols can be designed using trusted shuffling primitives. Lastly, privacy amplification by iteration [21] is another recent advancement that can be applied in an iterative distributed setting, but it is limited to convex objectives. 2 Background and Problem Formulation Differential Privacy We first define neighboring data sets. We refer to a pair of data sets D, D Dn as neighbors if D can be obtained from D by modifying one sample di D for some i [n]. Definition 2.1 (Differential privacy [16, 15]). A randomized algorithm A : Dn S is (ε, δ)- differentially private if, for any pair of neighboring data sets D, D Dn, and for all events S S in the output range of A, we have Pr[A(D) S] eε Pr[A(D ) S] + δ. For meaningful central DP guarantees (i.e., when n > 1), ε is assumed to be a small constant, and δ 1/n. The case δ = 0 is often referred to as pure DP (in which case, we just write ε-DP). Specializing Definition 2.1 to the case n = 1 gives what we call a local randomizer, which provides a local DP guarantee. Local randomizers are the typical building blocks of local DP protocols where individuals privatize their data before sending it to an aggregator for analysis [25]. Adaptive DP mechanisms occur naturally when constructing complex DP algorithms, for e.g., DPSGD. In addition to the dataset D, adaptive mechanisms also receive as input the output of other DP mechanisms. Formally, we say that an adaptive mechanism A : S Dn S is (ε, δ)-DP if the mechanism A(s , ) is (ε, δ)-DP for every s S . Problem Setup The distributed learning setup we consider here involves n clients, where each client j [n] holds a data record6 dj D, j [n], forming a distributed data set D = (d1, . . . , dn). We assume a coordinating server wants to train the parameters θ Θ of a model by using the dataset 6Each client is identified as a user. In a general FL setting, each dj can correspond to a local data set [11]. D to perform stochastic gradient descent steps according to some loss function ℓ: D Θ R+. The server s goal is to protect the privacy of all the individuals in D by providing strong DP guarantees against an adversary that can observe the final trained model as well as all the intermediate model parameters. We assume the server is trusted, all devices are honest-but-curious (i.e., they adhere to the prescribed protocol, and there are no malicious users), and all server-client communications are privileged (i.e., they cannot be detected or eavesdropped by an external adversary). The server starts with model parameters θ1 and produces a sequence of model parameters θ2, . . . , θm+1 over a sequence of m time slots. Our random check-ins technique allows clients to independently decide when to offer their contributions for a model update. If and when a client s contribution is accepted by the server, she uses the current parameters θ and her data d to send a privatized gradient of the form Aldp( θℓ(d, θ)) to the server, where Aldp is a DP local randomizer (e.g., performing gradient clipping and adding Gaussian noise [1]). Our results consider three different setups inspired by practical applications [11]: (1) The server uses m n time slots, where at most one user s update is used in each slot, for a total of m/b minibatch SGD iterations. It is assumed all n users are available for the duration of the protocol, but the server does not have enough bandwidth to process updates from every user (Section 3.1); (2) The server uses m n/b time slots, and all n users are available for the duration of the protocol (Section 4.1). On average, b users contribute updates to each time slot, and so, we take m minibatch SGD steps; (3) As with (2), but each user is only available during a small window of time relative to the duration of the protocol (Section 4.2). 3 Distributed Learning with Random Check-Ins This section presents the random check-ins technique for privacy amplification in the context of distributed learning. We formally define the random check-ins procedure, describe a fully distributed DP-SGD protocol with random check-ins, and analyze its privacy and utility guarantees. 3.1 Random Check-Ins with a Fixed Window Consider the distributed learning setup described in Section 2 where each client is willing to participate in the training procedure as long as their data remains private. To boost the privacy guarantees provided by the local randomizer Aldp, we will let clients volunteer their updates at a random time slot of their choosing. This randomization has a similar effect on the uncertainty about the use of an individual s data on a particular update as the one provided by uniform subsampling or shuffling. We formalize this concept using the notion of random check-in, which can be informally expressed as a client in a distributed iterative learning framework randomizing their instant of participation, and determining with some probability whether to participate in the process at all. Definition 3.1 (Random check-in). Let A be a distributed learning protocol with m check-in time slots. For a set Rj [m] and probability pj [0, 1], client j performs an (Rj, pj)-check-in in the protocol if with probability pj she requests the server to participate in A at time step I u.a.r. Rj, and otherwise abstains from participating. If pj = 1, we alternatively denote it as an Rj-check-in. Our first distributed learning protocol based on random check-ins is presented in Algorithm 1. Client j independently decides in which of the possible time steps (if any) she is willing to participate by performing an (Rj, pj)-check-in. We set Rj = [m] for all j [n], and assume7 all n clients are available throughout the duration of the protocol. On the server side, at each time step i [m], a random client Ji among all the ones that checked-in at time i is queried: this client receives the current model θi, locally computes a gradient update θℓ(d Ji, θi) using their data d Ji, and returns to the server a privatized version of the gradient obtained using a local randomizer Aldp. Clients checked-in at time i that are not selected do not participate in the training procedure. If at time i no client is available, the server adds a dummy gradient to update the model. 3.2 Privacy Analysis From a privacy standpoint, Algorithm 1 shares an important pattern with DP-SGD: each model update uses noisy gradients obtained from a random subset of the population. However, two key factors 7We make this assumption only for utility; the privacy guarantees are independent of this assumption. Server-side protocol: parameters: local randomizer Aldp, number of steps m Initialize model θ1 Θ Initialize gradient accumulator g1 0p for i [m] do Si {j : User(j) checked-in at time i} if Si is empty then gi Aldp (0p) // Dummy gradient else Sample Ji u.a.r. Si Request User(Ji) for update to model θi Receive gi from User(Ji) (θi+1, gi+1) Model Update(θi, gi + gi, i) Output θi+1 Client-side protocol for User(j): parameters: check-in window Rj, check-in probability pj, loss function ℓ, local randomizer Aldp private inputs: datapoint dj D if a pj-biased coin returns heads then Check-in with the server at time I u.a.r. Rj if receive request for update to model θI then g I Aldp( θℓ(dj, θI)) Send g I to server Model Update(θ, g, i): parameters: batch size b, learning rate η if i mod b = 0 then b g, 0p // Gradient descent step else return (θ, g) // Skip update Algorithm 1: Afix Distributed DP-SGD with random check-ins (fixed window). make the privacy analysis of our protocol more challenging than the same via subsampling/shuffling. First, unlike in the case of uniform sampling where the randomness in each update is independent, here there is a correlation induced by the fact that clients that check-in into one step cannot check-in into a different step. Second, in shuffling there is also a similar correlation between updates, but there we can ensure each update uses the same number of datapoints, while here the server does not control the number of clients that will check-in into each individual step. Nonetheless, the following result shows that random check-ins provides privacy amplification comparable to these techniques. Theorem 3.2 (Amplification via random check-ins into a fixed window). Suppose Aldp is an ε0DP local randomizer. Let Afix : Dn Θm be the protocol from Algorithm 1 with check-in probability pj = p0 and check-in window Rj = [m] for each client j [n]. For any δ (0, 1), algorithm Afix is (ε, δ)-DP with ε = p0(eε0 1) q 2eε0 log (1/δ) m + p2 0eε0(eε0 1)2 2m . In particular, for ε0 < 1 and δ < 1/100, we get ε 7p0ε0 q m . Furthermore, if Aldp is (ε0, δ0)-DP with δ0 (1 e ε0)δ1 4eε0 2+ ln(2/δ1) ln(1/(1 e 5ε0 )) , then Afix is (ε , δ )-DP with ε = p2 0e8ε0(e8ε0 1)2 2m + p0(e8ε0 2e8ε0 log (1/δ) m and δ = δ + m(eε + 1)δ0. Remark 1 We can always increase privacy in the above statement by decreasing p0. However, this will also increase the number of dummy updates, which suggests choosing p0 = Θ(m/n). With such a choice, we obtain an amplification factor of m/n. Critically, however, exact knowledge of the population size is not required to have a precise DP guarantee above. Remark 2 At first look, the amplification factor of m/n may appear stronger than the typical 1/ n factor obtained via uniform subsampling/shuffling. Note that one run of our technique provides m updates (as opposed to n updates via the other methods). When the server has sufficient capacity, we can set m = n to recover a 1/ n amplification. The primary advantage of our approach is that we can benefit from amplification in terms of n even if only a much smaller number of updates are actually processed. We can also extend our approach to recover the 1/ n amplification even when the server is rate limited (p0 = m/n), by repeating the protocol Afix adaptively n/m times to get Corollary 3.3 from Theorem 3.2 and applying advanced composition for DP [17]. Corollary 3.3. For algorithm Afix : Dn Θm in Theorem 3.2, suppose Aldp is an ε0-DP local randomizer s.t. ε0 2 log(n/8 m) 3 , and n (eε0 1)2eε0 m log (1/β). Setting p0 = m n , and running n m repetitions of Afix results in a total of n updates, and overall central (ε, δ)-DP with ε = O e1.5ε0/ n and δ (0, 1), where O( ) hides polylog factors in 1/β and 1/δ. Comparison to Existing Privacy Amplification Techniques Table 1 provides a comparison of the bound in Corollary 3.3 to other existing techniques, for performing one epoch of training (i.e., use one update from each client). For this comparison, we assume that ε0 > 1, since for ε0 1 all the shown amplification bounds can be written as O (ε0/ n). None denotes a na ıve scheme (with no privacy amplification) where each client is used once in any arbitrary order. Also, in general, the guarantees via privacy amplification by subsampling/shuffling apply only under the assumption of complete participation availability8 of each client. Thus, they define the upper limits of achieving such amplifications. Lastly, even though the bound in Corollary 3.3 appears better than amplification via shuffling, our technique does include dummy updates which do not occur in the other techniques. For linear optimization problems, it is easy to see that our technique will add a factor of e more noise as compared to the other two amplification techniques at the same privacy level. Source of Privacy Amplification ε for Central DP None [14, 32] ε0 Uniform subsampling [25, 9, 1] O (eε0/ n) Shuffling [19] O e3ε0/ n Shuffling (Theorem 5.1, This paper) O e2.5ε0/ n Random check-ins with a fixed window O e1.5ε0/ n (Theorem 3.2, This paper) Table 1: Comparison with existing amplification techniques for a data set of size n, running n iterations of DPSGD with batch size of 1 and ε0-DP local randomizers. For ease of exposition, we assume (eε0 1) ε0, and hide polylog factors in n and 1/δ. Proof Sketch for Theorem 3.2 Here, we provide a summary of the argument9 used to prove Theorem 3.2 in the case δ0 = 0. First, note that it is enough to argue about the privacy of the sequence of noisy gradients g1:m by post-processing. Also, the role each client plays in the protocol is symmetric, so w.l.o.g. we can consider two datasets D, D differing in the first position. Next, we imagine that the last n 1 clients make the same random check-in choices in Afix(D) and Afix(D ). Letting ci denote the number of such clients that check-in into step i [n], we model these choices by a pair of sequences F = ( d1:m, w1:m) where di D { } is the data record of an arbitrary client who checked-in into step i (with representing a dummy data record if no client checkedin), and wi = 1/(ci + 1) represents the probability that client 1 s data will be picked to participate in the protocol at step i if she checks-in in step i. Conditioned on these choices, the noisy gradients g1:m produced by Afix(D) can be obtained by: (1) initializing a dataset D = ( d1:m); (2) sampling I u.a.r. [m], and replacing d I with d1 in D w.p. p0w I; (3) producing the outputs g1:m by applying a sequence of ε0-DP adaptive local randomizers to D = ( d1:m) by setting gi = A(i)( di, g1:i 1). Here each of the A(i) uses all past gradients to compute the model θi and return gi = Aldp( θℓ( di, θi)). The final step involves a variant of the amplification by swapping technique [19, Theorem 8] which we call amplification by random replacement. The key idea is to reformulate the composition of the A(i) applied to the random dataset D, to a composition of mechanisms of the form gi = B(i)(d1, F, g1:i 1). Mechanism B(i) uses the gradient history to compute qi = Pr[I = i| g1:i 1] and returns A(i)(d1, g1:i 1) with probability p0wiqi, and A(i)( di, g1:i 1) otherwise. Note that before the process begins, we have Pr[I = i] = 1/m for every i; our analysis shows that the posterior probability after observing the first i 1 gradients is not too far from the prior: qi eε0 meε0 (eε0 1)(i 1). The desired bound is then obtained by using the overlapping mixtures technique [5] to show that B(i) is log(1 + p0qi(eε0 1))-DP with respect to changes on d1, and heterogeneous advanced composition [24] to compute the final ε of composing the B(i) adaptively. 3.3 Utility Analysis Proposition 3.4 (Dummy updates in random check-ins with a fixed window). For algorithm Afix : Dn Θm described in Theorem 3.2, the expected number of dummy updates performed by the server is at most m 1 p0 m n . For c > 0 if p0 = cm n , we get at most m ec expected dummy updates. 8By a complete participation availability for a client, we mean that the client should be available to participate when requested by the server for any time step(s) of training. 9Full proofs for every result in the paper are provided in the full version [7]. Utility for Convex ERMs We now instantiate our amplification result (Theorem 3.2) in the context of DP empirical risk minimization (ERM). For convex ERMs, we will show that DP-SGD [33, 9, 1] in conjunction with Theorem 3.2 is capable of achieving optimal privacy/accuracy trade-offs [9]. Theorem 3.5 (Utility guarantee). Suppose in algorithm Afix : Dn Θm described in Theorem 3.2 the loss ℓ: D Θ R+ is L-Lipschitz and convex in its second parameter and the model space Θ has dimension p and diameter R, i.e., supθ,θ Θ θ θ R. Furthermore, let D be a distribution on D, define the population risk L (D; θ) = Ed D [ℓ(d; θ)], and let θ = arg minθ Θ L (D; θ). If Aldp is a local randomizer that adds Gaussian noise with variance σ2, and the learning rate for a model update at step i [m] is set to be ηi = R(1 2e np0/m) (pσ2+L2)i , then the output θm of Afix(D) on a dataset D containing n i.i.d. samples from D satisfies10 ED,θm [L (D; θm)] L (D; θ ) = e O pσ2 + L2 R 1 2e np0/m m Remark 3 Note that as m n, it is easy to see for p0 = Ω m n that Theorem 3.5 achieves the optimal population risk trade-off [9, 8]. 4 Variations: Thrifty Updates, and Sliding Windows This section presents two variants of the main protocol from the previous section. The first variant makes a better use of the updates provided by each user at the expense of a small increase in the privacy cost. The second variant allows users to check-in into a sliding window to model the case where different users might be available during different time windows. 4.1 Leveraging Updates from Multiple Users Server-side protocol: parameters: total update steps m Initialize model θ1 Rp for i [m] do Si {j : User(j) checks-in for index i} if Si is empty then θi+1 θi else gi 0 for j Si do Request User(j) for update to model θi Receive gi,j from User(j) gi gi + gi,j θi+1 θi η |Si| gi Output θi+1 Algorithm 2: Aavg - Distributed DP-SGD with random check-ins (averaged updates). Now, we present a variant of Algorithm 1 which, at the expense of a mild increase in the privacy cost, removes the need for dummy updates, and for discarding all but one of the clients checked-in at every time step. The server-side protocol of this version is given in Algorithm 2 (the client-side protocol is identical as Algorithm 1). Here, if no client checked-in at some step i [m], the server skips the update. Furthermore, if at some step multiple clients checked in, the server requests gradients from all the clients, and performs an update using the average of the submitted noisy gradients. These changes have the advantage of reducing the noise in the model from dummy updates, and increasing the algorithm s data efficiency by using gradients provided by all available clients. The privacy analysis here becomes more challenging as (1) the adversary gains information about the time steps where no clients checked-in, and (2) the server uses the potentially nonprivate count |Si| of clients checked-in at time i when performing the model update. Nonetheless, we show that the privacy guarantees of Algorithm 2 are similar to those of Algorithm 1 with an additional O(e3ε0/2) factor, and the restriction of non-collusion among the participating clients. For simplicity, we only analyze the case where each client has check-in probability pj = 1. 10Here, e O hides a polylog factor in m. Theorem 4.1 (Amplification via random check-ins with averaged updates). Suppose Aldp is an ε0-DP local randomizer. Let Aavg : Dn Θm be the protocol from Algorithm 2 performing m averaged model updates with check-in probability pj = 1 and check-in window Rj = [m] for each user j [n]. Algorithm Aavg is (ε, δ + δ2)-DP with ε = e4ε0(eε0 1)2ε2 1 2 + e2ε0(eε0 2 log(1/δ), where ε1 = q n . In particular, for ε0 1 we get ε = O ε0 m . Furthermore, if Aldp is (ε0, δ0)-DP with δ0 (1 e ε0)δ1 4eε0 2+ ln(2/δ1) ln(1/(1 e 5ε0 )) , then Aavg is (ε , δ )-DP with ε = e32ε0(e8ε0 1)2ε2 1 2 +e16ε0(e8ε0 1)ε1 q δ and δ = δ+δ2+m(eε +1)δ1. We provide a utility guarantee for Aavg in terms of the excess population risk for convex ERMs (similar to Theorem 3.5). Theorem 4.2 (Utility guarantee of Algorithm 2). Suppose in algorithm Aavg : Dn Θm described in Theorem 4.1 the loss ℓ: D Θ R+ is L-Lipschitz and convex in its second parameter and the model space Θ has dimension p and diameter R, i.e., supθ,θ Θ θ θ R. Furthermore, let D be a distribution on D, define the population risk L (D; θ) = Ed D [ℓ(d; θ)], and let θ = arg minθ Θ L (D; θ). If Aldp is a local randomizer that adds Gaussian noise with variance σ2, and the learning rate for a model update at step i [m] is set to be ηi = R n (mpσ2+n L2)i, then the output θm of Aavg(D) on a dataset D containing n i.i.d. samples from D satisfies ED,θm [L (D; θm)] L (D; θ ) = e O mpσ2 + n L2 Furthermore, if the loss ℓis β-smooth in its second parameter and we set the step-size ηi = R n L2+pσ2 , then we have ED,θ1,...,θm L (D; θ ) = e O Comparison of the utility of Algorithm 2 to that of Algorithm 1: Recall that in Afix we can achieve a small fixed ε by taking p0 = m/n and σ = O (p0/ε m), in which case the excess risk bound in Theorem 3.5 becomes O q . On the other hand, in Aavg we can obtain a fixed small ε by taking σ = O (1/ε m). In this case the excess risks in Theorem 4.2 are bounded by in the convex case, and by O q n + p ε2nm + 1 in the convex and smooth case. Thus, we observe that all the bounds recover the optimal population risk trade-offs from [9, 8] as m n, and for m n and non-smooth loss Afix provides a better trade-off than Aavg, while on smooth losses Aavg and Afix are incomparable. Note that Afix (with b = 1) will not attain a better bound on smooth losses because each update is based on a single data-point. Setting b > 1 will reduce the number of updates to m/b for Afix, whereas to get an excess risk bound for Afix for smooth losses where more than one data point is sampled at each time step will require extending the privacy analysis to incorporate the change, which is beyond the scope of this paper. 4.2 Random Check-Ins with a Sliding Window The second variant we consider removes the need for all clients to be available throughout the training period. Instead, we assume that the training period comprises of n time steps, and each client j [n] is only available during a window of m time steps. Clients perform a random check-in to provide the server with an update during their window of availability. For simplicity, we assume clients wake up in order, one every time step, so client j [n] will perform a random check-in within the window Rj = {j, . . . , j + m 1}. The server will perform n m + 1 updates starting at time m to provide a warm-up period where the first m clients perform their random check-ins. Theorem 4.3 (Amplification via random check-ins with sliding windows). Suppose Aldp is an ε0DP local randomizer. Let Asldw : Dn Θn m+1 be the distributed algorithm performing n model updates with check-in probability pj = 1 and check-in window Rj = {j, . . . , j + m 1} for each user j [n]. For any m [n], algorithm Asldw is (ε, δ)-DP with ε = eε0(eε0 1)2 2eε0 log (1/δ) m . For ε0 < 1 and δ < 1/100, we get ε 7ε0 q m . Furthermore, if Aldp is (ε0, δ0)-DP with δ0 (1 e ε0)δ1 4eε0 2+ ln(2/δ1) ln(1/(1 e 5ε0 )) , then Asldw is (ε , δ )-DP with ε = e8ε0(e8ε0 1)2 2e8ε0 log (1/δ) m and δ = δ + m(eε + 1)δ1. Remark 4 We can always increase privacy in the statement above by increasing m. However, that also increases the number of clients who do not participate in training because their scheduled checkin time is before the process begins, or after it terminates. Moreover, the number of empty slots where the server introduces dummy updates will also increase, which we would want to minimize for good accuracy. Thus, m introduces a trade-off between accuracy and privacy. Proposition 4.4 (Dummy updates in random check-ins with sliding windows). For algorithm Asldw : Dn Θn m+1 described in Theorem 4.3, the expected number of dummy gradient updates performed by the server is at most (n m + 1)/e. 5 Improvements to Amplification via Shuffling Here, we provide an improvement on privacy amplification by shuffling. This is obtained using two technical lemmas (deferred to the full version [7]) to tighten the analysis of amplification by swapping, a central component in the analysis of amplification by shuffling given in [19]. Theorem 5.1 (Amplification via Shuffling). Let A(i) : S(1) S(i 1) D S(i), i [n], be a sequence of adaptive ε0-DP local randomizers. Let Asl : Dn S(1) S(n) be the algorithm that given a dataset D = (d1, . . . , dn) Dn samples a uniform random permutation π over [n], sequentially computes si = A(i)(s1:i 1, dπ(i)) and outputs s1:n. For any δ (0, 1), algorithm Asl satisfies (ε, δ)-DP with ε = e3ε0(eε0 1)2 2n +e3ε0/2(eε0 1) q 2 log (1/δ) n . Furthermore, if A(i), i [n], is (ε0, δ0)-DP with δ0 (1 e ε0)δ1 4eε0 2+ ln(2/δ1) ln(1/(1 e 5ε0 )) , then Asl satisfies (ε , δ )-DP with ε = e24ε0(e8ε0 1)2 2n + e12ε0(e8ε0 1) q 2 log (1/δ) n and δ = δ + n(eε + 1)δ1. For comparison, the guarantee in [19, Theorem 7] in the case δ0 = 0 results in ε = 2e2ε0(eε0 1) e 2 exp(2ε0)(eε0 1) n 1 + 2e2ε0(eε0 1) 2 log (1/δ) 6 Discussion Proposed framework s utility for non-convex problems: Our analytical arguments (Theorems 3.5 and 4.2) formally demonstrate order-optimal utility/privacy trade-offs for convex models. While we agree that this line of research should eventually demonstrate empirical evidence for efficacy, theoretical conclusions do act as guiding principles, and the novel privacy amplification results are interesting and relevant on their own. One can view this line of reasoning to be analogous to privacypreserving deep learning research, in that the theoretical guarantees for convex models acted as guiding principles for non-convex (deep learning) models [9, 1, 29]. 7 Broader Impacts The rapid growth in connectivity and information sharing has been accelerating the adoption of tighter privacy regulations and better privacy-preserving technologies. Therefore, training machine learning models on decentralized data using mechanisms with formal guarantees of privacy is highly desirable. However, despite the rapid acceleration of research on both DP and FL, only a tiny fraction of production ML models are trained using either technology. This work takes an important step in addressing this gap. Our work highlights the fact that proving DP guarantees for distributed or decentralized systems can be substantially more challenging than for centralized systems, because in the distributed world it becomes much harder to precisely control and characterize the randomness in the system, and this precise characterization and control of randomness is at the heart of DP guarantees. Specifically, production FL systems do not satisfy the assumptions that are typically made under state-of-the-art privacy accounting schemes, such as privacy amplification via subsampling. Without such accounting schemes, service providers cannot give DP statements with small ε s. This work, though largely theoretical in nature, proposes a method shaped by the practical constraints of distributed systems that allows for rigorous privacy statements under realistic assumptions. Nevertheless, there is more to do. Our theorems are sharpest in the high-privacy regime (small ε s), which may be too conservative to provide sufficient utility for some applications. While significantly relaxed from previous work, our assumptions will still not hold in all real-world systems. Thus, we hope this work encourages further collaboration between distributed systems and DP theory researchers in establishing protocols that address the full range of possible systems constraints as well as improving the full breadth of the privacy vs. utility Pareto frontier. Acknowledgements The authors would like to thank Vitaly Feldman for suggesting the idea of privacy accounting in DP-SGD via shuffling, and for help in identifying and fixing a mistake in the way a previous version of this paper handled (ε0, δ0)-DP local randomizers. No third party funding was received by any of the authors to pursue this work. [1] M. Abadi, A. Chu, I. J. Goodfellow, H. B. Mc Mahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security (CCS 16), pages 308 318, 2016. [2] D. P. T. Apple. Learning with privacy at scale, 2017. [3] S. Augenstein, H. B. Mc Mahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, et al. Generative models for effective ml on private, decentralized datasets. ar Xiv preprint ar Xiv:1911.06679, 2019. [4] V. Balcer and A. Cheu. Separating local & shuffled differential privacy via histograms. Co RR, abs/1911.06879, 2019. [5] B. Balle, G. Barthe, and M. Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pages 6280 6290, 2018. [6] B. Balle, J. Bell, A. Gascon, and K. Nissim. The privacy blanket of the shuffle model. In Advances in Cryptology CRYPTO, 2019. [7] B. Balle, P. Kairouz, H. B. Mc Mahan, O. Thakkar, and A. Thakurta. Privacy amplification via random check-ins. Co RR, abs/2007.06605, 2020. [8] R. Bassily, V. Feldman, K. Talwar, and A. G. Thakurta. Private stochastic convex optimization with optimal rates. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 11279 11288, 2019. [9] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proc. of the 2014 IEEE 55th Annual Symp. on Foundations of Computer Science (FOCS), pages 464 473, 2014. [10] A. Bittau, U. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, D. Lie, M. Rudominer, U. Kode, J. Tinn es, and B. Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017, pages 441 459. ACM, 2017. [11] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. Mc Mahan, et al. Towards federated learning at scale: System design. ar Xiv preprint ar Xiv:1902.01046, 2019. [12] A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via mixnets. Co RR, abs/1808.01394, 2018. [13] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 3571 3580, 2017. [14] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 429 438. IEEE Computer Society, 2013. [15] C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology EUROCRYPT, pages 486 503, 2006. [16] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of the Third Conf. on Theory of Cryptography (TCC), pages 265 284, 2006. [17] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. [18] U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, S. Song, K. Talwar, and A. Thakurta. Encode, shuffle, analyze privacy revisited: Formalizations and empirical evaluation. Co RR, abs/2001.03618, 2020. [19] U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar, and A. Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468 2479. SIAM, 2019. [20] U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacypreserving ordinal response. In Proc. of the 2014 ACM Conf. on Computer and Communications Security (CCS 14), pages 1054 1067. ACM, 2014. [21] V. Feldman, I. Mironov, K. Talwar, and A. Thakurta. Privacy amplification by iteration. In 59th Annual IEEE Symp. on Foundations of Computer Science (FOCS), pages 521 532, 2018. [22] B. Ghazi, R. Pagh, and A. Velingker. Scalable and differentially private distributed aggregation in the shuffled model. Co RR, abs/1906.08320, 2019. [23] P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecn y, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Ozg ur, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tram er, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and open problems in federated learning. Co RR, abs/1912.04977, 2019. [24] P. Kairouz, S. Oh, and P. Viswanath. The composition theorem for differential privacy. IEEE Trans. Inf. Theory, 63(6):4037 4049, 2017. [25] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. D. Smith. What can we learn privately? In 49th Annual IEEE Symp. on Foundations of Computer Science (FOCS), pages 531 540, 2008. [26] Y. Kuo, C. Chiu, D. Kifer, M. Hay, and A. Machanavajjhala. Differentially private hierarchical count-of-counts histograms. PVLDB, 11(11):1509 1521, 2018. [27] B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 1273 1282, 2017. [28] H. B. Mc Mahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private language models without losing accuracy. Co RR, abs/1710.06963, 2017. [29] H. B. Mc Mahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017. [30] V. Pichapati, A. T. Suresh, F. X. Yu, S. J. Reddi, and S. Kumar. Adaclip: Adaptive clipping for private sgd. ar Xiv preprint ar Xiv:1908.07643, 2019. [31] R. Rogers, S. Subramaniam, S. Peng, D. Durfee, S. Lee, S. K. Kancha, S. Sahay, and P. Ahammad. Linkedin s audience engagements api: A privacy preserving data analytics system at scale, 2020. [32] A. Smith, A. Thakurta, and J. Upadhyay. Is interaction necessary for distributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pages 58 77. IEEE, 2017. [33] S. Song, K. Chaudhuri, and A. D. Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pages 245 248. IEEE, 2013. [34] O. Thakkar, G. Andrew, and H. B. Mc Mahan. Differentially private learning with adaptive clipping. Co RR, abs/1905.03871, 2019. [35] Y. Wang, B. Balle, and S. P. Kasiviswanathan. Subsampled renyi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 1226 1235, 2019. [36] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. F. Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In S. Salihoglu, W. Zhou, R. Chirkova, J. Yang, and D. Suciu, editors, Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD, 2017.