# shuffle_private_stochastic_convex_optimization__9446dbb6.pdf Published as a conference paper at ICLR 2022 SHUFFLE PRIVATE STOCHASTIC CONVEX OPTIMIZATION Albert Cheu Georgetown University ac2305@georgetown.edu Matthew Joseph Google Research mtjoseph@google.com Jieming Mao Google Research maojm@google.com Binghui Peng Columbia University bp2601@columbia.edu In shuffle privacy, each user sends a collection of randomized messages to a trusted shuffler, the shuffler randomly permutes these messages, and the resulting shuffled collection of messages must satisfy differential privacy. Prior work in this model has largely focused on protocols that use a single round of communication to compute algorithmic primitives like means, histograms, and counts. We present interactive shuffle protocols for stochastic convex optimization. Our protocols rely on a new noninteractive protocol for summing vectors of bounded ℓ2 norm. By combining this sum subroutine with mini-batch stochastic gradient descent, accelerated gradient descent, and Nesterov s smoothing method, we obtain loss guarantees for a variety of convex loss functions that significantly improve on those of the local model and sometimes match those of the central model. 1 INTRODUCTION In stochastic convex optimization, a learner receives a convex loss function ℓ: Θ X R mapping pairs of parameters and data points to losses, and the goal is to use data samples x1, . . . , xn to find a parameter θ to minimize population loss Ex D [ℓ(θ, x)] over unknown data distribution D. The general applicability of this framework has motivated a long line of work studying stochastic convex optimization problems under the constraint of differential privacy (Dwork et al., 2006b), which guarantees that the solution found reveals little information about the data used during optimization. This has led to guarantees for both the central and local models of differential privacy. Users in central differential privacy must trust a central algorithm operator, while users in local differential privacy need not trust anyone outside their own machine. These different protections lead to different utility guarantees. When optimizing from n samples over a d-dimensional parameter space with privacy parameter ε, the optimal loss term due to privacy is O( d/(εn)) in the central model (Bassily et al., 2013; 2019; Feldman et al., 2020). In the local model, the optimal private loss term is O( d/(ε n)) for sequentially interactive protocols (Duchi et al., 2018) and a class of compositional fully interactive protocols (Lowy and Razaviyayn, 2021). A similar utility gap between central and local privacy also appears in many other problems (Kasiviswanathan et al., 2011; Chan et al., 2012; Duchi et al., 2018; Duchi and Rogers, 2019; Joseph et al., 2019). This has motivated the exploration of models achieving a finer trade-off between privacy and utility. One such notion of privacy is shuffle privacy (Cheu et al., 2019; Erlingsson et al., 2019). Here, users send randomized messages to a shuffler, which permutes the collection of messages before they are viewed by any other party. In conjunction with work on building shufflers via onion routing, multi-party computation, and secure computation enclaves, this simple model has led to a now-substantial body of work on shuffle privacy (Bittau et al., 2017; Cheu et al., 2019; Erlingsson et al., 2019; Balle et al., 2019; Ghazi et al., 2021; Balcer and Cheu, 2020; Feldman et al., 2021; Balcer et al., 2021), including steps toward real-world deployment (Google, 2021). However, comparatively little is known about shuffle private stochastic convex optimization. Published as a conference paper at ICLR 2022 1.1 OUR CONTRIBUTIONS 1. We introduce sequentially interactive (each user participates in a single protocol round) and fully interactive (each user participates in arbitrarily many protocol rounds) variants of the shuffle model of differential privacy1. This distinction is useful because, while full interactivity offers the strongest possible asymptotic guarantees, the practical obstacles to making multiple queries to a user over the duration of a protocol (Kairouz et al., 2021b; 2019) make sequentially interactive protocols more realistic. 2. We construct a noninteractive shuffle private protocol for privately computing a sum with bounded ℓ2 sensitivity. By using multiple messages, our protocol is substantially more accurate than existing results (see Related Work). 3. We use our ℓ2 summation protocol to derive several new shuffle private SCO guarantees (Table 1) via techniques including acceleration, Nesterov s smoothing method in the sequential interactive model, and noisy gradient descent in the fully interactive model. Loss Function Sequential Full Convex O d1/3 ε2/3n2/3 (Thm. 4.7) O d εn (Thm. 4.9) Convex and smooth O d2/5 ε4/5n4/5 (Thm. 4.3) Strongly convex O d2/3 ε4/3n4/3 (Thm. 4.8) O d ε2n2 (Thm. 4.9) Strongly convex and smooth O d ε2n2 (Thm. 4.8) Table 1: The guarantees proved in this paper. d is the dimension of the parameter space Θ, ε is the privacy parameter, and n is the number of data points. For neatness, this table omits the nonprivate 1/ n term, logarithmic terms, and other parameters like Lipschitz constants, parameter space diameter, and smoothness. Full statements appear in the referenced theorems. 1.2 ADDITIONAL RELATED WORK The vast majority of work on shuffle privacy studies the noninteractive model, but a few exceptions exist. Erlingsson et al. (2019) and Feldman et al. (2021) study a variant in which the shuffler is a layer that accepts randomizers from the analyst, assigns them randomly to users, and sends the permuted messages back to the analyzer. Feldman et al. (2021) also construct a basic protocol for SGD in this model. This model is similar to other work that achieves shuffle privacy guarantees by amplifying the local privacy guarantees of local randomizers (Balle et al., 2019). Erlingsson et al. (2020) also studied empirical risk minimization using shuffle-amplified sequentially interactive locally private protocols. The main difference between that work and ours is that we only require that the view of the shuffled outputs across rounds satisfies differential privacy, and can therefore avoid relying on amplification. Another exception is the work of Beimel et al. (2020), where the shuffler broadcasts the permuted messages to all parties. Users then execute key exchange protocols, and Beimel et al. show that a first round of key exchange enables the use of multi-party computation (MPC) to simulate any central DP algorithm in the second round. Our results depart from theirs in two ways. First, we only assume that the shuffler sends outputs to the analyzer, not to individual users. Second, the MPC approach given by Beimel et al. requires an honest majority. Without an honest majority, the differential privacy guarantees of their simulation fail. In contrast, the privacy guarantees of our shuffle protocols degrade smoothly with the fraction of honest users, which can be anything larger than an arbitrary constant (see discussion at the end of Section 3.1). Simultaneous independent work by Tenenbaum et al. (2021) also studies a sequentially interactive variant of the shuffle model in the context of multi-arm bandits. Their model of sequential interactivity is identical to ours, albeit for a different problem. 1These terms are borrowed from local differential privacy (Duchi et al., 2018; Joseph et al., 2019). Published as a conference paper at ICLR 2022 Girgis et al. (2021) provide a fully interactive shuffle private protocol for empirical risk minimization (ERM). In contrast, we provide both sequentially and fully interactive protocols for SCO, which optimizes population rather than empirical loss. Their ℓ2 summation protocol requires one message from each user, and its second moment guarantee for the average gradient has a 1/b dependence on batch size b (their Lemma 4). In contrast, our multi-message protocol obtains a better 1/b2 dependence. This is possible because our multi-message protocol does not rely on privacy amplification, and in particular circumvents their single-message lower bound (their Theorem 3). Kairouz et al. (2021b) study DP-FTRL, which uses a single pass, extends to nonconvex losses, does not rely on shuffling or amplification, and satisfies central DP. It is possible to adapt their algorithm to sequentially interactive shuffle privacy using our vector sum protocol, though their O(d1/4/ n) SCO guarantee is weaker than our O(d1/3/n2/3) SCO guarantee (Theorem 4.7). Concurrent work by Lowy and Razaviyayn (2021) study fully interactive shuffle private protocols with SCO guarantees matching ours (Section 4.2). There are two key differences between their work and ours. First, they do not provide sequentially interactive protocols. Second, their fully interactive protocols assume that users can send messages consisting of real vectors. This assumption is difficult to satisfy in practice (Canonne et al., 2020; Kairouz et al., 2021a). As a result, our protocols (and most of those in the shuffle privacy literature) rely only on discrete messages. Finally, several shuffle private protocols are known for the simpler problem of summing scalars (Cheu et al., 2019; Balle et al., 2019; Ghazi et al., 2019; Balcer and Cheu, 2020) and recent work has studied vector addition in the secure aggregation model (Kairouz et al., 2021a), which assumes the existence of a trusted aggregator that can execute modular arithmetic on messages. We expand on this work by introducing a shuffle private protocol for summing vectors with bounded ℓ2 norm, with noise standard deviation proportional to the ℓ2 sensitivity. 2 PRELIMINARIES Differential Privacy. Throughout, let X be a data universe, and suppose that each of n users has one data point from X. Two datasets X, X X n are neighbors, denoted X X , if they differ in the value of at most one user. Differential privacy is defined with respect to neighboring databases. Definition 2.1 (Dwork et al. (2006b)). An algorithm A satisfies (ε, δ)-differential privacy if, for every X X and every event Z, PA [A(X) Z] eε PA [A(X ) Z] + δ. Because Definition 2.1 assumes that the algorithm A has central access to compute on the entire raw dataset, we call this central privacy for brevity. For brevity, we will often use differential privacy and DP interchangeably. At times, it will also be useful to rephrase DP as a divergence constraint. Definition 2.2. The δ-Approximate Max Divergence between distributions X and Y is Dδ (X||Y ) = max Z supp(X):Pr[X Z] δ log Pr[X Z] δ Fact 2.3. Algorithm A is (ε, δ)-DP if and only if, for all X X , Dδ (A(X)||A(X )) ε. A key property of DP is closure under post-processing. Fact 2.4 (Proposition 2.1 (Dwork and Roth, 2014)). Fix any function f. If A is (ε, δ)-differentially private, then the composition f A is also (ε, δ)-differentially private. Shuffle Privacy. The shuffle model (Bittau et al., 2017; Cheu et al., 2019; Erlingsson et al., 2019) does not require users to trust the operator of A. Instead, we make the narrower assumption that there is a trusted shuffler: users pass messages to the shuffler, the shuffler permutes them, and the algorithm A operates on the resulting shuffled output, thus decoupling messages from the users that sent them. Definition 2.5. A one-round shuffle protocol P is specified by a local randomizer R : X Y and analyzer A : Y Z. In an execution of P on X, each user i computes R(Xi) possibly using public randomness and sends the resulting messages yi,1, yi,2, . . . to the shuffler S. S reports y to an analyzer, where y is a uniformly random permutation of all user messages. The final output of the protocol is A( y). We now define our privacy objective in the shuffle model, often shorthanded as shuffle privacy . Published as a conference paper at ICLR 2022 Definition 2.6 (Cheu et al. (2019); Erlingsson et al. (2019)). P = (R, A) is (ε, δ)-shuffle differentially private if the algorithm (S Rn)(X) := S(R(X1), . . . , R(Xn)), i.e. the analyzer s view of the shuffled messages, is (ε, δ)-differentially private. The privacy guarantee is only over the internal randomness of the users randomizers and the shuffler. A drawback of the preceding definition of shuffle protocols is that it limits communication to one round. It is not possible, for example, to adjust the randomizer assigned to user i based on the messages reported from user i 1. This is an obstacle to implementing adaptive algorithms like gradient descent. We therefore extend the shuffle model with the following sequentially interactive and fully interactive variants. Sequential interactivity breaks the data into batches and runs a different shuffle protocol on each batch in (possibly adaptive) sequence. Definition 2.7. Let Tr denote the universe of shuffle protocol transcripts, and let R denote the universe of randomizers. A sequentially interactive shuffle protocol P consists of initial local randomizer R0, analyzer A, and update function U : Tr [n] R. Let Trt denote the transcript after t rounds of the protocol. In each round t of P, the analyzer computes (nt, Rt) = U(Trt 1), nt new users apply Rt to their data points, and S shuffles the result and relays it back to A. At the conclusion of a T-round protocol, the analyzer releases final output A(Tr T ). This formalizes the idea that the analyzer looks at the past shuffled outputs to determine how many and what kind of randomizers to pass to the next batch of users. Note that, in a sequentially interactive protocol, each user only participates in one computation. More generally, we can further allow the analyzer to repeatedly query users. This leads us to the fully interactive model. Definition 2.8. A fully interactive shuffle protocol P is identical to a sequentially interactive shuffle protocol, except the update function U : Tr 2[n] R selects an arbitrary subset of users in each round. In particular, a user may participate in multiple rounds. The main advantage of full interactivity is that it offers the strongest formal utility guarantees (see Section 4); the main advantage of sequential interactivity is that it is typically difficult to query a user multiple times in practice (Kairouz et al., 2021b; 2019). For sequentially and fully interactive shuffle protocols, the definition of privacy is identical: the view of the transcript satisfies DP. The noninteractive definition is a special case of this general definition. Definition 2.9. Given shuffle protocol P, let MP : X Tr denote the central algorithm that simulates P on an input database and outputs the resulting transcript. Then P is (ε, δ)-shuffle differentially private if MP is (ε, δ)-differentially private. Stochastic Convex Optimization. Let closed convex Θ Rd with diameter D be our parameter space. We always assume loss functions ℓ(θ, x) that are convex and L-Lipschitz in Θ. Definition 2.10. Loss function ℓ: Θ X R is convex in Θ if, for all x X, for all t [0, 1] and θ, θ Θ, ℓ(tθ + (1 t)θ , x) tℓ(θ, x) + (1 t)ℓ(θ , x). ℓis L-Lipschitz in Θ if for all x X and θ, θ Θ, |ℓ(θ, x) ℓ(θ , x)| L θ θ 2. In some cases, we also assume that our loss function ℓis β-smooth and/or λ-strongly convex. Definition 2.11. Loss function ℓ: Θ X R is β-smooth over Θ if, for every x X and for every θ, θ Θ, ℓ(θ, x) ℓ(θ , x) 2 β θ θ 2. ℓis λ-strongly convex in Θ if, for all x X, for all t [0, 1] and θ, θ Θ, ℓ(tθ + (1 t)θ , x) tℓ(θ, x) + (1 t)ℓ(θ , x) λt(1 t) Our goal is to minimize population loss. We view each user as a sample from D, and our protocols learns through repeated interaction with the shuffler rather than direct access to the samples. Definition 2.12. Let D be a distribution over X, and let ℓ: Θ X R be a loss function. Then algorithm A: X Θ has population loss EA,D ℓ( θ, D) = Eθ A(Dn) [Ex D [ℓ(θ, x)]]. 3 VECTOR SUMMATION In this section, we provide a shuffle private protocol PVEC for summing vectors of bounded ℓ2 norm. This contrasts with existing work, which focuses on ℓ1 sensitivity. PVEC will be useful for the gradient descent steps of our algorithms in Section 4. PVEC relies on d invocations of a scalar sum protocol Published as a conference paper at ICLR 2022 P1D, one for each dimension. We describe this scalar summation subroutine in Section 3.1, then apply it to vector summation in Section 3.2. Proofs for results in this section appear in Appendix A 3.1 SCALAR SUM SUBROUTINE We start with the pseudocode for P1D, a shuffle protocol for summing scalars. As is generally the case for shuffle protocols, the overall protocol decomposes P1D = (R1D, A1D) into randomizer and analyzer components: each user i computes a collection of messages yi R1D(xi), the shuffler S aggregates and permutes these messages to produce y {0, 1}(g+b)n, and the analyzer takes the result as input for its final analysis A1D( y). P1D uses the fixed-point encoding presented by Cheu et al. (2019) and ensures privacy using a generalization of work by Balcer and Cheu (2020). Algorithm 1 P1D, a shuffle protocol for summing scalars 1: Parameters: Scalar database X = (x1, . . . , xn) [0, ]n; g, b N; p (0, 1/2) 2: procedure LOCAL RANDOMIZER R1D(x) 3: Set x xg/ 4: Sample rounding value η1 Ber(xg/ x) 5: Set ˆx x + η1 6: Sample privacy noise value η2 Bin(b, p) 7: Report ˆx + η2 copies of 1 and g + b (ˆx + η2) copies of 0 8: end procedure 9: procedure ANALYZER A1D ( y) 10: Output estimator g ((P(g+b)n i=1 yi) pbn) 11: end procedure P1D s privacy guarantee is instance-specific: for every pair of inputs, we bound the approximate maxdivergence between output distributions as a function of the change in inputs. Previously proposed by Chatzikokolakis et al. (2013), this property will be essential when proving that PVEC is private. Lemma 3.1. Fix any number of users n, ε 15, and 0 < δ < 1/2. Let g n, b > 180g2 ln(2/δ) and p = 90g2 ln(2/δ) bε2n . Then 1) for any neighboring databases X X [0, ]n that differ on user u, Dδ ((S Rn 1D)(X)||(S Rn 1D)(X )) ε 2 g + |xu x u| , and 2) for any input database X [0, ]n, P1D(X) is an unbiased estimate of Pn i=1 xi and has variance O( 2 We defer the proof to Appendix A. Claim 1 follows first from the observation that the adversary s view is equivalent to the sum of the message bits. This sum has binomial noise, which was first analyzed by Dwork et al. (2006a) in the context of central DP. We adapt work on a shuffle private variant due to Ghazi et al. (2021) to incorporate a dependence on the per-instance distance between databases. Claim 2 follows as a straightforward calculation. Lemma 3.1 holds when all n users follow the protocol. We call these users honest. It is possible to prove a more general version of Lemma 3.1 whose parameters smoothly degrade with the fraction of honest users (see Appendix A.3). This robust variant of shuffle privacy was originally defined in work by Balcer et al. (2021) and naturally extends to PVEC. The remaining analysis uses P1D in a black-box manner. It is therefore possible to replace it with a lower-communication subroutine for scalar sum (Balle et al., 2019). However, this improved communication requires a more complicated protocol with an additional subroutine. Our work focuses on sample complexity guarantees, so we use P1D for clarity. Nonetheless, we note that each user sends g + b n + 2 log(1/δ)/ε2 bits in P1D, and the analyzer processes these in time O((g + b)n). Scaling these quantities by d yields guarantees for PVEC in the next subsection. Moreover, it is possible to remove the communication dependence on by having users scale down their inputs by before running P1D, running P1D as if = 1, and then scaling up the final output by to compensate; this does not affect the overall unbiasedness or variance guarantees. Improving the communication efficiency of these protocols may be an interesting direction for future work. Published as a conference paper at ICLR 2022 3.2 VECTOR SUM Below, we present the pseudocode for PVEC, again decomposing into randomizer and analyzer components. Note that we view the vector y to be the collection of all shuffled messages and, since the randomizers labels these messages by coordinate, yj consists of the messages labelled j. The accompanying guarantee for PVEC appears in Theorem 3.2. Algorithm 2 PVEC, a shuffle protocol for vector summation 1: Input: database of d-dimensional vectors X = ( x1, , xn); privacy parameters ε, δ; 2. 2: procedure LOCAL RANDOMIZER RVEC( xi) 3: for coordinate j [d] do 4: Shift data to enforce non-negativity: wi,j xi,j + ( 2, . . . , 2) 5: mj R1D( wi,j) 6: end for 7: Output labeled messages {(j, mj)}j [d] 8: end procedure 9: procedure ANALYZER AVEC( y) 10: for coordinate j [d] do 11: Run analyzer on j s messages zj A1D(j, yj) 12: Re-center: oj zj n 2 13: end for 14: Output the vector of estimates o 15: end procedure Theorem 3.2. For any 0 < ε 15, 0 < δ < 1/2, d, n N, and 2 > 0, there are choices of parameters b, g N and p (0, 1/2) for P1D (Algorithm 3.1) such that, for inputs X = ( x1, . . . , xn) of vectors with maximum norm || xi||2 2, 1) PVEC is (ε, δ)-shuffle private and, 2) PVEC( X) is an unbiased estimate of Pn i=1 xi and has bounded variance = O d 2 2 ε2 log2 d The proof relies on a variant of advanced composition (Lemma 3.3). Although the proof follows many of the same steps as prior work, the statement is somewhat nonstandard because it uses the divergence between two specific algorithm executions, rather than the worst-case divergence of a generic differential privacy guarantee. This enables us to use Lemma 3.1. Lemma 3.3. Fix any γ (0, 1) and neighboring databases X, X X n. For all j [d], suppose algorithm Aj : X n Y use independent random bits and satisfies Dδj (Aj(X)||Aj(X )) εj. Let j=1 εj(eεj 1) + 2 v u u tlog(1/γ) j=1 ε2 j and δ = j=1 δj + γ. Then Dδ (A1(X), . . . , Ad(X)||A1(X ) . . . , Ad(X )) ε . We defer the proof of Theorem 3.2 to Appendix A.2 but sketch the outline here. For any neighboring pair of datasets, the j-th invocation of P1D by PVEC produces one of two message distributions such that the divergence between them is roughly proportional to the gap in the j-th coordinate of the input. This is immediate from Lemma 3.1. The fact that we know a bound on sensitivity allows us to bound the (squared) ℓ2 norm of the vector of divergences. This norm is implicit within Lemma 3.3, so the target theorem follows by re-scaling of parameters and substitution. 4 CONVEX OPTIMIZATION We now apply PVEC to convex optimization. Section 4.1 describes shuffle protocols that are sequentially interactive and solicit one input from each user.2 Section 4.2 offers stronger utility guarantees 2Since sequentially interactive protocols process data online, our sequentially interactive shuffle protocols also have straightforward pan-private (Dwork et al., 2010a) analogues with the same guarantees; see Appendix C. Published as a conference paper at ICLR 2022 via fully interactive protocols that query users multiple times. Complete proofs of all results in this section, along with communication and runtime guarantees, appear in Appendix B. 4.1 SEQUENTIALLY INTERACTIVE PROTOCOLS Our first sequentially interactive protocol is a simple instantiation of noisy mini-batch SGD, PSGD. This baseline algorithm achieves population loss O(d1/4/ εn) (Theorem 4.1), which already improves on the locally private guarantee. However, in a departure from the central and local models, we then show that a still better O(d1/3/(εn)2/3) loss guarantee is possible (Theorem 4.7) using a more complex algorithm PAGD that smooths the loss function and employs accelerated SGD. Convex ℓ We start with PSGD. Because PSGD is sequentially interactive, for a fixed number of users n, the batch size b is inversely proportional to the number of iterations T. As a result, Theorem 4.1 chooses b to balance the noise added to each gradient step (which decreases with b) with the number of steps taken (which also decreases with b). Algorithm 3 PSGD, Sequentially interactive shuffle private SGD Require: Number of users n, batch size b, privacy parameter ε, δ, step size η, Lipschitz parameter L 1: Initialize parameter estimate θ0 Θ, and set number of iterations T = n/b 2: for t = 1, 2, . . . , T do 3: Compute gradient at θt 1, gt 1 b PVEC ( θℓ(θt 1, xt,1), , θℓ(θt 1, xt,b); ε, δ, L) 4: Update parameter estimate θt πΘ (θt 1 + η gt) 5: end for 6: Output 1 T PT 1 t=0 θt Theorem 4.1. Let loss function ℓbe convex, L-Lipschitz over a closed convex set Θ Rd of diameter D, θ = 1 T PT 1 t=0 θt, b = ε , T = n/b, η = εb D L d L log(d/δ)), and ε 15. Then PSGD is (ε, δ)-shuffle private and has population loss EPSGD,D ℓ( θ, D) min θ Θ ℓ(θ, D) + O d1/4DL log1/2(d/δ) εn Privacy for PSGD and our other protocols, which all rely on PVEC follows from the fact that the L-Lipschitz assumption ensures that a loss gradient has ℓ2 norm at most L. The utility proof proceeds by viewing each noisy gradient step as a call to a noisy gradient oracle. This enables us to use a standard result for gradient oracle SGD. Lemma 4.2 (Bubeck et al. (2015)). Suppose PSGD queries an LG-noisy gradient oracle at each iteration, then its output satisfies E ℓ( θ, D) minθ Θ ℓ(θ, D) + D2 2ηT + ηL2 G 2 . In our case, we can bound the noise level LG as a function of our problem and privacy parameters, including the batch size. We then choose the batch size to minimize the overall loss guarantee. Convex and smooth ℓ Next, we obtain a better loss guarantee than Theorem 4.1 when the loss function is also smooth. The key insight is that by using acceleration, a better trade-off between noise level and number of iterations is possible. The resulting algorithm, PAGD, can be seen as a private version of the accelerated stochastic approximation algorithm (AC-SA) of Lan (2012). Theorem 4.3. Let ℓbe convex, β-smooth, and L-Lipschitz over closed convex set Θ Rd of diameter D. Denote θ = θag t+1. Let batch size b = n3/5d1/5L2/5 log2/5(d/δ) ε2/5β2/5D2/5 , Lt = 1 t+1((T + 2)3/2 σ D + β), αt = 2 t+2, and ε 15. Then PAGD is (ε, δ)-shuffle private with population loss EPAGD,D ℓ( θ, D) min θ Θ ℓ(θ, D) + O DL n + d2/5β1/5D6/5L4/5 log4/5(d/δ) The proof of Theorem 4.3 is similar to that of Theorem 4.1. The main difference is that we rely on an oracle utility guarantee that is specific to accelerated gradient descent: Published as a conference paper at ICLR 2022 Algorithm 4 PAGD, Sequentially interactive shuffle private AC-SA Require: Number of users n, batch size b, privacy parameters ε, δ, Lipchitz parameter L, learning rate sequence {Lt}, {αt} 1: Initialize parameter estimate θag 1 = θ1 Θ, and set number of iterations T = n/b 2: for t = 1, 2, . . . , T do 3: Update middle parameter estimate θmd t αtθt + (1 αt)θag t . 4: Estimate gradient at θmd t , gt 1 b PVEC θℓ(θmd t , xt,1), , θℓ(θmd t , xt,b); ε, δ, L 5: Update parameter estimate θt+1 arg minθ Θ gt, θ θt + Lt 6: Update aggregated parameter estimate θag t+1 αtθt+1 + (1 αt)θag t 7: end for 8: Output θag T +1 Lemma 4.4 (Theorem 2 (Lan, 2012)). Suppose PAGD receives a noisy gradient oracle with variance (at most) σ in each iteration. Taking Lt = 1 t+1((T + 2)3/2 σ D + β), αt = 2 t+2, then the expected error of PAGD can be bounded as E ℓ( θ, D) min θ Θ ℓ(θ, D) + O βD2 Smoothing for convex non-smooth ℓ Perhaps surprisingly, when the target loss function is convex and non-smooth, the SGD approach of algorithm PSGD is suboptimal. We improve the convergence rate by combining the accelerated gradient descent method PAGD with Nesterov s smoothing technique. In particular, for a non-smooth loss function, we approximate it with a smooth function by exploiting Moreau-Yosida regularization. We then optimize the smooth function via PAGD. This is not the first application of Moreau-Yosida regularization in the private optimization literature (Bassily et al., 2019), but we depart from past work by combining it with acceleration to attain a better guarantee. We start by recalling the notion of the β-Moreau envelope, using the language of Bassily et al. (2019). Definition 4.5. For β > 0 and convex f : θ Rd, the β-Moreau envelope fβ : θ Rd of f is fβ(θ) = min θ Θ The following properties of the β-Moreau envelope will be useful. Lemma 4.6 (Nesterov (2005)). Let f : θ Rd be a convex function, and define the proximal operator proxf(θ) = arg minθ Θ f(θ ) + 1 2 θ θ 2 2 . For β > 0: 1. fβ is convex, 2L-Lipschitz and β-smooth. 2. θ Θ, fβ(θ) = β(θ proxf/β(θ)). 3. θ Θ, fβ(θ) f(θ) fβ(θ) + L2 Informally, we use Lemma 4.6 as follows. First, claim 1 enables us to optimize fβ using PAGD. Inside PAGD, we use claim 2 to compute the desired gradients of fβ and obtain a loss guarantee for fβ using Theorem 4.3. Finally, claim 3 bounds the change in loss between fβ and the true function f as a function of β. We then choose β to minimize the overall loss guarantee. Theorem 4.7. Let ℓbe convex and L-Lipschitz over a closed convex set Θ Rd of diameter D and ε 15. Then combining PAGD with the above smoothing method yields an (ε, δ)-shuffle private algorithm with population loss E ℓ( θ, D) min θ Θ ℓ(θ, D) + O DL n + d1/3DL log2/3(d/δ) Strongly convex function When the function is strongly convex, we use a folklore reduction from the convex setting: start from arbitrary point θ0 Θ, and apply a convex optimization algorithm Published as a conference paper at ICLR 2022 k = O(log log n) times, where the j-th (j [k]) application uses the output from previous phase as the initial point and nj = n/k samples. In the strongly convex but possibly non-smooth case, we apply the smoothing version of PAGD used in Theorem 4.7. In the strongly convex and smooth case, we apply the version of PAGD used in Theorem 4.3. A similar approach appears in the work of Feldman et al. (2020); the adaptation for our setting requires, among other modifications, a more careful analysis of the optimization trajectory. Theorem 4.8. Let ℓbe λ-strongly convex and L-Lipschitz over a closed convex set Θ Rd of diameter D and ε 15. Then combining PAGD with the reduction above yields an (ε, δ)-shuffle private algorithm with population loss E ℓ( θ, D) min θ Θ ℓ(θ, D) + O λn + d2/3L2 log4/3(d/δ) If the loss function is also β-smooth, then the population loss can be improved to E ℓ( θ, D) min θ Θ ℓ(θ, D) + O L2 λn + dβ1/2L2 log2(d/δ) 4.2 FULLY INTERACTIVE PROTOCOL We conclude with fully interactive protocols, which allow a user to participate in multiple shuffles. In this expanded setting, a version of PSGD obtains loss guarantees that match those of the central model of DP, up to logarithmic factors. The pseudocode of our protocol PGD is shown in Algorithm 5. For strongly convex functions, we further divide users into disjoint groups and apply a fully interactive protocol PGD to each group in sequence, using the parameter estimate learned from one group as the initial parameter estimate for the next. Algorithm 5 PGD, Fully interactive shuffle private gradient descent Require: Number of users n, privacy parameters ε, δ, Lipschitz parameter L, number of iterations T, step size η 1: Initialize parameter estimate θ0 0 2: for t = 1, 2, . . . , T do 3: Compute the gradient at θt, θℓ(θt, x1), , θℓ(θt, xn); ε 2T log(1/δ) , δ T + 1, L 4: Compute and store the parameter θt+1 πΘ (θt + η gt) 5: end for 6: Output final estimate θ 1 T PT 1 t=0 θt Theorem 4.9. Let ℓbe convex and L-Lipschitz over a closed convex set Θ Rd of diameter D and ε 15. Then the protocol PGD is (ε, δ)-shuffle private with population loss E ℓ( θ, D) min θ Θ ℓ(θ, D) + O DL n + d1/2DL log3/2(d/δ) If ℓis λ-strongly convex, we can further improve the population loss to E ℓ( θ, D) min θ Θ ℓ(θ, D) + O L2 n + d log3(d/δ) Details appear at the end of Appendix B; since the algorithm and analysis combines previouslydiscussed components, we sketch them here. The analysis employs the same Moreau envelope approach used in Theorem 4.7. However, unlike in the analysis of Theorem 4.7, we no longer require acceleration. Advanced composition and an appropriate setting of the parameters completes the result. For strongly convex function, we need to further split users into log log n disjoint groups and apply PGD to these disjoint groups of users in sequence. Published as a conference paper at ICLR 2022 ACKNOWLEDGEMENTS We thank the anonymous ICLR reviewers for their helpful comments. Binghui Peng is supported by Christos Papadimitriou s NSF grant CCF-1763970 AF, CCF-1910700 AF, DMS-2134059 and a softbank grant, and by Xi Chen s NSF grant CCF-1703925 and IIS-1838154. Albert Cheu is supported in part by a gift to Georgetown University. Part of his work was done while he was a Ph D student at Northeastern University. Published as a conference paper at ICLR 2022 Kareem Amin, Matthew Joseph, and Jieming Mao. Pan-private uniformity testing. In Conference on Learning Theory (COLT), 2020. Hilal Asi, John Duchi, and Omid Javidbakht. Element level differential privacy: The right granularity of privacy. ar Xiv preprint ar Xiv:1912.04042, 2019. Victor Balcer and Albert Cheu. Separating local & shuffled differential privacy via histograms. In Information-Theoretic Cryptography (ITC) 2020, 2020. Victor Balcer, Albert Cheu, Matthew Joseph, and Jieming Mao. Connecting robust shuffle privacy and pan-privacy. In Symposium on Discrete Algorithms (SODA), 2021. Borja Balle, James Bell, Adrià Gascón, and Kobbi Nissim. Differentially private summation with multi-message shuffling. Arxiv, abs/1906.09116, 2019. Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2013. Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Neural Information Processing Systems (Neur IPS), 2019. Amos Beimel, Iftach Haitner, Kobbi Nissim, and Uri Stemmer. On the round complexity of the shuffle model. In Rafael Pass and Krzysztof Pietrzak, editors, Theory of Cryptography Conference (TCC), 2020. Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Symposium on Operating Systems Principles (SOSP), 2017. Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 2015. Clément Canonne, Gautam Kamath, and Thomas Steinke. The discrete gaussian for differential privacy. In Neural Information Processing Systems (Neur IPS), 2020. TH Hubert Chan, Elaine Shi, and Dawn Song. Optimal lower bound for differentially private multi-party aggregation. In European Symposium on Algorithms (ESA), 2012. Konstantinos Chatzikokolakis, Miguel E. Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. Broadening the scope of differential privacy using metrics. In Privacy Enhancing Technologies Symposium (PETS), 2013. Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), 2019. John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory (COLT), 2019. John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 2018. Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends R in Theoretical Computer Science, 2014. Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, 2006a. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC), 2006b. Published as a conference paper at ICLR 2022 Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N Rothblum, and Sergey Yekhanin. Pan-private streaming algorithms. In Innovations in Computer Science (ICS), 2010a. Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51 60. IEEE, 2010b. Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Symposium on Discrete Algorithms (SODA), 2019. Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Shuang Song, Kunal Talwar, and Abhradeep Thakurta. Encode, shuffle, analyze privacy revisited: Formalizations and empirical evaluation. ar Xiv preprint ar Xiv:2001.03618, 2020. Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in linear time. In Symposium on the Theory of Computing (STOC), 2020. Vitaly Feldman, Audra Mc Millan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In Symposium on the Foundations of Computer Science (FOCS), 2021. Badih Ghazi, Rasmus Pagh, and Ameya Velingker. Scalable and differentially private distributed aggregation in the shuffled model. Arxiv, 2019. Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. On the power of multiple anonymous messages. In Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), 2021. Antonious Girgis, Deepesh Data, Suhas Diggavi, Peter Kairouz, and Ananda Theertha Suresh. Shuffled model of differential privacy in federated learning. In Artificial Intelligence and Statistics (AISTATS), 2021. Google. Cobalt: Telemetry with built-in privacy. https://fuchsia.googlesource.com/ cobalt, 2021. Accessed: 05-21-2021. Matthew Joseph, Jieming Mao, Seth Neel, and Aaron Roth. The role of interactivity in local differential privacy. In Symposium on the Foundations of Computer Science (FOCS), 2019. Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Peter Kairouz, Ziyu Liu, and Thomas Steinke. The distributed discrete gaussian mechanism for federated learning with secure aggregation. In Neural Information Processing Systems (Neur IPS), 2021a. Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning (ICML), 2021b. Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 2011. Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 2012. Andrew Lowy and Meisam Razaviyayn. Private federated learning without a trusted server: Optimal algorithms for convex losses. ar Xiv preprint ar Xiv:2106.09779, 2021. Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 2005. Jay Tenenbaum, Haim Kaplan, Yishay Mansour, and Uri Stemmer. Differentially private multi-armed bandits in the shuffle model. ar Xiv preprint ar Xiv:2106.02900, 2021. Published as a conference paper at ICLR 2022 A PROOFS FOR SUM The goal of this section is to prove Lemma 3.1 and Theorem 3.2. A.1 PROOFS FOR SCALAR SUM We first state a technical lemma adapted from Lemma 4.12 in the work of Ghazi et al. (2021). Our lemma expresses divergence in terms of the per-instance distance between databases, which is necessary for the per-instance statement in Lemma 3.1. Lemma A.1. For any function f : X n Z and parameters p (0, 1/2] and m N, let Mm,p,f : X n Z be the algorithm that samples η Bin(m, p) and returns f(X) + η. For any t N, X X such that |f(X) f(X )| t, and α (0, 1) where αmp 2t, Dδ (Mm,p,f(X)||Mm,p,f(X )) ε where ε = |f(X) f(X )| ln 1+α 1 α and δ = 2 exp( α2mp/10). Proof. Fix any set of integers Z. We will show that P [Mm,p,f(X) Z] exp |f(X) f(X )| ln 1+α 1 α P [Mm,p,f(X) Z] + 2 exp( α2mp/10) which is equivalent to the divergence bound. For any integer i, we use the notation Z i to denote the set {z i | z Z}. Then P [Mm,p,f(X) Z] = X z Z Pη Bin(m,p) [η = z f(X)] ˆz Z f(X) Pη Bin(m,p) [η = ˆz] ˆz (Z f(X)) Q Pη Bin(m,p) [η = ˆz] + X ˆz / Q Pη Bin(m,p) [η = ˆz] . (1) The last inequality holds for any set Q. Our specific Q is Q := q Z | Pη Bin(m,p) [η = q] Pη Bin(m,p) [η = q + f(X) f(X )] < exp |f(X) f(X )| ln 1+α Recall that we supposed |f(X) f(X )| t αmp 2 . The following fact comes from Lemma 4.12 in Ghazi et al. (2021): for any k [ t, t], PY Bin(m,p) Pη Bin(m,p) [η = Y ] Pη Bin(m,p) [η = Y + k] exp |k| ln 1+α 1 α 2 exp( α2mp/10). Then by substitution, ˆz (Z f(X)) Q exp |f(X) f(X )| ln 1+α 1 α Pη Bin(m,p) [η = ˆz + f(X) f(X )] + 2 exp( α2mp/10) ˆz (Z f(X )) Q exp |f(X) f(X )| ln 1+α 1 α Pη Bin(m,p) [η = ˆz] + 2 exp( α2mp/10) exp |f(X) f(X )| ln 1+α 1 α P [Mm,p,f(X ) Z] + 2 exp( α2mp/10). With Lemma A.1 in hand, we now turn to proving our main result for scalar sum, Lemma 3.1. Published as a conference paper at ICLR 2022 Lemma A.2 (Restatement of Lemma 3.1). Fix any number of users n, ε 15, and 0 < δ < 1/2. Let g n, b > 180g2 ln(2/δ) ε2n , and p = 90g2 ln(2/δ) bε2n . Then 1) for any neighboring databases X X [0, ]n that differ on user u, Dδ ((S Rn 1D)(X)||(S Rn 1D)(X )) ε 2 g + |xu x u| , and 2) for any input database X [0, ]n, the P1D(X) is an unbiased estimate of Pn i=1 xi and has variance O( + 2 Proof. We first prove item 1. On input X, the shuffler produces a uniformly random permutation of the (g + b)n bits y1, . . . y(g+b)n. For each user i, let zi be the sum of the messages sent by user i, zi = Pg+b j=1 yi,j. Then the random variable for permutations of y1, . . . y(g+b)n can also be viewed as a post-processing of the random variable Z = Pn i=1 zi. Let Z be the analogue of Z for input database X , differing in user u. Then by the data-processing inequality, it suffices to prove Dδ (Z||Z ) ε (2/g + |xu x u|/ ). By the definition of R1D, Z is distributed as ˆxu + P i =u ˆxu + Bin(bn, p) while Z is distributed as ˆx u + P i =u ˆxu + Bin(bn, p). We have therefore reduced the analysis to that of the binomial mechanism. Our next step is to justify that Lemma A.1 applies to our construction. Specifically, we must identify α, m, t and show p < 1/2, |ˆxu ˆx u| t, and αmp 2t. We set m = bn, t = g, and α = ε/3g. Our assumption on b implies that p < 1/2. And the fact that ˆxi, ˆx i both lie in the interval [0, g] implies |ˆxi ˆx i| g = t. So it remains to prove αmp 2t: αmp = α bn 90g2 ln(2/δ) = α 90g2 ln(2/δ) = 30g ln(2/δ) ε > 2g ln(2/δ) (ε < 15) > 2t. (δ < 1/2, t = g) Next, we derive the target bound on the divergence. Via Lemma A.1, we have that D ˆδ ((S Rn 1D)(X)||(S Rn 1D)(X )) ˆε, ˆε = |ˆxu ˆx u| ln 1 + α 1 α |ˆxu ˆx u| 3α (α small) g |ˆxu ˆx u| g 2 + |xu x u| g (Discretization) g + |xu x u| ˆδ = 2 exp( α2mp/10) 2 exp α2 9 g2 ln(2/δ) Moving to item 2, observe that the sum of the messages produced by user i is xi + η1 + η2. Since η1 was drawn from Ber(xig/ xi), E [η1] = xig/ xi and Var [η1] 1/4 since any Bernoulli random variable has variance 1/4. Meanwhile, η2 is a sample from Bin(b, p) so E [η2] = bp and Var [η2] = bp(1 p) bp. Recall our definition above of zi = Pb+g j=1 yi,j. Then E [zi] = xig/ +bp Published as a conference paper at ICLR 2022 and, by the independence of η1 and η2, Var [zi] 1/4 + bp. Extending this over all n users yields E [Pn i=1 zi] = g Pn i=1 xi + bnp and Var [Pn i=1 zi] n/4 + bnp. A1D shifts and rescales this quantity, so its final output is unbiased and, by our assumption that g n and choice of p, 4 + bnp = O 1 + 2bnp A.2 PROOFS FOR VECTOR SUM Having completed our analysis of P1D, we now turn to PVEC. The first step in this direction is proving the following variant of advanced composition: Lemma A.3 (Restatement of Lemma 3.3). Fix any γ (0, 1) and neighboring databases X, X X n. For all j [d], suppose algorithm Aj : X n Y use independent random bits and satisfies Dδj (Aj(X)||Aj(X )) εj. Let j=1 εj(eεj 1) + 2 v u u tlog(1/γ) j=1 ε2 j and δ = j=1 δj + γ. Then Dδ (A1(X), . . . , Ad(X)||A1(X ) . . . , Ad(X )) ε . Our proof follows the steps in (Dwork et al., 2010b; Asi et al., 2019) with some minor adaptations. We provide the detailed proof for completeness. We first provide some useful definitions. Definition A.4. Given random variables Y and Z, the KL Divergence between Y and Z is D(Y ||Z) = Ey Y log Pr[Y = y] The Max Divergence between Y and Z is D (Y ||Z) = max S supp(Y ) log Pr[Y S] The total variation distance between Y and Z is (Y, Z) = max S supp(Y ) |PY (S) PZ(S)|. The following lemma relating these notions comes from Dwork and Roth (2014). Lemma A.5 (Lemma 3.17 and 3.18 (Dwork and Roth, 2014)). Given random variables Y and Z, 1. We have both Dδ (Y ||Z) ε and Dδ (Z||Y ) ε if and only if there exists random variables Y , Z such that (Y, Y ) δ/(eε + 1) and (Z, Z ) δ/(eε + 1), and D (Y ||Z ) ε, and D (Z ||Y ) ε. 2. If D (Y ||Z) ε and D (Z||Y ) ε, then D(Y ||Z) ε(eε 1). For the first claim, Dwork and Roth (2014) only explicitly proves that D (Y ||Z ) ε, but their construction of Y and Z also implies D (Z ||Y ) ε. We now have the tools to prove our variant of advanced composition, Lemma A.3. Published as a conference paper at ICLR 2022 Proof of Lemma A.3. For any neighboring datasets X and X , let Y = (Y1, Y2, . . . , Yd) and Z = (Z1, Z2, . . . , Zd) denote the outcomes of running A1, . . . , Ad on X, X respectively. By Fact 2.3 and the first part of Lemma A.5, there exist random variables Y = (Y 1, . . . , Y d) and Z = (Z 1, . . . , Z d) such that, for any j [d], the following four inequalities hold: (i) (Yj, Y j ) δj/(eεj + 1), (ii) (Zj, Z j) δj/(eεj + 1), (iii) D (Y j ||Z j) εj, and (iv) D (Z j||Y j ) εj. Then (Y, Y ), (Z, Z ) δj eεj + 1 < 1 Consider the outcome set B = {v Rd : Pr[Y = v] eε Pr[Z = v]}. It remains to prove Pr[Y B] γ. For any fixed realization v = (v1, . . . , vd) Rd, we have log Pr[Y = v] Pr[Y j = vj] Pr[Z j = vj] where the first step follows from independence and the last step defines cj. Since D (Y j ||Z j) εj and D (Z j||Y j ) εj, we know that |cj| εj, and by the second part of Lemma A.5 Evj Yj [cj] = Evj Yj Pr[Y j = vj] Pr[Z j = vj] Thus we have Pr [Y B] = Pr j=1 εj(eεj 1) + 2 v u u tlog(1/γ) by a Chernoff-Hoeffding bound. Tracing back, P [Y B] γ implies Dγ (Y ||Z ) ε . Our previous bounds on (Y, Y ) and (Z, Z ) then give D Pd j=1 δj+γ (Y ||Z) ε . Finally, we can now prove Theorem 3.2, our primary guarantee for vector summation. Theorem A.6 (Restatement of Theorem 3.2). For any 0 < ε 15, 0 < δ < 1/2, d, n N, and 2 > 0, there are choices of parameters b, g N and p (0, 1/2) for P1D (Algorithm 3.1) such that, for inputs X = ( x1, . . . , xn) of vectors with maximum norm || xi||2 2, 1) PVEC is (ε, δ)-shuffle private and, 2) PVEC( X) is an unbiased estimate of Pn i=1 xi and has bounded variance = O d 2 2 ε2 log2 d Proof. We set g = max(2 2 n, d, 4). We will choose b, p later in the proof. Fix neighboring databases of vectors X and X . We assume without loss of generality that the databases differ only in the first user, x1 = x 1. Then the ℓ2 sensitivity is 2 = x1 x 1 2 2 2. (2) From our definition of wi,j in line 4 of the pseudocode for PVEC and the fact that xi 2 2, wi,j 0 and wi,j xi + 2 xi 2 + 2 2 2. This means the sensitivity of the value of wi,j is 2 2. For each coordinate j [d], we use aj to denote the absolute difference in the j-th coordinate, aj := | x1,j x 1,j| = | w1,j w 1,j| Published as a conference paper at ICLR 2022 Let γ, ˆδ = δ/(d + 1) and ˆε = ε/18 p log(1/γ). Because g 2 2 n, we may use Lemma 3.1 to set b, p such that D ˆδ ((S Rn 1D)(Wj)||(S Rn 1D)(W j)) ˆε 2 where we use Wj to denote (w1,j, . . . , wn,j) and the last step defines εj. Notice that j [d] ε2 j = ˆε2 X 2g + a2 j 4 2 2 g2 + 2 x1 x 1 1 2g + x1 x 1 2 2 4 2 2 (Def of aj) d x1 x 1 2 2g + x1 x 1 2 2 4 2 2 ˆε2 (4 + 4 + 1) = 9ˆε2 (3) The final inequality comes from the fact that g d. By our choices of ˆε, γ, and ˆδ, we apply Lemma 3.3 to conclude that Dδ ((S Rn VEC)( X)||(S Rn VEC)( X )) j=1 εj(eεj 1) + 2 v u u tlog(1/γ) j=1 2ε2 j + 2 v u u tlog(1/γ) 18ˆε2 + 6ˆε p log(1/γ) (Via (3)) ε (Choice of ˆε) which is precisely (ε, δ)-differential privacy by Fact 2.3. Note that the second inequality holds because g 4 implies εj ˆε ε/18. Since ε 15, we get εj < 1, and eεj 1 2εj. The estimator produced by PVEC is unbiased because each invocation of P1D yields unbiased estimates. And we have, by linearity and the variance bound of P1D, = O d 2 2 ˆε2 log 1 The theorem statement follows by substitution. A.3 ROBUST PRIVACY OF SUM We now discuss the robustness of our privacy guarantees, as mentioned at the end of Section 3.1. In more detail, we slightly generalize the privacy analysis of P1D to account for the case where a γ fraction of the users are honest. Because P1D is used as a building block for the other protocols we describe, this robustness carries over. We first present the definition of robust shuffle privacy originally given by Balcer et al. (2021). Definition A.7 (Balcer et al. (2021)). Fix γ (0, 1]. A protocol P = (R, A) is (ε, δ, γ)-robustly shuffle differentially private if, for all n N and γ γ, the algorithm S Rγ n is (ε, δ)-differentially private. In other words, P guarantees (ε, δ)-shuffle privacy whenever at least a γ fraction of users follow the protocol. Although the above definition only explicitly handles drop-out attacks by 1 γ malicious users, this is without loss of generality: any messages sent to the shuffler from the malicious users constitute a post-processing of S Rγn. Published as a conference paper at ICLR 2022 Now, we show that reducing the number of participants in P1D loosens the bound on the divergence in a smooth fashion. Because P1D is the building block of our other protocols, they inherit its robustness. The statement requires γ 1/3, but this constant is arbitrary; changing it to a different value will only influence the lower bound on g. Claim A.8. Fix any γ [1/3, 1], any number of users n, ε 1, and 0 < δ < 1. If g 3, b > 180 g2 δ, and p = 90 g2 δ then, for any neighboring databases X X [0, ]n that differ on user u, Dδ ((S Rγn 1D)(X)||(S Rγn 1D)(X )) ε g + |xu x u| This implies P1D satisfies (O(ε/γ), δ, γ)-robust shuffle privacy. Proof. The proof proceeds almost identically with that of Item 2 in Lemma 3.1. The key change is that privacy noise is now drawn from Bin(γbn, p). This means we take m = γbn. We again set t = g but now α = ε/3γg. Note that α < 1/3 because ε 1 and γg > 3γ > 1. As before, our assumption on b implies that p < 1/2 and we have |ˆxu ˆx u| g = t. So it remains to prove αmp 2t in order to apply Lemma A.1: αmp = α γbn 90g2 ln(2/δ) bε2n = α γ 90g2 ln(2/δ) ε2 = 30g ln(2/δ) Now we derive the target bound on the divergence. Via Lemma A.1, we have that D ˆδ ((S Rn 1D)(X)||(S Rn 1D)(X )) ˆε, where ˆε = |ˆxu ˆx u| ln 1 + α 1 α |ˆxu ˆx u| 3α (α small) g |ˆxu ˆx u| g 2 + |xu x u| g (Discretization) g + |xu x u| and ˆδ = 2 exp( α2mp/10) = 2 exp( α2 9γ g2 ln(2/δ) 2 exp( ln(2/δ)) = δ This concludes the proof. When we replace ε in the proof of Theorem 3.2 with ε/γ, we can invoke Claim A.8 to derive the following statement about the robust shuffle privacy of PVEC. Corollary A.9. For any 0 < ε 1, 0 < δ < 1, d, n N, and 2 > 0, there are choices of parameters b, g N and p (0, 1/2) for P1D (Algorithm 3.1) such that, for inputs X = ( x1, . . . , xn) of vectors with maximum norm || xi||2 2 and any γ [1/3, 1], PVEC is (ε/γ, δ, γ)-robustly shuffle private Because PVEC serves as a primitive for our optimization protocols, each round of those protocols ensures privacy in the face of two-thirds corruptions. For comparison, the generic construction by Beimel et al. (2020) is able to simulate arbitrary centrally private algorithms but requires an honest majority. Published as a conference paper at ICLR 2022 B PROOFS FOR SECTION 4 B.1 PROOFS FOR SECTION 4.1 We start by proving the guarantee for PSGD, Theorem 4.1. The first step is to formally define the gradient oracle setup that will be necessary for several of our results. Definition B.1. Given function ℓover Θ Rd, we say G: Θ Rd is an LG-noisy gradient oracle for ℓif for every θ Θ, (1) EG [G(θ)] ℓ(θ), and (2) EG h ||G(θ)||2 2 i L2 G. Furthermore, we say the gradient oracle G has variance at most σ2 if EG G(θ) EG [G(θ)] 2 2 σ2. The following lemma is standard for (noisy) stochastic gradient descent. Lemma B.2 (Restatement of Lemma 4.2). Suppose PSGD queries an LG-noisy gradient oracle at each iteration, then its output satisfies E ℓ( θ, D) min θ Θ ℓ(θ, D) + D2 2ηT + ηL2 G 2 . We now have the tools to prove Theorem 4.1. Proof of Theorem 4.1. The privacy guarantee follows directly from the privacy guarantee of the vector summation protocol (see Theorem 3.2), and the fact that the algorithm queries each user at most once. For the accuracy guarantee, our algorithm directly optimizes the excess population loss. For each time step t {0, 1, . . . , T 1}, we write the gradient gt = 1 b Pb i=1 θℓ(θt 1, xt,i) + Zt. By Theorem 3.2, we know that Ext,1,...,xt,b,Zt [ gt] = Ex D [ θℓ(θt 1, x)] and Ext,1,...,xt,b,Zt h || gt||2 2 i = Ext,1,...,xt,b,Zt i=1 θℓ(θt 1, xt,i) + Zt = Ext,1,...,xt,b i=1 θℓ(θt 1, xt,i) L2 + O d L2 log2(d/δ) where the second equality uses the independence of the data and noise, and the inequality follows from ℓbeing L-Lipschitz and Theorem 3.2. The gradient gt therefore qualifies as a noisy gradient oracle call with L2 G = L2 + O d L2 log2(d/δ) By Lemma 4.2 and the fact that there are n/b iterations, we get E ℓ( θ, D) min θ Θ ℓ(θ, D) + D2 2ηT + ηL2 G 2 min θ Θ ℓ(θ, D) + DLG T (AM-GM ineq.) = min θ Θ ℓ(θ, D) + D L2 + O d L2 log2(d/δ) = min θ Θ ℓ(θ, D) + O b n + d log2(d/δ) = min θ Θ ℓ(θ, D) + O d1/4DL log1/2(d/δ) εn Published as a conference paper at ICLR 2022 The final step uses our choice of b = Next, we prove the guarantee for PAGD, Theorem 4.3. Proof of Theorem 4.3. The privacy guarantee follows directly from the privacy guarantee of the vector summation protocol (see Theorem 3.2), and the fact that the algorithm queries each user at most once, and the remaining computation is post-processing. We now prove the accuracy guarantee. For any iteration t [T], let Zt be the noise introduced by the summation protocol PVEC and let gt = 1 b Pb i=1 θℓ(θt 1, xt,i) + Zt. We first bound the variance: E G(θt 1) E [G(θt 1)] 2 2 = Ext,1,...,xt,b,Zt i=1 θℓ(θt 1, xt,i) + Zt Ex D [ ℓ(θt 1, x)] = Ext,1,...,xt,b i=1 ℓ(θt 1, xt,i) Ex D [ ℓ(θt 1, x)] i=1 ℓ(θt 1, xt,i) b Var [ ℓ(θt 1, xt,1)] + E Zt 2 2 b Ext D h ℓ(θt 1, xt) Ex [ ℓ(θt 1, x)] 2 2 i + E Zt 2 2 b + d L2 log2(d/δ) The second equality comes from the fact that the data and noise are independent, the fourth equality uses the independence of xt,1, , xt,b and (5) follows from Theorem 3.2. Next, we use the following result from Lan (2012). Lemma B.3 (Restatement of Lemma 4.4). Suppose PAGD receives a noisy gradient oracle with variance (at most) σ2 in each iteration. Taking Lt = 1 t+1((T + 2)3/2 σ D + β), αt = 2 t+2, then the population loss of PAGD can be bounded as E ℓ( θ, D) min θ Θ ℓ(θ, D) + O βD2 In our case, we have σ2 = L2 b + d L2 log2(d/δ) b2ε2 so that E ℓ( θ, D) min θ Θ ℓ(θ, D) + O βD2 min θ Θ ℓ(θ, D) + O b + d L2 log2(d/δ) min θ Θ ℓ(θ, D) + O d DL log(d/δ) = min θ Θ ℓ(θ, D) + O n2 + DL n + d DL log(d/δ) = min θ Θ ℓ(θ, D) + O d2/5β1/5D6/5L4/5 log4/5(d/δ) ε4/5n4/5 + DL n The first step follows from Lemma 4.4, and the second step comes from Eq. (5). In the last step, we substitute b = n3/5d1/5L2/5 log2/5(d/δ) ε2/5β2/5D2/5 . Published as a conference paper at ICLR 2022 Next, we prove our improved guarantee for convex losses by combining PAGD and smoothing (Theorem 4.7). Proof of Theorem 4.7. We directly optimize the loss function ℓβ with PAGD. PAGD only requires the gradient ℓβ(θmd t , x) (t [T]). By the second item in Lemma 4.6, this can be obtained by computing β(θmd t proxℓ/β(θmd t )). Moreover, since ℓβ is 2L-Lipschitz, PAGD still guarantees (ε, δ)-differential privacy with a constant scaling of the noise. For accuracy, we have that E ℓ( θ, D) min θ Θ ℓ(θ, D) E ℓβ( θ, D) min θ Θ ℓβ(θ, D) + L2 d2/5β1/5D6/5L4/5 log4/5(d/δ) DL n + d1/3DL log2/3(d/δ) The first step follows from the third item in Lemma 4.6, the second step follows from the guarantee of PAGD (Theorem 4.3) and the fact that ℓβ is β-smooth, and the last step comes from our choice of β = ε2/3n2/3L d1/3D log2/3(d/δ). Finally, we prove Theorem 4.8, which applies to strongly convex ℓand is the final result about sequentially interactive protocols. Proof of Theorem 4.8. The privacy guarantee follows from the previously-proven privacy guarantee for PAGD and the fact that we split the users into disjoint subsets. The rest of the proof focuses on the population loss. We begin with the non-smooth setting. The high level idea for this case is that both the population loss and the distance to optimal solution shrink in each iteration. By recursively applying the shuffle private convex algorithm and solving the recursion, one can get improved convergence rate. Let θi be the output of the i-th phase and θ = arg minθ Θ ℓ(θ, D). Denote D2 i = E θi θ 2 2 and i = E ℓ( θi, D) ℓ(θ , D) . Due to the λ-strong convexity, we have D2 i 2 i λ . Since the guarantee of Theorem 4.7 uses D as a bound on the distance between the starting point and optimum, we have Di L ni + d1/3Di L log2/3(d/δ) L ni + d1/3L log2/3(d/δ) Here C is an universal constant, and the last step defines E. Hence, one has We know that 1 2L2/λ (due to strong convexity), E2 2L2/λn (due to definition), and therefore, 1/E2 O(n). Hence, after k = O(log log n) phases, we have k/E2 2, and the population loss is E ℓ( θ, D) min θ Θ ℓ(θ, D) + O L nk + d1/3L log2/3(d/δ) n2/3 k ε2/3 min θ Θ ℓ(θ, D) + O λn + d2/3L2 log4/3(d/δ) Published as a conference paper at ICLR 2022 This completes the proof for the non-smooth setting. Turning to the setting where ℓis also β-smooth, let A = d2/5β1/5L4/5 log4/5(d/δ) ε4/5 . Then our algorithm guarantees AD6/5 i n4/5 i + LDi ni A 3/5 i λ3/5n4/5 i + L i λni for some constant C > 1. First, suppose there exists i [k] such that i is already small, i.e. Then we claim t 210C4 L2 λni + A5/2 holds for all t i. We divide into cases. Case 1. Suppose L2 λ3/2n2 i . Then we have λ2/3n2 i+1 + 25C3 LA5/4 λ5/4n3/2 i+1 4C 26C3 A5/2 λ2/3n2 i + 25C3 A5/2 The first step follows from Eq. (6), ni = ni+1 = n/k, the second step follows our assumption. Case 2. Suppose L2 λ3/2n2 i . Then we have λ6/5n7/5 i+1 + 25C3 L2 λni + 25C3 L2 The first step follows from Eq. (6), ni = ni+1 = n/k, the second step follows our assumption. Next, suppose i T = 210C4 L2 λni + A5/2 , then we claim that i+1 is decreasing, i.e., i+1 i. Suppose this does not hold, since we have one of the following should hold 2C A 3/5 i λ3/5n4/5 i+1 i 2 or 2C L i p We argue this can not happens, since the former one implies i 25C5/2 A5/2 λ3/2n2 i and the later implies Combining the above argument, we know that i is decreasing, until it goes below a threshold T, and after that, the value i will always below such the threshold T. Now, suppose after k = O(log log n) iterations, k is still above the threshold T, then we know { i}i [k] is decreasing and we divide the algorithmic procedure into two phases. In the first phase, we have L i λn A 3/5 i λ3/5n4/5 i , this happens when i L10λ A10n3 i , and the algorithm satisfies i+1 4C A 3/5 i λ3/5n4/5 i+1 = 4C A 3/5 i λ3/5n4/5 i . Published as a conference paper at ICLR 2022 For the second phase, we have L i λn A 3/5 i λ3/5n4/5 and this happens when this happens when i L10λ A10n3 i , and the algorithm satisfies i+1 4C L i p λni+1 = 4C L i λni . We know one of the two phases will run for more than Ω(log log n) iterations. Suppose the former case holds, i.e., i+1 4C A 3/5 i λ3/5n4/5 i . Let E = 4CA λ3/5n4/5 i 5/2 25C5/2 A5/2 λ3/2n2 i . Then, we have It is easy to verify that 1/E poly(n), and therefore, after Ω(log log n) iterations, we know that i will drop below O(E) = 25C5/2 A5/2 λ3/2n2 i T. When later case holds, i.e., i+1 4C L i λni . Let E = 16C2 L2 λni , then we know that It is easy to verify that 1/E poly(n), and therefore, after Ω(log log n) iterations, we know that i will drop below O(E) = 16C2 L2 In summary, taking k = Ω(log log n), we know that E ℓ( θk, D) min θ Θ ℓ(θ, D) = k T = 210C4 L2 λn + dβ1/2L2 log(d/δ)2 This concludes the proof. B.2 PROOFS FOR SECTION 4.2 We now turn to proofs for our fully interactive protocol, PGD. The first step will be proving Lemma B.6, which provides a privacy and population loss guarantee for PGD when the loss function ℓis smooth. We will later combine this analysis for smooth losses with the smoothing trick used in the previous section to obtain guarantees for non-smooth losses. Since PGD passes the training data more than once and does not directly optimize the excess population loss, the proof of Lemma B.6 proceeds by bounding the empirical loss and the generalization error separately. First, we recall a result bounding empirical loss. Lemma B.4 (Bubeck et al. (2015)). Let ℓbe convex and L-Lipschitz over closed convex set Θ Rd of diameter D. Let σ2 denote the variance of the privacy noise in the gradient update step of PGD. For any set of data points S and η > 0, the output of PGD satisfies E ℓ( θ, S) min θ Θ ℓ(θ, S) D2 2ηT + η(L2 + σ2) Next, we bound generalization error. Lemma B.5 (Lemma 3.4 (Bassily et al., 2019)). Suppose ℓis convex, L-Lipschitz, and β-smooth over closed convex set Θ Rd of diameter D and η 2 β . Then PGD is α-uniformly stable with n . In other words, it satisfies ES Dn |ℓ( θ, S) ℓ( θ, D)| L2 Tη Published as a conference paper at ICLR 2022 With these two tools, we can state and prove Lemma B.6 Lemma B.6. Let ℓbe convex, L-Lipschitz, and β-smooth over a closed convex set Θ Rd of diameter D. Then PGD is (ε, δ)-shuffle private with population loss E ℓ( θ, D) min θ Θ ℓ(θ, D) + O DL n + d1/2DL log3/2(d/δ) Proof. For the privacy guarantee, since each iteration guarantees ( ε 2 2T log(1/δ), δ T +1)-differential privacy, and the algorithm runs for T iterations in total, advanced composition implies (ε, δ)-shuffle privacy overall. We now focus on the loss guarantee. First, by Lemma B.4, the empirical loss satisfies E ℓ( θ; S) min θ Θ ℓ(θ; S) + D2 2ηT + ηL2 G 2 min θ Θ ℓ(θ; S) + D2 2 + O ηd TL2 log3(nd/δ) where the second step follows from L2 G L2 + O( d T L2 log3(nd/δ) n2ε2 ) (see Eq. (4)). Moreover, by Lemma B.5, we know the generalization error is most L2 T η n . Hence, for S Dn, E ℓ( θ; D) min θ Θ ℓ(θ, D) E |ℓ( θ, D) ℓ( θ, S)| + E ℓ( θ, S) min θ Θ ℓ(θ, S) 2 + ηd TL2 log3(nd/δ) 2n2ε2 + L2 Tη d log3(nd/δ) The second step follows from Lemma B.5 and Lemma 4.2. Especially, we take expectation over S Dn in Lemma 4.2. The last step follows by taking η = D L T and T = min{n, ε2n2 d log2(nd/δ)}. We use δ 1/n for the final statement. We can now prove Theorem 4.9. As done previously, we use Moreau envelope smoothing to apply the guarantee of Lemma B.6. For the strongly convex case, we apply the same reduction to the convex case as used by Feldman et al. (2020). Proof of Theorem 4.9. The privacy guarantees are inherited from Lemma B.6. We employ Moreau envelope smoothing, as described in Lemma 4.6 and used earlier in the proof of Theorem 4.7, to construct ℓβ with β = L D min n, εn d1/2 log3/2(nd/δ) . Optimizing ℓβ yields ES Dn ℓ( θ; D) min θ Θ ℓ( θ, D) ES Dn ℓβ( θ; D) min θ Θ ℓβ( θ, D) + L2 DL n + d1/2DL log3/2(d/δ) 1 2 n + d1/2 log3/2(d/δ) DL n + d1/2DL log3/2(d/δ) The first step follows from the third item in Lemma 4.6, and the second step follows from Lemma B.6 and the definition of β. For strongly convex losses, we use the same reduction as (Feldman et al., 2020). Published as a conference paper at ICLR 2022 Lemma B.7 (Theorem 5.1 in (Feldman et al., 2020)). Suppose for any loss function that is convex and L-Lipschitz over a closed convex set Θ Rd of diameter D, there is an (ε, δ)-differentially private algorithm that achieves excess population loss of order Then for any λ-strongly convex, L-Lipschitz loss function, there is an (ε, δ)-differentially private algorithm that achieves excess population loss of order We note that this reduction works for the fully interactive model and keeps the privacy guarantee. In particular, one can divide users into the same set of groups as (Feldman et al., 2020) and apply PGD sequentially to these groups. By taking ρ = ε log3/2(nd/δ), we get the desired guarantee for strongly convex functions. B.3 SCO COMMUNICATION AND RUNTIME Recall from the discussion at the end of Section 3.1 that, with the scaling by trick, each user sends O(d[ n + log(1/δ)/ε2]) bits in each n-user invocation of PVEC, and the analyzer processes these in time O(dn[ n + log(1/δ)/ε2]). Communication and runtime guarantees each for SCO algorithm then follow as simple corollaries based on the number of rounds and the batch size in each round. Corollary B.8. In the algorithm described in Theorem 4.1, each user sends O(d[ b + log(1/δ)/ε2]) bits and the total runtime is b + log(1/δ) ε + log(1/δ) Corollary B.9. In the algorithm described in Theorem 4.3, each user sends O(d[ b + log(1/δ)/ε2]) bits and the total runtime is b + log(1/δ) " d1/10n3/10L1/5 log1/5(d/δ) ε1/5β1/5D1/5 + log(1/δ) Corollary B.10. In the algorithm described in Theorem 4.7, each user sends O(d[ b+log(1/δ)/ε2]) bits and the total runtime is b + log(1/δ) " d1/6n1/6 log1/3(d/δ) ε1/3 + log(1/δ) Note that the preceding corollary omits the time complexity of the smoothing step. A short discussion of doing this efficiently appears in Section 4 of the work of Bassily et al. (2019). Corollary B.11. In the algorithm described in Theorem 4.9, each user sends O d T n + log(1/δ) = min n, ε2n2 d log2(nd/δ) O d n + log(1/δ) bits in total, and the total runtime is O dn T n + log(1/δ) = min n, ε2n2 d log2(nd/δ) O dn n + log(1/δ) C PAN PRIVATE CONVEX OPTIMIZATION Finally, we turn to pan-privacy, where we assume users/data arrive in a stream, and the goal is to guarantee differential privacy against a single intrusion into the stream-processing algorithm s internal state. We formalize the model and provide definition in Section C.1, and provide algorithms in Section C.2. Pan-privacy was introduced by Dwork et al. (2010a); we use the presentation given by Amin et al. (2020) and Balcer et al. (2021). Published as a conference paper at ICLR 2022 C.1 MODELS AND DEFINITION Our focus will be on online algorithms. They receive raw data one element at a time in a stream. At each step in the stream, the algorithm receives a data point, updates its internal state based on this data point, and then proceeds to the next element. The only way the algorithm remembers past elements is through its internal state. The formal definition we use is below. Definition C.1. An online algorithm Q is defined by an internal algorithm QI and an output algorithm QO. Q processes a stream of elements through repeated application of QI : X I I, which (with randomness) maps a stream element and internal state to an internal state. At the end of the stream, Q publishes a final output by executing QO on its final internal state. As in the case of datasets, we say that two streams X and X are neighbors if they differ in at most one element. Pan-privacy requires the algorithm s internal state and output to be differentially private with regard to neighboring streams. Definition C.2 (Dwork et al. (2010a); Amin et al. (2020)). Given an online algorithm Q, let QI(X) denote its internal state after processing stream X, and let X t be the first t elements of X. We say Q is (ε, δ)-pan-private if, for every pair of neighboring streams X and X , every time t and every set of internal state, output state pairs T I O, PQ QI(X t), QO(QI(X)) T eε PQ QI(X t), QO(QI(X )) T + δ. (7) When δ = 0, we say Q is ε-pan-private. C.2 ALGORITHMS Due to the structural similarity between sequential shuffle privacy and pan-privacy, all the results developed in Section 4.1 have pan-private counterparts. Theorem C.3. Suppose a convex loss function ℓis L-Lipschitz over a closed convex set Θ Rd, then there exists an algorithm with output θ that guarantees (ε, δ)-pan privacy, and: (1) When the loss function is non-smooth, then the population loss satisfies ℓ( θ, D) min θ Θ ℓ(θ, D) + O DL n + d1/3DL log1/3(1/δ) (2) When the loss function is β-smooth, then the population loss satisfies ℓ( θ, D) min θ Θ ℓ(θ, D) + O DL n + d2/5β1/5D6/5L4/5 log2/5(1/δ) (3) When the loss function is λ-strongly convex, then the population loss satisfies ℓ( θ, D) min θ Θ ℓ(θ, D) + O λn + d2/3L2 log2/3(1/δ) (4) When the loss function is λ-strongly convex and β-smooth then the the population loss satisfies ℓ( θ, D) min θ Θ ℓ(θ, D) + O L2 λn + dβ1/2L2 log(1/δ) Proof. We analyze the pseudocode presented in Algorithm 6. Although it does not explicitly obey the syntax of an online algorithm given in Definition C.1, it is straightforward to see that the internal state is the tuple (θt, θag t , θmd t , gt) and an update to gt occurs when any data point is read. The updates to the θ variables also occur after the last data point of a batch is read. We first prove Algorithm 6 is (ε, δ)-pan private. Lemma C.4. For 0 < ε, δ < 1, Algorithm 6 is (ε, δ)-pan-private. Published as a conference paper at ICLR 2022 Algorithm 6 Pan-private AC-SA Require: Batch size b, privacy parameter ε, stream length n, learning rate sequence {Lt}, {αt} 1: Initialize noise parameter ζ2 8L2 log(2/δ) b2ε2 2: Initialize parameter estimate θag 1 = θ1 Θ, and set number of iterations T = n/b 3: for t = 1, 2, . . . , T do 4: Draw fresh noise Zt N 0, ζ2Id 5: Initialize gradient estimate gt Zt 6: for i = 1, 2, . . . , b do 7: Update gradient estimate gt gt + 1 b θℓ(θmd t , xt,i) 8: end for 9: gt gt + N 0, ζ2Id 10: Update parameter estimate θt+1 arg minθ Θ gt, θ θt + Lt 11: Update aggregated parameter estimate θag t+1 αtθt+1 + (1 αt)θag t 12: Update middle parameter estimate θmd t+1 αtθt+1 + (1 αt)θag t . 13: end for 14: Output θag T +1 Proof. Let t (resp. i ) be the batch number (resp. position within the batch) where the adversary intrudes. We will show that Dδ (θt , θag t , θmd t , gt , θag T +1||θ t , θag t , θmd t , g t , θag T +1 ) ε where the indicates variables generated from an execution on neighboring stream X . For neighboring streams X, X , let t denote the batch number and let i denote the number of the sample within the batch where the two streams differ. We will perform case analysis around t. Case 1: t < t. Here, the tuple (θt , θag t , θmd t , gt ) is identically distributed with (θ t , θag t , θmd t , g t ) which means we only need to prove Dδ (θag T +1||θag T +1 ) ε. Observe that θag T +1 (resp. θag T +1 ) is a post-processing of (θt, θag t , θmd t , gt) (resp. (θ t, θag t , θmd t , g t)), which means that we only need to bound Dδ (θt, θag t , θmd t , gt||θ t, θag t , θmd t , g t). Furthermore, the parameter estimates are generated from a post-processing of the gradient updates, it suffices to bound Dδ (gt||g t) after line 9. Recall the Gaussian mechanism: Lemma C.5. For ε < 1 and function f : An Bd on size-n databases with ℓ2-sensitivity 2(f) = max neighboring a,a An ||f(a) f(a )||2 f(a) + N 0, 2 2(f)2 log(2/δ) where Id is the d-dimensional identity matrix, is (ε, δ)-centrally private. For a proof, refer to Theorem A.1 in the survey of Dwork and Roth (Dwork and Roth, 2014). Since ℓis L-Lipschitz in θ, for any x we have || θℓ( , x)||2 L. Thus max x,x X || θℓ( , x) θℓ( , x )||2 2L. Thus if we define f(xt,1, . . . , xt,b) = 1 b Pb i=1 θℓ(θt, xt,i), we get 2(f) 2L b . By Lemma C.5, it follows that our addition of Zt N 0, 8L2 log(2/δ) b2ε2 noise guarantees (ε, δ)-privacy Case 2: t = t. We break into more cases around i . When i i < b, the adversary only observes a change to the gradient (line 7). Specifically, the tuple (θt , θag t , θmd t , θag T +1) is identically distributed with (θ t , θag t , θmd t , θag T +1 ) which means it suffices to prove Dδ (gt||g t) ε after line 7. This follows from the privacy of the Gaussian mechanism. Published as a conference paper at ICLR 2022 When i b = i , observe that the parameter updates are post-processing of the gradient update. We again use the privacy of the Gaussian mechanism. When i < i, we re-use the arguments from Case 1. The noise from line 9 provide privacy. Case 3: t > t. Observe that θag T +1 (resp. θag T +1 ) is a post-processing of (θt , θag t , θmd t , gt ) (resp. (θ t , θag t , θmd t , g t )), which means that we only need to bound Dδ (θt , θag t , θmd t , gt ||θ t , θag t , θmd t , g t ). But the random variables in question are obtained by post-processing (θt, θag t , θmd t , gt) and (θ t, θag t , θmd t , g t). We use the same line of reasoning as Case 1 to bound the divergence. The main tool for our utility guarantees is bounding the variance of the noise added for privacy, analogous to the variance guarantee of Theorem 3.2. Once we have that result, the remaining analyses are essentially unchanged from their shuffle counterparts, apart from slightly different choices for parameters b and β, which are straightforward algebraic consequences of the different variance guarantee. As a result, we only bound the variance. For each time step t [T], the gradient gt = 1 b Pb i=1 θℓ(θt 1, xt,i) + Zt,a + Zt,b, where Zt,a, Zt,b N 0, ζ2Id . It is clearly an unbiased estimator and the variance satisfies Ext,1,...,xt,b,Zt,a,Zt,b i=1 θℓ(θt 1, xt,i) + Zt,a + Zt,b Ex D [ ℓ(θt 1, x)] = Ext,1,...,xt,b i=1 ℓ(θt 1, xt,i) Ex D [ ℓ(θt 1, x)] + E Zt,a|2 2 + E Zt,b 2 2 b Ext D h ℓ(θt 1, xt) Ex [ ℓ(θt 1, x)] 2 2 i + E Zt,a 2 2 + E Zt,b 2 2 b + d L2 log(1/δ) The first step follows from the fact that the noise Zt,a, Zt,b is independent of the data xt,1, , xt,b, and the second step follows from the independence of xt,1, , xt,b. The last step follows from Zt,a, Zt,b N 0, ζ2Id and ζ2 8L2 log(2/δ)