# concurrent_shuffle_differential_privacy_under_continual_observation__7e8a2ed4.pdf Concurrent Shuffle Differential Privacy Under Continual Observation Jay Tenenbaum 1 Haim Kaplan 2 1 Yishay Mansour 2 1 Uri Stemmer 2 1 We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffle model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a.k.a. the counter problem) and show that the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffle model. Specifically, we give a summation algorithm with error O(n1/(2k+1)) with k concurrent shufflers on a sequence of length n. Furthermore, we prove that this bound is tight for any k, even if the algorithm can choose the sizes of the batches adaptively. For k = log n shufflers, the resulting error is polylogarithmic, much better than Θ(n1/3) which we show is the smallest possible with a single shuffler. We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal O( n) regret with k = Ω(log n) concurrent shufflers. 1. Introduction Differential privacy, introduced in the seminal work of Dwork et al. (2006), is a formal notion of privacy that enables the release of statistical information about a set of users, without compromising the privacy of any individual user. Briefly, differential privacy requires that any change in a single user s data changes the probability of any output of the computation by only a limited amount. Differential 1Google Research 2Blavatnik School of Computer Science, Tel Aviv University. Correspondence to: Jay Tenenbaum , Haim Kaplan , Yishay Mansour , Uri Stemmer . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). privacy has been extensively studied under many different sub-models of privacy. On one end of the spectrum lies the centralized model where the users trust the server who holds a database of their data and is liable to protect it. This server receives a (maybe interactive) sequence of queries to the database, and must make sure that the published aggregated statistics respect the privacy constraints. On the other end of the spectrum lies the (strictly stronger) local model of differentialy privacy (LDP), where each user privatizes its own data prior to sending it to the server. In the middle between the local and the centralized models lies the shuffle model (Bittau et al., 2017; Cheu et al., 2019; Ghazi et al., 2019; Balle et al., 2019; Erlingsson et al., 2019; 2020). This model introduces a trusted shuffler (or shuffler for short) that receives (encoded) messages from the users and permutes them (i.e., disassociates a message from its sender) before they are delivered to the server.1 Intuitively, protocols in the shuffle model ensure that sufficiently many reports are collected and shuffled together so that any one report can hide in a shuffled batch. Motivated by real-world applications, such as monitoring systems or search trends which output continual statistics, Dwork et al. (2010a) studied differential privacy under continual observation . They focus on the fundamental problem of private summation under continual observation (PSCO) in which the server continuously reports an approximation of an accumulating sum of bounded values in [0, 1], one from each new user. They guarantee event-level privacy, which protects the content of a single user in a single interaction with the server. Over n users, Dwork et al. (2010a) and Chan et al. (2011) reduced the trivial O( n/ε) error achieved by advanced composition (and is optimal in LDP Vadhan (2017)), to O(polylog(n)/ε) using a tree-based algorithm. Briefly, they maintain a binary tree with leaves 1, . . . , n, where the t th leaf contains the value of the t th user, and each internal node (not a leaf or the root) contains the sum of the leaves in its subtree with added noise Lap( Θ(1/ε)) to ensure privacy. The error bound follows 1For privacy analysis, we assume that the shuffle is perfectly secure, i.e., its output contains no information about which user generated each of the messages. This is traditionally achieved by the shuffler stripping implicit metadata from the messages (e.g., timestamps, routing information), and frequently forwarding this data to remove time and order information. Concurrent Shuffle Differential Privacy Under Continual Observation since any partial sum is computed with log n internal nodes.2 Note that private (non-continual) summation has optimal Θ(1/ε) (Dwork et al., 2006) and Θ( n/ε) (Beimel et al., 2008; Chan et al., 2012) errors in the centralized and local models, respectively.3 In this paper, we apply the shuffle model in the continual observation setting, to get smaller errors. The first technicality that must be addressed is that, traditionally, shuffle model protocols assume only a single interaction with the shuffler. That is, most existing protocols assume that all of the users submit their (encoded) messages to the shuffler simultaneously. This is, of course, unsuitable for online or interactive settings in which the users arrive one by one in a sequential manner. To address this and to adapt the shuffle model to online learning settings, Tenenbaum et al. (2021) and Cheu et al. (2022) considered a model in which the arriving users are partitioned into consecutive batches, and each batch of users interacts with the shuffler separately. Specifically, in their model, when a user arrives it submits its (encoded) message to the shuffler, that stores all the (encoded) messages it receives until the current batch ends. When the batch ends, the shuffler reveals to the server a random permutation of all the (encoded) massages it received from the users of this batch, and the server postprocesses this in order to estimate the desired statistics. We call this the sequential shuffle model. In this model, using the techniques above and selecting the optimal batch size for the PSCO problem, gives an algorithm with an error of O(n1/3). 1.1. Our Contributions The concurrent shuffle model: We ask can we leverage a small number of shufflers working concurrently to design an improved algorithm for the PSCO problem (with error less than O(n1/3))? We introduce a novel model of differential privacy under continual observation, which we call the Concurrent Shuffle model. In this model, we have k N different shufflers working concurrently. At each time, a user arrives and sends information to a subset of the k shufflers. Once a shuffler is full, i.e., all its users have arrived, it sends its data to the server in a random order. Then this shuffler is reset and reused for the subsequent users. Algorithm for PSCO: A weakness of the sequential shuffle model of Tenenbaum et al. (2021); Cheu et al. (2022) is that the server does not get any information about the users in the current batch until it fills up. Thus, intuitively, 2Advanced composition of differential privacy: the combination of k (ε, δ)-differentially private algorithms is O kε2 + 2 p k log(1/δ ), kδ + δ -differentially private for any δ > 0 (Dwork et al., 2010b). 3The Θ( ) hides polylogarithmic terms in n, 1 the error increases with the size of the batch. On the other hand when we try to address this by reducing the size of the batch, then the number of batches increases. This intuitively also increases the error, because for each batch we must introduce sufficient noise in order to hide any single user from this batch, and hence the overall noise increases with the number of batches. The O(n1/3) bound for the PSCO problem comes from balancing these two sources of error. We show how to improve this tradeoff using concurrent shufflers. Even with k = 2 concurrent shufflers we reduce the error significantly. The idea is to use two concurrent shufflers of different sizes. The larger one allows us to cover a substantial part of the already arrived prefix using a small number of batches, while the smaller one allows us reduce the number of users we are blind to because we still did not get the reports from the current shufflers. As k increases we can better control this tradeoff. Consequently, for PSCO with k concurrent shufflers we derive an optimal algorithm with worst-case error O k3/2n 1 2k+1 with high probability. In particular, with k = log(n) concurrent shufflers we recover the same polylog(n) error achievable in the centralized model of DP. This is a significant improvement over the O(n1/3) error achievable in the sequential shuffle model (which we show is optimal for the sequential shuffle model). One can view our algorithm as a variation of the tree-based algorithm of Dwork et al. (2010a), with different internal node degrees and a different method of approximating the running sum. Lower bound for binary summation: For private summation of bounded values, and specifically binary values, we show a lower bound which matches the performance of our algorithm for any value of k. That is, for any number k of concurrent shufflers, we prove that no concurrent shuffle mechanism can have smaller error than what we achieve! This lower bound holds even for algorithms that can choose the mechanisms (encoders and sizes) adaptively. In particular our proof implies that the algorithm with O(n1/3) error is optimal in the sequential shuffle model (in which k = 1). Application to linear contextual bandits: We use our private summation algorithm to devise a new algorithm for contextual linear bandits with arbitrary contexts (even adversarial) in the concurrent shuffle model. Our algorithm is based on Lin UCB (Abbasi-Yadkori et al., 2011) and achieves regret O k3/4n k+1 2k+1 / ε with k concurrent shuf- flers. Specifically, for k = log n, this regret is O( n/ ε), which clearly has optimal asymptotic dependence on n even without privacy. This improves over the result of Chowdhury & Zhou (2022b) who obtained regret of O(n3/5/ ε) in the sequential shuffle model. That is, we reduce the regret to optimal, from O(n3/5/ ε) to O( n/ ε), while essentially maintaining the same trust model (small number Concurrent Shuffle Differential Privacy Under Continual Observation of concurrent shufflers instead of just one). This answers positively the open question of Chowdhury & Zhou (2022b). We believe that the concurrent shuffle model may allow to get improved utility (compared to the sequential shuffle model) for additional problems. All missing proofs and algorithms appear in the Appendix. 1.2. Further Related Work A large body of research has recently studied the PSCO problem with horizon n in several different settings. In Local DP (LDP), Vadhan (2017) give an algorithm and a matching lower bound for error O( n/ε). In the centralized model, Henzinger & Upadhyay (2022) study finegrained error bounds, and Ghazi et al. (2022) give upper and lower bounds for more general tree structures. Other works considered ℓp extensions, high dimensional variants, or improvements and applications of PSCO.4 Several works have studied the private multi-armed bandit problem (Mishra & Thakurta, 2015; Tossou & Dimitrakakis, 2017; Sajed & Sheffet, 2019; Ren et al., 2020a; Chen et al., 2020; Zhou & Tan, 2021; Dubey, 2021), the private contextual linear bandit problem (Shariff & Sheffet, 2018; Zheng et al., 2020; Han et al., 2020; Ren et al., 2020b; Garcelon et al., 2022), and the more general private reinforcement learning (Vietri et al., 2020; Garcelon et al., 2021; Chowdhury & Zhou, 2022a) problem, in both local and centralized models of privacy. The regret gap between the two models (when the contexts are arbitrary, not stochastic (Han et al., 2021)) has shrunk using the intermediate sequential shuffle model (Tenenbaum et al., 2021; Chowdhury & Zhou, 2022b; Garcelon et al., 2022). See Section 5 for further discussion of these results for private contextual linear bandits. If we are not in the continual observation privacy model then it is known that several shuffle mechanisms can run in parallel using a single shuffler (Cheu et al., 2022; Cheu, 2021). This is not true when the adversary continually observes reports of the algorithm. In fact we show that using k concurrent shufflers substantially improves accuracy under continual observation. 2. Background and Preliminaries In this paper, we study the power of using concurrent shufflers for problems with continual reporting, focusing specifically on private summation under continual observation (PSCO). We first define the problem of PSCO, then present 4Bolot et al. (2013); Smith et al. (2017); Fichtenberger et al. (2021); Kairouz et al. (2021); Upadhyay & Upadhyay (2021); Upadhyay et al. (2021); Denissov et al. (2022); Jain et al. (2022); Cardoso & Rogers (2022); Epasto et al. (2023); Henzinger et al. (2023) Figure 1. The tree-based mechanism. Each internal node vj i contains an estimate Svj i of the sum Pj l=i bl. At the 3 rd time (when b3 participates), we approximate P3 i=1 bi by Sv2 1 + Sv3 3. The approximation errors in each Svj i are iid η L( log n the tree-based mechanism that solves it with polylogarithmic error, and finally review the shuffle model of differential privacy. 2.1. Private Summation Under Continual Observation Motivated by monitoring event occurrences over time, Dwork et al. (2010a) studied the problem of private continual reporting the running sum of bounded real values that arrive online. Specifically, they considered binary values and reported the number of 1 s that arrived by time t while protecting the value reported at each particular time. Recall the definition of Dwork et al. (2010a) which bounds the additive errors of the released sum estimates throughout all the times: Definition 2.1. A randomized streaming algorithm yields an (n, α, β) summation algorithm if for every input b = (b1, . . . , bn), it holds that Pr t = 1, . . . , n ˆSt Σt i=1bi α 1 β, where ˆSt denotes the random estimate released by the algorithm after observing b1, . . . , bt. 2.2. The Tree-based Mechanism For the task of private summation in the centralized model, Dwork et al. (2010a) devised a tree-based mechanism. They assumed for simplicity that n = 2i, and let T be a complete binary tree with its leaf nodes associated with the values b1, . . . , bn. Each node x T stores the sum of all the leaf nodes in the subtree rooted at x. They observed that one can compute any partial sum Pt j=1 bj using the sums in at most p := log(n) nodes of T, and that for any two neighboring data sequences D and D0, the sums stored at no more than p nodes in T are different. They also observed that the error at each time is a sum of at most p = log(n) noises. Hence, to make sure that the entire tree is (ε, δ)-DP, by simple composition it suffices to ensure that each node preserves ( ε p)-DP. By applying a concentration inequality over the p noises and a union bound over the horizon n, they conclude that this algorithm creates a summation algorithm with an Concurrent Shuffle Differential Privacy Under Continual Observation error term α which grows like O 1/ε . An illustration of the tree-based mechanism appears in Figure 1. 2.3. Shuffle-model Privacy In the well-studied shuffle model of privacy, there are n users, each with data xi X. Each user applies some encoder E : X Y to their data and sends the messages E(xi) = (yi,1, ..., yi,p) to a shuffler S. The shuffler then shuffles all the messages yi,j from all the users, and outputs them in a uniformly random order to the server. Thus, the shuffle mechanism is defined by the pair (E, n).5 We say that such a mechanism M is (ε, δ)-shuffle differentially private (or (ε, δ)-SDP for short) if the shuffler s output is (ε, δ)-differentially private, or more formally: A mechanism M = (E, n) is (ε, δ)-SDP if for any pair of inputs {xi}n i=1 and {x i}n i=1 which differ in at most one value, we have for all B Y : Pr[S((E(xi))n i=1) B] eε Pr[S((E(x i))n i=1) B]+δ, where (zi)n i=1 denotes concatenation of values z1, . . . , zn. 3. Concurrent Shuffle Differential Privacy To adapt the shuffle model to adaptive algorithms (e.g., bandits, sum estimates etc.) under continual observation, Tenenbaum et al. (2021); Cheu et al. (2022) and Chowdhury & Zhou (2022b) divide the users into continuous batches, and run a shuffle-DP (SDP) mechanism over each batch separately. When a new batch starts, the server selects the next shuffle mechanism (encoder and size), possibly as a function of the outputs of the previous shuffle mechanisms (i.e., it may be adaptive). We refer to this as the sequential shuffle model. We consider a generalized model of privacy under continual observation which has similar privacy guarantees, but is more powerful and enables improved accuracy. In this model, a set of k N different shufflers are executed concurrently throughout the run of the algorithm, and each user can participate in several different shuffle mechanisms. We can reuse each shuffler again and again, each time possibly with a different mechanism. Each shuffler i = 1, . . . , k, at each time, can either be active or inactive. The active shufflers are in the process of accumulating messages from the users. Once an active shuffler fills up (i.e. all its users have sent it their encoded data), it is executed and its data is sent to the server in a random order. Then it is marked inactive, until the server decides to reuse it on a new batch. Note that in this model each user t can participate in the (up to k) active shufflers 5Traditionally, the algorithm which analyzes the shuffled output was part of the mechanism, but in this paper the server analyzes the outputs of many shufflers together. at time t. As a simplifying assumption, we assume that the server is deterministic, i.e., the decisions to allocate new mechanisms and their sizes, and the statistic estimates are all deterministic functions of the outputs of the previously executed shuffle mechanisms. (The shuffle mechanisms are randomized though). We formalize this in Algorithm 1 below. Algorithm 1 Concurrent Shuffle Differential Privacy Under Continual Observation with k Shufflers Input: k shufflers. Processing: for each time t = 1, . . . , n: 1. The server activates zero or more inactive shufflers. It assigns each of them a shuffle mechanism (encoder E and size m) that may depend on the outputs of previously executed shuffle mechanisms. 2. For each active shuffle mechanism M = (E, m), the t th user sends the encoding E(bt) of its private value bt to the shuffler. 3. Each shuffle mechanism that has become full, sends the server its shuffled received data, and is deactivated. 4. The server outputs an estimate ˆyt. Privacy. Recall that the private data of the t th user is its value bt. In the sequential shuffle model (Tenenbaum et al., 2021; Cheu et al., 2022; Chowdhury & Zhou, 2022b), since each user participates in exactly one shuffle mechanism, we ensure (ε, δ) differential privacy of the entire algorithm by making each executed shuffle mechanism (ε, δ)-SDP. In our model, a user can participate in several different shuffle mechanisms. Thus ensuring that each shuffle mechanism is (ε, δ)-SDP is not enough to guarantee that the overall algorithm is (ε, δ)-DP. Specifically, for a given input {xi}n i=1, consider the (adaptive) sequence of shuffle mechanisms (Ej, mj) and shufflers Sj, run on batches of users Bj (|Bj| = mj) by some algorithm A. Let A({xi}n i=1) Y be the vector Sj((Ej(u))u Bj) generated by concatenating the outputs of all these shuffle mechanisms.6 We say that A is (ε, δ)-concurrent shuffle differentially private (CSDP) if for any pair of inputs {xi}n i=1 and {x i}n i=1 which differ in at most one value, we have for all B Y : Pr[A({xi}n i=1) B] eε Pr[A({x i}n i=1) B] + δ. The main trade-off we address in this paper is how the error (the smaller the better) depends on the number of shufflers k that the algorithm uses. We demonstrate the power of the concurrent shuffle model 6Note that output of the algorithm throughout all times is a post-processing of A({xi}n i=1). Concurrent Shuffle Differential Privacy Under Continual Observation on the specific problem of private summation, where each user has a private bounded value, and we estimate the running sum. 3.1. Private Summation for Concurrent Shufflers Consider the PSCO problem, where the value of the t th user is [0, 1], and in every time t we estimate the sum Pt s=1 bs while protecting the values of the users. In the sequential shuffle model (one shuffler), using a standard summation mechanism in the shuffle model with O(n2/3) batches of size O(n1/3) attains error O(n1/3) (see Section 3.2). In Section 4, we show that this is optimal for the sequential shuffle model. We now give an algorithm in the concurrent shuffle model with lower (polylogarithmic) error, assuming we have k = Θ(log n) shufflers. In Section 3.2, we extend this algorithm for any k log n shufflers and analyze its error. 3.1.1. TREE-BASED ALGORITHM IN THE CONCURRENT SHUFFLE MODEL In this section, we adapt the well known tree-based algorithm (Dwork et al., 2010a; Chan et al., 2011), originally built for private summation in the centralized model of differential privacy, to private summation in the concurrent shuffle model, to give an algorithm with additive error term α = O 1/ε . We have a balanced binary tree in which the t th leaf corresponds to the data of the user arriving at time t. With each internal node, we associate a batch that contains all users in its subtree. These batches are processed by k = log n 1 shufflers, one for all nodes in a single level of the tree. We compute a noisy sum of the batch associated with node v using a shuffle-DP (SDP) summation mechanism M sum v . We assume that by analyzing the output of M sum v , we can get an unbiased estimate of the sum with sub-gaussian additive error, which does not depend on the input. Definition 3.1. For t = 1, . . . , n, let V t be the set of all the highest level internal nodes in the tree for which all their inputs have arrived and the shuffler associated with them completed. This is the set of all left siblings of internal nodes in the path from the root to the (t + 1) th leaf. To estimate the sum at time t, we sum the estimates of the set of nodes V t . Algorithm 2 gives a formal description, and Figure 2 illustrates it and the definition of V t . We bound the error of Algorithm 2 in Theorem 3.2. The proof is similar to the proof of Theorem 4.1 in Dwork et al. (2010a). It uses the fact that the additive error of M sum v has zero mean and is sub-gaussian, and requires an upper bound Vn,ε,δ on its variance over batches of size at most n for privacy parameters ε and δ. Figure 2. The tree-based mechanism for k = 2 shufflers. Each internal node vj i contains an estimate Svj i of the sum Pj l=i bl using a shuffle mechanism for summation. Internal nodes of height 1 (v2 1, v4 3, v6 5, and v8 7) and of height 2 (v4 1 and v8 5) apply shufflers of size 2 and 4 respectively to the data at their leaves. In times 6 and 7, V 6 , V 7 = {v4 1, v6 5}, hence we give the sum approximate Sv4 1 + Sv6 5, and in time 8, V 8 = {Sv4 1, Sv8 5}. Theorem 3.2. Consider a run of Algorithm 2 with parameters n N and ε, δ > 0. At each time t the error of the sum estimate is sub-gaussian with variance at most O Vn, ε log n , δ log n log n . Moreover, the algorithm is (ε, δ)-CSDP and an (n, α, β) summation algorithm for α = Vn, ε log n , δ log n ( p log(1/β) log n + log n) and for any β (0, 1). We apply Theorem 3.2 to two specific instances of private summation in the shuffle model. The first is for private binary summation, with Algorithm 3 of Tenenbaum et al. (2021) as the shuffle mechanism, denoted by M sum bin . For any ε < 1 and δ > 0, it has unbiased sub-gaussian error with variance Vn,ε,δ = O log(1/δ) ε2 . The second is for privately summing whole vectors and matrices with bounded l2 norms, with the mechanism from Appendix C of (Chowdhury & Zhou, 2022b) as the shuffle mechanism, denoted by M sum vec . It relies on an efficient and accurate mechanism of Cheu et al. (2022), and for any ε < 15 and δ < 1/2 it gets unbiased sub-gaussian error with variance Vn,ε,δ = O (log(d/δ))2 ε2 for each entry separately. Corollary 3.3. Consider a run of Algorithm 2 on a binary input with parameters n N, ε < 1, δ < 1/2 and using M sum bin . At each time t, the error is sub-gaussian with variance at most O 1 ε2 . Moreover, the algorithm is (ε, δ)-CSDP and an (n, α, β) summation algorithm for α = O 1 ε . The same holds for bounded vector inputs using M sum vec , for each entry separately. From Figure 2, we can see that Algorithm 2 requires k = log n 1 concurrent shufflers, one for each internal level in the tree. Note that we can optimize the memory of the algorithm to O(log n), by observing that there is no use in memorizing the sum estimate for two siblings, since the sum estimate of their parent replaces them and has lower Concurrent Shuffle Differential Privacy Under Continual Observation Algorithm 2 CSTA (Concurrent Shuffle Tree-Based Algorithm) Tree. Instantiate a binary tree with leaves corresponding to the numbers 1, . . . , n. Internal Nodes. Each internal node (not the root or a leaf) v in the binary tree, is associated with the set Lv of its leaf descendants. Processing. For every time step t {1, . . . , n} corresponding to the t th leaf in the tree: 1. For every internal node v for which t is its leftmost descendant, allocate a new shuffle mechanism M sum v for private summation with privacy parameters (ε/ log n, δ/ log n) and shuffle size |Lv|, to the shuffler associated with the level of v. 2. The leaf node t receives value bt. 3. Let the leaf node t participate in all the shuffle mechanisms M sum v such that t Lv. 4. Execute all the mechanisms of the full shufflers, i.e., the mechanisms M sum v such that t is the rightmost descendant of v, and mark the corresponding shufflers inactive. For each such v, we analyze the output of the corresponding shuffler to get an estimate Sv of the sum of values of Lv. 5. Output P v V t Sv, where V t is the set of at most log n internal nodes defined in Definition 3.1. 3.2. Constrained Number of Shufflers Recall that in order to get polylogarithmic error, Algorithm 2 used k = log n 1 shufflers and every user participated in k = log n 1 different shuffle mechanisms. A natural question is: What if we cannot support log n shufflers, but only k log n shufflers? . Can we still get better error guarantees than in the sequential shuffle model? We point out two natural modifications of Algorithm 2 to k log n which give bad results. The first is to use only the k lowest levels of the tree, i.e., shufflers of sizes {2i}k i=1. This solution has a larger error since for many queries it has to aggregate results from a large number of shuffle mechanisms each giving a noisy sum estimate with an independent noise. We call this kind of error an inter-batch error. Specifically, the best estimate we get for the sum at each time is by summing the estimates from all the previous roughly n/2k batches of size 2k, incurring an error proportional to p n/2k. (In general if we aggregate results from s shuffle mechanisms we would get inter-batch error Ω( s).) The second natural modification is to take the k highest levels of the tree, i.e., mechanisms of sizes {2i}log n i=log n k. Unfortunately, this solution is bad due to what we call a high intra-batch error. Specifically, since each mechanism is of size at least 2log n k, and its shuffler reports only when it fills up, the error after the (2log n k 1) th user can be Ω(2log n k) in the worst case. In general, if the smallest shuffle mechanism is of size b we can get Ω(b) intra-batch error. To get small error for some fixed k N, we have to change the structure of the tree T so that the number of children of any node is d = Θ(n 2 2k+1 ) except for nodes at the lowest level (of internal nodes, which is level 1) that are of degree dlow = n/dk = Θ(n 1 2k+1 ), to ensure that the tree has dkdlow = n leaves. To ease the presentation assume that dlow and therefore d are integers. Note that the root is not used since a shuffler for all users is useless. The major difference of this tree from the binary tree, is in the way we estimate the sum in each time. As a warm-up, consider the sequential shuffle model in which there is only a single shuffler. i.e. k = 1. The resulting tree has n leaves, d internal nodes over dlow = n/d leaves each, and a single root above all the internal nodes. For this tree, the shuffler is run only for the internal level, and the intra-batch error is dlow 1 = O(n/d), whereas the inter-batch error at the last time n, scales like d, since the best estimate for the sum is to sum the outputs of all the previous d executed mechanisms. We balance the two error terms by setting d = Θ(n2/3) and dlow = Θ(n1/3). This gives an optimal algorithm with error term α = Θ(n1/3). We extend this analysis to an arbitrary k log n. For each t = 1, . . . , n we evaluate the sum by adding up the estimates associated with the nodes of V t as defined in Definition 3.1 (which holds for a general tree). For the inter-batch error, note that each node of height i = 1, . . . , k of the tree has at most d 1 siblings, so by the definition of V t , the height i can contribute at most d 1 estimates. Therefore, each time t, we sum a total of at most m = Pk i=1(d 1) = k(d 1) = O k n 2 2k+1 sum estimates. Since each user participates in at most k mechanisms (matching the internal nodes that are its ancestors), it follows by simple composition that to preserve privacy, the internal nodes mechanisms should be (ε/k, δ/k)-SDP. Assuming that these mechanisms have noise which is subgaussian with variance at most Vn, ε k over batches of size at most n and privacy parameters ε and δ, we get that at each time t, the error of the output is sub-gaussian with variance at most O Vn, ε k k n 2 2k+1 . However, within the lowest level of internal nodes, the batch size is dlow, so the intra-batch error is at most dlow 1 = O n 1 2k+1 . Hence the total error in each time is sub-gaussian with variance at most O Vn, ε k k n 2 2k+1 , and with expectation at most Concurrent Shuffle Differential Privacy Under Continual Observation O n 1 2k+1 . The full algorithm appears in the Appendix (Algorithm 3) for which we prove the following results. Theorem 3.4. For n N, and ε, δ > 0 and any number of shufflers k N, for the algorithm above, in each time t N the sum estimate error is sub-gaussian with variance at most O Vn, ε k k n 2 2k+1 . Moreover the algorithm is (ε, δ)-CSDP and an (n, α, β) summation algorithm for kn 1 2k+1 p log(1/β) + log n and for any β (0, 1). Analogously to Corollary 3.3, we conclude: Corollary 3.5. For any number of shufflers k N, the algorithm above run on a binary input with parameters n N, ε < 1, δ < 1/2 and using M sum bin , in each time t the sum estimate error is sub-gaussian with variance at most O k3 ε2 n 2 2k+1 . Moreover, the algorithm is (ε, δ)-CSDP and an (n, α, β) summation algorithm for ε n 1 2k+1 . The same holds for bounded vector inputs using M sum vec , for each entry separately. 4. Lower Bounds We consider the specific problem of binary PSCO where the inputs bt {0, 1} are binary. In this section we show that all algorithms in the concurrent shuffle model which are (n, α, β) binary summation algorithms have error α = Ω(n1/(2k+1)). This specifically means that in the sequential shuffle model (k = 1), all algorithms have error α = Ω(n1/3), and that for any number k of concurrent shufflers, our algorithm from Section 3.2 has optimal asymptotic dependence on n. Our proof relies on some components of the Ω( n) lower bound of Vadhan (2017) for private summation in LDP. Specifically, we also use the fact that conditioned on any specific transcript t of a DP communication protocol (between parties with independent randomness and private data), the following two facts hold. (1) by privacy, the bit of each user has constant variance, and (2) these bits are independent (Lemma B.1 in the Appendix). As in Vadhan (2017) these facts allow us to use an anti-concentration bound (Lemma B.4 in the Appendix) which shows that the Hoeffding inequality is tight. Specifically, it shows that the sum of m bounded independent random variables, each with variance σ2, is not concentrated within any region of length O(σ m) with high probability. To adapt this proof to our privacy model with k concurrent shufflers we face multiple difficulties. For our proof, we cannot just use random bits as in Vadhan (2017), and we need to identify a hard and subtle family of distributions over user values, that allows us to use induction on k. We denote this family by n,k. Let cε := eε e2ε+1. We define n,k to be the set of all distributions of binary allocations, which start with a sequence of at most repn,k := 1 2n 1 2k+1 c 2k 2k+1 ε 0 s or 1 s, and continue with an alternation between a single Ber(1/2) bit, and repn,k copies of another Ber(1/2) bit, until length n is reached. We prove the theorem by induction on k = 0, 1, . . .. For the induction step, we assume by contradiction the existence of an algorithm M with error o(repn,k). We consider a distribution n,k and the resulting transcript random variable T obtained by running M on a vector D of random user values sampled from . Note that T is determined by the sampled D , and the internal randomness of the algorithm (in all shuffle mechanisms which were executed). Then we split according to whether or not we allocated a mechanism over a large batch of size at least Bign,k := n 2k 1 2k+1 c 2 2k+1 ε at any time. Intuitively, if we did, during the lifetime of this batch, we can use M to get an algorithm for binary PSCO within this batch by computing differences of estimated sums of M. Since M is assumed to be very accurate, the resulting mechanism contradicts the inductive argument for Bign,k users, k 1 shufflers and a distribution in Bign,k,k 1. Assuming we didn t run a mechanism over a large batch (covered by Lemma B.6 in the Appendix), let Bitsn,k be the single (non-repeated) Ber(1/2) bits of D which are evenly spaced at distance Bign,k of one another. For analysis, we consider an extended mechanism M which adds all the random bits of D which aren t of Bitsn,k to the transcript as well, but returns the exact same value as M, hence has the same error. We show that the output of M is private with respect to the bits Bitsn,k (Lemma B.5 in the Appendix), and this means (Lemma B.3 in the Appendix) that over the distribution D , with high probability, conditioned on the transcript, each of the O(n/Bign,k) bits in Bitsn,k has constant variance. By defining a communication protocol which produces the transcript of M , we conclude that the bits in Bitsn,k are independent conditioned on the transcript (Lemma B.1). Hence, by anticoncentration bounds (Lemma B.4), we get an error lower bound of Ω( q Bitsn,k ) = Ω( p n/Bign,k) = Ω(repn,k) in the sum approximation. To split on the two cases above, we consider a partition of the transcript space, and apply the law of total probability to conclude the lower bound. Theorem 4.1. Let ε and δ such that δ < min(ε,1) 40n(k+1), and let an (ε, δ)-CSDP n-user k-shuffler algorithm M for binary PSCO with adaptive mechanisms (encoders and sizes). Consider a distribution n,k over user binary values. Then, M has additive error α = Ω(repn,k) on some of its Concurrent Shuffle Differential Privacy Under Continual Observation reported sum estimates with constant probability over the randomness of D and M. 5. Contextual Linear Bandits In the linear contextual bandit problem (Auer, 2002; Chu et al., 2011), at each time t = 1, . . . , n (n is called the horizon), a new user with an arbitrary (and possibly adversarial) private context ct C is recommended an action. The server s goal is to recommend the user an action at which maximizes the resulting expected reward. The reward function is characterized by the unknown parameter vector θ Rd and a known mapping ϕ : C A Rd, from context-action pairs to a d-dimensional feature vector. Specifically, the reward is sampled as θ , ϕ(ct, at) + ηt, where ηt is a zero-mean sub-gaussian noise with variance at most a constant σ2. The actions A and contexts C are arbitrary and can vary with time. We measure the utility of an algorithm for horizon n by the cumulative pseudo-regret defined to be h maxa A θ , ϕ(ct, a) θ , ϕ(ct, at) i , which quantifies the expected loss due to not knowing the optimal action at each time, since the parameter vector θ is unknown. At each time t = 1, . . . , n, the interaction between the server and the t th user is according to the following steps: (1) The server sends the user information (effectively an action selection rule) which deterministically maps any context c to action a. (2) The user applies the selection rule to its context ct to select the action at A and gets the resulting vector xt = ϕ(ct, at). (3) The user receives a stochastic reward yt = θ , xt + ηt, where ηt is zero-mean sub-gaussian noise with variance at most a constant σ2. (4) The user sends information to the server (in our case, encoded information of the vector xtyt and matrix xtx T t through the shufflers to the server).7 (5) The server updates the action selection rule. The Linear Upper Confidence Bound (Lin UCB) algorithm (Abbasi-Yadkori et al., 2011) is a well-studied algorithm for the linear contextual bandits problem that has optimal regret. It maintains a confidence ellipsoid in which θ resides with high probability. This ellipsoid at time t is defined by the matrix P s t xsx T s and the vector P s t xsys. In Lin UCB, the action decision rule sent to the users selects the action for each user optimisitically: It chooses the action which 7Since at and xt are functions of the private information ct, and yt is private information as well, they all include user-private information. maximizes the inner product of the action with any point of the confidence ellipsoid. Contextual linear bandits are applicable in internet advertisement selection, recommendation systems and many more fields. This motivated a line of work (Shariff & Sheffet, 2018; Zheng et al., 2020; Chowdhury & Zhou, 2022b; Garcelon et al., 2022) studying linear contextual bandit problems under the lens of differential privacy, to guarantee that the users private information cannot be inferred by an adversary during this learning process. Since Lin UCB reduces the problem to accurately maintaining the matrix P s t xsx T s and the vector P s t xsys, it suffices to ensure privacy with respect to the variables xt and yt. Shariff & Sheffet (2018) achieved this by injecting noise into these cumulative sums. Since the server only uses noisy versions of P s t xsx T s and P s t xsys, they enlarged the confidence ellipsoid to ensure it contains θ with high probability. Let ρmax be the largest magnitude of noise of any cumulative sum estimate in the sequence, then the adapted Lin UCB algorithm of Shariff & Sheffet (2018) ensures regret which scales like O( n ρmax). For centralized DP, the error of a summation algorithm for horizon n is ρmax = O(polylog(n)) = O(1), so Shariff & Sheffet (2018) obtained a Lin UCB-based centralized DP algorithm with regret O( n 1) = O( n). For Local Differential Privacy (LDP), the error of a summation algorithm for horizon n is ρmax = O( n), so Zheng et al. (2020) used the methodology of Shariff & Sheffet (2018) to obtain a Lin UCB-based LDP algorithm with regret O( n p n) = O(n3/4). Chowdhury & Zhou (2022b) considered Shuffle Differential Privacy (SDP), grouping the users into sequential batches of constant size B. The error of their summation algorithm for horizon n scales like ρmax = O( p n/B), so by extending the analysis of Shariff & Sheffet (2018) to batches, they obtained an algorithm with regret O(B + n qp n/B) = O(B + n3/4/B1/4). Setting B = n3/5, they got O(n3/5) regret and posed the open question of whether regret O( n) can be achieved under any notion of privacy stronger than the centralized model. We give a positive answer to this open question in our Concurrent Shuffle DP (CSDP) model. Corollary 3.3 gives a summation algorithm for horizon n with error ρmax = O k3/2 ε n 1 2k+1 for k concurrent shufflers. Hence using the framework of Shariff & Sheffet (2018), we get an algorithm for contextual linear bandits with regret which scales ε n 1 2k+1 = O k3/4 ε n k+1 2k+1 . This is formalized in the following theorem. Theorem 5.1. Fix a horizon n N, a number of shufflers k N, and privacy budgets ε < 1 and δ < 1/2. Concurrent Shuffle Differential Privacy Under Continual Observation Then the algorithm as described above (and detailed in Appendix C) is (ε, δ)-CSDP and has regret Reg(n) = O k3/4n k+1 2k+1 ε (σ + d) . In particular, for k = log n shufflers we get the following corollary. Corollary 5.2. Fix a horizon n N, and privacy budgets ε < 1 and δ < 1/2. Then the algorithm as described above (and detailed in Appendix C) using k = log n shufflers is (ε, δ)-CSDP and has regret Reg(n) = O n ε (σ + d) .8 6. Concluding Remarks In this paper, we introduced and analyzed the novel concurrent shuffle model of differential privacy under continual observation. We demonstrated the effectiveness of the concurrent shuffle model in the problem of private summation, to get improved and tight algorithms, and in the contextual linear bandit problem, showing improved error bounds and similar privacy guarantees compared to the sequential shuffle model. Specifically, for k = log n shufflers, we use a variation of the tree-based algorithm of Dwork et al. (2010a) to get an algorithm for PSCO with polylogarithmic error, and for contextual linear bandits, we close the gap of Chowdhury & Zhou (2022b) to the centralized model and achieve regret with asymptotic dependence on n of O( n). There are several directions for future work. One possibility is to explore other problems and applications where the concurrent shuffle model can be applied to achieve improved error bounds and strong privacy guarantees. Disclosure of Funding This work is partially supported by Israel Science Foundation (grants 993/17,1595/19,1871/19), Len Blavatnik and the Blavatnik Family Foundation, the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement 882396), the Yandex Initiative for Machine Learning at Tel Aviv University and a grant from the Tel Aviv University Center for AI and Data Science (TAD). Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Neur IPS, 2011. Auer, P. Using confidence bounds for exploitationexploration trade-offs. JMLR, pp. 397 422, 2002. Balle, B., Bell, J., Gasc on, A., and Nissim, K. The privacy 8Actually, taking k = 1 2 is large enough to ensure the same asymptotic regret. blanket of the shuffle model. In CRYPTO, pp. 638 667, 2019. Beimel, A., Nissim, K., and Omri, E. Distributed private data analysis: Simultaneously solving how and what. In CRYPTO, pp. 451 468, 2008. Bittau, A., Erlingsson, U., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., Rudominer, M., Kode, U., Tinnes, J., and Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In SOSP, pp. 441 459, 2017. Bolot, J., Fawaz, N., Muthukrishnan, S., Nikolov, A., and Taft, N. Private decayed predicate sums on streams. In ICDT, pp. 284 295, 2013. Cardoso, A. R. and Rogers, R. Differentially private histograms under continual observation: Streaming selection into the unknown. In AIStat, pp. 2397 2419, 2022. Chan, T. H., Shi, E., and Song, D. Optimal lower bound for differentially private multi-party aggregation. In ESA, pp. 277 288, 2012. Chan, T.-H. H., Shi, E., and Song, D. Private and continual release of statistics. TISSEC, pp. 1 24, 2011. Chen, X., Zheng, K., Zhou, Z., Yang, Y., Chen, W., and Wang, L. (locally) differentially private combinatorial semi-bandits. In ICML, pp. 1757 1767, 2020. Cheu, A. Differential privacy in the shuffle model: A survey of separations. ar Xiv preprint ar Xiv:2107.11839, 2021. Cheu, A., Smith, A., Ullman, J., Zeber, D., and Zhilyaev, M. Distributed differential privacy via shuffling. In CRYPTO, pp. 375 403, 2019. Cheu, A., Joseph, M., Mao, J., and Peng, B. Shuffle private stochastic convex optimization. ICLR, 2022. Chowdhury, S. R. and Zhou, X. Differentially private regret minimization in episodic markov decision processes. In AAAI, pp. 6375 6383, 2022a. Chowdhury, S. R. and Zhou, X. Shuffle private linear contextual bandits. ar Xiv preprint ar Xiv:2202.05567, 2022b. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In AISTATS, pp. 208 214, 2011. Denissov, S., Mc Mahan, H. B., Rush, J. K., Smith, A., and Thakurta, A. G. Improved differential privacy for sgd via optimal private linear operators on adaptive streams. Neur IPS, 2022. Dubey, A. No-regret algorithms for private gaussian process bandit optimization. In AIStat, pp. 2062 2070, 2021. Concurrent Shuffle Differential Privacy Under Continual Observation Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC, pp. 265 284, 2006. Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N. Differential privacy under continual observation. In STOC, pp. 715 724, 2010a. Dwork, C., Rothblum, G. N., and Vadhan, S. Boosting and differential privacy. In FOCS, pp. 51 60. IEEE, 2010b. Epasto, A., Mao, J., Medina, A. M., Mirrokni, V., Vassilvitskii, S., and Zhong, P. Differentially private continual releases of streaming frequency moment estimations. ITCS, 2023. Erlingsson, U., Feldman, V., Mironov, I., Raghunathan, A., Talwar, K., and Thakurta, A. Amplification by shuffling: From local to central differential privacy via anonymity. In SODA, pp. 2468 2479, 2019. Erlingsson, U., Feldman, V., Mironov, I., Raghunathan, A., Song, S., Talwar, K., and Thakurta, A. Encode, shuffle, analyze privacy revisited: Formalizations and empirical evaluation. PPMLP, 2020. Fichtenberger, H., Henzinger, M., and Ost, W. Differentially private algorithms for graphs under continual observation. ar Xiv preprint ar Xiv:2106.14756, 2021. Garcelon, E., Perchet, V., Pike-Burke, C., and Pirotta, M. Local differential privacy for regret minimization in reinforcement learning. Neur IPS, 34:10561 10573, 2021. Garcelon, E., Chaudhuri, K., Perchet, V., and Pirotta, M. Privacy amplification via shuffling for linear contextual bandits. In ALT, pp. 381 407, 2022. Ghazi, B., Pagh, R., and Velingker, A. Scalable and differentially private distributed aggregation in the shuffled model. TPDP, 2019. Ghazi, B., Kamath, P., Kumar, R., Manurangsi, P., and Wu, K. On differentially private counting on trees. ar Xiv preprint ar Xiv:2212.11967, 2022. Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020. Han, Y., Liang, Z., Wang, Y., and Zhang, J. Generalized linear bandits with local differential privacy. Neur IPS, pp. 26511 26522, 2021. Henzinger, M. and Upadhyay, J. Constant matters: Finegrained complexity of differentially private continual observation using completely bounded norms. ar Xiv preprint ar Xiv:2202.11205, 2022. Henzinger, M., Upadhyay, J., and Upadhyay, S. Almost tight error bounds on differentially private continual counting. In SODA, pp. 5003 5039, 2023. Jain, P., Raskhodnikova, S., Sivakumar, S., and Smith, A. The price of differential privacy under continual observation. ICML, 2022. Kairouz, P., Mc Mahan, B., Song, S., Thakkar, O., Thakurta, A., and Xu, Z. Practical and private (deep) learning without sampling or shuffling. In ICML, pp. 5213 5225, 2021. Matouˇsek, J. Lower bound on the minus-domination number. Discrete Mathematics, 233(1-3):361 370, 2001. Mishra, N. and Thakurta, A. (nearly) optimal differentially private stochastic multi-arm bandits. In AUAI, pp. 592 601, 2015. Ren, W., Zhou, X., Liu, J., and Shroff, N. B. Multi-armed bandits with local differential privacy. ar Xiv preprint ar Xiv:2007.03121, 2020a. Ren, Z., Zhou, Z., and Kalagnanam, J. R. Batched learning in generalized linear contextual bandits with general decision sets. IEEE Control Systems Letters, 6:37 42, 2020b. Sajed, T. and Sheffet, O. An optimal private stochasticmab algorithm based on optimal private stopping rule. In ICML, pp. 5579 5588, 2019. Shariff, R. and Sheffet, O. Differentially private contextual linear bandits. Neur IPS, 31, 2018. Smith, A., Thakurta, A., and Upadhyay, J. Is interaction necessary for distributed private learning? In SP, pp. 58 77, 2017. Tenenbaum, J., Kaplan, H., Mansour, Y., and Stemmer, U. Differentially private multi-armed bandits in the shuffle model. Neur IPS, 34, 2021. Tossou, A. C. Y. and Dimitrakakis, C. Achieving privacy in the adversarial multi-armed bandit. In AAAI, 2017. Upadhyay, J. and Upadhyay, S. A framework for private matrix analysis in sliding window model. In ICML, pp. 10465 10475, 2021. Upadhyay, J., Upadhyay, S., and Arora, R. Differentially private analysis on graph streams. In AIStat, pp. 1171 1179, 2021. Vadhan, S. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pp. 347 450. Springer, 2017. Concurrent Shuffle Differential Privacy Under Continual Observation Vietri, G., Balle, B., Krishnamurthy, A., and Wu, S. Private reinforcement learning with pac and regret guarantees. In ICML, pp. 9754 9764, 2020. Zheng, K., Cai, T., Huang, W., Li, Z., and Wang, L. Locally differentially private (contextual) bandits learning. Neur IPS, 2020. Zhou, X. and Tan, J. Local differential privacy for bayesian optimization. In AAAI, pp. 11152 11159, 2021. Concurrent Shuffle Differential Privacy Under Continual Observation A. Missing Proofs from Section 3 A.1. Missing Proofs from Section 3.1 Proof of Theorem 3.2. Observe that the major differences between our algorithm and the interpretation of the tree-based algorithm of Dwork et al. (2010a) as presented in the Introduction (see Figure 1) are (1) our noise for each internal node is slightly different due to the shuffle mechanism, (2) since we do not allocate a shuffler for a single user (which would be wasteful), we have no internal node directly above each leaf so we produce the sum estimates using a different set of internal nodes. The privacy is trivial by simple composition, since in each batch we use an (ε/ log n, δ/ log n)-SDP mechanism, and for each user there are at most log n mechanisms in which they participate, matching the internal nodes that are its ancestors.9 To analyze the accuracy, observe that for each time t, we get a sum estimate by adding at most|V t | log n independent noises (one from each sum estimate), each one has zero mean and is sub-gaussian with variance Vn, ε log n , δ log n . Therefore, the sum of noises at each time t is sub-gaussian with variance at most Vn, ε log n , δ log n |V t | = Vn, ε log n , δ log n log n. By Hoeffding s inequality, with probability at least 1 β/n, the magnitude of the sum of noises is at most log(1/β) + log n q Vn, ε log n , δ log n log n = O q Vn, ε log n , δ log n ( p log(1/β) log n + log n) . By a union bound over the horizon n of the algorithm, with probability at least 1 β, in all t = 1, . . . , n simultaneously, this sum of noises has magnitude O q Vn, ε log n , δ log n ( p log(1/β) log n + log n) . For any even time t, since a shuffle mechanism has just ended at time t, the sum estimate is unbiased and the error is precisely the sum of noises bounded above. For any odd time t, we approximate its sum using the estimate of time t 1, so we accumulate an additional error|bt| 1, which is dominated by the error above due to the sum of noises. Proof of Corollary 3.3. We first consider the binary summation setting. Tenenbaum et al. (2021) showed that the error of M sum bin on each single batch of size at most n is sub-gaussian with variance at most Vn,ε,δ = ε2 , which for parameters (n, ε log n, δ log n) gives variance Vn, ε log n , δ log n = O log(log n/δ) log2 n ε2 . Applying Theorem 3.2 yields an (n, α, β) summation algorithm for α = O q Vn, ε log n , δ log n ( p log(1/β) log n + log n) = log(log n/δ) log2 n log(1/β) log n + log n) = O log(log n/δ) log(1/β) log1.5 n+log2 n , and in each time t the sum estimate error is sub-gaussian with variance at most O log(log n/δ) log3 n For the bounded vector setting, in the proof of Theorem C.1 of Chowdhury & Zhou (2022b), they use the error bounds from Lemma 3.1 of Cheu et al. (2022), to yield a mechanism Pvec, which we denote by M sum vec , which is (ε, δ)-SDP, and gives an unbiased estimate of the sum with noise which is sub-gaussian with variance O (log(d/δ))2 ε2 within each entry. For parameters (n, ε log n, δ log n), this gives variance Vn, ε log n , δ log n = O (log(d log n/δ))2 log2 n ε2 . Applying Theorem 3.2 for each entry, this yields an (n, α, β) summation algorithm for α = O q Vn, ε log n , δ log n ( p log(1/β) log n + log n) = (log(d log n/δ))2 log2 n log(1/β) log n + log n) = O log(d log n/δ) log(1/β) log1.5 n+log2 n , and in each time t the sum estimate error is sub-gaussian with variance at most O (log(d log n/δ))2 ε2 log3 n for each entry. A.2. Missing Parts from Section 3.2 We first give the missing algorithm from Section 3.2, for PSCO using k log n concurrent shufflers, and then turn to the missing proofs. This algorithm is a modification of Algorithm 2, using different degrees for different levels. 9For simplicity of presentation, we used simple composition instead of advance composition, which would shave another log n from the final counter error α and another log n factor from the sub-gaussian variance in Corollary 3.3. Concurrent Shuffle Differential Privacy Under Continual Observation Algorithm 3 CSTA (Concurrent Shuffle Tree-Based Algorithm for k concurrent shufflers) Tree. Instantiate a tree with leaves as the numbers 1, . . . , n, where each level i has degree d = Θ(n 2 2k+1 ) except the lowest level (directly above the leaves) which has degree dlow = Θ(n 1 2k+1 ). Internal Nodes. Each internal node v in the tree, is associated with the set of leaf nodes Lv that are descendants of v. Processing. For every time period t {1, . . . , n} corresponding to the t th leaf in the tree: 1. For every internal node v for which t is its leftmost descendant, allocate a new shuffle mechanism M sum v for private summation with privacy parameters (ε/k, δ/k) and shuffle size|Lv|, to the shuffler corresponding to the level of v. 2. The leaf node t receives value bt. 3. The leaf node t participates in all the shuffle mechanisms M sum v such that t Lv. 4. Execute all the mechanisms of the full shufflers, i.e., the mechanisms M sum v such that t is the rightmost descendant of v, and mark their corresponding shufflers inactive. For each such v, we analyze the shuffler output to get an estimate Sv of the sum of values of Lv. 5. Output P v V t Sv, where V t is defined as in Definition 3.1 (which holds for general trees). Proof of Theorem 3.4. Observe that the difference from Theorem 3.2 is that now we have two sources of errors the inter-batch and intra-batch error. The privacy is trivial by simple composition, since in each batch we use an (ε/k, δ/k)-SDP mechanism, and for each user there are at most k mechanisms in which they participate, matching the internal nodes that are its ancestors.10 To analyze the accuracy, observe that for each time t we get a sum estimate by adding at most |V t | = k(d 1) = O k n 2 2k+1 independent noises (one from each sum estimate), each one has zero mean and is subgaussian with variance Vn, ε k . This is since by the definition of V t , and since each node of height 1, . . . , k of the tree has at most d 1 siblings, each such height i can contribute at most d 1 internal node sums.11 Hence, similarly to the proof of Theorem 3.2, the sum of noises at time t is sub-gaussian with zero mean variance at most Vn, ε k s = O(Vn, ε k k n 2 2k+1 ), and with probability at least 1 β, in all t = 1, . . . , n simulta- neously, the sum approximation error has expectation O p log(1/β) + log n q k (1 + (k 1)(d 1)) = 1 + k n 2 2k+1 p log(1/β) + log n = O q kn 1 2k+1 p log(1/β) + log n . For any time t which is divisible by dlow, since a shuffle mechanism has just ended at time t, the sum estimate is unbiased and the error is precisely the sum of noises bounded above. For any other time t which is not divisible by dlow, we approximate its sum using the same sum estimate from the largest multiple of dlow smaller than t. Hence, we accumulate an additional error of the sum of all the values bi for i ranging from the last multiple of dlow to t. Since each value bi is bounded in [0, 1], this error is at most dlow 1 = O n 1 2k+1 which is dominated by the error above due to the sum of noises. Proof of Corollary 3.5. The proof follows from an identical argument as the proof of Corollary 3.3, where for binary summation we get Vn, ε k = O log(k/δ)k2 ε2 , so applying Theorem 3.4 gives that the error in each time is sub- gaussian with variance at most O log(k/δ)k2 ε2 k n 2 2k+1 , and the algorithm is an (n, α, β) summation algorithm for kn 1 2k+1 p log(1/β) + log n . 10For simplicity of presentation, we used simple composition instead of advance composition, which would shave another k from the final counter error α and another k factor from the sub-gaussian variance in Corollary 3.5. 11Note that an exception to this is the single last time t = n, in which V t is the set of d internal nodes which are children of the root. However, for this case V t is still of size d = O(dk), so the argument holds for it too. Concurrent Shuffle Differential Privacy Under Continual Observation For vector summation we get Vn, ε k = O log2(dk/δ)k2 ε2 , so applying Theorem 3.4 gives that the error in each time is sub-gaussian with variance at most O log2(dk/δ)k2 ε2 k n 2 2k+1 , and the algorithm is an (n, α, β) summation algorithm for log2(dk/δ)k2 kn 1 2k+1 p log(1/β) + log n . B. Missing Parts from Section 4 In this section, we prove a lower bound for PSCO (specifically for binary values), for any number k of shufflers, even for mechanisms that can choose the shuffle mechanisms (encoders and sizes) adaptively. The proof is by induction on the concurrency level k and is based upon two subtle ideas: First, identifying the hard distribution on inputs that would make any mechanism fail. Second, a clever partition of the possible transcripts of the algorithm that allows us to show that whatever batch sizes the algorithm chooses it accumulates a large error with high probability. Specifically we show that if the algorithm uses a mechanism with a large batch then while the shuffler running the mechanism is active the concurrency level reduces and the large error follows by induction. Otherwise, all batches are small. This implies that many input bits are independent even given the transcript of the algorithm (see Lemma B.1). Furthermore, these bits also have large variance (see Lemma B.3) due to privacy so an anti-concentration bound (see Lemma B.4) implies that we must accumulate a large error. We now present and prove the intermediate general results for the proof presented above. First, we prove Lemma B.1 which shows that conditioned on the transcript, input bits which were independent continue to be independent. Lemma B.1 (Independence given transcript, based on Lemma 2 of Chan et al. (2012)). Let a set of n parties participating in a communication protocol where the i th party depends on a random variable Xi that he gets as input. The parties communicate publicly, where in each round a single party publishes a value depending on their input data, their internal private randomness, and the previous messages. Then if X1, . . . , Xn are independent, then they are still independent, even conditioned on all the communicated messages of the protocol. Proof of Lemma B.1. We prove the statement by induction on the number of rounds of messages. For the base case, after zero rounds the values are independent since the communication protocol didn t start yet. For the step, consider the j th message mj sent by the party who has input Xi, and suppose X i is the joint input of all other parties and m exp(2ε)} Y . By Concurrent Shuffle Differential Privacy Under Continual Observation approximate-DP, Pr M[M(D ) Bad1] eε Pr M[M(D) Bad1] + δ and therefore Pr M[M(D) Bad1] = P y Bad1 Pr M[M(D) = y] < P y Bad1 exp( 2ε) Pr M[M(D ) = y] = exp( 2ε) Pr M[M(D ) Bad1] exp( 2ε) exp(ε) Pr M[M(D) Bad1] + δ = exp( ε) Pr M[M(D) Bad1] + δ exp( 2ε). Thus, Pr M[M(D) Bad1] < δ exp( 2ε) 1 exp( ε). A similar analysis shows that Pr M[M(D) Bad2] < δ exp(ε) exp(ε) 1. Let Bad = Bad1 Bad2. Since Bad1 and Bad2 are disjoint, we have that Pr M[M(D) Bad] = Pr M[M(D) Bad1] + Pr M[M(D) Bad2] < 2δ exp(ε) 1 exp( ε) = O(δ/ε). From approximate-DP, we therefore get that Pr M[M(D ) Bad] exp(ε) 2δ exp(ε) 1 exp( ε) + δ = O(δ/ε). Let E(D, D ) = Y \ Bad. It follows that Pr M[M(D) E(D, D )] > 1 O(δ/ε), Pr M[M(D ) E(D, D )] > 1 O(δ/ε), and most importantly by the definition of Bad, for any t E(D, D ) it holds that t / Bad1 and t / Bad1, and therefore Pr M[M(D)=t] Pr M[M(D )=t] [e 2ε, e2ε]. We now extend Lemma B.2 to distributions. Lemma B.3 (Extended over distribution). Let M : Xn Y be an (ε, δ)-DP mechanism, and a distribution A of datasets of size n. For i [n], let Di a denote the dataset D with the i th element set to a. Then, for any two distinct elements x1, x2 X, and index i, there is a subset E(i, x1, x2, A) Y such that: 1. Pr M,D A[M(Di x1) E], Pr M,D A[M(Di x2) E] 1 2δ min(ε,1), and 2. t E, Pr M,D A[M(Di x1)=t] Pr M,D A[M(Di x2)=t] [e 2ε, e2ε]. Proof of Lemma B.3. Let i [n] and x1, x2 X. consider the set of outputs Bad1 = {y Y | Pr M,D A[M(Di x1)=y] Pr M,D A[M(Di x2)=y] < exp( 2ε)} Y and Bad2 = {y Y | Pr M,D A[M(Di x1)=y] Pr M,D A[M(Di x2)=y] > exp(2ε)} Y . Observe that Pr M,D A[M(Di x1) Bad1] = X y Bad1 Pr M,D A[M(Di x1) = y] y Bad1 exp( 2ε) Pr M,D A[M(Di x2) = y] = exp( 2ε) Pr M,D A[M(Di x2) Bad1] = exp( 2ε) X D Xn Pr M [M(Di x2) Bad1] Pr A [D] exp(ε) Pr M [M(Di x1) Bad1] + δ Pr A [D] = exp( 2ε)δ + exp( ε) X D Xn Pr M [M(Di x1) Bad1] Pr A [D] = exp( 2ε)δ + exp( ε) Pr M,D A[M(Di x1) Bad1], where the first step follows by summing over Bad1, the second step follows by the definition of Bad1, the fourth step follows by the law of total probability over D A, the fifth step follows by approximate-DP, where for any dataset D Xn, the datasets Di x1 and Di x2 are neighbors, and the seventh step follows by the law of total probability over D A. Thus, Pr M,D A[M(Di x1) Bad1] < δ exp( 2ε) 1 exp( ε). A similar analysis shows that Pr M,D A[M(Di x1) Bad2] < δ exp(ε) exp(ε) 1. Let Bad = Bad1 Bad2. Since Bad1 and Bad2 are disjoint, we have that Pr M,D A[M(Di x1) Bad] = Pr M,D A[M(Di x1) Bad1] + Pr M,D A[M(Di x1) Bad2] < δ exp( 2ε) 1 exp( ε) + δ exp(ε) exp(ε) 1 2δ min(ε, 1) Concurrent Shuffle Differential Privacy Under Continual Observation A symmetric defining argument shows that Pr M,D A[M(Di x2) Bad] 2δ min(ε,1). Let E(i, x1, x2, A) = Y \ Bad. Therefore, Pr M,D A[M(Di x1) E(i, x1, x2, A)] > 1 2δ min(ε,1), Pr M,D A[M(Di x2) E(i, x1, x2, A)] > 1 2δ min(ε,1), and most importantly by the definition of Bad, for any t E(i, x1, x2, A) it holds that t / Bad1 and t / Bad2, and therefore Pr M,D A[M(Di x1)=t] Pr M,D A[M(Di x2)=t] [e 2ε, e2ε]. Lemma B.4 specifies the anti-concentration bound which we use. Lemma B.4 (Matouˇsek (2001)). There exist constants a, b, c > 0 s.t. the following holds. Let X be a sum of independent random variables, each attaining values in [0, 1], and assume V ar[X] 2002. Then for all a β 2 b V ar[X] and every interval I R of length|I| c q V ar[X] log( a β ) we have Pr[X / I] β. The last general result we need is Lemma B.5, which shows that an algorithm which outputs the result of an (ε, δ)-DP algorithm together with a constant subset of the input bits I [n], is private with respect to the bits which it does not output. Lemma B.5. Let M : Xn Y be an (ε, δ)-DP algorithm over n users, and let I [n] be a subset of users. Consider the randomized algorithm M : Xn Y X|I|, which given D = (x1, . . . , xn), computes the random output (M(D), DI) where DI := (xi)i I is the concatenation of the values of the users in I. Then M is (ε, δ)-DP with respect to the inputs D I := (xi)i/ I. Proof of Lemma B.5. We identify each dataset D = (x1, . . . , xn) by D = (DI, D I), separating the values of users in I and not in I. Consider a pair of neighbors D = (DI, D I), D = (DI, (D ) I) of n users for M , where D I and (D ) I differ at a single index, and let B Y X|I| be an output subset. It remains to show that Pr[M (D) B] eε Pr[M (D ) B] + δ. Indeed, let BDI := {r Y | (r, DI) B} Y . We have that Pr[M (D) B] = Pr[M (D) BDI {DI}] = Pr[M(D) BDI] eε Pr[M(D ) BDI] + δ = eε Pr[M (D ) BDI {DI}] + δ = eε Pr[M (D ) B] + δ, where the first and fifth steps follow since the second output of M is dataset s bits at indices I, and since the bits at indices I of both D and D are equal DI, the second and fourth steps follow since the bits at indices I of both D and D are equal DI, and the third step follows since M is (ε, δ)-DP and since BDI Y . We now prove the lower bound, which is the main result of this section. This proof relies also on Lemma B.6, however since this lemma requires the context and definitions of the proof, we state and prove it following the proof of Theorem 4.1 below. We repeat some of the notation from section 4 for convenience: cε := eε e2ε+1, repn,k := 1 2n 1 2k+1 c 2k 2k+1 ε and Bign,k := n 2k 1 2k+1 c 2 2k+1 ε . Let n,k be the set of all distributions of binary allocations, which start with a sequence of at most repn,k 0 s or 1 s, and continue with an alternation between a single Ber(1/2) bit, and repn,k copies of another Ber(1/2) bit, until length n is reached. We also introduce new notation.12 Let Bitsn,k := {Xi}m i=1 be the single (non-repeated) Ber(1/2) bits of D n,k which are evenly spaced at distance Bign,k of one another. Note that since there are at most Bign,k bits from the beginning of the input to X1, it holds that m n/Bign,k 1. Proof of Theorem 4.1. We prove the theorem by induction over the number of shufflers k = 0, 1, 2, . . .. Specifically, we show that given δ < min(ε,1) 40n(k+1), a large enough n such that nc2 ε/Bign,k > 2002, an (ε, δ)-CSDP n-user k-shuffler algorithm M for binary PSCO with adaptive mechanisms (encoders and sizes), and a distribution n,k over user binary values, M has additive error α = Ω(repn,k) on some of its reported sum estimates with probability Pn,k = β0 2 1 2nδk min(ε,1) over the randomness of D and M. For the base case k = 0, we have that repn,0 = n/2 and n,0 is the set of distributions of n/2 constant bits, then a single Ber(1/2) bit and then n/2 1 copies of the same Ber(1/2) bit. For any n,0 for k = 0 shufflers, while there are 12All the n,k are in fact the same distribution, but the difference is only in the identity of the prefix of at most repn,k 0 s or 1 s, that is, the length of this prefix and its bits. Concurrent Shuffle Differential Privacy Under Continual Observation up to n/2 constant bits, there is still a sequence of at least n n/2 1 bits which can be all zeros or all ones, i.e., there are equal probability pairs of sequences in the support whose sum differ by at least n/2 1. Since the outcome of the algorithm is deterministic (the server is deterministic, and no data is sent to it), and since , the error of the algorithm is at least n/4 1 = Ω(repn,0). For the step, let ε and δ be the privacy parameters, and let (ε, δ)-CSDP n-user k-shuffler algorithm M for binary PSCO with adaptive mechanisms. Observe that our sources of randomness are the random vector D and the randomness S = (Si)#batches i=1 of the encoders and the shuffles performed within the shuffle mechanisms by all k shufflers. Note that given a concrete D and S, since we assume that the server algorithm is deterministic, the random variable T = T(D, S) denoting the transcript of the algorithm as sent to the server is deterministically defined.13 Our proof follows by partitioning the set of possible transcripts, and showing that each set of transcripts either occurs with low probability, or that conditioned on the set, M has large additive error. The additive error will follow by the differential privacy of M with respect to Bitsn,k. To achieve this, we would like to condition the probability space on the values of the bits of D which are not Bitsn,k (including both the constant bits of the beginning of the sequences of D and the random bits) denoted by Bitsn,k. For technical reasons, we achieve this conditioning through an analysis of an alternative algorithm M . Given an input D and randomness S for M, M simulates M using the same randomness S, and outputs the approximation of M for the sum at each time. However, M produces the more detailed transcript T which includes both the transcript T from the simulation of M(D), and Bitsn,k. This transcript T is created similarly to the transcript of M, but after each time t, before including the output of the shufflers of M which were closed at t, it outputs the binary value of each new random bit (or constant bit in the beginning of D) of Bitsn,k when it is first observed. We now define a partition of the space of transcripts of M . Note that from T (D, S) we can infer the number of batches of each shuffler and their sizes, but we cannot infer the dataset D or the randomness S. To define the partition, we define two subsets of transcripts of M . Let E be the subset of transcripts t of M in which for all i = 1, . . . , m it holds that Pr D,S[T =t |Xi=1] Pr D,S[T =t |Xi=0] [e 2ε, e2ε]. Let T small,n,k be the subset of transcripts of M where all batches throughout all shufflers are of size Bign,k. We partition the transcripts of M below by grouping by the prefix of the transcript until right before the first batch of size > Bign,k, and if there was no such batch, it groups based on whether or not the transcript is contained in E . Specifically, the partition is: 1. The (unique) set intersection T small,n,k E . 2. The (unique) set intersection T small,n,k E . 3. For each specific transcript prefix p of a run until it includes its first batch of size > Bign,k in any of its shufflers, consider the set Tp of all p s transcript completions. We now bound the error for each set separately. The set of type (1): We observe that this set is of small probability. Indeed, by applying Lemma B.5 to M with respect to Bitsn,k we conclude that the computation of T is (ε, δ)-DP with respect to the bits Bitsn,k. For each i, we now apply Lemma B.3 (which we can apply, since the bits Xi are independent, and hence sampling from conditioned on an Xi is equivalent to sampling from and overriding Xi) to M , with A taken to be the distribution , Y is the transcript T , i refers to the i th bit of Bitsn,k (recall that this is Xi), and x1 = 0 and x2 = 1 determine the binary value of this bit. We get that there exists a subset E i of transcripts of M of probability Pr D,S[T E i | Xi = 1], Pr D,S[T E i | Xi = 0] 1 2δ min(ε,1) such that t E i, Pr D,S[T =t |Xi=1] Pr D,S[T =t |Xi=0] [e 2ε, e2ε]. By a union bound, it holds that Pr D,S[T E ] 1 2nδ min(ε,1). Therefore, Pr D,S[T T small,n,k E ] Pr D,S[T E ] 2nδ min(ε,1). The set of type (2): For each transcript t T small,n,k E , Lemma B.6 which is formulated below, implies that conditioned on T = t , M has additive error α = Ω(repn,k) at some time, with probability β0/2, over the randomness of D and M . 13The transcript is the shuffled encoded bits throughout all the batches, ordered by the termination time of the batch, and then by an arbitrary but consistent order over the shufflers. Concurrent Shuffle Differential Privacy Under Continual Observation Sets of type (3): Let p be a prefix of a transcript of a run until right before time τ when a large batch B of size|B| > Bign,k starts. Assume without loss of generality that B is processed by shuffler number k, and that|B| = Bign,k (otherwise we just look at a prefix of B of size Bign,k). The rest of the argument is in the subspace of D and S conditioned on p. We claim that in this subspace M has error Ω repn,k at some time t with probability Pn,k 1. To show this, we assume by contradiction that M has error O repn,k in all its sum estimates with probability > 1 Pn,k 1 and we will get a contradiction. Specifically, by our assumption, there exists a fixed randomness Sp for shuffle mechanisms that end before time τ, and fixed sizes for shuffle mechanisms that overlap time τ (we do not fix the randomness of the encoders of these mechanisms that overlap time τ), and a fix set of binary values Dp of the users arriving before τ, conditioned on which M has error O repn,k in all prefixes with probability > 1 Pn,k 1. We now define a (k 1)-shuffler algorithm M B for binary PSCO over B, which uses M internally, and we show that since M has low error conditioned on Sp and Dp, the error of M B on B is too low and contradicts the induction hypothesis. Specifically, the algorithm M B runs on|B| users arriving as batch B using the following steps. 1. It simulates the shuffle mechanisms on all batches that end before time τ on the server using Dp and Sp. Then it simulates the shuffle mechanisms on batches overlapping τ using the shufflers 1, . . . , k 1 by injecting to each of them the encoding (using fresh randomness of the encoders) of the appropriate suffix of Dp. The sizes of these k 1 mechanism is determined by Sp. 2. It runs on the actual inputs that arrive in batch B using the randomness of M . At each time j, during batch B, M B sets its sum estimate (of the prefix of B) to be the difference between the estimated sum of M at time j minus the sum of the bits of Dp. Note that during the|B| 1 different times within B, while simulating M , the server does not receive any update from the k th shuffler since it is still running. Therefore M B uses only (k 1) shufflers to get its estimates. Let B be the distribution of the bits in B conditioned on Dp. If the bit arriving at time τ is part of a segment of repn,k copies of the same random bit, then each sequence in B starts with the suffix of this segment starting at time τ. The sequence then continues as in with alternation between a single Ber(1/2) bit, and repn,k copies of another Ber(1/2) bit. Since repn,k = rep Bign,k,k 1 = rep|B|,k 1 we have that B |B|,k 1. Note that when M B runs on the input distribution B, the inputs that it feeds to M during batch B has the same distribution as in conditioned on the prefix Dp. By our assumption towards contradiction, all sum approximates of M during B have error O repn,k with probability > 1 Pn,k 1 over D and S conditioned on Dp and Sp, respectively, so M B has error O repn,k over all the prefixes with probability > 1 Pn,k 1, over the randomness of B and its shuffle randomness. This contradicts the induction hypothesis for n |B| and k k 1, which claims that specifically for M B and B |B|,k 1, we get error Ω(|B| 2k 2 2k 1 ε ) = Ω(n 2k 1 (2k+1)(2k 1) c 2 (2k+1)(2k 1) ε c 2k 2 2k 1 ε ) = Ω(repn,k) at some time with probability P|B|,k 1 Pn,k 1. Hence, our assumption was false, and M has error Ω repn,k at some time t with probability Pn,k 1 over D and S. To conclude the proof we denote by ERRORM and ERRORM the worst error of M and M respectively over all prefixes, Concurrent Shuffle Differential Privacy Under Continual Observation and aggregate the error bound for the different sets by the law of total probability, to get that Pr D,S[ERRORM = Ω repn,k ] = Pr D,S[ERRORM = Ω repn,k ] = Pr D,S[ERRORM = Ω repn,k | T T small,n,k E ] Pr D,S[T T small,n,k E ] t T small,n,k E Pr D,S[ERRORM = Ω repn,k | T = t ] Pr D,S[T = t ] p Pr D,S[ERRORM = Ω repn,k | T Tp] Pr D,S[T Tp] 0 + β0/2 Pr D,S[T T small,n,k E ] + Pn,k 1 Pr D,S[T p Tp] Pn,k 1 Pr D,S[T (T small,n,k E ) ( p Tp)] = Pn,k 1 Pr D,S[T / T small,n,k E ] Pn,k 1 1 2nδ min(ε, 1) = β0/2 1 2nδ(k 1) 1 2nδ min(ε, 1) > β0/2 1 2nδk min(ε, 1) where the first step follows by the definition of M , the second step follows by the law of total probability, the third step follows by Lemma B.6 and by the analysis of transcripts of type 2 (resulting with the constant β0/2) above, the fourth step follows since Pn ,k β0/2 for any n , k , and the sixth step follows by the analysis of set 1, and the seventh step follows by the definition of Pn,k 1. We now provide the missing lemma used in the proof of Theorem 4.1, which shows using the notation of the proof of Theorem 4.1, that for each transcript t T small,n,k E , conditioned on T = t , M has additive error α = Ω(repn,k). Lemma B.6. Let ε and δ such that δ < min(ε,1) 40n and nc2 ε/Bign,k > 2002, and an (ε, δ)-CSDP n-user k-shuffler algorithm M for PSCO with adaptive mechanisms, and let M and E as defined in the proof of Theorem 4.1. Let n,k be a distribution over user binary values. Then, conditioned on a fixed transcript T = t = (t, Bitsn,k) T small,n,k E , M has additive error α = Ω(repn,k) at some time with probability β0/2, over the randomness of and M . Proof of Lemma B.6. We claim that: 1. The conditional random variables {Xi | T = t }m i=1 are independent. (Recall that Bitsn,k := {Xi}m i=1 are the non-repeated Ber(1/2) bits of D n,k which are evenly spaced at distance Bign,k of one another. ) 2. Each bit Xi satisfies Pr D,S[Xi = 1 | T = t ] h 1 e2ε+1, e2ε e2ε+1 i . We prove Item (a) through an application of Lemma B.1. Recall that a random D is composed of a constant (non-random) prefix of bits, then alternating single Ber(1/2) bits and a sequence of equal Ber(1/2) bits. The communicating parties can be divided to those which are associated with input bit subsets or a specific batch of users (defined by the subset of users, the mechanism and the shuffler they used). For input bits, we associate a communicating party with each (a) single Ber(1/2) bit in the input, (b) sequence of of copies of a Ber(1/2) bit, and (c) a constant bit in the prefix of at most repn,k bits. For batches of users, we associate a communicating party of type (d) to each batch of users which contain no bit among Bitsn,k.14 The communication protocol is divided to 2n rounds of communication, two rounds per each time 1, . . . , n. Throughout the times t = 1, . . . , n, the first communication rounds are used to communicate bits of the input. Each party which is 14The outputs of the shuffle mechanisms over batches containing some bit of Bitsn,k are associated with this single Bitsn,k input bit from above. Note that we can make this association since the batches are of size at most Bign,k since t T small,n,k, so the batch cannot contain two or more consecutive Xj and Xj+1 s. Concurrent Shuffle Differential Privacy Under Continual Observation associated with a set of input bits communicates them in the first round of a single step, as follows. A party associated with a single bit (types (a),(c)), communicates this bit at the first round of time t of the bit, and a party associated with a sequence of copies of a bit (type (b)), communicates the copied bit value at the first round of time t which contains the last copied bit of the sequence. Throughout the times t = 1, . . . , n, the second communication rounds are used to communicate shuffler outputs. Specifically, in each time t which contains a bit of Bitsn,k, the party associated with the bit t (this is a subset of parties of type (a) as above) communicates the output of all (at most k) shufflers that closed in time t in the second communication round of t. Also, for each batch of type (d), its associated party communicates the output of the shuffler run on the batch at the second round of the shuffler s close time t. We show that this is a valid communication protocol, that is, each party communicates a message as a function of their internal randomness, their private input and the history. For parties which communicate in the first rounds, the communicated message is simply their bit which is their private input. For the second rounds of communication, each participating party associated with a Bitsn,k bit, e.g., Xi, communicates the output of a shuffle mechanism, which is a function of the shuffle input Xi (attributed to the input of party i), the rest of the bits of users in the shuffle batch (which were communicated previously or in the first round of the current time), the executed shuffle mechanism (which depends on the information of previous rounds of communication) and the shuffle randomness (attributed to the internal randomness of party i). A similar argument is given for parties of type (d) which are not associated with any Bitsn,k bit. We observe that the bits {Xi}i were initially independent by the definition of , and that the total communication of the parties precisely produces the transcript t , and therefore Item (a) follows by Lemma B.1. To prove item (b), for every i = 1, . . . , m we observe that Pr D,S[Xi = 1 | T = t ] Pr D,S[Xi = 0 | T = t ] = Pr D,S[T = t | Xi = 1] Pr D,S[Xi = 1 |]/ Pr D,S[T = t |] Pr D,S[T = t | Xi = 0] Pr D,S[Xi = 0 |]/ Pr D,S[T = t |] = Pr D,S[T = t | Xi = 1] 1/2 Pr D,S[T = t | Xi = 0] 1/2 = Pr D,S[T = t | Xi = 1] Pr D,S[T = t | Xi = 0] [e 2ε, e2ε], where the first step follows by Bayes Rule, the second step follows since for j {0, 1}, Pr D[Xi = j] = 0.5 and since Xi is independent of S, and the last step follows by since t E and by the definition of E . This concludes the proof of item (b). Combining Claims 1 and 2 above, we have that {Xi | T = t }m i=1 are independent, and each bit Xi satisfies Pr D,S[Xi = 1 | T = t ] h 1 e2ε+1, e2ε e2ε+1 i . Thus, Pm i=1 Xi is the sum of m independent Bernoulli random variables with bias in h 1 e2ε+1, e2ε e2ε+1 i (so the variance is at least c2 ε) and therefore Pm i=1 Xi has variance at least mc2 ε > 2002, where the last step follows by the theorem s assumptions on n and ε. We apply Lemma B.4 to the random variable Pm i=1 Xi | T = t , with the constant β = β0 = min( a 4, 1/4) (2 bmc2 ε, a) where a is a constant guaranteed from Lemma B.4, and the interval I(t ) = M (t ) sum(Bitsn,k) c/2 q mc2ε log( a where M (t ) is the (deterministic) last sum estimate that M returns for the transcript t (which is precisely the last estimate that M returns for t), and sum(Bitsn,k) is the sum of bits of D which are not of type Bitsn,k.15 We conclude that with constant probability β0 over S and D, M incurs error at least Ω( m cε) = Ω( p n/Bign,k cε) = Ω(repn,k), by the definition of Bign,k and repn,k. C. Concurrent Shuffle Lin UCB In this section, we present with detail the Lin UCB adaptation for the Concurrent Shuffle model with k shufflers. Algorithm 4 below simulates Lin UCB with errors, as in the setting of Shariff & Sheffet (2018), where the estimates of the cumulative matrix and vector sums are given by a CSDP algorithm for PSCO guaranteed from Corollary 3.5. We make a standard assumption of boundedness, easily satisfied by normalization as follows. 15The reason we use this interval is that the true sum is true Sum = sum(Bitsn,k)+Pm i=1 Xi, and to show that with high probability we mis-approximate it with error ϕ with M (e), it suffices to show that M (e) sum(Bitsn,k) Pm i=1 Xi mis-approximates 0 with error ϕ. Concurrent Shuffle Differential Privacy Under Continual Observation Assumption C.1 (normalization). By normalization, we assume that the rewards are bounded for all t, i.e., yt [0, 1]. Also, θ 2 1 and supc,a ϕ(c, a) 2 1. Algorithm 4 Concurrent Shuffle Private Lin UCB Parameters: horizon n, number of shufflers k, privacy parameters ε and δ, regularization λ > 0, confidence radii {βi}n i=1, feature map ϕ : C χ Rd. Initialize: Batch statistics V0 = λId, u0 = 0, parameter estimate ˆθ0 = 0, and a Concurrent Shuffle DP (CSDP) bounded vector and matrix summation algorithm Asum vec guaranteed by Corollary 3.5 for n users, k shufflers, and privacy parameters ε and δ. We wlog assume that Asum vec handles both vectors and matrix sums, since they use the same tree structure (same users in each mechanism by the definition of the tree-based structure in Corollary 3.5), and the two shuffle mechanisms can be composed using simple composition. For each time t = 1, . . . , n: 1. The server activates the new shuffle mechanisms for the t th time, according to Asum vec . 2. The server communicates ˆθm 1 and V 1 m 1 to the t th user. 3. The t th user chooses the action at arg maxa χ D ϕ(ct, a), ˆθm 1 E + βt 1 ϕ(ct, a) V 1 m 1. 4. The t th user observes the reward yt. 5. For each shuffle mechanism which requires t s participation for vectors and matrices sums, it participates using its private vector value ϕ(ct, at)yt and matrix value ϕ(ct, at)ϕ(ct, at)T respectively. 6. The data of each shuffle mechanism that has just completed is shuffled and sent to the server. 7. The server analyzes the data to obtain a cumulative sum vector estimate ut and a cumulative matrix estimate V t , and computes the parameter estimates Vt = V t + λId and ˆθt = V 1 t ut. To prove the regret of Algorithm 4, we refer to the framework of Shariff & Sheffet (2018), who gave a bound on the total regret of an algorithm based on the magnitude of the errors of the estimates ut and V 1 t throughout the algorithm. Specifically, they used the following notations, where for each time t, they denote by Ht := Vt Pt i=1 ϕ(ct, at)ϕ(ct, at)T the noise injected into the Gram-matrix, and denote by ht := ut Pt i=1 ϕ(ct, at)yt the noise injected into the feature-reward vector. Shariff & Sheffet (2018) showed the following. Theorem C.2 (Shariff & Sheffet (2018)). If we have that 1. (Regularity) For any α (0, 1], Hm is PSD, and there are ρmax ρmin > 0 and γ depending on α, such that with probability at least 1 α, for all t = 1, . . . , n, Ht ρmax, H 1 t 1/ρmin, ht H 1 t γ. 2. (Boundedness) θ , x B, θ S, x L and yt = θ , xt + ηt where ηt is sub-gaussian with variance σ2. Then with probability at least 1 α, the regret satisfies: 2 log(2/α) + d log ρmin + n L2 + (S ρmax + γ) d log 1 + n L2 using βt := σ r 2 log(2/α) + d log ρmax ρmin + t L2 dρmin + S ρmax + γ. We are now prepared to prove the main result of this section. Concurrent Shuffle Differential Privacy Under Continual Observation Theorem C.3 (Restatement of Theorem 5.1). Fix a horizon n N, number of shufflers k and privacy budgets ε < 1 and δ < 1/2. Then the algorithm as described above instantiated with the tree-Based Concurrent Shuffle algorithm (guaranteed by Corollary 3.5 for summing vectors with bounded l2 norms) with internal parameter βt := σ q 2 log(2/α) + d log 3 + 2t λ/2, where α = 1 n (0, 1], λ = 2σ d and σ = O k3/2 ε n 1 2k+1 is (ε, δ)-CSDP and has regret O k3/4n k+1 2k+1 ε (σ + d) , where σ2 is the bound on the sub-gaussian variance of each ηt when computing yt = θ , xt + ηt. Proof. Our proof is similar to the proof of Lemma A.4 of Chowdhury & Zhou (2022b). However, we use different constants and internal variable values, so include the proof below. We receive the bound on the regret by applying Theorem C.2. To do so, we find the variables which ensure regularity and boundedness. For regularity, note that for all t = 1, . . . , n, ht is a random vector whose entries are independent, and Ht is a random symmetric matrix whose entries on and above the diagonal are independent. We apply Corollary 3.5 to get that every entry of every matrix and vector at each time t has error at most σ with probability 1 n2 (d2+d). By a union bound over n and the entries of the matrix and vectors, and a simple bound on vector and matrix l2 norms, we get that with probability at least 1 α/2, for all t = 1, . . . , n simultaneously, ht , Ht σ d = λ/2. Hence, we have for all t = 1, . . . , n , it holds that Ht = Ht + λId 3λ/2, i.e., ρmax = 3λ/2, and ρmin = λ/2. Finally, to determine γ, we note that ht H 1 t 1 ρmin ht For boundedness, by the normalization Assumption C.1 we get boundedness for B = S = L = 1, and we use the same the definition of σ as from Section 5. We now apply Theorem C.2 with the regularity and boundedness results above for the α and observe that the theorem s statement contains the correct matching sequence βt. We get that with probability at least 1 α, the regret satisfies: 2 log(2n) + d log 3 + 2n d log 1 + 2n = O n (σ + d) log(n) + = O n σ log(n) + d σ log(n) = O σ log(n) + d ε n 1 2k+1 log(n) k3/4n k+1 2k+1 ε (σ + d) log(n) k3/4n k+1 2k+1 ε (σ + d) where the fourth step follows by the definition of σ from the theorem s statement. Since the worst possible regret is n, splitting to the event above with probability at least 1 α = 1 1 n or its complement for α = 1 n, the law of total expectation, the regret is at most k3/4n k+1 2k+1 ε (σ + d) k3/4n k+1 2k+1 ε (σ + d) Proof of Corollary 5.2. The proof follows by a simple application of Theorem 5.1 for k = log n.