# counting_distinct_elements_under_personlevel_differential_privacy__e335e51c.pdf Counting Distinct Elements Under Person-Level Differential Privacy Alexander Knop Google alexanderknop@google.com Thomas Steinke Google Deep Mind steinke@google.com We study the problem of counting the number of distinct elements in a dataset subject to the constraint of differential privacy. We consider the challenging setting of person-level DP (a.k.a. user-level DP) where each person may contribute an unbounded number of items and hence the sensitivity is unbounded. Our approach is to compute a bounded-sensitivity version of this query, which reduces to solving a max-flow problem. The sensitivity bound is optimized to balance the noise we must add to privatize the answer against the error of the approximation of the bounded-sensitivity query to the true number of unique elements. 1 Introduction An elementary data analysis task is to count the number of distinct elements occurring in a dataset. The dataset may contain private data and even simple statistics can be combined to leak sensitive information about people [DN03]. Our goal is to release (an approximation to) this count in a way that ensures the privacy of the people who contributed their data. As a motivating example, consider a collection of internet browsing histories, in which case the goal is to compute the total number of websites that have been visited by at least one person. Differential privacy (DP) [DMNS06] is a formal privacy standard. The simplest method for ensuring DP is to add noise (from either a Laplace or Gaussian distribution) to the true answer, where the scale of the noise corresponds to the sensitivity of the true answer i.e., how much one person s data can change the true value. If each person contributes a single element to the dataset, then the sensitivity of the number of unique elements is one. However, a person may contribute multiple elements to the dataset and our goal is to ensure privacy for all of these contributions simultaneously. That is, we seek to provide person-level DP (a.k.a. user-level DP2). This is the problem we study: We have a dataset D = (u1, u2, , un) of person records. Each person i [n] contributes a finite dataset ui Ω , where Ωis some (possibly infinite) universe of potential elements (e.g., all finite-length binary strings) and Ω := S ℓ N Ωℓdenotes all subsets of Ωof finite size. Informally, our goal is to compute the number of unique elements Alphabetical author order. 2We prefer the term person over user, as the latter only makes sense in some contexts and could be confusing in others. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). in a way that preserves differential privacy. A priori, the sensitivity of this quantity is infinite, as a single person can contribute an unbounded number of unique elements. In particular, it is not possible to output a meaningful upper bound on the number of distinct elements subject to differential privacy. This is because a single person could increase the number of distinct elements arbitrarily and differential privacy requires us to hide this contribution. It follows that we cannot output a differentially private unbiased estimate of the number of distinct elements with finite variance. However, it is possible to output a lower bound. Thus our formal goal is to compute a high-confidence lower bound on the number of distinct elements that is as large as possible and which is computed in a differentially private manner. 1.1 Our Contributions Given a dataset D = (u1, , un) (Ω )n and an integer ℓ 1, we define DC(D; ℓ) := max : i [n] vi ui |vi| ℓ That is, DC(D; ℓ) is the number of distinct elements if we restrict each person s contribution to ℓ elements. We take the maximum over all possible restrictions. It is immediate that DC(D; ℓ) DC(D) for all ℓ 1. Thus we obtain a lower bound on the true number of unique elements. The advantage of DC(D; ℓ) is that its sensitivity is bounded by ℓ(see Lemma A.1 for a precise statement) and, hence, we can estimate it in a differentially private manner. Specifically, Mℓ,ε(D) := DC(D; ℓ) + Lap (ℓ/ε) (3) defines an ε-DP algorithm Mℓ,ε : (Ω )n R, where Lap (b) denotes Laplace noise scaled to have mean 0 and variance 2b2. This forms the basis of our algorithm. Two challenges remain: Setting the sensitivity parameter ℓand computing DC(D; ℓ) efficiently. To obtain a high-confidence lower bound on the true distinct count, we must compensate for the Laplace noise, which may inflate the reported value. We can obtain such a lower bound from Mℓ(D) using the cumulative distribution function (CDF) of the Laplace distribution: That is, b > 0 β (0, 1/2] P h Lap (b) b log 1 2β i = β, so | {z } lower bound 1 β | {z } confidence Choosing the sensitivity parameter ℓ. Any choice of ℓ 1 gives us a lower bound: DC(D; ℓ) DC(D). Since D limℓ DC(D; ℓ) = DC(D), this lower bound can be arbitrarily tight. However, the larger ℓis, the larger the sensitivity of DC(D; ℓ) is. That is, the noise we add scales linearly with ℓ. Thus there is a bias-variance tradeoff in the choice of ℓ. To make this precise, suppose we want a lower bound on DC(D) with confidence 1 β [ 1 2, 1), as in Equation (4). To obtain the tightest possible lower bound with confidence 1 β, we want ℓto maximize the expectation q(D; ℓ) := DC(D; ℓ) ℓ We can use the exponential mechanism [MT07] to privately select ℓthat approximately maximizes q(D; ℓ). However, directly applying the exponential mechanism is problematic because each score has a different sensitivity the sensitivity of q( ; ℓ) is ℓ. Instead, we apply the Generalized Exponential Mechanism (GEM) of Raskhodnikova and Smith [RS15] (see Algorithm 3). Note that we assume some a priori maximum value of ℓis supplied to the algorithm; this is ℓmax. Our main algorithm attains the following guarantees. Theorem 1.1 (Theoretical Guarantees of Our Algorithm). Let ε > 0 and β (0, 1 2) and ℓmax N. Define M : (Ω ) N R to be M(D) = DPDISTINCTCOUNT(D; ℓmax, ε, β) from Algorithm 1. Then M satisfies all of the following properties. Privacy: M is ε-differentially private. Lower bound: For all D (Ω )n, P (ˆℓ,ˆν) M(D) [ˆν DC(D)] 1 β. (6) Upper bound: For all D (Ω )n, P (ˆℓ,ˆν) M(D) ˆν max ℓ [ℓmax] DC(D; ℓ) 10ℓ+ 18ℓ A ε log ℓmax where ℓ A = arg maxℓ [ℓmax] DC(D; ℓ) ℓ ε log 1 2β . Computational efficiency: M(D) has running time O |D|1.5 ℓ2 max , where |D| := P The upper bound guarantee (7) is somewhat difficult to interpret. However, if the number of items per person is bounded by ℓ , then we can offer a clean guarantee: If D = (u1, , un) (Ω )n satisfies maxi [n] |ui| ℓ ℓmax, then combining the upper and lower bounds of Theorem 1.1 gives P (ˆℓ,ˆν) M(D) DC(D) ˆν DC(D) 28ℓ Note that ℓ is not assumed to be known to the algorithm, but the accuracy guarantee is able to adapt. We only assume ℓ ℓmax, where ℓmax is the maximal sensitivity considered by the algorithm. In addition to proving the above theoretical guarantees, we perform an experimental evaluation of our algorithm. Algorithm 1 Distinct Count Algorithm 1: procedure SENSITIVEDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓ N) DC(D; ℓ) 2: Let Uℓ= S i [n] {i} [min{ℓ, |ui|}] [n] [ℓ]. 3: Let V = S i [n] ui Ω. 4: Define Eℓ U V by ((i, j), v) E v ui. 5: Let Gℓbe a bipartite graph with vertices partitioned into Uℓand V and edges Eℓ. 6: mℓ MAXIMUMMATCHINGSIZE(G). [HK73; Kar73] 7: return mℓ N 8: end procedure 9: procedure DPDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓmax N, ε>0, β (0, 1 2)) 10: for ℓ [ℓmax] do 11: Define qℓ(D) := SENSITIVEDISTINCTCOUNT(D; ℓ) 2ℓ ε log 1 2β . 12: end for 13: ˆℓ GEM(D; {qℓ}ℓ [ℓmax], {ℓ}ℓ [ℓmax], ε/2, β). Algorithm 3 14: ˆν qˆℓ(D) + Lap 2ˆℓ/ε . 15: return (ˆℓ, ˆν) [ℓmax] R. 16: end procedure Efficient computation. The main computational task for our algorithm is to compute DC(D; ℓ). By definition (2), this is an optimization problem. For each person i [n], we must select a subset vi of that person s data ui of size at most ℓso as to maximize the size of the union of the subsets S We can view the dataset D = (u1, , un) (Ω )n as a bipartite graph. On one side we have the n people and on the other side we have the elements of the data universe Ω.3 There is an edge between i [n] and x Ωif and only if x ui. We can reduce computing DC(D; ℓ) to a max-flow problem: Each edge in the bipartite graph has capacity one. We add a source vertex s which is connected to each person i [n] by an edge with capacity ℓ. Finally we add a sink t that is connected to each x Ωby an edge with capacity 1. The max flow through this graph is precisely DC(D; ℓ). Alternatively, we can reduce computing DC(D; ℓ) to bipartite maximum matching. For ℓ= 1, DC(D; 1) is exactly the maximum cardinality of a matching in the bipartite graph described above. For ℓ 2, we simply create ℓcopies of each person vertex i [n] and then DC(D; ℓ) is the maximum cardinality of a matching in this new bipartite graph.4 Using this reduction, standard algorithms for bipartite maximum matching [HK73; Kar73] allow us to compute DC(D; ℓ) with O(|D|1.5 ℓ) operations. We must repeat this computation for each ℓ [ℓmax]. Algorithm 2 Linear-Time Approximate Distinct Count Algorithm 1: procedure DPAPPROXDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓmax N, ε>0, β (0, 1 2)) 2: S . 3: for ℓ [ℓmax] do 4: for i [n] with ui \ S = do 5: Choose lexicographically first v ui \ S. Match (i, ℓ) to v. 6: Update S S {v}. 7: end for 8: Define qℓ(D) := |S| 2ℓ ε log 1 2β . This loop computes {qℓ(D)}ℓ [ℓmax]. 9: end for 10: ˆℓ GEM(D; {qℓ}ℓ [ℓmax], {ℓ}ℓ [ℓmax], ε/2, β). Algorithm 3 11: ˆν qˆℓ(D) + Lap 2ˆℓ/ε . 12: return (ˆℓ, ˆν) [ℓmax] R. 13: end procedure Linear-time algorithm. Our algorithm above is polynomial-time. However, for many applications the dataset size |D| is enormous. Thus we also propose a linear-time variant of our algorithm. However, we must trade accuracy for efficiency. There are two key ideas that differentiate our linear-time algorithm (Algorithm 2) from our first algorithm (Algorithm 1) above: First, we compute a maximal bipartite matching instead of a maximum bipartite matching.5 This can be done using a linear-time greedy algorithm and gives a 2-approximation to the maximum matching. (Experimentally we find that the approximation is better than a factor of 2.) Second, rather than repeating the computation from scratch for each ℓ [ℓmax], we incrementally update our a maximal matching while increasing ℓ. The main challenge here is ensuring that the approximation to DC(D; ℓ) has low sensitivity i.e., we must ensure that our approximation algorithm doesn t inflate the sensitivity. Note that DC(D; ℓ) having low sensitivity does not automatically ensure that the approximation to it has low sensitivity. Theorem 1.2 (Theoretical Guarantees of Our Linear-Time Algorithm). Let ε > 0 and β (0, 1 2) and ℓmax N. Define M : (Ω ) N R to be c M(D) = 3The data universe Ωmay be infinite, but we can restrict the computation to the finite set S i [n] ui. Thus there are at most n + DC(D) n + |D| item vertices in the graph. 4We need only create min{ℓ, |ui|} copies of the person i [n]. Thus the number of person vertices is at most min{nℓ, |D|}. 5To clarify the confusing terminology: A matching is a subset of edges such that no two edges have a vertex in common. A maximum matching is a matching of the largest possible size. A maximal matching is a matching such that no edge could be added to the matching without violating the matching property. A maximum matching is also a maximal matching, but the reverse is not true. Data Set Vocabulary Size Estimated Vocabulary Size 10th Percentile Median 90th Percentile Amazon Fashion 1450 1220.6 1319.1 1394.2 Amazon Industrial and Scientific 36665 35970.5 36198.9 36326.7 Reddit 102835 102379.7 102512.6 102643.9 IMDB 98726 98555.6 98670.4 98726.8 Table 1: True and estimated (using DPDistinct Count with ε = 1, β = 0.05 and ℓmax = 100) counts per data set. DPAPPROXDISTINCTCOUNT(D; ℓmax, ε, β) from Algorithm 2. Then c M satisfies all of the following properties. Privacy: c M is ε-differentially private. Lower bound: For all D (Ω )n, P (ˆℓ,ˆν) c M(D) [ˆν DC(D)] 1 β. (9) Upper bound: If D = (u1, , un) (Ω )n satisfies maxi [n] |ui| ℓ ℓmax, then P (ˆℓ,ˆν) c M(D) Computational efficiency: M(D) has running time O (|D| + ℓmax log ℓmax), where |D| := P The factor 1 2 in the upper bound guarantee (10) is the main loss compared to Theorem 1.1. (The win is O(|D|) runtime.) This is a worst-case bound and our experimental result show that for realistic data the performance gap is not so bad. The proofs of Theorems 1.1 and 1.2 are in Appendix A. 2 Related Work Counting the number of distinct elements in a collection is one of the most fundamental database computations. This is supported as the COUNT(DISTINCT ...) operation in SQL. Hence, unsurprisingly, the problem of computing the number of unique elements in a differentially private way has been extensively investigated. In the case where we assume each person contributes only one element (a.k.a. event-level privacy or item-level privacy), the number of distinct elements has sensitivity 1 and, hence, we can simply use Laplace (or Gaussian) noise addition to release. However, it may not be possible to compute the number of distinct elements exactly due to space, communication, or trust constraints (e.g. in the local model of DP [KLNRS11]). Most efforts have been focused on creating differentially private algorithms for counting distinct elements under space constraints (and assuming each person contributes a single element). To save space, we wish to compute a small summary of the dataset (called a sketch) that allows us to estimate the number of distinct elements and which can be updated as more elements are added. Smith, Song, and Thakurta [SST20] proved that a variant of the Flajolet-Martin sketch is private and Pagh and Stausholm [PS20] analyzed a sketch over the binary finite field. Dickens, Thaler, and Ting [DTT22] proved a general privacy result for order-invariant cardinality estimators. Hehir, Ting, and Cormode [HTC23] provided a mergeable private sketch (i.e. two sketches can be combined to obtain a sketch of the union of the two datasets). In contrast, Desfontaines, Lochbihler, and Basin [DLB19] proved an impossibility result for mergeable sketches, which shows that privacy or accuracy must degrade as we merge sketches. 0 20 40 60 80 100 Person Contribution Bound Distinct Count Amazon Fashion 0 20 40 60 80 100 Person Contribution Bound Amazon Industrial 0 20 40 60 80 100 0.2 Person Contribution Bound Distinct Count 0 5 10 15 20 25 30 Person Contribution Bound true value matching algorithm greedy algorithm sampling algorithm Figure 1: Performance of different algorithms estimating distinct count assuming that each person can contribute at most ℓelements (e.g., these algorithms are estimating DC(D; ℓ)). (These algorithms have bounded sensitivity, but we do not add noise for privacy yet.) Counting unique elements has been considered in the pan-private streaming setting [DNPRY10] (the aforementioned algorithms also work in the pan-private setting) and in the continual release streaming setting [GKNM23]. (In the continual release setting the approximate count is continually updated, while in the pan-private setting the approximate count is only revealed once, but at an unknown point in time.) Kreuter, Wright, Skvortsov, Mirisola, and Wang [KWSMW20] give private algorithms for counting distinct elements in the setting of secure multiparty computation. In the local and shuffle models, the only known results are communication complexity bounds [CGKM21]. A closely related problem is that of identifying as many elements as possible (rather than just counting them); this is known as partition selection, set union, or key selection [SDH23; DVGM22; KKMN09; CWG22; RCR22; GGKSSY20; ZDKSTMAS23]. Note that, by design, DP prevents us from identifying elements that only appear once in the dataset, or only a few times. Thus we can only output items that appear frequently. The most closely related work to ours is that of Dong, Fang, Yi, Tao, and Machanavajjhala [DFYTM22] and Fang, Dong, and Yi [FDY22]. These papers present two different algorithms for privately approximating the distinct count (and other statistics). We discuss these below and present an experimental comparison in Table 2. We also remark that both papers prove instance optimality guarantees for their algorithms. 5 10 15 20 25 30 Distinct Count Amazon Fashion 5 10 15 20 25 30 Amazon Industrial 5 10 15 20 25 30 Distinct Count 5 10 15 20 25 30 true value matching algorithm greedy algorithm sampling algorithm shifted inverse algorithm Figure 2: Performance of different algorithms estimating distinct count in a differentially private way for different values of ε; for all of them β = 0.05 and ℓmax = 100. The values between 10th and 90th percentile of each algorithms estimation are shaded into corresponding colors. For the shifted inverse algorithm, the first two plots contain the results for β = 0.05 and D equal to the true number of distinct elements in the dataset. The later two datasets are lacking the results for shifted inverse algorithm due to the computational constraints. User Supplier Customer Attribute PS.AQ L.EP O.OD L.RD R2T [DFYTM22] 0.0658 0.1759 0.0061 0.150 (Approx)Shifted Inverse [FDY22] 0.0553 0.0584 0.005 0.0061 DPApprox Distinct Count 0.0140 0.0110 0.0008 0.0037 DPDistinct Count 0.0100 0.0096 0.0008 0.0001 Table 2: Average relative absolute error of algorithms described in this paper and in [DFYTM22; FDY22] on the TPC-H dataset. For each algorithm we executed it 100 times, removed 20 top and 20 bottom values and computed average error for the rest of 60 values. Most similar to our algorithm is the Race-to-the-Top (R2T) algorithm [DFYTM22]; R2T is a generic framework and the original paper did not specifically consider counting distinct elements, but the approach can easily be applied to DC(D; ℓ). While we use the generalized exponential mechanism [RS15] to select the sensitivity ℓ, R2T computes multiple lower bounds with different sensitivities ℓand then outputs the maximum of the noisy values. This approach incurs the cost of composition across the multiple evaluations. To manage this cost, R2T only evaluates ℓ= 2, 4, 8, , 2log ℓmax. Compared to our guarantee (8) with an error O ℓ β , R2T has a slightly worse theoret- ical error guarantee of O ℓ ε log(ℓmax) log log ℓmax β [DFYTM22, Theorem 5.1]. The shifted inverse mechanism [FDY22] takes a different approach to the problem. Rather than relying on adding Laplace noise (as we do), it applies the exponential mechanism with an ingenious loss function (see [Ste23] for additional discussion). When applied to counting distinct elements, the shifted inverse mechanism gives an accuracy guarantee comparable to ours (8). The downside of the shifted inverse mechanism is that computing the loss function is, in general, NP-hard. Fang, Dong, and Yi [FDY22] propose polynomial-time variants for several specific tasks, including counting distinct elements. However, the algorithm is still relatively slow. 3 Technical Background on Differential Privacy For detailed background on differential privacy, see the survey by Vadhan [Vad17] or the book by Dwork and Roth [DR14]. We briefly define pure DP and some basic mechanisms and results. Algorithm 3 Generalized Exponential Mechanism [RS15] 1: procedure GEM(D X ; qi :X R for i [m], i >0 for i [m], ε>0, β >0) 2: Require: qi has sensitivity sup x,x X neighboring |q(x) q(x )| i for all i [m]. 3: Let t = 2 4: for i [m] do 5: si minj [m] (qi(D) t i) (qj(D) t j) i+ j . 6: end for 7: Sample ˆi [m] from the Exponential Mechanism using the normalized scores si; i.e., i [m] P h ˆi = i i = exp 1 k [m] exp 1 8: return ˆi [m]. 9: end procedure Definition 3.1 (Differential Privacy (DP) [DMNS06] ). A randomized algorithm M : X Y satisfies ε-DP if, for all inputs D, D X differing only by the addition or removal of an element and for all measurable S Y, we have P [M(D) S] eε P [M(D ) S]. We refer to pairs of inputs that differ only by the addition or removal of one person s data as neighboring. Note that it is common to also consider replacement of one person s data; for simplicity, we Data Set Size Words per Person Vocabulary Size People Records Min Median Max Amazon Fashion 404 8533 1 14.0 139 1450 Amazon Industrial and Scientific 11041 1446031 0 86 2059 36665 Reddit 223388 7117494 0 18.0 1724 102835 IMDB 50000 6688844 5 110.0 925 98726 Table 3: Data sets details. do not do this. We remark that there are also variants of DP such as approximate DP [DKMMN06] and concentrated DP [DR16; BS16], which quantitatively relax the definition, but these are not relevant in our application. A key property of DP is that it composes and is invariant under postprocessing. Lemma 3.2 (Composition & Postprocessing). Let M1 : X Y be ε1-DP. Let M2 : X Y Z be such that, for all y Y, the restriction M( , y) : X Z is ε2-DP. Define M12 : X Z by M12(D) = M2(D, M1(D)). Then M12 is (ε1 + ε2)-DP. A basic DP tool is the Laplace mechanism [DMNS06]. Note that we could also use the discrete Laplace mechanism [GRS09; CKS20]. Lemma 3.3 (Laplace Mechanism). Let q : X R. We say q has sensitivity if |q(D) q(D )| for all neighboring D, D X . Define M : X R by M(D) = q(D) + Lap ( /ε), where Lap (b) denotes laplace noise with mean 0 and variance 2b2 i.e., P ξ Lap(b) [ξ > t] = P ξ Lap(b) [ξ < t] = 1 b for all t > 0. Then M is ε-DP. Another fundamental tool for DP is the exponential mechanism [MT07]. It selects the approximately best option from among a set of options, where each option i has a quality function qi with sensitivity . The following result generalizes the exponential mechanism by allowing each of the quality functions to have a different sensitivity. Theorem 3.4 (Generalized Exponential Mechanism [RS15, Theorem 1.4]). For each i [m], let qi : X R be a query with sensitivity i. Let ε, β > 0. The generalized exponential mechanism (GEM( ; {qi}i [m], { i}i [m], ε, β) in Algorithm 3) is ε-DP and has the following utility guarantee. For all D X , we have P ˆi GEM(D;{qi}i [m],{ i}i [m],ε,β) qˆi(D) max j [m] qj(D) j 4 4 Experimental Results We empirically validate the performance of our algorithms using data sets of various sizes from different text domains. We focus on the problem of computing vocabulary size with person-level DP. Section 4.1 describes the data sets and Section 4.2 discusses the algorithms we compare. 4.1 Datasets We used four publicly available datasets to assess the accuracy of our algorithms compared to baselines. Two small datasets were used: Amazon Fashion 5-core [NLM19] (reviews of fashion products on Amazon) and Amazon Industrial and Scientific 5-core [NLM19] (reviews of industrial and scientific products on Amazon). Two large data sets were also used: Reddit [She20] (a data set of posts collected from r/Ask Reddit) and IMDb [N20; MDPHNP11] (a set of movie reviews scraped from IMDb). See details of the datasets in Table 3. 4.2 Comparisons Computing the number of distinct elements using a differentially private mechanism involves two steps: selecting a contribution bound (ℓin our algorithms) and counting the number of distinct elements in a way that restricts each person to only contribute the given number of elements. Selection: We examine four algorithms for determining the contribution limit: 1. Choosing the true maximum person contribution (due to computational restrictions this was only computed for Amazon Fashion data set). 2. Choosing the 90th percentile of person contributions. 3. Choosing the person contribution that exactly maximizes the utility function qℓ(D) = DC(D; ℓ) ℓ 2β ), where ε = 1, and β = 0.001. 4. Choosing the person contribution that approximately maximizes the utility function using the generalized exponential mechanism with ϵ = 1. Note that only the last option is differentially private, but we consider the other comparison points nonetheless. Counting: We also consider three algorithms for estimating the number of distinct elements for a given sensitivity bound ℓ: 1. For each person, we uniformly sample ℓelements without replacement and count the number of distinct elements in the union of the samples. 2. The linear-time greedy algorithm (Algorithm 2) with ε = 1 and β = 0.001. 3. The matching-based algorithm (Algorithm 1) with ε = 1 and β = 0.001. All of these can be converted into DP algorithms by adding Laplace noise to the result. In all our datasets true maximum person contribution and 90th percentile of person contributions output bounds that are much larger than necessary to obtain true distinct count; hence, we only consider DP versions of the estimation algorithm for these selection algorithms. 4.3 Results Figure 1 shows the dependency of the result on the contribution bound for each of the algorithms for computing the number of distinct elements with fixed person contribution. It is clear that matching and greedy algorithms vastly outperform the sampling approach that is currently used in practice. Tables 4 to 7 show the performance of algorithms for selecting optimal person contribution bounds on different data sets. For all bound selection algorithms and all data sets, the sampling approach to estimating the distinct count performs much worse than the greedy and matching-based approaches. The greedy approach performs worse than the matching-based approach, but the difference is about 10% for Amazon Fashion and is almost negligible for other data sets since they are much larger. As for the matching-based algorithm, it performs as follows on all the data sets: 1. The algorithm that uses the bound equal to the maximal person contribution overestimates the actual necessary bound. Therefore, we only consider the DP algorithms for counts estimation. It is easy to see that while the median of the estimation is close to the actual distinct count, the amount of noise is somewhat large. 2. The algorithm that uses the bound equal to the 99th percentile of person contributions also overestimates the necessary bound and behaves similarly to the one we just described (though the spread of the noise is a bit smaller). 3. The algorithms that optimize the utility function are considered: one non-private and one private. The non-private algorithm with non-private estimation gives the answer that is very close to the true number of distinct elements. The private algorithm with non-private estimation gives the answer that is worse, but not too much. Finally, the private algorithm with the private estimation gives answers very similar to the results of the non-private estimation. Acknowledgments and Disclosure of Funding We would like to thank Badih Ghazi, Andreas Terzis, and four anonymous reviewers for their constructive feedback and valuable suggestions. We thank Markus Hasen ohrl for helpful discussions, which helped us identify the problem. In addition, we are grateful to Ke Yi, Wei Dong, and Juanru Fang for bringing their related work [DFYTM22; FDY22] to our attention. [BS16] M. Bun and T. Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds . In: Theory of Cryptography: 14th International Conference, TCC 2016-B, Beijing, China, October 31-November 3, 2016, Proceedings, Part I. Springer. 2016, pp. 635 658. URL: https://arxiv.org/ abs/1605.02065 (cit. on p. 9). [CGKM21] L. Chen, B. Ghazi, R. Kumar, and P. Manurangsi. On Distributed Differential Privacy and Counting Distinct Elements . In: 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, January 6-8, 2021, Virtual Conference. Ed. by J. R. Lee. Vol. 185. LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2021, 56:1 56:18. URL: https : / / doi . org / 10 . 4230 / LIPIcs.ITCS.2021.56 (cit. on p. 6). [CKS20] C. L. Canonne, G. Kamath, and T. Steinke. The discrete gaussian for differential privacy . In: Advances in Neural Information Processing Systems 33 (2020), pp. 15676 15688. URL: https://arxiv.org/abs/2004.00010 (cit. on p. 9). [CWG22] R. S. Carvalho, K. Wang, and L. S. Gondara. Incorporating Item Frequency for Differentially Private Set Union . In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 2022, pp. 9504 9511. URL: https://ojs.aaai.org/index.php/AAAI/article/view/21183 (cit. on p. 6). [DDKPWWXZ23] Z. Ding, J. Durrell, D. Kifer, P. Protivash, G. Wang, Y. Wang, Y. Xiao, and D. Zhang. A Floating-Point Secure Implementation of the Report Noisy Max with Gap Mechanism . In: ar Xiv preprint ar Xiv:2308.08057 (2023). URL: https://arxiv.org/abs/2308.08057 (cit. on p. 21). [DFYTM22] W. Dong, J. Fang, K. Yi, Y. Tao, and A. Machanavajjhala. R2T: Instance Optimal Truncation for Differentially Private Query Evaluation with Foreign Keys . In: Proceedings of the 2022 International Conference on Management of Data. SIGMOD 22. Philadelphia, PA, USA: Association for Computing Machinery, 2022, 759 772. ISBN: 9781450392495. URL: https : / / cse . hkust.edu.hk/~yike/R2T.pdf (cit. on pp. 6, 8, 11). [DKMMN06] C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation . In: Advances in Cryptology-EUROCRYPT 2006: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28-June 1, 2006. Proceedings 25. Springer. 2006, pp. 486 503 (cit. on p. 9). [DKSWXZ+21] Z. Ding, D. Kifer, T. Steinke, Y. Wang, Y. Xiao, D. Zhang, et al. The permuteand-flip mechanism is identical to report-noisy-max with exponential noise . In: ar Xiv preprint ar Xiv:2105.07260 (2021). URL: https://arxiv.org/ abs/2105.07260 (cit. on p. 21). [DLB19] D. Desfontaines, A. Lochbihler, and D. A. Basin. Cardinality Estimators do not Preserve Privacy . In: Proc. Priv. Enhancing Technol. 2019.2 (2019), pp. 26 46. URL: https://doi.org/10.2478/popets-2019-0018 (cit. on p. 5). [DMNS06] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis . In: Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3. Springer. 2006, pp. 265 284 (cit. on pp. 1, 8, 9). [DN03] I. Dinur and K. Nissim. Revealing information while preserving privacy . In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2003, pp. 202 210 (cit. on p. 1). [DNPRY10] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum, and S. Yekhanin. Pan Private Streaming Algorithms . In: Innovations in Computer Science - ICS 2010, Tsinghua University, Beijing, China, January 5-7, 2010. Proceedings. Ed. by A. C. Yao. Tsinghua University Press, 2010, pp. 66 80. URL: http: //conference.iiis.tsinghua.edu.cn/ICS2010/content/papers/6. html (cit. on p. 6). [DR14] C. Dwork and A. Roth. The algorithmic foundations of differential privacy . In: Foundations and Trends R in Theoretical Computer Science 9.3 4 (2014), pp. 211 407. URL: https://www.cis.upenn.edu/~aaroth/Papers/ privacybook.pdf (cit. on p. 8). [DR16] C. Dwork and G. N. Rothblum. Concentrated differential privacy . In: ar Xiv preprint ar Xiv:1603.01887 (2016). URL: https://arxiv.org/abs/1603. 01887 (cit. on p. 9). [DTT22] C. Dickens, J. Thaler, and D. Ting. Order-Invariant Cardinality Estimators Are Differentially Private . In: Neur IPS. 2022. URL: https://arxiv.org/ abs/2203.15400 (cit. on p. 5). [DVGM22] D. Desfontaines, J. Voss, B. Gipson, and C. Mandayam. Differentially private partition selection . In: Proc. Priv. Enhancing Technol. 2022.1 (2022), pp. 339 352. URL: https://doi.org/10.2478/popets20220017 (cit. on p. 6). [FDY22] J. Fang, W. Dong, and K. Yi. Shifted Inverse: A General Mechanism for Monotonic Functions under User Differential Privacy . In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. CCS 22. Los Angeles, CA, USA: Association for Computing Machinery, 2022, 1009 1022. ISBN: 9781450394505. URL: https://cse.hkust.edu. hk/~yike/Shifted Inverse.pdf (cit. on pp. 6, 8, 11). [GGKSSY20] S. Gopi, P. Gulhane, J. Kulkarni, J. H. Shen, M. Shokouhi, and S. Yekhanin. Differentially private set union . In: International Conference on Machine Learning. PMLR. 2020, pp. 3627 3636. URL: https://arxiv.org/abs/ 2002.09745 (cit. on p. 6). [GKNM23] B. Ghazi, R. Kumar, J. Nelson, and P. Manurangsi. Private Counting of Distinct and k-Occurring Items in Time Windows . In: 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA. Ed. by Y. T. Kalai. Vol. 251. LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2023, 55:1 55:24. URL: https: //doi.org/10.4230/LIPIcs.ITCS.2023.55 (cit. on p. 6). [God] W. Goddard. A brief undergraduate text for Algorithms. Chapter G4: Approximation Algorithms. URL: https://people.computing.clemson.edu/ ~goddard/texts/algor/G4.pdf (cit. on p. 19). [GRS09] A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utilitymaximizing privacy mechanisms . In: Proceedings of the forty-first annual ACM symposium on Theory of computing. 2009, pp. 351 360. URL: https: //arxiv.org/abs/0811.2841 (cit. on p. 9). [HK73] J. E. Hopcroft and R. M. Karp. An n5/2 Algorithm for Maximum Matchings in Bipartite Graphs . In: SIAM J. Comput. 2.4 (1973), pp. 225 231. URL: https://doi.org/10.1137/0202019 (cit. on pp. 3, 4). [HTC23] J. Hehir, D. Ting, and G. Cormode. Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting . In: ar Xiv preprint ar Xiv:2302.02056 (2023). URL: https://arxiv.org/abs/2302.02056 (cit. on p. 5). [Ilv20] C. Ilvento. Implementing the exponential mechanism with base-2 differential privacy . In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 2020, pp. 717 742. URL: https://arxiv. org/abs/1912.04222 (cit. on p. 21). [Kar73] A. Karzanov. An exact estimate of an algorithm for finding a maximum flow, applied to the problem on representatives . In: Issues of Cybernetics. Proc. of the Seminar on Combinatorial Mathematics (1973), pp. 66 70 (cit. on pp. 3, 4). [KKMN09] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. Releasing search queries and clicks privately . In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009. Ed. by J. Quemada, G. Le on, Y. S. Maarek, and W. Nejdl. ACM, 2009, pp. 171 180. URL: https://doi.org/10.1145/1526709.1526733 (cit. on p. 6). [KLNRS11] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In: SIAM Journal on Computing 40.3 (2011), pp. 793 826. URL: https://arxiv.org/abs/0803.0924 (cit. on p. 5). [KWSMW20] B. Kreuter, C. W. Wright, E. S. Skvortsov, R. Mirisola, and Y. Wang. Privacypreserving secure cardinality and frequency estimation . In: (2020). URL: https://research.google/pubs/pub49177/ (cit. on p. 6). [MDPHNP11] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis . In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, June 2011, pp. 142 150. URL: http : / / www . aclweb . org / anthology/P11-1015 (cit. on p. 9). [MS20] R. Mc Kenna and D. R. Sheldon. Permute-and-Flip: A new mechanism for differentially private selection . In: Advances in Neural Information Processing Systems 33 (2020), pp. 193 203. URL: https://arxiv.org/abs/2010. 12603 (cit. on p. 21). [MT07] F. Mc Sherry and K. Talwar. Mechanism design via differential privacy . In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07). IEEE. 2007, pp. 94 103 (cit. on pp. 2, 9). [N20] L. N. IMDB Dataset of 50K Movie Reviews. English. 2020. URL: https : //www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of50k-movie-reviews (cit. on p. 9). [NLM19] J. Ni, J. Li, and J. J. Mc Auley. Justifying Recommendations using Distantly Labeled Reviews and Fine-Grained Aspects . In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLPIJCNLP 2019, Hong Kong, China, November 3-7, 2019. Ed. by K. Inui, J. Jiang, V. Ng, and X. Wan. Association for Computational Linguistics, 2019, pp. 188 197. URL: https://doi.org/10.18653/v1/D19-1018 (cit. on p. 9). [PS20] R. Pagh and N. M. Stausholm. Efficient Differentially Private F 0 Linear Sketching . In: ar Xiv preprint ar Xiv:2001.11932 (2020). URL: https : / / arxiv.org/abs/2001.11932 (cit. on p. 5). [RCR22] A. Rivera Cardoso and R. Rogers. Differentially Private Histograms under Continual Observation: Streaming Selection into the Unknown . In: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. Ed. by G. Camps-Valls, F. J. R. Ruiz, and I. Valera. Vol. 151. Proceedings of Machine Learning Research. PMLR, 2022, pp. 2397 2419. URL: https: //arxiv.org/abs/2103.16787 (cit. on p. 6). [RS15] S. Raskhodnikova and A. D. Smith. Efficient Lipschitz Extensions for High Dimensional Graph Statistics and Node Private Degree Distributions . In: Co RR abs/1504.07912 (2015). ar Xiv: 1504.07912. URL: http://arxiv. org/abs/1504.07912 (cit. on pp. 2, 8, 9, 20). [SDH23] M. Swanberg, D. Desfontaines, and S. Haney. DP-SIPS: A simpler, more scalable mechanism for differentially private partition selection . In: Co RR abs/2301.01998 (2023). ar Xiv: 2301.01998. URL: https://doi.org/10. 48550/ar Xiv.2301.01998 (cit. on p. 6). [She20] J. H. Shen. Ask Reddit. English. 2020. URL: https : / / github . com / heyyjudes / differentially - private - set - union / tree / ea7b39285dace35cc9e9029692802759f3e1c8e8/data (cit. on p. 9). [SST20] A. D. Smith, S. Song, and A. Thakurta. The Flajolet-Martin Sketch Itself Preserves Differential Privacy: Private Counting with Minimal Space . In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin. 2020. URL: https://proceedings.neurips.cc/paper/2020/ hash / e3019767b1b23f82883c9850356b71d6 - Abstract . html (cit. on p. 5). [Ste23] T. Steinke. Beyond Local Sensitivity via Down Sensitivity. Differential Privacy.org. https://differentialprivacy.org/downsensitivity/. Sept. 2023 (cit. on p. 8). [Vad17] S. Vadhan. The complexity of differential privacy . In: Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich (2017), pp. 347 450. URL: https : / / privacytools . seas . harvard . edu / files / privacytools/files/manuscript_2016.pdf (cit. on p. 8). [ZDKSTMAS23] B. Zhang, V. Doroshenko, P. Kairouz, T. Steinke, A. Thakurta, Z. Ma, H. Apte, and J. Spacek. Differentially Private Stream Processing at Scale . In: ar Xiv preprint ar Xiv:2303.18086 (2023). URL: https://arxiv.org/abs/2303. 18086 (cit. on p. 6). Selection Counting Person Contribution Bound Distinct Count 10th PC Median 90th PC 10th PC Median 90th PC Max Contrib DP Sampling 139 1196.8 1407.5 1649.1 Max Contrib DP Greedy 139 1174.2 1439.2 1646.5 Max Contrib DP Matching 139 1222.2 1460.9 1631.0 90th PC Contrib DP Sampling 48 1225.4 1296.2 1377.9 90th PC Contrib DP Greedy 48 1367.0 1432.6 1516.3 90th PC Contrib DP Matching 48 1365.3 1444.7 1524.8 Max Utility Sampling 41 1247.0 1259.0 1270.0 Max Utility Greedy 20 1376 Max Utility Matching 17 1428 DP Max Utility Sampling 8.9 16.0 28.0 661.6 892.5 1124.5 DP Max Utility Greedy 8.0 11.0 17.0 1148.0 1241.0 1348.0 DP Max Utility Matching 7.0 9.0 14.0 1252.0 1317.0 1400.0 DP Max Utility DP Sampling 9.0 16.0 27.1 702.4 899.1 1145.1 DP Max Utility DP Greedy 8.0 10.0 19.0 1128.5 1224.4 1370.8 DP Max Utility DP Matching 6.9 9.0 13.1 1220.6 1319.1 1394.2 Table 4: Amazon Fashion: the comparison is for ℓmax = 100. Selection Counting Person Contribution Bound Distinct Count 10th PC Median 90th PC 10th PC Median 90th PC 90th PC Contrib DP Sampling 297 32458.1 32943.8 33452.6 90th PC Contrib DP Greedy 297 36270.3 36669.5 37019.0 90th PC Contrib DP Matching 297 36236.2 36651.7 37102.7 Max Utility Sampling 99 24967.0 25039.0 25121.2 Max Utility Greedy 79 36246 Max Utility Matching 42 36364 DP Max Utility Sampling 85.9 96.0 99.0 23852.8 24739.0 25049.8 DP Max Utility Greedy 34.0 49.0 66.1 35393.0 35839.0 36116.9 DP Max Utility Matching 22.9 30.5 43.2 36026.8 36243.5 36371.2 DP Max Utility DP Sampling 87.0 95.0 99.0 23997.6 24701.1 25067.7 DP Max Utility DP Greedy 32.9 47.5 68.0 35336.6 35776.2 36136.6 DP Max Utility DP Matching 22.0 28.0 38.0 35970.5 36198.9 36326.7 Table 5: Amazon Industrial and Scientific: the comparison is for ℓmax = 100. Lemma A.1 (Sensitivity of DC(D; ℓ)). As in Equation (2), for D (Ω )n and ℓ N, define DC(D; ℓ) := max : i [n] vi ui |vi| ℓ Let D, D (Ω ) be neighboring. That is, D and D differ only by the addition or removal of one entry. Then |DC(D; ℓ) DC(D ; ℓ)| ℓfor all ℓ N. Proof. Without loss of generality D = (u1, , un) (Ω )n and D = (u1, , un 1) (Ω )n 1. I.e., D is D with person n removed. Let ℓ N. Selection Counting Person Contribution Bound Distinct Count 10th PC Median 90th PC 10th PC Median 90th PC 90th PC Contrib DP Sampling 75 92480.7 92654.8 92812.1 90th PC Contrib DP Greedy 75 102544.8 102665.7 102817.7 90th PC Contrib DP Matching 75 102651.1 102784.1 102907.8 Max Utility Sampling 99 95606.9 95692.0 95750.3 Max Utility Greedy 52 102543 Max Utility Matching 32 102685 DP Max Utility Sampling 89.0 96.0 99.0 94549.9 95394.5 95656.5 DP Max Utility Greedy 26.0 33.0 50.0 102015.0 102253.0 102527.0 DP Max Utility Matching 14.0 18.5 30.0 102357.0 102501.5 102671.0 DP Max Utility DP Sampling 88.8 96.0 99.0 94665.2 95375.5 95693.5 DP Max Utility DP Greedy 27.0 34.0 53.0 102053.2 102289.6 102531.2 DP Max Utility DP Matching 14.9 18.5 28.0 102379.7 102512.6 102643.9 Table 6: Reddit: the comparison is for ℓmax = 100. Selection Counting Person Contribution Bound Distinct Count 10th PC Median 90th PC 10th PC Median 90th PC 90th PC Contrib DP Sampling 238 95264.5 95593.5 95966.1 90th PC Contrib DP Greedy 238 98411.0 98734.0 99120.0 90th PC Contrib DP Matching 238 98354.2 98729.4 99164.2 Max Utility Sampling 29 49907.8 50036.5 50195.3 Max Utility Greedy 29 98459 Max Utility Matching 19 98712 DP Max Utility Sampling 29.0 29.0 29.0 49899.6 50070.5 50220.9 DP Max Utility Greedy 22.0 25.0 29.0 98244.0 98364.0 98459.0 DP Max Utility Matching 13.0 16.0 21.0 98586.0 98674.0 98721.0 DP Max Utility DP Sampling 29.0 29.0 29.0 49924.2 50053.7 50211.9 DP Max Utility DP Greedy 20.0 26.0 29.0 98126.7 98369.6 98451.8 DP Max Utility DP Matching 12.0 16.0 21.0 98555.6 98670.4 98726.8 Table 7: IMDB: the comparison is for ℓmax = 30. Let v1 u1, , vn un and v 1 u1, , v n 1 un 1 satisfy i [n] |vi| ℓ and i [n 1] |v i| ℓ and i [n 1] v i = DC(D ; ℓ). We can convert the witness v 1, , v n 1 for D into a witness for D simply by adding the empty set. Define v n = , which satisfies v n un and |v n| ℓ. Then DC(D; ℓ) = max : i [n] vi ui |vi| ℓ = DC(D ; ℓ). Similarly, we can convert the witness v1, , vn for D into a witness for D simply by discarding vn: DC(D ; ℓ) = max i [n 1] v i : i [n 1] v i ui |v i| ℓ DC(D; ℓ) ℓ. Thus |DC(D; ℓ) DC(D ; ℓ)| ℓ, as required. Proof of Theorem 1.1. Privacy: First note that qℓ(D) = DC(D; ℓ) 2ℓ ε log(1/2β) has sensitivity ℓ. Algorithm 1 accesses the dataset via qℓ(D) in two ways: First it runs the generalized exponential mechanism to select ˆℓand second it computes ˆν qˆℓ(D) + Lap 2ˆℓ/ε . Since the generalized exponential mechanism is ε/2-DP and adding Laplace noise is also ε/2-DP, the overall algorithm is ε-DP by composition. Lower bound: Since ˆν qˆℓ(D) + Lap 2ˆℓ/ε , we have ˆν qˆℓ(D) + 2ˆℓ ˆν qˆℓ(D) 2ˆℓ = 1 β. (11) Since DC(D; ˆℓ) DC(D) and qℓ(D) = DC(D; ℓ) 2ℓ ε log(1/2β), Equation (11) gives P ˆν [ˆν DC(D)] P ˆν h ˆν DC(D; ˆℓ) i = 1 β. This is the guarantee of Equation (6); ˆν is a lower bound on DC(D) with probability 1 β. Upper bound: The accuracy guarantee of the generalized exponential mechanism (Theorem 3.4) is qˆℓ(D) max ℓ [ℓmax] qℓ(D) ℓ 4 ε/2 log(ℓmax/β) 1 β. (12) Combining Equations (11) and (12) with a union bound yields P (ˆℓ,ˆν) M(D) ˆν max ℓ [ℓmax] DC(D; ℓ) 2ℓ+ 2ˆℓ To interpret Equation (13) we need a high-probability upper bound on ˆℓ. Let A > 0 be determined later and define ℓ A := arg max ℓ [ℓmax] DC(D; ℓ) Aℓ so that DC(D; ℓ) DC(D; ℓ A) + (ℓ ℓ A) A ε for all ℓ [ℓmax]. Assume the event in Equation (12) happens. We have DC(D; ˆℓ) 2ˆℓ DC(D; ℓ A) + (ˆℓ ℓ A)A (by Equation (14)) DC(D; ˆℓ) 2ˆℓ = qˆℓ(D) max ℓ [ℓmax] qℓ(D) ℓ 4 ε/2 log(ℓmax/β) (by assumption) = max ℓ [ℓmax] DC(D; ℓ) 2ℓ DC(D; ℓ A) 2ℓ A ε log 1 8ℓ A ε log ℓmax Combining inequalities yields DC(D; ℓ A) 2ℓ A ε log 1 8ℓ A ε log ℓmax DC(D; ℓ A) + (ˆℓ ℓ A)A (15) which simplifies to A ℓ A 2 log 1 + 8 log ℓmax Now we set A = log 1 2β to obtain + 8 log ℓmax Substituting Equation (17) into Equation (13) gives P (ˆℓ,ˆν) M(D) ˆν max ℓ [ℓmax] DC(D; ℓ) 2ℓ+ 2ℓ A ε log 1 8ℓ+ 16ℓ A ε log ℓmax (18) We simplify Equation (18) using log 1 2β log ℓmax β to obtain Equation (7). Computational efficiency: Finally, the runtime of DPDISTINCTCOUNT(D) is dominated by ℓmax calls to the SENSITIVEDISTINCTCOUNT(D) subroutine, which computes the maximum size of a bipartite matching on a graph with |E| = P i [n] |ui| min{ℓ, |ui|} |D| ℓmax edges and |V | + |U| = DC(D) + P i [n] min{ℓ, |ui|} 2|D| vertices. The Hopcroft-Karp-Karzanov algorithm runs in time O(|E| p |V | + |U|) O(|D|1.5 ℓmax) time. Proof of Theorem 1.2. Privacy: For convenience we analyze Algorithm 4, which produces the same result as Algorithm 2. (The difference is that Algorithm 4 is written to be efficient, while Algorithm 2 is written in a redundant manner to make the subroutine we are analyzing clear.) The privacy of the algorithm follows by composing the privacy guarantees of the generalized exponential mechanism and Laplace noise addition. The only missing part is to prove that SENSITIVEAPPROXDISTINCTCOUNT( , ℓ) has sensitivity ℓ. SENSITIVEAPPROXDISTINCTCOUNT((u1, , un), ℓ) = SENSITIVEAPPROXDISTINCTCOUNT((u1, , un, , u1, , un | {z } ℓtimes therefore, it is enough to prove that SENSITIVEAPPROXDISTINCTCOUNT( , 1) has sensitivity 1. Assume D = (u1, , un) and D = (u1, , uj 1, uj+1, , un). Let S1, , Sn and v1, , vn be the states of S and v (vi = if ui \ Si 1 = ), respectively, when Algorithm 4 Approximate Distinct Count Algorithm 1: procedure SENSITIVEAPPROXDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓ N) 2: S . 3: for ℓ [ℓ] do 4: for i [n] with ui \ S = do 5: Choose lexicographically first v ui \ S. Match (i, ℓ ) to v. 6: Update S S {v}. 7: end for 8: end for 9: return |S| 10: end procedure 11: procedure DPAPPROXDISTINCTCOUNT(D = (u1, , un) (Ω )n; ℓmax N, ε > 0, β (0, 1 2)) 12: for ℓ [ℓmax] do 13: Define qℓ(D) := SENSITIVEAPPROXDISTINCTCOUNT(D; ℓ) 2ℓ ε log 1 2β . 14: end for 15: ˆℓ GEM(D; {qℓ}ℓ [ℓmax], {ℓ}ℓ [ℓmax], ε/2, β). Algorithm 3 16: ˆν qˆℓ(D) + Lap 2ˆℓ/ε . 17: return (ˆℓ, ˆν) [ℓmax] R. 18: end procedure we run APPROXSENSITIVEDISTINCTCOUNT(D, 1). Similarly, let S 1, , S j 1, S j+1, , S n, v 1, , v j 1, v j+1, , v n be the states of S and v (v i = if ui \ S i 1 = ),6 respectively, when we run APPROXSENSITIVEDISTINCTCOUNT(D , 1). Our goal is to show that ||Sn| |S n|| 1. Define S j = S j 1 and v j = . Clearly S i = Si and v i = vi for all i < j. For i j, we claim that S i Si and |Si| |S i|+1. This is true for i = j, since S j = S j 1 = Sj 1 and Sj = Sj 1 {vj}. We prove the claim for i > j by induction. I.e., assume Si 1 = S i 1 {v i 1} for some v i 1 (possibly v i 1 = , whence Si 1 = S i 1). Now vi is the lexicographically first element of ui \ Si 1 and v i is the lexicographically first element of ui \ S i 1 (or if these sets are empty). By the induction assumption, ui \ Si 1 ui \ S i 1 (ui \ Si 1) {v i 1}. Thus either v i = vi or v i = v i 1. If v i = vi, then S i = S i 1 {v i} Si 1 {vi} = Si and S i {v i 1} = S i 1 {v i, v i 1} Si 1 {vi} = Si. If v i = v i 1, then S i = S i 1 {v i 1} Si 1 {v i 1} = Si 1 Si and S i {vi} = S i 1 {vi, v i 1} Si 1 {vi} = Si, so the claim holds with v i = vi. The sensitivity bound implies that qℓ(D) = |S| 2ℓ ε log(1/2β) has sensitivity ℓ. Since the generalized exponential mechanism is ε/2-DP and adding Laplace noise is also ε/2-DP, the overall algorithm is ε-DP by composition. Lower bound: Let us denote by d DC(D; ℓ) = SENSITIVEAPPROXDISTINCTCOUNT( , ℓ) the value of |S| we obtain on Line 8 in Algorithm 2. Note that d DC(D; ℓ) is the size of a maximal matching in Gℓ, where Gℓis the bipartite graph corresponding to the input D with ℓcopies of each person (see Algorithm 1 for a formal description of the graph). Since a maximal matching is a 2-approximation to a maximum matching [God], we have 1 2DC(D; ℓ) d DC(D; ℓ) DC(D; ℓ) DC(D). (19) Since ˆν qˆℓ(D) + Lap 2ˆℓ/ε , we have ˆν qˆℓ(D) + 2ˆℓ ˆν qˆℓ(D) 2ˆℓ = 1 β. (20) 6For notational convenience, we define { } = . Substituting qℓ(D) = d DC(D; ℓ) 2ℓ ε log(1/2β) into Equations (19) and (20) gives Equation (9) P ˆν [ˆν DC(D)] P ˆν h ˆν d DC(D; ˆℓ) i = 1 β. Upper bound: The accuracy guarantee of the generalized exponential mechanism (Theorem 3.4) is qˆℓ(D) max ℓ [ℓmax] qℓ(D) ℓ 4 ε/2 log(ℓmax/β) 1 β. (21) Combining Equations (20) and (21) and the definition of qℓ(D) with a union bound yields P (ˆℓ,ˆν) M(D) ˆν max ℓ [ℓmax] d DC(D; ℓ) 2ℓ+ 2ˆℓ Now we assume maxi [n] |ui| ℓ ℓmax for some ℓ . This implies DC(D) = DC(D; ℓ). We have max ℓ [ℓmax] d DC(D; ℓ) 2ℓ+ 2ˆℓ d DC(D; ℓ ) 2ℓ + 2ˆℓ 2DC(D) 2ℓ + 2ˆℓ As in the proof of Theorem 1.1, if the event in Equation (21) happens, we can show that + 8 log ℓmax ℓ A := arg max ℓ [ℓmax] d DC(D; ℓ) ℓ Note that ℓ A ℓ , since d DC(D; ℓ) d DC(D; ℓ ) for all ℓ [ℓmax]. (That is simply to say that the size of the maximal matching cannot be increased by making more copies of a vertex once there is one copy for each neighbor.) Combining bounds yields P (ˆℓ,ˆν) M(D) Since log 1 2β log ℓmax β , this implies Equation (10). Computational efficiency: It only remains to verify that Algorithm 2 can be implemented in O(|D| + ℓmax log ℓmax) time. We can implement S using a hash table to ensure that we can add an element or query membership of an element in constant time. (We can easily maintain a counter for the size of S.) We assume D is presented as a linked list of linked lists representing each ui and furthermore that the linked lists ui are sorted in lexicographic order. The outer loop proceeds through the linked list for D = (u1, , un). For each ui, we simply pop elements from the linked list and check if they are in S until either we find v ui \ S (and add v to S) or ui becomes empty (in which case we remove it from the linked list for D.) Since each iteration decrements |D|, the runtime of the main loop is O(|D|). Running the generalized exponential mechanism (Algorithm 3) takes O(ℓmax log ℓmax) time. B Implementing the Generalized Exponential Mechanism We conclude with some remarks about implementing the generalized exponential mechanism of Raskhodnikova and Smith [RS15] given in Algorithm 3. There are two parts to this; first we must compute the normalized scores and then we must run the standard exponential mechanism. Implementing the standard exponential mechanism is well-studied [Ilv20] and can be performed in linear time (with some caveats about randomness and precision). We remark that, instead of the exponential mechanism, we can use report-noisy-max or permute-and-flip [MS20; DKSWXZ+21; DDKPWWXZ23]. These variants may be easier to implement and may provide better utility too. The normalized scores are given by i [m] si = min j [m] (qi(D) t i) (qj(D) t j) i + j , (29) where qi(D) is the value of the query on the dataset, i > 0 is the sensitivity of qi, and t is a constant. Na ıvely it would take Θ(m2) time to compute all the normalized scores. However, a more complex algorithm can compute the scores in O(m log m) time. Observe that, for each i [m], we have si = min j [m] (qi(D) t i) (qj(D) t j) i + j min j [m] (qi(D) t i) (qj(D) t j) si( i + j) min j [m] (qi(D) t i) (qj(D) t j) si( i + j) = 0 qi(D) (si + t) i = max j [m] qj(D) + (si t) j | {z } f(si t) That is, we can compute si by solving the equation qi(D) (si + t) i = f(si t). Since f(x) := maxj [m] qj(D) + x j is the maximum of increasing linear functions, we have that f is a convex, increasing, piecewise-linear function, with at most m pieces. We can represent f by a sorted list of the points where the linear pieces connect, along with the linear function on each of the pieces. We can compute this representation of f as a pre-processing step in O(m log m) time; we sort the lines y(x) = qj(D) + x j by their slope j and then compute the intersection points between consecutive lines (we delete lines that never realize the max). Given the above representation of f and the values qi(D), i, t, we can compute si in O(log m) time. We must solve qi(D) (si + t) i = f(si t) for si. We can perform binary search on the pieces of f to identify j such that f(si t) = qj(D) + (si t) j. Once we have this we can directly compute si = (qi(D) t i) (qj(D) t j) i+ j . The binary search takes O(log m) time and we must compute m scores. Thus the overall runtime (including the pre-processing) is O(m log m).