# counting_distinct_elements_under_personlevel_differential_privacy__e335e51c.pdf

Counting Distinct Elements Under Person-Level Differential Privacy

Alexander Knop Google alexanderknop@google.com

Thomas Steinke Google Deep Mind steinke@google.com

We study the problem of counting the number of distinct elements in a dataset subject to the constraint of differential privacy. We consider the challenging setting of person-level DP (a.k.a. user-level DP) where each person may contribute an unbounded number of items and hence the sensitivity is unbounded. Our approach is to compute a bounded-sensitivity version of this query, which reduces to solving a max-ﬂow problem. The sensitivity bound is optimized to balance the noise we must add to privatize the answer against the error of the approximation of the bounded-sensitivity query to the true number of unique elements.

1 Introduction

An elementary data analysis task is to count the number of distinct elements occurring in a dataset. The dataset may contain private data and even simple statistics can be combined to leak sensitive information about people [DN03]. Our goal is to release (an approximation to) this count in a way that ensures the privacy of the people who contributed their data. As a motivating example, consider a collection of internet browsing histories, in which case the goal is to compute the total number of websites that have been visited by at least one person.

Differential privacy (DP) [DMNS06] is a formal privacy standard. The simplest method for ensuring DP is to add noise (from either a Laplace or Gaussian distribution) to the true answer, where the scale of the noise corresponds to the sensitivity of the true answer i.e., how much one person s data can change the true value.

If each person contributes a single element to the dataset, then the sensitivity of the number of unique elements is one. However, a person may contribute multiple elements to the dataset and our goal is to ensure privacy for all of these contributions simultaneously. That is, we seek to provide person-level DP (a.k.a. user-level DP2).

This is the problem we study: We have a dataset D = (u1, u2, , un) of person records. Each person i [n] contributes a ﬁnite dataset ui Ω , where Ωis some (possibly inﬁnite) universe of potential elements (e.g., all ﬁnite-length binary strings) and Ω := S

ℓ N Ωℓdenotes all subsets of Ωof ﬁnite size. Informally, our goal is to compute the number of unique elements

Alphabetical author order. 2We prefer the term person over user, as the latter only makes sense in some contexts and could be confusing in others.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

in a way that preserves differential privacy. A priori, the sensitivity of this quantity is inﬁnite, as a single person can contribute an unbounded number of unique elements.

In particular, it is not possible to output a meaningful upper bound on the number of distinct elements subject to differential privacy. This is because a single person could increase the number of distinct elements arbitrarily and differential privacy requires us to hide this contribution. It follows that we cannot output a differentially private unbiased estimate of the number of distinct elements with ﬁnite variance. However, it is possible to output a lower bound. Thus our formal goal is to compute a high-conﬁdence lower bound on the number of distinct elements that is as large as possible and which is computed in a differentially private manner.

1.1 Our Contributions

Given a dataset D = (u1, , un) (Ω )n and an integer ℓ 1, we deﬁne

DC(D; ℓ) := max

: i [n] vi ui |vi| ℓ

That is, DC(D; ℓ) is the number of distinct elements if we restrict each person s contribution to ℓ elements. We take the maximum over all possible restrictions.

It is immediate that DC(D; ℓ) DC(D) for all ℓ 1. Thus we obtain a lower bound on the true number of unique elements. The advantage of DC(D; ℓ) is that its sensitivity is bounded by ℓ(see Lemma A.1 for a precise statement) and, hence, we can estimate it in a differentially private manner. Speciﬁcally, Mℓ,ε(D) := DC(D; ℓ) + Lap (ℓ/ε) (3)

deﬁnes an ε-DP algorithm Mℓ,ε : (Ω )n R, where Lap (b) denotes Laplace noise scaled to have mean 0 and variance 2b2. This forms the basis of our algorithm. Two challenges remain: Setting the sensitivity parameter ℓand computing DC(D; ℓ) efﬁciently.

To obtain a high-conﬁdence lower bound on the true distinct count, we must compensate for the Laplace noise, which may inﬂate the reported value. We can obtain such a lower bound from Mℓ(D) using the cumulative distribution function (CDF) of the Laplace distribution: That is, b > 0 β (0, 1/2] P h Lap (b) b log 1 2β i = β, so

| {z } lower bound

1 β | {z } conﬁdence

Choosing the sensitivity parameter ℓ. Any choice of ℓ 1 gives us a lower bound: DC(D; ℓ) DC(D). Since D limℓ DC(D; ℓ) = DC(D), this lower bound can be arbitrarily tight. However, the larger ℓis, the larger the sensitivity of DC(D; ℓ) is. That is, the noise we add scales linearly with ℓ.

Thus there is a bias-variance tradeoff in the choice of ℓ. To make this precise, suppose we want a lower bound on DC(D) with conﬁdence 1 β [ 1

2, 1), as in Equation (4). To obtain the tightest possible lower bound with conﬁdence 1 β, we want ℓto maximize the expectation

q(D; ℓ) := DC(D; ℓ) ℓ

We can use the exponential mechanism [MT07] to privately select ℓthat approximately maximizes q(D; ℓ). However, directly applying the exponential mechanism is problematic because each score has a different sensitivity the sensitivity of q( ; ℓ) is ℓ. Instead, we apply the Generalized Exponential Mechanism (GEM) of Raskhodnikova and Smith [RS15] (see Algorithm 3). Note that we assume some a priori maximum value of ℓis supplied to the algorithm; this is ℓmax.

Our main algorithm attains the following guarantees.

Theorem 1.1 (Theoretical Guarantees of Our Algorithm). Let ε > 0 and β (0, 1

2) and ℓmax N. Deﬁne M : (Ω ) N R to be M(D) = DPDISTINCTCOUNT(D; ℓmax, ε, β) from Algorithm 1. Then M satisﬁes all of the following properties.

Privacy: M is ε-differentially private.

Lower bound: For all D (Ω )n,

P (ˆℓ,ˆν) M(D) [ˆν DC(D)] 1 β. (6)

Upper bound: For all D (Ω )n,

P (ˆℓ,ˆν) M(D)

ˆν max ℓ [ℓmax] DC(D; ℓ) 10ℓ+ 18ℓ A ε log ℓmax

where ℓ A = arg maxℓ [ℓmax] DC(D; ℓ) ℓ

ε log 1 2β .

Computational efﬁciency: M(D) has running time O |D|1.5 ℓ2 max , where |D| := P

The upper bound guarantee (7) is somewhat difﬁcult to interpret. However, if the number of items per person is bounded by ℓ , then we can offer a clean guarantee: If D = (u1, , un) (Ω )n satisﬁes maxi [n] |ui| ℓ ℓmax, then combining the upper and lower bounds of Theorem 1.1 gives

P (ˆℓ,ˆν) M(D)

DC(D) ˆν DC(D) 28ℓ

Note that ℓ is not assumed to be known to the algorithm, but the accuracy guarantee is able to adapt. We only assume ℓ ℓmax, where ℓmax is the maximal sensitivity considered by the algorithm.

In addition to proving the above theoretical guarantees, we perform an experimental evaluation of our algorithm.

Algorithm 1 Distinct Count Algorithm

1: procedure SENSITIVEDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓ N) DC(D; ℓ) 2: Let Uℓ= S

i [n] {i} [min{ℓ, |ui|}] [n] [ℓ]. 3: Let V = S

i [n] ui Ω. 4: Deﬁne Eℓ U V by ((i, j), v) E v ui. 5: Let Gℓbe a bipartite graph with vertices partitioned into Uℓand V and edges Eℓ. 6: mℓ MAXIMUMMATCHINGSIZE(G). [HK73; Kar73] 7: return mℓ N 8: end procedure 9: procedure DPDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓmax N, ε>0, β (0, 1

2)) 10: for ℓ [ℓmax] do

11: Deﬁne qℓ(D) := SENSITIVEDISTINCTCOUNT(D; ℓ) 2ℓ

ε log 1 2β .

12: end for 13: ˆℓ GEM(D; {qℓ}ℓ [ℓmax], {ℓ}ℓ [ℓmax], ε/2, β). Algorithm 3

14: ˆν qˆℓ(D) + Lap 2ˆℓ/ε .

15: return (ˆℓ, ˆν) [ℓmax] R. 16: end procedure

Efﬁcient computation. The main computational task for our algorithm is to compute DC(D; ℓ). By deﬁnition (2), this is an optimization problem. For each person i [n], we must select a subset vi of that person s data ui of size at most ℓso as to maximize the size of the union of the subsets S

We can view the dataset D = (u1, , un) (Ω )n as a bipartite graph. On one side we have the n people and on the other side we have the elements of the data universe Ω.3 There is an edge between i [n] and x Ωif and only if x ui.

We can reduce computing DC(D; ℓ) to a max-ﬂow problem: Each edge in the bipartite graph has capacity one. We add a source vertex s which is connected to each person i [n] by an edge with capacity ℓ. Finally we add a sink t that is connected to each x Ωby an edge with capacity 1. The max ﬂow through this graph is precisely DC(D; ℓ).

Alternatively, we can reduce computing DC(D; ℓ) to bipartite maximum matching. For ℓ= 1, DC(D; 1) is exactly the maximum cardinality of a matching in the bipartite graph described above. For ℓ 2, we simply create ℓcopies of each person vertex i [n] and then DC(D; ℓ) is the maximum cardinality of a matching in this new bipartite graph.4

Using this reduction, standard algorithms for bipartite maximum matching [HK73; Kar73] allow us to compute DC(D; ℓ) with O(|D|1.5 ℓ) operations. We must repeat this computation for each ℓ [ℓmax].

Algorithm 2 Linear-Time Approximate Distinct Count Algorithm

1: procedure DPAPPROXDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓmax N, ε>0, β (0, 1

2)) 2: S . 3: for ℓ [ℓmax] do 4: for i [n] with ui \ S = do 5: Choose lexicographically ﬁrst v ui \ S. Match (i, ℓ) to v. 6: Update S S {v}. 7: end for 8: Deﬁne qℓ(D) := |S| 2ℓ

ε log 1 2β . This loop computes {qℓ(D)}ℓ [ℓmax].

9: end for 10: ˆℓ GEM(D; {qℓ}ℓ [ℓmax], {ℓ}ℓ [ℓmax], ε/2, β). Algorithm 3

11: ˆν qˆℓ(D) + Lap 2ˆℓ/ε .

12: return (ˆℓ, ˆν) [ℓmax] R. 13: end procedure

Linear-time algorithm. Our algorithm above is polynomial-time. However, for many applications the dataset size |D| is enormous. Thus we also propose a linear-time variant of our algorithm. However, we must trade accuracy for efﬁciency.

There are two key ideas that differentiate our linear-time algorithm (Algorithm 2) from our ﬁrst algorithm (Algorithm 1) above: First, we compute a maximal bipartite matching instead of a maximum bipartite matching.5 This can be done using a linear-time greedy algorithm and gives a 2-approximation to the maximum matching. (Experimentally we ﬁnd that the approximation is better than a factor of 2.) Second, rather than repeating the computation from scratch for each ℓ [ℓmax], we incrementally update our a maximal matching while increasing ℓ. The main challenge here is ensuring that the approximation to DC(D; ℓ) has low sensitivity i.e., we must ensure that our approximation algorithm doesn t inﬂate the sensitivity. Note that DC(D; ℓ) having low sensitivity does not automatically ensure that the approximation to it has low sensitivity.

Theorem 1.2 (Theoretical Guarantees of Our Linear-Time Algorithm). Let ε > 0 and β (0, 1

2) and ℓmax N. Deﬁne M : (Ω ) N R to be c M(D) =

3The data universe Ωmay be inﬁnite, but we can restrict the computation to the ﬁnite set S

i [n] ui. Thus there are at most n + DC(D) n + |D| item vertices in the graph. 4We need only create min{ℓ, |ui|} copies of the person i [n]. Thus the number of person vertices is at most min{nℓ, |D|}. 5To clarify the confusing terminology: A matching is a subset of edges such that no two edges have a vertex in common. A maximum matching is a matching of the largest possible size. A maximal matching is a matching such that no edge could be added to the matching without violating the matching property. A maximum matching is also a maximal matching, but the reverse is not true.

Data Set Vocabulary Size Estimated Vocabulary Size

10th Percentile Median 90th Percentile

Amazon Fashion 1450 1220.6 1319.1 1394.2 Amazon Industrial and Scientiﬁc 36665 35970.5 36198.9 36326.7 Reddit 102835 102379.7 102512.6 102643.9 IMDB 98726 98555.6 98670.4 98726.8

Table 1: True and estimated (using DPDistinct Count with ε = 1, β = 0.05 and ℓmax = 100) counts per data set.

DPAPPROXDISTINCTCOUNT(D; ℓmax, ε, β) from Algorithm 2. Then c M satisﬁes all of the following properties.

Privacy: c M is ε-differentially private.

Lower bound: For all D (Ω )n,

P (ˆℓ,ˆν) c M(D) [ˆν DC(D)] 1 β. (9)

Upper bound: If D = (u1, , un) (Ω )n satisﬁes maxi [n] |ui| ℓ ℓmax, then

P (ˆℓ,ˆν) c M(D)

Computational efﬁciency: M(D) has running time O (|D| + ℓmax log ℓmax), where |D| := P

The factor 1

2 in the upper bound guarantee (10) is the main loss compared to Theorem 1.1. (The win is O(|D|) runtime.) This is a worst-case bound and our experimental result show that for realistic data the performance gap is not so bad.

The proofs of Theorems 1.1 and 1.2 are in Appendix A.

2 Related Work

Counting the number of distinct elements in a collection is one of the most fundamental database computations. This is supported as the COUNT(DISTINCT ...) operation in SQL. Hence, unsurprisingly, the problem of computing the number of unique elements in a differentially private way has been extensively investigated.

In the case where we assume each person contributes only one element (a.k.a. event-level privacy or item-level privacy), the number of distinct elements has sensitivity 1 and, hence, we can simply use Laplace (or Gaussian) noise addition to release. However, it may not be possible to compute the number of distinct elements exactly due to space, communication, or trust constraints (e.g. in the local model of DP [KLNRS11]).

Most efforts have been focused on creating differentially private algorithms for counting distinct elements under space constraints (and assuming each person contributes a single element). To save space, we wish to compute a small summary of the dataset (called a sketch) that allows us to estimate the number of distinct elements and which can be updated as more elements are added. Smith, Song, and Thakurta [SST20] proved that a variant of the Flajolet-Martin sketch is private and Pagh and Stausholm [PS20] analyzed a sketch over the binary ﬁnite ﬁeld. Dickens, Thaler, and Ting [DTT22] proved a general privacy result for order-invariant cardinality estimators. Hehir, Ting, and Cormode [HTC23] provided a mergeable private sketch (i.e. two sketches can be combined to obtain a sketch of the union of the two datasets). In contrast, Desfontaines, Lochbihler, and Basin [DLB19] proved an impossibility result for mergeable sketches, which shows that privacy or accuracy must degrade as we merge sketches.

0 20 40 60 80 100

Person Contribution Bound

Distinct Count

Amazon Fashion

0 20 40 60 80 100

Person Contribution Bound

Amazon Industrial

0 20 40 60 80 100 0.2

Person Contribution Bound

Distinct Count

0 5 10 15 20 25 30

Person Contribution Bound

true value matching algorithm greedy algorithm sampling algorithm

Figure 1: Performance of different algorithms estimating distinct count assuming that each person can contribute at most ℓelements (e.g., these algorithms are estimating DC(D; ℓ)). (These algorithms have bounded sensitivity, but we do not add noise for privacy yet.)

Counting unique elements has been considered in the pan-private streaming setting [DNPRY10] (the aforementioned algorithms also work in the pan-private setting) and in the continual release streaming setting [GKNM23]. (In the continual release setting the approximate count is continually updated, while in the pan-private setting the approximate count is only revealed once, but at an unknown point in time.) Kreuter, Wright, Skvortsov, Mirisola, and Wang [KWSMW20] give private algorithms for counting distinct elements in the setting of secure multiparty computation. In the local and shufﬂe models, the only known results are communication complexity bounds [CGKM21].

A closely related problem is that of identifying as many elements as possible (rather than just counting them); this is known as partition selection, set union, or key selection [SDH23; DVGM22; KKMN09; CWG22; RCR22; GGKSSY20; ZDKSTMAS23]. Note that, by design, DP prevents us from identifying elements that only appear once in the dataset, or only a few times. Thus we can only output items that appear frequently.

The most closely related work to ours is that of Dong, Fang, Yi, Tao, and Machanavajjhala [DFYTM22] and Fang, Dong, and Yi [FDY22]. These papers present two different algorithms for privately approximating the distinct count (and other statistics). We discuss these below and present an experimental comparison in Table 2. We also remark that both papers prove instance optimality guarantees for their algorithms.

5 10 15 20 25 30

Distinct Count

Amazon Fashion

5 10 15 20 25 30

Amazon Industrial

5 10 15 20 25 30

Distinct Count

5 10 15 20 25 30

true value matching algorithm greedy algorithm sampling algorithm shifted inverse algorithm

Figure 2: Performance of different algorithms estimating distinct count in a differentially private way for different values of ε; for all of them β = 0.05 and ℓmax = 100. The values between 10th and 90th percentile of each algorithms estimation are shaded into corresponding colors. For the shifted inverse algorithm, the ﬁrst two plots contain the results for β = 0.05 and D equal to the true number of distinct elements in the dataset. The later two datasets are lacking the results for shifted inverse algorithm due to the computational constraints.

User Supplier Customer

Attribute PS.AQ L.EP O.OD L.RD

R2T [DFYTM22] 0.0658 0.1759 0.0061 0.150 (Approx)Shifted Inverse [FDY22] 0.0553 0.0584 0.005 0.0061 DPApprox Distinct Count 0.0140 0.0110 0.0008 0.0037 DPDistinct Count 0.0100 0.0096 0.0008 0.0001

Table 2: Average relative absolute error of algorithms described in this paper and in [DFYTM22; FDY22] on the TPC-H dataset. For each algorithm we executed it 100 times, removed 20 top and 20 bottom values and computed average error for the rest of 60 values.

Most similar to our algorithm is the Race-to-the-Top (R2T) algorithm [DFYTM22]; R2T is a generic framework and the original paper did not speciﬁcally consider counting distinct elements, but the approach can easily be applied to DC(D; ℓ). While we use the generalized exponential mechanism [RS15] to select the sensitivity ℓ, R2T computes multiple lower bounds with different sensitivities ℓand then outputs the maximum of the noisy values. This approach incurs the cost of composition across the multiple evaluations. To manage this cost, R2T only evaluates ℓ= 2, 4, 8, , 2log ℓmax. Compared to our guarantee (8) with an error O ℓ

β , R2T has a slightly worse theoret-

ical error guarantee of O ℓ

ε log(ℓmax) log log ℓmax

β [DFYTM22, Theorem 5.1].

The shifted inverse mechanism [FDY22] takes a different approach to the problem. Rather than relying on adding Laplace noise (as we do), it applies the exponential mechanism with an ingenious loss function (see [Ste23] for additional discussion). When applied to counting distinct elements, the shifted inverse mechanism gives an accuracy guarantee comparable to ours (8). The downside of the shifted inverse mechanism is that computing the loss function is, in general, NP-hard. Fang, Dong, and Yi [FDY22] propose polynomial-time variants for several speciﬁc tasks, including counting distinct elements. However, the algorithm is still relatively slow.

3 Technical Background on Differential Privacy

For detailed background on differential privacy, see the survey by Vadhan [Vad17] or the book by Dwork and Roth [DR14]. We brieﬂy deﬁne pure DP and some basic mechanisms and results.

Algorithm 3 Generalized Exponential Mechanism [RS15]

1: procedure GEM(D X ; qi :X R for i [m], i >0 for i [m], ε>0, β >0) 2: Require: qi has sensitivity sup x,x X

neighboring |q(x) q(x )| i for all i [m].

3: Let t = 2

4: for i [m] do

5: si minj [m] (qi(D) t i) (qj(D) t j)

i+ j . 6: end for 7: Sample ˆi [m] from the Exponential Mechanism using the normalized scores si; i.e.,

i [m] P h ˆi = i i = exp 1

k [m] exp 1

8: return ˆi [m]. 9: end procedure

Deﬁnition 3.1 (Differential Privacy (DP) [DMNS06] ). A randomized algorithm M : X Y satisﬁes ε-DP if, for all inputs D, D X differing only by the addition or removal of an element and for all measurable S Y, we have P [M(D) S] eε P [M(D ) S].

We refer to pairs of inputs that differ only by the addition or removal of one person s data as neighboring. Note that it is common to also consider replacement of one person s data; for simplicity, we

Data Set Size Words per Person Vocabulary Size People Records Min Median Max

Amazon Fashion 404 8533 1 14.0 139 1450 Amazon Industrial and Scientiﬁc 11041 1446031 0 86 2059 36665 Reddit 223388 7117494 0 18.0 1724 102835 IMDB 50000 6688844 5 110.0 925 98726

Table 3: Data sets details.

do not do this. We remark that there are also variants of DP such as approximate DP [DKMMN06] and concentrated DP [DR16; BS16], which quantitatively relax the deﬁnition, but these are not relevant in our application. A key property of DP is that it composes and is invariant under postprocessing.

Lemma 3.2 (Composition & Postprocessing). Let M1 : X Y be ε1-DP. Let M2 : X Y Z be such that, for all y Y, the restriction M( , y) : X Z is ε2-DP. Deﬁne M12 : X Z by M12(D) = M2(D, M1(D)). Then M12 is (ε1 + ε2)-DP.

A basic DP tool is the Laplace mechanism [DMNS06]. Note that we could also use the discrete Laplace mechanism [GRS09; CKS20].

Lemma 3.3 (Laplace Mechanism). Let q : X R. We say q has sensitivity if |q(D) q(D )| for all neighboring D, D X . Deﬁne M : X R by M(D) = q(D) + Lap ( /ε), where Lap (b) denotes laplace noise with mean 0 and variance 2b2 i.e., P ξ Lap(b) [ξ > t] =

P ξ Lap(b) [ξ < t] = 1

b for all t > 0. Then M is ε-DP.

Another fundamental tool for DP is the exponential mechanism [MT07]. It selects the approximately best option from among a set of options, where each option i has a quality function qi with sensitivity . The following result generalizes the exponential mechanism by allowing each of the quality functions to have a different sensitivity.

Theorem 3.4 (Generalized Exponential Mechanism [RS15, Theorem 1.4]). For each i [m], let qi : X R be a query with sensitivity i. Let ε, β > 0. The generalized exponential mechanism (GEM( ; {qi}i [m], { i}i [m], ε, β) in Algorithm 3) is ε-DP and has the following utility guarantee. For all D X , we have

P ˆi GEM(D;{qi}i [m],{ i}i [m],ε,β)

qˆi(D) max j [m] qj(D) j 4

4 Experimental Results

We empirically validate the performance of our algorithms using data sets of various sizes from different text domains. We focus on the problem of computing vocabulary size with person-level DP. Section 4.1 describes the data sets and Section 4.2 discusses the algorithms we compare.

4.1 Datasets

We used four publicly available datasets to assess the accuracy of our algorithms compared to baselines. Two small datasets were used: Amazon Fashion 5-core [NLM19] (reviews of fashion products on Amazon) and Amazon Industrial and Scientiﬁc 5-core [NLM19] (reviews of industrial and scientiﬁc products on Amazon). Two large data sets were also used: Reddit [She20] (a data set of posts collected from r/Ask Reddit) and IMDb [N20; MDPHNP11] (a set of movie reviews scraped from IMDb). See details of the datasets in Table 3.

4.2 Comparisons

Computing the number of distinct elements using a differentially private mechanism involves two steps: selecting a contribution bound (ℓin our algorithms) and counting the number of distinct elements in a way that restricts each person to only contribute the given number of elements.

Selection: We examine four algorithms for determining the contribution limit:

1. Choosing the true maximum person contribution (due to computational restrictions this was only computed for Amazon Fashion data set). 2. Choosing the 90th percentile of person contributions. 3. Choosing the person contribution that exactly maximizes the utility function qℓ(D) = DC(D; ℓ) ℓ

2β ), where ε = 1, and β = 0.001.

4. Choosing the person contribution that approximately maximizes the utility function using the generalized exponential mechanism with ϵ = 1.

Note that only the last option is differentially private, but we consider the other comparison points nonetheless.

Counting: We also consider three algorithms for estimating the number of distinct elements for a given sensitivity bound ℓ:

1. For each person, we uniformly sample ℓelements without replacement and count the number of distinct elements in the union of the samples. 2. The linear-time greedy algorithm (Algorithm 2) with ε = 1 and β = 0.001. 3. The matching-based algorithm (Algorithm 1) with ε = 1 and β = 0.001.

All of these can be converted into DP algorithms by adding Laplace noise to the result.

In all our datasets true maximum person contribution and 90th percentile of person contributions output bounds that are much larger than necessary to obtain true distinct count; hence, we only consider DP versions of the estimation algorithm for these selection algorithms.

4.3 Results

Figure 1 shows the dependency of the result on the contribution bound for each of the algorithms for computing the number of distinct elements with ﬁxed person contribution. It is clear that matching and greedy algorithms vastly outperform the sampling approach that is currently used in practice.

Tables 4 to 7 show the performance of algorithms for selecting optimal person contribution bounds on different data sets. For all bound selection algorithms and all data sets, the sampling approach to estimating the distinct count performs much worse than the greedy and matching-based approaches. The greedy approach performs worse than the matching-based approach, but the difference is about 10% for Amazon Fashion and is almost negligible for other data sets since they are much larger. As for the matching-based algorithm, it performs as follows on all the data sets:

1. The algorithm that uses the bound equal to the maximal person contribution overestimates the actual necessary bound. Therefore, we only consider the DP algorithms for counts estimation. It is easy to see that while the median of the estimation is close to the actual distinct count, the amount of noise is somewhat large. 2. The algorithm that uses the bound equal to the 99th percentile of person contributions also overestimates the necessary bound and behaves similarly to the one we just described (though the spread of the noise is a bit smaller). 3. The algorithms that optimize the utility function are considered: one non-private and one private. The non-private algorithm with non-private estimation gives the answer that is very close to the true number of distinct elements. The private algorithm with non-private estimation gives the answer that is worse, but not too much. Finally, the private algorithm with the private estimation gives answers very similar to the results of the non-private estimation.

Acknowledgments and Disclosure of Funding

We would like to thank Badih Ghazi, Andreas Terzis, and four anonymous reviewers for their constructive feedback and valuable suggestions. We thank Markus Hasen ohrl for helpful discussions, which helped us identify the problem. In addition, we are grateful to Ke Yi, Wei Dong, and Juanru Fang for bringing their related work [DFYTM22; FDY22] to our attention.

[BS16] M. Bun and T. Steinke. Concentrated differential privacy: Simpliﬁcations, extensions, and lower bounds . In: Theory of Cryptography: 14th International Conference, TCC 2016-B, Beijing, China, October 31-November 3, 2016, Proceedings, Part I. Springer. 2016, pp. 635 658. URL: https://arxiv.org/ abs/1605.02065 (cit. on p. 9). [CGKM21] L. Chen, B. Ghazi, R. Kumar, and P. Manurangsi. On Distributed Differential Privacy and Counting Distinct Elements . In: 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, January 6-8, 2021, Virtual Conference. Ed. by J. R. Lee. Vol. 185. LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2021, 56:1 56:18. URL: https : / / doi . org / 10 . 4230 / LIPIcs.ITCS.2021.56 (cit. on p. 6). [CKS20] C. L. Canonne, G. Kamath, and T. Steinke. The discrete gaussian for differential privacy . In: Advances in Neural Information Processing Systems 33 (2020), pp. 15676 15688. URL: https://arxiv.org/abs/2004.00010 (cit. on p. 9). [CWG22] R. S. Carvalho, K. Wang, and L. S. Gondara. Incorporating Item Frequency for Differentially Private Set Union . In: Thirty-Sixth AAAI Conference on Artiﬁcial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artiﬁcial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 2022, pp. 9504 9511. URL: https://ojs.aaai.org/index.php/AAAI/article/view/21183 (cit. on p. 6). [DDKPWWXZ23] Z. Ding, J. Durrell, D. Kifer, P. Protivash, G. Wang, Y. Wang, Y. Xiao, and D. Zhang. A Floating-Point Secure Implementation of the Report Noisy Max with Gap Mechanism . In: ar Xiv preprint ar Xiv:2308.08057 (2023). URL: https://arxiv.org/abs/2308.08057 (cit. on p. 21). [DFYTM22] W. Dong, J. Fang, K. Yi, Y. Tao, and A. Machanavajjhala. R2T: Instance Optimal Truncation for Differentially Private Query Evaluation with Foreign Keys . In: Proceedings of the 2022 International Conference on Management of Data. SIGMOD 22. Philadelphia, PA, USA: Association for Computing Machinery, 2022, 759 772. ISBN: 9781450392495. URL: https : / / cse . hkust.edu.hk/~yike/R2T.pdf (cit. on pp. 6, 8, 11). [DKMMN06] C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation . In: Advances in Cryptology-EUROCRYPT 2006: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28-June 1, 2006. Proceedings 25. Springer. 2006, pp. 486 503 (cit. on p. 9). [DKSWXZ+21] Z. Ding, D. Kifer, T. Steinke, Y. Wang, Y. Xiao, D. Zhang, et al. The permuteand-ﬂip mechanism is identical to report-noisy-max with exponential noise . In: ar Xiv preprint ar Xiv:2105.07260 (2021). URL: https://arxiv.org/ abs/2105.07260 (cit. on p. 21). [DLB19] D. Desfontaines, A. Lochbihler, and D. A. Basin. Cardinality Estimators do not Preserve Privacy . In: Proc. Priv. Enhancing Technol. 2019.2 (2019), pp. 26 46. URL: https://doi.org/10.2478/popets-2019-0018 (cit. on p. 5).

[DMNS06] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis . In: Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3. Springer. 2006, pp. 265 284 (cit. on pp. 1, 8, 9). [DN03] I. Dinur and K. Nissim. Revealing information while preserving privacy . In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2003, pp. 202 210 (cit. on p. 1). [DNPRY10] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum, and S. Yekhanin. Pan Private Streaming Algorithms . In: Innovations in Computer Science - ICS 2010, Tsinghua University, Beijing, China, January 5-7, 2010. Proceedings. Ed. by A. C. Yao. Tsinghua University Press, 2010, pp. 66 80. URL: http: //conference.iiis.tsinghua.edu.cn/ICS2010/content/papers/6. html (cit. on p. 6). [DR14] C. Dwork and A. Roth. The algorithmic foundations of differential privacy . In: Foundations and Trends R in Theoretical Computer Science 9.3 4 (2014), pp. 211 407. URL: https://www.cis.upenn.edu/~aaroth/Papers/ privacybook.pdf (cit. on p. 8). [DR16] C. Dwork and G. N. Rothblum. Concentrated differential privacy . In: ar Xiv preprint ar Xiv:1603.01887 (2016). URL: https://arxiv.org/abs/1603. 01887 (cit. on p. 9). [DTT22] C. Dickens, J. Thaler, and D. Ting. Order-Invariant Cardinality Estimators Are Differentially Private . In: Neur IPS. 2022. URL: https://arxiv.org/ abs/2203.15400 (cit. on p. 5). [DVGM22] D. Desfontaines, J. Voss, B. Gipson, and C. Mandayam. Differentially private partition selection . In: Proc. Priv. Enhancing Technol. 2022.1 (2022), pp. 339 352. URL: https://doi.org/10.2478/popets20220017 (cit. on p. 6). [FDY22] J. Fang, W. Dong, and K. Yi. Shifted Inverse: A General Mechanism for Monotonic Functions under User Differential Privacy . In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. CCS 22. Los Angeles, CA, USA: Association for Computing Machinery, 2022, 1009 1022. ISBN: 9781450394505. URL: https://cse.hkust.edu. hk/~yike/Shifted Inverse.pdf (cit. on pp. 6, 8, 11). [GGKSSY20] S. Gopi, P. Gulhane, J. Kulkarni, J. H. Shen, M. Shokouhi, and S. Yekhanin. Differentially private set union . In: International Conference on Machine Learning. PMLR. 2020, pp. 3627 3636. URL: https://arxiv.org/abs/ 2002.09745 (cit. on p. 6). [GKNM23] B. Ghazi, R. Kumar, J. Nelson, and P. Manurangsi. Private Counting of Distinct and k-Occurring Items in Time Windows . In: 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA. Ed. by Y. T. Kalai. Vol. 251. LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2023, 55:1 55:24. URL: https: //doi.org/10.4230/LIPIcs.ITCS.2023.55 (cit. on p. 6). [God] W. Goddard. A brief undergraduate text for Algorithms. Chapter G4: Approximation Algorithms. URL: https://people.computing.clemson.edu/ ~goddard/texts/algor/G4.pdf (cit. on p. 19). [GRS09] A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utilitymaximizing privacy mechanisms . In: Proceedings of the forty-ﬁrst annual ACM symposium on Theory of computing. 2009, pp. 351 360. URL: https: //arxiv.org/abs/0811.2841 (cit. on p. 9).

[HK73] J. E. Hopcroft and R. M. Karp. An n5/2 Algorithm for Maximum Matchings in Bipartite Graphs . In: SIAM J. Comput. 2.4 (1973), pp. 225 231. URL: https://doi.org/10.1137/0202019 (cit. on pp. 3, 4). [HTC23] J. Hehir, D. Ting, and G. Cormode. Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting . In: ar Xiv preprint ar Xiv:2302.02056 (2023). URL: https://arxiv.org/abs/2302.02056 (cit. on p. 5).

[Ilv20] C. Ilvento. Implementing the exponential mechanism with base-2 differential privacy . In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 2020, pp. 717 742. URL: https://arxiv. org/abs/1912.04222 (cit. on p. 21). [Kar73] A. Karzanov. An exact estimate of an algorithm for ﬁnding a maximum ﬂow, applied to the problem on representatives . In: Issues of Cybernetics. Proc. of the Seminar on Combinatorial Mathematics (1973), pp. 66 70 (cit. on pp. 3, 4). [KKMN09] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. Releasing search queries and clicks privately . In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009. Ed. by J. Quemada, G. Le on, Y. S. Maarek, and W. Nejdl. ACM, 2009, pp. 171 180. URL: https://doi.org/10.1145/1526709.1526733 (cit. on p. 6). [KLNRS11] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In: SIAM Journal on Computing 40.3 (2011), pp. 793 826. URL: https://arxiv.org/abs/0803.0924 (cit. on p. 5). [KWSMW20] B. Kreuter, C. W. Wright, E. S. Skvortsov, R. Mirisola, and Y. Wang. Privacypreserving secure cardinality and frequency estimation . In: (2020). URL: https://research.google/pubs/pub49177/ (cit. on p. 6). [MDPHNP11] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis . In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, June 2011, pp. 142 150. URL: http : / / www . aclweb . org / anthology/P11-1015 (cit. on p. 9). [MS20] R. Mc Kenna and D. R. Sheldon. Permute-and-Flip: A new mechanism for differentially private selection . In: Advances in Neural Information Processing Systems 33 (2020), pp. 193 203. URL: https://arxiv.org/abs/2010. 12603 (cit. on p. 21). [MT07] F. Mc Sherry and K. Talwar. Mechanism design via differential privacy . In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07). IEEE. 2007, pp. 94 103 (cit. on pp. 2, 9). [N20] L. N. IMDB Dataset of 50K Movie Reviews. English. 2020. URL: https : //www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of50k-movie-reviews (cit. on p. 9). [NLM19] J. Ni, J. Li, and J. J. Mc Auley. Justifying Recommendations using Distantly Labeled Reviews and Fine-Grained Aspects . In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLPIJCNLP 2019, Hong Kong, China, November 3-7, 2019. Ed. by K. Inui, J. Jiang, V. Ng, and X. Wan. Association for Computational Linguistics, 2019, pp. 188 197. URL: https://doi.org/10.18653/v1/D19-1018 (cit. on p. 9). [PS20] R. Pagh and N. M. Stausholm. Efﬁcient Differentially Private F 0 Linear Sketching . In: ar Xiv preprint ar Xiv:2001.11932 (2020). URL: https : / / arxiv.org/abs/2001.11932 (cit. on p. 5). [RCR22] A. Rivera Cardoso and R. Rogers. Differentially Private Histograms under Continual Observation: Streaming Selection into the Unknown . In: Proceedings of The 25th International Conference on Artiﬁcial Intelligence and Statistics. Ed. by G. Camps-Valls, F. J. R. Ruiz, and I. Valera. Vol. 151. Proceedings of Machine Learning Research. PMLR, 2022, pp. 2397 2419. URL: https: //arxiv.org/abs/2103.16787 (cit. on p. 6). [RS15] S. Raskhodnikova and A. D. Smith. Efﬁcient Lipschitz Extensions for High Dimensional Graph Statistics and Node Private Degree Distributions . In: Co RR abs/1504.07912 (2015). ar Xiv: 1504.07912. URL: http://arxiv. org/abs/1504.07912 (cit. on pp. 2, 8, 9, 20).

[SDH23] M. Swanberg, D. Desfontaines, and S. Haney. DP-SIPS: A simpler, more scalable mechanism for differentially private partition selection . In: Co RR abs/2301.01998 (2023). ar Xiv: 2301.01998. URL: https://doi.org/10. 48550/ar Xiv.2301.01998 (cit. on p. 6). [She20] J. H. Shen. Ask Reddit. English. 2020. URL: https : / / github . com / heyyjudes / differentially - private - set - union / tree / ea7b39285dace35cc9e9029692802759f3e1c8e8/data (cit. on p. 9). [SST20] A. D. Smith, S. Song, and A. Thakurta. The Flajolet-Martin Sketch Itself Preserves Differential Privacy: Private Counting with Minimal Space . In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin. 2020. URL: https://proceedings.neurips.cc/paper/2020/ hash / e3019767b1b23f82883c9850356b71d6 - Abstract . html (cit. on p. 5). [Ste23] T. Steinke. Beyond Local Sensitivity via Down Sensitivity. Differential Privacy.org. https://differentialprivacy.org/downsensitivity/. Sept. 2023 (cit. on p. 8). [Vad17] S. Vadhan. The complexity of differential privacy . In: Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich (2017), pp. 347 450. URL: https : / / privacytools . seas . harvard . edu / files / privacytools/files/manuscript_2016.pdf (cit. on p. 8). [ZDKSTMAS23] B. Zhang, V. Doroshenko, P. Kairouz, T. Steinke, A. Thakurta, Z. Ma, H. Apte, and J. Spacek. Differentially Private Stream Processing at Scale . In: ar Xiv preprint ar Xiv:2303.18086 (2023). URL: https://arxiv.org/abs/2303. 18086 (cit. on p. 6).

Selection Counting Person Contribution Bound Distinct Count

10th PC Median 90th PC 10th PC Median 90th PC

Max Contrib DP Sampling 139 1196.8 1407.5 1649.1 Max Contrib DP Greedy 139 1174.2 1439.2 1646.5 Max Contrib DP Matching 139 1222.2 1460.9 1631.0

90th PC Contrib DP Sampling 48 1225.4 1296.2 1377.9 90th PC Contrib DP Greedy 48 1367.0 1432.6 1516.3 90th PC Contrib DP Matching 48 1365.3 1444.7 1524.8

Max Utility Sampling 41 1247.0 1259.0 1270.0 Max Utility Greedy 20 1376 Max Utility Matching 17 1428

DP Max Utility Sampling 8.9 16.0 28.0 661.6 892.5 1124.5 DP Max Utility Greedy 8.0 11.0 17.0 1148.0 1241.0 1348.0 DP Max Utility Matching 7.0 9.0 14.0 1252.0 1317.0 1400.0

DP Max Utility DP Sampling 9.0 16.0 27.1 702.4 899.1 1145.1 DP Max Utility DP Greedy 8.0 10.0 19.0 1128.5 1224.4 1370.8 DP Max Utility DP Matching 6.9 9.0 13.1 1220.6 1319.1 1394.2

Table 4: Amazon Fashion: the comparison is for ℓmax = 100.

Selection Counting Person Contribution Bound Distinct Count

10th PC Median 90th PC 10th PC Median 90th PC

90th PC Contrib DP Sampling 297 32458.1 32943.8 33452.6 90th PC Contrib DP Greedy 297 36270.3 36669.5 37019.0 90th PC Contrib DP Matching 297 36236.2 36651.7 37102.7

Max Utility Sampling 99 24967.0 25039.0 25121.2 Max Utility Greedy 79 36246 Max Utility Matching 42 36364

DP Max Utility Sampling 85.9 96.0 99.0 23852.8 24739.0 25049.8 DP Max Utility Greedy 34.0 49.0 66.1 35393.0 35839.0 36116.9 DP Max Utility Matching 22.9 30.5 43.2 36026.8 36243.5 36371.2

DP Max Utility DP Sampling 87.0 95.0 99.0 23997.6 24701.1 25067.7 DP Max Utility DP Greedy 32.9 47.5 68.0 35336.6 35776.2 36136.6 DP Max Utility DP Matching 22.0 28.0 38.0 35970.5 36198.9 36326.7

Table 5: Amazon Industrial and Scientiﬁc: the comparison is for ℓmax = 100.

Lemma A.1 (Sensitivity of DC(D; ℓ)). As in Equation (2), for D (Ω )n and ℓ N, deﬁne

DC(D; ℓ) := max

: i [n] vi ui |vi| ℓ

Let D, D (Ω ) be neighboring. That is, D and D differ only by the addition or removal of one entry. Then |DC(D; ℓ) DC(D ; ℓ)| ℓfor all ℓ N.

Proof. Without loss of generality D = (u1, , un) (Ω )n and D = (u1, , un 1) (Ω )n 1. I.e., D is D with person n removed. Let ℓ N.

Selection Counting Person Contribution Bound Distinct Count

10th PC Median 90th PC 10th PC Median 90th PC

90th PC Contrib DP Sampling 75 92480.7 92654.8 92812.1 90th PC Contrib DP Greedy 75 102544.8 102665.7 102817.7 90th PC Contrib DP Matching 75 102651.1 102784.1 102907.8

Max Utility Sampling 99 95606.9 95692.0 95750.3 Max Utility Greedy 52 102543 Max Utility Matching 32 102685

DP Max Utility Sampling 89.0 96.0 99.0 94549.9 95394.5 95656.5 DP Max Utility Greedy 26.0 33.0 50.0 102015.0 102253.0 102527.0 DP Max Utility Matching 14.0 18.5 30.0 102357.0 102501.5 102671.0

DP Max Utility DP Sampling 88.8 96.0 99.0 94665.2 95375.5 95693.5 DP Max Utility DP Greedy 27.0 34.0 53.0 102053.2 102289.6 102531.2 DP Max Utility DP Matching 14.9 18.5 28.0 102379.7 102512.6 102643.9

Table 6: Reddit: the comparison is for ℓmax = 100.

Selection Counting Person Contribution Bound Distinct Count

10th PC Median 90th PC 10th PC Median 90th PC

90th PC Contrib DP Sampling 238 95264.5 95593.5 95966.1 90th PC Contrib DP Greedy 238 98411.0 98734.0 99120.0 90th PC Contrib DP Matching 238 98354.2 98729.4 99164.2

Max Utility Sampling 29 49907.8 50036.5 50195.3 Max Utility Greedy 29 98459 Max Utility Matching 19 98712

DP Max Utility Sampling 29.0 29.0 29.0 49899.6 50070.5 50220.9 DP Max Utility Greedy 22.0 25.0 29.0 98244.0 98364.0 98459.0 DP Max Utility Matching 13.0 16.0 21.0 98586.0 98674.0 98721.0

DP Max Utility DP Sampling 29.0 29.0 29.0 49924.2 50053.7 50211.9 DP Max Utility DP Greedy 20.0 26.0 29.0 98126.7 98369.6 98451.8 DP Max Utility DP Matching 12.0 16.0 21.0 98555.6 98670.4 98726.8

Table 7: IMDB: the comparison is for ℓmax = 30.

Let v1 u1, , vn un and v 1 u1, , v n 1 un 1 satisfy

i [n] |vi| ℓ and

i [n 1] |v i| ℓ and

i [n 1] v i

= DC(D ; ℓ).

We can convert the witness v 1, , v n 1 for D into a witness for D simply by adding the empty set. Deﬁne v n = , which satisﬁes v n un and |v n| ℓ. Then

DC(D; ℓ) = max

: i [n] vi ui |vi| ℓ

= DC(D ; ℓ).

Similarly, we can convert the witness v1, , vn for D into a witness for D simply by discarding vn:

DC(D ; ℓ) = max

i [n 1] v i

: i [n 1] v i ui |v i| ℓ

DC(D; ℓ) ℓ.

Thus |DC(D; ℓ) DC(D ; ℓ)| ℓ, as required.

Proof of Theorem 1.1. Privacy: First note that qℓ(D) = DC(D; ℓ) 2ℓ

ε log(1/2β) has sensitivity ℓ. Algorithm 1 accesses the dataset via qℓ(D) in two ways: First it runs the generalized exponential

mechanism to select ˆℓand second it computes ˆν qˆℓ(D) + Lap 2ˆℓ/ε . Since the generalized

exponential mechanism is ε/2-DP and adding Laplace noise is also ε/2-DP, the overall algorithm is ε-DP by composition.

Lower bound: Since ˆν qˆℓ(D) + Lap 2ˆℓ/ε , we have

ˆν qˆℓ(D) + 2ˆℓ

ˆν qˆℓ(D) 2ˆℓ

= 1 β. (11)

Since DC(D; ˆℓ) DC(D) and qℓ(D) = DC(D; ℓ) 2ℓ

ε log(1/2β), Equation (11) gives

P ˆν [ˆν DC(D)] P ˆν

h ˆν DC(D; ˆℓ) i = 1 β.

This is the guarantee of Equation (6); ˆν is a lower bound on DC(D) with probability 1 β.

Upper bound: The accuracy guarantee of the generalized exponential mechanism (Theorem 3.4) is

qˆℓ(D) max ℓ [ℓmax] qℓ(D) ℓ 4 ε/2 log(ℓmax/β) 1 β. (12)

Combining Equations (11) and (12) with a union bound yields

P (ˆℓ,ˆν) M(D)

ˆν max ℓ [ℓmax] DC(D; ℓ) 2ℓ+ 2ˆℓ

To interpret Equation (13) we need a high-probability upper bound on ˆℓ. Let A > 0 be determined later and deﬁne ℓ A := arg max ℓ [ℓmax] DC(D; ℓ) Aℓ

so that DC(D; ℓ) DC(D; ℓ A) + (ℓ ℓ A) A

ε for all ℓ [ℓmax].

Assume the event in Equation (12) happens. We have

DC(D; ˆℓ) 2ˆℓ

DC(D; ℓ A) + (ˆℓ ℓ A)A

(by Equation (14))

DC(D; ˆℓ) 2ˆℓ

= qˆℓ(D) max ℓ [ℓmax] qℓ(D) ℓ 4 ε/2 log(ℓmax/β) (by assumption)

= max ℓ [ℓmax] DC(D; ℓ) 2ℓ

DC(D; ℓ A) 2ℓ A ε log 1

8ℓ A ε log ℓmax

Combining inequalities yields

DC(D; ℓ A) 2ℓ A ε log 1

8ℓ A ε log ℓmax

DC(D; ℓ A) + (ˆℓ ℓ A)A

(15) which simpliﬁes to

A ℓ A 2 log 1

+ 8 log ℓmax

Now we set A = log 1 2β to obtain

+ 8 log ℓmax

Substituting Equation (17) into Equation (13) gives

P (ˆℓ,ˆν) M(D)

ˆν max ℓ [ℓmax] DC(D; ℓ) 2ℓ+ 2ℓ A ε log 1

8ℓ+ 16ℓ A ε log ℓmax

(18) We simplify Equation (18) using log 1 2β log ℓmax

β to obtain Equation (7).

Computational efﬁciency: Finally, the runtime of DPDISTINCTCOUNT(D) is dominated by ℓmax calls to the SENSITIVEDISTINCTCOUNT(D) subroutine, which computes the maximum size of a bipartite matching on a graph with |E| = P

i [n] |ui| min{ℓ, |ui|} |D| ℓmax edges and |V | + |U| = DC(D) + P i [n] min{ℓ, |ui|} 2|D| vertices. The Hopcroft-Karp-Karzanov algorithm

runs in time O(|E| p

|V | + |U|) O(|D|1.5 ℓmax) time.

Proof of Theorem 1.2. Privacy: For convenience we analyze Algorithm 4, which produces the same result as Algorithm 2. (The difference is that Algorithm 4 is written to be efﬁcient, while Algorithm 2 is written in a redundant manner to make the subroutine we are analyzing clear.) The privacy of the algorithm follows by composing the privacy guarantees of the generalized exponential mechanism and Laplace noise addition. The only missing part is to prove that SENSITIVEAPPROXDISTINCTCOUNT( , ℓ) has sensitivity ℓ.

SENSITIVEAPPROXDISTINCTCOUNT((u1, , un), ℓ) = SENSITIVEAPPROXDISTINCTCOUNT((u1, , un, , u1, , un | {z } ℓtimes

therefore, it is enough to prove that SENSITIVEAPPROXDISTINCTCOUNT( , 1) has sensitivity 1.

Assume D = (u1, , un) and D = (u1, , uj 1, uj+1, , un). Let S1, , Sn and v1, , vn be the states of S and v (vi = if ui \ Si 1 = ), respectively, when

Algorithm 4 Approximate Distinct Count Algorithm

1: procedure SENSITIVEAPPROXDISTINCTCOUNT(D=(u1, , un) (Ω )n; ℓ N) 2: S . 3: for ℓ [ℓ] do 4: for i [n] with ui \ S = do 5: Choose lexicographically ﬁrst v ui \ S. Match (i, ℓ ) to v. 6: Update S S {v}. 7: end for 8: end for 9: return |S| 10: end procedure 11: procedure DPAPPROXDISTINCTCOUNT(D = (u1, , un) (Ω )n; ℓmax N, ε > 0, β (0, 1

2)) 12: for ℓ [ℓmax] do

13: Deﬁne qℓ(D) := SENSITIVEAPPROXDISTINCTCOUNT(D; ℓ) 2ℓ

ε log 1 2β .

14: end for 15: ˆℓ GEM(D; {qℓ}ℓ [ℓmax], {ℓ}ℓ [ℓmax], ε/2, β). Algorithm 3

16: ˆν qˆℓ(D) + Lap 2ˆℓ/ε .

17: return (ˆℓ, ˆν) [ℓmax] R. 18: end procedure

we run APPROXSENSITIVEDISTINCTCOUNT(D, 1). Similarly, let S 1, , S j 1, S j+1, , S n, v 1, , v j 1, v j+1, , v n be the states of S and v (v i = if ui \ S i 1 = ),6 respectively, when we run APPROXSENSITIVEDISTINCTCOUNT(D , 1). Our goal is to show that ||Sn| |S n|| 1. Deﬁne S j = S j 1 and v j = . Clearly S i = Si and v i = vi for all i < j. For i j, we claim that S i Si and |Si| |S i|+1. This is true for i = j, since S j = S j 1 = Sj 1 and Sj = Sj 1 {vj}. We prove the claim for i > j by induction. I.e., assume Si 1 = S i 1 {v i 1} for some v i 1 (possibly v i 1 = , whence Si 1 = S i 1). Now vi is the lexicographically ﬁrst element of ui \ Si 1 and v i is the lexicographically ﬁrst element of ui \ S i 1 (or if these sets are empty). By the induction assumption, ui \ Si 1 ui \ S i 1 (ui \ Si 1) {v i 1}. Thus either v i = vi or v i = v i 1. If v i = vi, then S i = S i 1 {v i} Si 1 {vi} = Si and S i {v i 1} = S i 1 {v i, v i 1} Si 1 {vi} = Si. If v i = v i 1, then S i = S i 1 {v i 1} Si 1 {v i 1} = Si 1 Si and S i {vi} = S i 1 {vi, v i 1} Si 1 {vi} = Si, so the claim holds with v i = vi.

The sensitivity bound implies that qℓ(D) = |S| 2ℓ

ε log(1/2β) has sensitivity ℓ. Since the generalized exponential mechanism is ε/2-DP and adding Laplace noise is also ε/2-DP, the overall algorithm is ε-DP by composition.

Lower bound: Let us denote by d DC(D; ℓ) = SENSITIVEAPPROXDISTINCTCOUNT( , ℓ) the value of |S| we obtain on Line 8 in Algorithm 2. Note that d DC(D; ℓ) is the size of a maximal matching in Gℓ, where Gℓis the bipartite graph corresponding to the input D with ℓcopies of each person (see Algorithm 1 for a formal description of the graph). Since a maximal matching is a 2-approximation to a maximum matching [God], we have

1 2DC(D; ℓ) d DC(D; ℓ) DC(D; ℓ) DC(D). (19)

Since ˆν qˆℓ(D) + Lap 2ˆℓ/ε , we have

ˆν qˆℓ(D) + 2ˆℓ

ˆν qˆℓ(D) 2ˆℓ

= 1 β. (20)

6For notational convenience, we deﬁne { } = .

Substituting qℓ(D) = d DC(D; ℓ) 2ℓ

ε log(1/2β) into Equations (19) and (20) gives Equation (9)

P ˆν [ˆν DC(D)] P ˆν

h ˆν d DC(D; ˆℓ) i = 1 β.

Upper bound: The accuracy guarantee of the generalized exponential mechanism (Theorem 3.4) is

qˆℓ(D) max ℓ [ℓmax] qℓ(D) ℓ 4 ε/2 log(ℓmax/β) 1 β. (21)

Combining Equations (20) and (21) and the deﬁnition of qℓ(D) with a union bound yields

P (ˆℓ,ˆν) M(D)

ˆν max ℓ [ℓmax] d DC(D; ℓ) 2ℓ+ 2ˆℓ

Now we assume maxi [n] |ui| ℓ ℓmax for some ℓ . This implies DC(D) = DC(D; ℓ). We have

max ℓ [ℓmax] d DC(D; ℓ) 2ℓ+ 2ˆℓ

d DC(D; ℓ ) 2ℓ + 2ˆℓ

2DC(D) 2ℓ + 2ˆℓ

As in the proof of Theorem 1.1, if the event in Equation (21) happens, we can show that

+ 8 log ℓmax

ℓ A := arg max ℓ [ℓmax] d DC(D; ℓ) ℓ

Note that ℓ A ℓ , since d DC(D; ℓ) d DC(D; ℓ ) for all ℓ [ℓmax]. (That is simply to say that the size of the maximal matching cannot be increased by making more copies of a vertex once there is one copy for each neighbor.) Combining bounds yields

P (ˆℓ,ˆν) M(D)

Since log 1 2β log ℓmax

β , this implies Equation (10).

Computational efﬁciency: It only remains to verify that Algorithm 2 can be implemented in O(|D| + ℓmax log ℓmax) time. We can implement S using a hash table to ensure that we can add an element or query membership of an element in constant time. (We can easily maintain a counter for the size of S.) We assume D is presented as a linked list of linked lists representing each ui and furthermore that the linked lists ui are sorted in lexicographic order. The outer loop proceeds through the linked list for D = (u1, , un). For each ui, we simply pop elements from the linked list and check if they are in S until either we ﬁnd v ui \ S (and add v to S) or ui becomes empty (in which case we remove it from the linked list for D.) Since each iteration decrements |D|, the runtime of the main loop is O(|D|). Running the generalized exponential mechanism (Algorithm 3) takes O(ℓmax log ℓmax) time.

B Implementing the Generalized Exponential Mechanism

We conclude with some remarks about implementing the generalized exponential mechanism of Raskhodnikova and Smith [RS15] given in Algorithm 3. There are two parts to this; ﬁrst we must compute the normalized scores and then we must run the standard exponential mechanism.

Implementing the standard exponential mechanism is well-studied [Ilv20] and can be performed in linear time (with some caveats about randomness and precision). We remark that, instead of the exponential mechanism, we can use report-noisy-max or permute-and-ﬂip [MS20; DKSWXZ+21; DDKPWWXZ23]. These variants may be easier to implement and may provide better utility too.

The normalized scores are given by

i [m] si = min j [m] (qi(D) t i) (qj(D) t j)

i + j , (29)

where qi(D) is the value of the query on the dataset, i > 0 is the sensitivity of qi, and t is a constant. Na ıvely it would take Θ(m2) time to compute all the normalized scores. However, a more complex algorithm can compute the scores in O(m log m) time.

Observe that, for each i [m], we have

si = min j [m] (qi(D) t i) (qj(D) t j)

i + j min j [m] (qi(D) t i) (qj(D) t j) si( i + j)

min j [m] (qi(D) t i) (qj(D) t j) si( i + j) = 0

qi(D) (si + t) i = max j [m] qj(D) + (si t) j | {z } f(si t)

That is, we can compute si by solving the equation qi(D) (si + t) i = f(si t).

Since f(x) := maxj [m] qj(D) + x j is the maximum of increasing linear functions, we have that f is a convex, increasing, piecewise-linear function, with at most m pieces. We can represent f by a sorted list of the points where the linear pieces connect, along with the linear function on each of the pieces. We can compute this representation of f as a pre-processing step in O(m log m) time; we sort the lines y(x) = qj(D) + x j by their slope j and then compute the intersection points between consecutive lines (we delete lines that never realize the max).

Given the above representation of f and the values qi(D), i, t, we can compute si in O(log m) time. We must solve qi(D) (si + t) i = f(si t) for si. We can perform binary search on the pieces of f to identify j such that f(si t) = qj(D) + (si t) j. Once we have this we can directly compute si = (qi(D) t i) (qj(D) t j)

i+ j . The binary search takes O(log m) time and we must compute m scores. Thus the overall runtime (including the pre-processing) is O(m log m).