# unmasking_vulnerabilities_cardinality_sketches_under_adaptive_inputs__cc8817b9.pdf

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs

Sara Ahmadian * 1 Edith Cohen * 1 2

Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size k. In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the standard estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size k. Empirically, our attack used only 4k queries with the widely used Hyper Log Log (HLL++) (Flajolet et al., 2007b; Heule et al., 2013) sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as any estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in k, matching a generic upper bound.

1. Introduction

Composable sketches for cardinality estimation are data structures that are commonly used in practice (Apache Software Foundation, Accessed: 2024; Google Cloud, Accessed: 2024) and had been studied extensively (Flajolet & Martin, 1985; Flajolet et al., 2007a; Cohen, 1997; Bar-Yossef et al.,

2002; Kane et al., 2010; Nelson et al., 2014; Cohen, 2015; Pettie & Wang, 2021). The sketch of a set is a compact

*Equal contribution 1Google Research, United States 2Department of Computer Science, Tel Aviv University, Israel. Correspondence to: Sara Ahmadian <sahmadian@google.com>, Edith Cohen <edith@cohenwang.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

representation that supports merge (set union) operations, adding elements, and retrieval of approximate cardinality. The sketch size is only logarithmic or double logarithmic in the cardinality of queries, which allows for a significant efficiency boost over linear size data structures such as Bloom filters.

Formally, for a universe U of keys, and randomness ρ, a sketch map U 7 Sρ(U) is a mapping from sets of keys U 2U to their sketch Sρ(U). Sketch maps are designed so that for each U we can recover with high probability (over ρ) an estimate of |U| by applying an estimator M to Sρ(U). A common guarantee is a bound on the Normalized Root Mean Squared Error (NRMSE) so that for accuracy parameter ϵ,

" M(Sρ(U)) |U|

The maps Sρ are designed to be composable: For a set U and key u, the sketch Sρ(U {u}) can be computed from Sρ(U) and u. For two sets U, V , the sketch Sρ(U V ) can be computed from their respective sketches Sρ(U), Sρ(V ). Composability is a crucial property that makes the sketch representations useful for streaming, distributed, and parallel applications. Importantly, the use of the same internal randomness ρ across all queries is necessary for composability and therefore in typical use cases it is fixed across a system.

The basic technique in the design of cardinality (distinct count) sketches is to randomly prioritize the keys in the universe U through the use of random hash functions (specified by ρ). The sketch of a set U keeps the lowest priorities of keys that are in the set U. This provides information on the cardinality |U|, since a larger cardinality corresponds to the presence of lower priorities keys in U. This technique was introduced by Flajolet & Martin (1985) for counting distinct elements in streaming and as composable sketches of sets by Cohen (1997). The core idea of sampling keys based on a random order emerged in reservoir sampling (Knuth, 1998), and in weighted sampling (Ros en, 1997). Cardinality sketches are also Locality Sensitive Hashing (LSH) maps (Indyk & Motwani, 1998) with respect to set differences.

The specific designs of cardinality sketches vary and in-

Cardinality Sketches under Adaptive Inputs

clude Min Hash sketches (randomly map keys to priorities) or domain sampling (randomly map keys to sampling rates). With these methods, the sketch size dependence on the maximum query size |U| n is log n or log log n. The sketch size (number of registers) needed for the NRMSE guarantee of Equation (1) is k = O(ϵ 2). The sketch size needed for the following (ϵ, δ) guarantee (confidence 1 δ of relative error of ϵ)

M(Sρ(U)) |U|

is k = O(ϵ 2 log(1/δ)). This guarantee means that for any U, almost any sampled ρ works well.

We model the use of the sketching map Sρ as an interaction between a source, that issues queries Ui U, and a query response (QR) algorithm, that receives the sketch Sρ(Ui) (but not Ui) applies an estimator M and returns the estimate M(Sρ(Ui)) on the cardinality |Ui|.

When randomized data structures or algorithms are invoked interactively, it is important to make a distinction between non-adaptive queries, that do not depend on ρ, and adaptive queries. In non-adaptive settings we can treat queries as fixed in advance. In this case, we can apply a union bound with the guarantee of Equation 2 and obtain that the probability that the responses for all r queries are within a relative error of ϵ, is at least 1 rδ. Therefore, the sketch size needed to provide an (ϵ, δ)-guarantee on this ℓ error is k = O(ϵ 2 log(r/δ). In particular, the query response algorithm can be correct on a number of non-adaptive queries that is exponential in the sketch size until a set is encountered for which the estimate is off.

Many settings, however, such as control loops, optimization processes, or malicious behavior, give rise to adaptive inputs. This can happen inadvertently when a platform such as Apache Software Foundation (Accessed: 2024) or SQL (Google Cloud, Accessed: 2024) is used. In such cases, information on the randomness ρ may leak from query responses, and the union bound argument does not hold. An important question that arises is thus to understand the actual vulnerability of our specific algorithms in such settings. Are they practically robust? How efficiently can they be attacked? What can be the consequences of finding such an adversarial input?

Randomized data structures designed for non-adaptive queries can be applied in generic ways with adaptive queries. However, the guarantees provided by the resulting (robust) algorithms tend to be significantly weaker than those of their non-robust counterparts. The straightforward approach is to maintain multiple copies of the sketch maps (with independent randomness) and discard a copy after it is used once to respond to a query. This results in a linear dependence of the number of queries r in the size of the data

structure. Hassidim et al. (2020) proposed the robustness wrapper method that allows for r2 adaptive queries using O(r) sketch maps. The method uses differential privacy to protect the randomness and the analysis uses generalization (Dwork et al., 2015b; Bassily et al., 2021) and advanced composition (Dwork et al., 2006). The quadratic relation is known to be tight in the worst-case for adaptive statistical queries with an attack (Hardt & Ullman, 2014; Steinke & Ullman, 2015) designed using Fingerprinting Codes (Boneh & Shaw, 1998). But these attacks do not preclude a tailored design with a better utility guarantee for cardinality sketches and also do not apply with natural workloads.

Contributions and Overview

We consider the known sublinear composable sketch structures for cardinality estimation, which we review in Section 3. Our primary contribution is designing attacks that construct a set U that is adversarial for the randomness ρ. We make this precise in the sequel, but for now, an adversarial set U results in cardinality estimates that are off.

We consider query response algorithms that use the standard cardinality estimators. These estimators optimally use the information in the sketch and report a value that is a function of a sufficient statistic of the cardinality. In Section 4 we present an attack on these estimators. The product of the attack is an adversarial set, one for which the sketch Sρ(U) is grossly out of distribution. The attack uses linearly many queries O(k) in the sketch size and importantly, issues all queries in a single batch. The only adaptive component is the post processing of the query responses. This single-batch attack suggests that it is possible to construct an adversarial input by simply observing and post-processing a normal and non-adaptive workload of the system. The linear size of the attack matches the straightforward upper bound of using disjoint components of the sketch for different queries.

We conduct an empirical evaluation of our proposed attack on the Hyper Log Log (HLL) sketch (Durand & Flajolet, 2003; Flajolet et al., 2007a) with the HLL++ estimator (Heule et al., 2013). This is the most widely utilized sketch for cardinality estimation in practice. The results reported in Section 5 show that even with a single-batch attack using 4k queries, we can consistently construct adversarial inputs on which the estimator substantially overestimates or underestimates the cardinality by 40%.

In Section 6 and Section 7, we present an attack that broadly applies against any correct query response algorithm. By that, we establish inherent vulnerability of the sketch structures themselves. Our attack uses

Cardinality Sketches under Adaptive Inputs

O(k2) adaptive queries. We show that multiple batches are necessary against strategic query response algorithms. This quadratic attack size matches the generic quadratic upper bound construction of Hassidim et al. (2020). The product of our attack is a small mask set M that can poison larger sets U in the sense that S(M U) S(M), making any estimator ineffective. The attack applies even when the query response is for the more specialized soft threshold problem: Determine if the cardinality is below or above a range of the form [A, 2A]. Moreover, it applies even when the response is tailored to the attack algorithm and its internal state including the distribution from which the query sets are selected at each step. Note that this strengthening of the query response and simplification of the task only makes the query response algorithm harder to attack.

Our attacks have the following structure: We fix a ground set N of keys and issue queries that are subsets Ui N. We maintain scores to keys in N that are adjusted for the keys in Ui based on the response. The design has the property that scores are correlated with the priorities of keys and the score is higher when the cardinality is underestimated. The adversarial set is then identified as a prefix or a suffix of keys ordered by their score.

The vulnerabilities we exposed may have practical significance in multiple scenarios: In a non-malicious setting, an adaptive algorithm or an optimization process that is applied in sketch space can select keys that tend to be in overestimated (or underestimates) sets, essentially emulating an attack and inadvertently selecting a biased set on which the estimate is off. In malicious settings, the construction of an adversarial input set U can be an end goal. For example, a system that collects statistics on network traffic can be tricked to report that traffic is much larger or much smaller than it actually is. A malicious player can poison the dataset by injecting a small adversarial set M to the data U, for example, by issuing respective search queries to a system that sketches sets of search queries. The sketch Sρ(M U) then masks Sρ(U), making it impossible to recover an estimate of the true cardinality of U. Finally, cardinality sketches have weighted extensions (max-distinct statistics) and are building blocks of sketches designed for a large class of concave sublinear frequency statistics, that include cap statistics and frequency moments with p 1 (Cohen, 2018; Cohen & Geri, 2019; Jayaram & Woodruff, 2023), and thus these vulnerabilities apply to these extensions.

2. Related Work

There are prolific lines of research on the effect of adaptive inputs that span multiple areas including dynamic

graph algorithms (Shiloach & Even, 1981; Ahn et al., 2012; Gawrychowski et al., 2020; Gutenberg & Wulff Nilsen, 2020; Wajc, 2020; Beimel et al., 2021), sketching and streaming algorithms (Mironov et al., 2008; Hardt & Woodruff, 2013; Ben-Eliezer et al., 2021b; Hassidim et al., 2020; Woodruff & Zhou, 2021; Attias et al., 2021; Ben Eliezer et al., 2021a; Cohen et al., 2022a;b), adaptive data analysis (Freedman, 1983; Ioannidis, 2005; Lukacs et al., 2009; Hardt & Ullman, 2014; Dwork et al., 2015a) and machine learning (Szegedy et al., 2013; Goodfellow et al., 2014; Athalye et al., 2018; Papernot et al., 2017).

Reviriego & Ting (2020) and Paterson & Raynal (2021) proposed attacks on the HLL sketch with standards estimators. The proposed attacks were in a streaming setting and utilized many dependent queries in order to detect keys whose insertion results in updates to the cardinality estimate. Our attacks are more general: We construct single-batch attacks with standard estimators and also construct attacks on these cardinality sketches that apply with any estimator. The question of robustness of cardinality sketches to adaptive inputs is related but different than the well studied question of whether they are differentialy private. Cardinality sketches were shown to be not privacy preserving when the sketch randomness or content are public (Desfontaines et al., 2019). Other works (Smith et al., 2020; Pagh & Stausholm, 2021; Knop & Steinke, 2023) showed that cardinality sketches are privacy preserving, but this is under the assumption that the randomness is used once. Our contribution here is designing attacks and quantifying their efficiency. The common grounds with privacy is the high sensitivity of the sketch maps to removal or insertion of low priority key.

Several works constructed attacks on linear sketches, including the Johnson Lindenstrauss Transform (Cherapanamjeri & Nelson, 2020), the AMS sketch (Ben-Eliezer et al., 2021b; Cohen et al., 2022b), and Count Sketch (Cohen et al., 2022a;b). The latter showed that the standard estimators for Count Sketch and the AMS sketch can be compromised with a linear number of queries and the sketches with arbitrary estimators can be compromised with a quadratic number of queries. The method was to combine low bias inputs with disjoint supports and have the bias amplified, since the bias increases linearly whereas the ℓ2 norms increases proportionally to r. This approach does not work with cardinality sketches, which required a different attack structure. Combining disjoint sets on which the estimate is slightly biased up will not amplify the bias. The common ground, perhaps surprisingly, is that these fundamental and popular sketches are all vulnerable with adaptive inputs and in a similar manner: Estimators that optimally use the sketch require linear size attacks. Arbitrary correct estimators require quadratic size attacks.

Cardinality Sketches under Adaptive Inputs

3. Preliminaries

An attack is an interaction designed to construct a set that is adversarial to the randomness ρ. An adversarial set can be identified by trying out a large number of inputs. We measure the efficiency of the attack by its size (number of issued queries) and concurrency (number of batches of concurrent queries).

A set U is adversarial for the randomness ρ if a sufficient statistics for the cardinality that is computed from Sρ(U) is very skewed with respect to its distribution under sampling of ρ. That is, it has proportionally too few or two many low priority keys. Definition 3.1 (Sufficient Statistics). A statistic T on the sketch domain S 7 R is sufficient for the cardinality |U| if it includes all information the sketch provides on the cardinality |U|. That is, for each t, the conditional distribution of the random variable Sρ(U) given T(Sρ(U)) = t, does not depend on |U|.

3.1. Composable Cardinality Sketches

The underlying technique in all small space cardinality sketches is to use random hashmaps h that assign priorities to keys in U.1 The sketch of a set is specified by the priorities of a small set of keys with the lowest priorities. This information is related to the cardinality as smaller lowest priorities in a subset U are indicative of larger cardinality. Therefore cardinality estimates can be recovered from the sketch. We describe several common designs. Min Hash sketches (see surveys (Cohen, 2008; 2023)) are suitable for insertions only (set unions and insertions of new elements) and are also suitable for sketch-based sampling. Domain sampling has priorities that are discretized sampling rates and has the advantage that the sketch can be represented as random linear maps (specified by ρ) of the data vector and therefore have support for deletions (negative entries in sketched vectors) (Ganguly, 2007).

Min Hash sketches Types of Min Hash sketches

k-mins (Flajolet & Martin, 1985; Cohen, 1997) k random hash functions h1, . . . , hk that map each key x U to i.i.d samples from the domain of the hash function. The sketch Sρ(U) of a set U is the list (minx U hi(x))i [k] of minimum values of each hash function over the keys in U. The sketch distribution for a subset U is Exp[|U|]k, a set of k i.i.d. exponentially distributed random variables with parameter |U|. The sum T(S) := S 1 is a sufficient statistics for estimating the parameter |U|. An unbiased cardinality

1(Chakraborty et al., 2022) proposed a method for streaming cardinality estimates that does not require hashmaps but the sketch is not composable.

estimator is (k 1)/T(S).

Bottom-k (Ros en, 1997; Cohen, 1997; Bar-Yossef et al., 2002) One random hash function h that maps x U to i.i.d samples from a distribution. The sketch {h(x) | x U}(1:k) stores the k smallest hash values of keys x U. The kth smallest value T(S) := {h(x) | x U}(k) is a sufficient statistics for estimating |U|. When the distribution is U[0, 1], the unbiased cardinality estimate is (k 1)/T(S).

k-partition (Flajolet et al., 2007a) One hash P : U [k] randomly partition keys to k parts. One hash function h : U maps keys to i.i.d Exp[1]. The sketch includes the minimum in each part (minx U|P (x)=i h(x))i [k].

Note that the choice of (continuous) distribution does not affect the information content in the sketch. Variations of these sketches store rounded/truncated numbers (HLL (Flajolet et al., 2007a) stores a maximum negated exponent). When studying vulnerabilities of query response algorithms, the result is stronger when the full precision representation is available to them.

The cardinality estimates obtained with these sketches have NRMSE error (1) of 1/

Definition 3.2 (bias of the sketch). We say that the sketch Sρ(U) of a set U is biased up by a factor of 1/α when T(Sρ(U)) αk/|U| and we say it is biased down by a factor of α when T(Sρ(U)) (1/α)k/|U|.

For our purposes, α 1/2 would places the sketch at the δ = e Ω(k) tail of the distribution under sampling of ρ and we say that U is adversarial for ρ.

Domain sampling These cardinality sketches can be expressed as discretized bottom-k sketches. Therefore vulnerabilities of bottom-k sketches also apply with domain sampling sketches. The input is viewed as a vector of dimension |U| where the set U corresponds to its nonzero entries. The cardinality |U| is thus the sparsity (number of nonzero entries). The sketch map Sρ is a dimensionality reduction via a random linear map (specified by ρ).

We sample the domain U = [n] with different rates p = 2 j. For each rate, we collect a count cj (capped by k) of the number of sampled keys from our set X. This can be done by storing the first k distinct keys we see or (approximately) by random hashing into a domain of size k and considering how many cells were hit. A continuous version known as liquid legions (Kreuter et al., 2020)) is equivalent to a bottom-k sketch: Each key is assigned a random i.i.d. priority (lowest sampling rate in which it is counted with domain sampling) and we seek the sampling rate with which we have k keys.

Cardinality Sketches under Adaptive Inputs

Algorithm 1: Attack standard estimators Input: ρ, n, r, T Fix a set N of n keys // selected randomly independently of ρ) foreach key x N do // initialize t[x] 0 c[x] 0 foreach i = 1, . . . , r do

U include each x N independently with prob 1

2 foreach key x U do // score keys t[x] t[x] + 1 c[x] c[x] + T(Sρ(U))

return The keys in N ordered by average score A[x] = c[x]

Specifying keys for the sketch Note that with all these sketch maps, the sketch of a set U is specified by a small subset U0 U of the lowest priority keys in U. With k-mins and k-partition sketches it is the keys arg minx U{hi(x)} for i [k]. With bottom-k sketches, it is the keys with k smallest values in {hi(x)}x U. With domain sampling, it is the keys with the highest sampling rate. Note that |U0| = O(k) |U| but Sρ(U0) = Sρ(U).

4. Attack on the standard estimators

The standard cardinality estimators optimally use the content in the sketch. They can be equivalently viewed as reporting a sufficient statistics. We design a single-batch attack described in Algorithm 1. The algorithm fixes a ground set N of keys. For r queries, it samples a random subset U N where each u N is included independently with probability 1/2. It receives from the estimator the value of the sufficient statistics T(Sρ(U)) := 1/M(Sρ(U)) (we use the inverse of the cardinality estimate). For each key x N it computes a score A[x] that is the average value of T(Sρ(U)) over all subsets where x U.

We show that for α > 0, an attack of size O(r/α2) produces an adversarial set with sketch that is biased up by a factor α (see Definition 3.2).

Theorem 4.1 (Utility of Algorithm 1). Consider Algorithm 1 with k-mins or bottom-k sketches and T(S) being the inverse of the cardinality estimate as specified in Section 3.1. For α > 0, set the parameters n = Ω( 1

αk log(kr)) and r = O k

α2 . Then with probability at least 0.99, the sketch Sρ(Uα), where Uα N is the of the αn lowest A[u] scores, is biased up by a factor of Ω(1/α):

E [M(Uα)] = Θ(n) .

Our analysis extends to the case when the estimator reports T(Sρ(U)) with relative error O(1/

k). That is, as long as the estimates are sufficiently accurate (within the order

of the accuracy guarantees of a size k sketch), then O(k) attack queries suffice.

Analysis Highlights The proof of Theorem 4.1 is presented in Appendix A. The high level idea is as follows. We establish that scores are correlated with the priorities of keys the keys with lowest priorities have in expectation lower scores. Therefore a prefix of the order will contain disproportionately more of them and overestimate the cardinality and a suffix will contain disproportionately fewer of them and underestimate the cardinality.

We consider, for each key x N, the distributions of T(Sρ(U)) conditioned on x U. We bound from above the variance of these distributions and bound from below the gap in the means of the distributions between the keys that have the lowest priority in N and the bulk of other keys in N. We then apply Chebyshev s Inequality to bound the number of rounds that is needed so that enough of the low priority keys have lower average scores A[u] than most other keys. A nuance we overcame in the analysis was to handle the dependence of the sketches of the different queries that are selected from the same ground set.

5. Experimental Evaluation

In this section, we empirically demonstrate the efficacy of our proposed attack (Algorithm 1) against the Hyper Log Log (HLL) sketch (Durand & Flajolet, 2003; Flajolet et al., 2007a) with the HLL++ estimator (Heule et al., 2013). This is the most widely utilized sketch for cardinality estimation in practice. Given an accuracy parameter ϵ, the HLL sketch stores k = 1.04ϵ 2 values that are the negated exponents of a k-partition Min Hash sketch (described in Section 3.1).

The HLL++ estimator is a hybrid that was introduced in order to improve accuracy on low cardinality queries. When the sketch representation is sparse (fewer parts are populated), which is the case with cardinality lower than the sketch size, HLL++ uses the sketch as a hash table and estimates cardinality based on the number of populated parts. This yields essentially precise values. When all parts are populated, HLL++ uses an estimator based on the Min Hash property. We set the size of our ground set n k to be in this relevant regime.

We conduct two primary experiments: (i) For a fixed sketch size, we analyze the efficacy of the attack with a varying number of queries. (ii) For different sketch sizes, we evaluate the effectiveness of the attack with the number of queries linearly proportional to the sketch size. In the following section, we will first provide a detailed explanation of the ingredients required for our experimental setup. Subsequently, we will present the results of each experiment.

Cardinality Sketches under Adaptive Inputs

Experiment setup. To generate the data, we ensure that a ground set with a size of at least 10 k is produced for a given sketch size k. The size of the ground set must be at least linearly larger than the sketch size to prevent the sketch from memorizing the entire dataset. Given the desired size of the ground set, we generate random strings using the English alphabet of a fixed length, where the length is appropriately chosen so that we can generate the desired size set with different strings.

We utilize the open-source implementation of HLL++ algorithm in github. In this implementation, the sketch is fixed by giving the error rate ϵ (0, 1) and the sketch size k for error rate ϵ is 1.04/ϵ2 (consistent with Flajolet et al. (2007a)).

5.1. Efficacy with a varying number of queries

In this experiment we examine the impact of introducing a variable quantity of queries. The attack is executed with the same ground set for eight distinct query counts, where each count is a power of 4. At the conclusion, the algorithm generates scores and returns keys sorted in ascending order according to their scores. Keys with high score correspond to low-priority keys which are expected to appear when the estimate is biased up. By including these keys in the adversarial set, we basically can trick the estimator to think that they are seeing a sketch of a large set. Similarly we can construct adversarial input sets by including keys with low scores and trick the estimator to think they are seeing a sketch of a small set.

We present two sets of plots corresponding to how the estimator overestimates or underestimates as keys are incrementally added to the adversarial input in the increasing or decreasing order of their scores. We consider two different error rates, ϵ = 0.1 with corresponding sketch size k = 104 and ϵ = 0.05, with corresponding sketch size k = 416. We use the same ground set comprising of 5000 keys for both sets of experiments. It is worth noting that the plot with one query, which oscillates around the line y = x (denoted by a dashed line), is close to a non-adversarial setting and we can see that the estimates are within the desired specified error of ϵ.

Figure 1 reports cardinality estimates when keys are added incrementally in increasing order of their scores. We can see that as we increase the number of queries, the gap between estimated value and the y = x line (actual value) widens. This gap indicate the overestimation error. Our algorithm is able to construct more effective adversarial input with a larger number of queries. However the gain in effectiveness becomes marginal at some point. For example, for k = 104, we already see good degree of error in estimation with 4096 queries. Figure 2 reports results when keys are added incrementally in decreasing order of their scores. The gap

Figure 1. Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with lowest average score after r = 4i queries.

here corresponds to an underestimation error. We can see that the attacks are more effective with more queries.

5.2. Efficacy of the attack with a varying sketch sizes

In this section, our focus is on examining HLL++ with different sketch sizes, namely we consider six different error rates corresponding to sketch sizes k = 2i for i ranging from 6 to 11. For each sketch size k, we generate a ground set of size n = 10 10 log10(k) to ensure that the ground set is larger than sketch size and the Min Hash component of the HLL++ estimator is used. In Figure 3, we report the ratio of estimated size to actual size of the set for all subsets constructed as a prefix of the order on keys, sorted by increasing average scores A[x] for a fixed number of queries set to 4k. In this cardinality regime, HLL++ is nearly unbiased and we expect a ratio that is close to 1 when the queries are not adversarial. However by running attacks with enough number of queries (linear in the size of sketch), we are able to identify keys with low-priority and then trick the estimator to give an estimate for a set much higher than the actual size.

Cardinality Sketches under Adaptive Inputs

Figure 2. Attack on the HLL++ sketch and estimator, for varying number of queries. Cardinality estimates for the prefix of keys with largest average score after r = 4i queries.

6. Attack Setup Against Strategic Estimators

We design attacks that apply generally against any query response (QR) algorithm. The attacks are effective even when the specifics of the attack and the full internal state of the attack algorithm are shared with the QR algorithm, including the per-step distribution from which the attacker selects each query. Moreover, we can even assume that the QR algorithm is provided with an enhanced sketch that includes the identities of the low priority keys that determined the sketch and that after the QR algorithm responds to a query, the full query set U is shared with it. The only requirement from the QR algorithm is that it selects correct response maps (with respect to the query distribution). Note that such a powerful QR algorithm precludes attacks that use queries of fixed cardinality, since the QR algorithm can simply return that cardinality value without even considering the actual input sketch.2

Moreover, the task of the QR algorithm is the following problem that is more specialized than cardinality estimation:

Problem 6.1 (Soft Threshold A). Return 0 when |U| A and 1 when |U| 2A.

2Our attack in Algorithm 1 is also precluded, since a fixed response of n/2 satisfies the requirements (the cardinality is Binom(n, p) and for n k, all queries have size close to n/2).

Figure 3. Attack on HLL++ for varying sketch sizes while utilizing queries of size 4 times the sketch size.

Remark 6.2. Soft Threshold can be solved via a cardinality estimate with a multiplicative error of at most

2 by reporting 1 when the estimate is larger than

2A and 0 otherwise. When estimates are computed from cardinality sketches with randomness ρ that does not depend on the queries, a sketch of size k = Θ(log(1/δ)) is necessary and suffices for providing correct responses with probability 1 δ.

Attack Framework We describe the attack framework. We specify attacks in Sections 7 and 8. We model the interaction as a process between three parties: the Attacker, the QR algorithm, and System. The attacker fixes a ground set N from which it samples query sets. The product of the attack is a subset M N which we refer to as a mask. The aim is for the mask to have size that is much smaller than our query subset sizes and the property that for uniformly sampled U N (|U| |M|) the information in the sketch Sρ(M U) is insufficient to estimate |U|. The attack proceeds in steps described in Algorithm 2:

Algorithm 2: Attack Interaction Step

Attacker specifies a distribution D over the support 2N and sends it to QR. It then selects a query U D and sends it to System. QR selects a correct map with δ = O(1/

k) (as in Definition 6.3) of sketches to probabilities S 7 π(S) [0, 1]. The selection may depend on the prior interaction transcript and on D. If there is no correct map QR reports failure and halts. System computes the sketch Sρ(U) and sends it to QR. QR sends Z Bern[π(Sρ(U))] to Attacker. Attacker shares U and its internal state with QR.

Definition 6.3 (Correct Map). We say that the map π is correct for A and δ and query distribution D if for any cardinality value, over the query distribution for this value, it returns a correct response to a soft threshold problem with

Cardinality Sketches under Adaptive Inputs

A with probability at least 1 δ. That is,

for c < A, EU D||U|=c(π(Sρ(U))) δ

for c > 2A, EU D||U|=c(π(Sρ(U))) 1 δ .

Remark 6.4 (Many correct maps). There can be multiple correct maps and QR may choose any one at any step. Since the output when |U| [A, 2A] is not specified, the probability EU Dπ(Sρ(U)) of reporting Z = 1 may vary by Pr U D[|U| [A, 2A]] + δ between correct maps.

Recall that our attack on the standard estimators (Algorithm 1) issued a single batch of queries (all drawn from the same pre-specified distribution D0). We show that multiple batches are necessary to attack general QR algorithms:

Lemma 6.5 (Multiple batches are necessary). Any attack of polynomial size in k on a soft threshold estimator must use multiple batches.

Proof. When there is a single batch of r queries, we can apply the standard estimator while accessing only a component of the sketch that is of size k = O(log(r/δ)) and obtain correct responses on all queries. This component is the same k hash functions with k-mins sketches, only k parts in a k-partition sketch, or the bottom-k values in a bottomk sketch. Since the query response only leaks information on this component, the attacker is only able to compromise that component in this single-batch attack. Therefore, an exponential number of queries in k is needed in order to construct an adversarial input in a single batch.

7. Single-batch attack on symmetric QR

Algorithm 3 specifies a single-batch attack. We establish that the attack succeeds when we set the size r = O(k2) and QR is constrained to be symmetric (see Definition 7.2). In essence symmetry means that the QR algorithm does not make a significant distinction between components of the sketch. Symmetry excludes strategies that distinguish between components of the sketch as in the proof of Lemma 6.5 but still allows for flexibility including randomly selecting components of the sketch. In Section 8 we extend this attack to an adaptive attack that works against any QR algorithm.

The attacker initializes the scores C[x] 0 of all keys x N in the ground set. Each query is formed by sampling a rate q (as described in Algorithm 5) and selecting a random subset U N so that each key in N is included independently with probability q. The attacker receives Z and increments by Z the score C[x] of all keys x U. The final product is the set M of keys with scores that are higher by Ω(r/k) than the median score.

Theorem 7.1 (Utility of Algorithm 3 with symmetric maps).

Algorithm 3: Single Batch Attacker Input: ρ, n, r Select a set N of n keys // Randomly from U A n/16 foreach key x N do C[x] 0 // initialize for i = 1, . . . , r do

Sample q as specified in Algorithm 5 // using A, n U includes each u N independently with prob q send U // System receive Z // Symmetric Query Response foreach key x U do C[x] C[x] + Z // score

C median{C[N]} // Compute median score return M {x N | C[x] > C + Ω( r

k)} // Mask

For α > 0, set n = Ω( 1

αk log(kr)) and r = Ω k2 α2 . Then

Pr[(Sρ(M) = Sρ(N)) (|M| < αn)] 0.99.

7.1. Proof Overview

See Appendix B for details. We work with the rank-domain representations of the sketches with respect to the ground set N. This representation simplifies our analysis as it only depends on the rank order of keys by their hash values and by that factors out the hash values. The rank-domain sketches SR(U) have the form (Y1, . . . , Yk) where Yi are positive integers in [N]. The sketch distribution over the sampling of U for fixed q is that of k independent Geom[q] random variables Yi. The sum T = Pk i=1 Yi is a sufficient statistics for q from the sketch.

Definition 7.2 (symmetric map). A map π is symmetric if it uses the rank-domain sketch as an unordered set and (ii) is monotone in that if a sketch S1 S2 coordinate-wise then then π(S1) π(S2).

We denote by N 0 the set of k lowest-rank (lowest priority) keys. It includes the bottom-k keys with bottom-k sketches, and the minimum hash key with respect to each of the k hashmaps with k-mins and k-partition sketches. Note that Sρ(N) = Sρ(N 0 ). We denote by N N a set of keys that are transparent very unlikely to influence the sketch if included in the attack subsets of Algorithm 3. We show that a key in N 0 obtains in expectation a higher score than a key in N and use that to establish the utility claim: Lemma 7.3 (separation with symmetric maps). Let π be correct and symmetric (Definition 7.2). Then for any m N 0 and u N ,

EU[π(Sρ(U) 1(m U)] EU[π(Sρ(U) 1(u U)] = Ω(1/k)

Proof. See Appendix B.5. The gap only holds on average for general correct maps (Lemma B.6) but holds per-key when specialized to symmetric maps.

Proof of Theorem 7.1. The distribution of the score C[x] of

Cardinality Sketches under Adaptive Inputs

Algorithm 4: Adaptive Attacker Input: ρ, n, r Select a set N of n keys // Randomly from U A n/16; M foreach key x N do C[x] 0 // initialize for i = 1, . . . , r do

Sample U D0 as in Algorithm 3 send M U to system receive Z from QR if failure then exit foreach key x U do // score keys C[x] C[x] + Z if C[x] median(C[N \ M]) + p

i log(200nr)/2 then // test if score is high M M {x}

send M, C, U to QR // share internal state

return M // Mask

all transparent keys u N is identical and is the sum of r independent Poisson random variables Pr i=1 Zi Bern[qi].

From Lemma 7.3 E[C[m]] E[C[u]] = Ω(r/k). For x N 0 , the gap random variables of different steps may be dependent, but the lower bound on the expected gap in Lemma 7.3 holds even conditioned on transcript. Additionally, the expected gap in each step is bounded in [ 1, 1]. We can apply Chernoff bounds (Chernoff, 1952) to bound the probability that a sum deviates by more than λ from its expectation

Pr[|C[x] E[C[x]]| λ] 2e 2λ2/r . (3)

Setting λ = cr/k separates a key in N 0 from a key in N with probability 1 2e 2c2r/k2. Choosing r = O(k2 log |N|) we get that the order separates out with high probability all the keys N 0 from all the keys N . Note that there are only Ω(k) non transparent keys N0 := N \ N . Therefore with high probability N 0 M N0 and (we can fix constants so that) |M| = O(k) αn. Since N 0 M, Sρ(M) = Sρ(N).

8. Adaptive Attack on General QR

An attack on general QR algorithms is given in Algorithm 4. The attacker maintains an initially empty set M N of keys which we refer to as mask. The query sets have the form M U, where U D0 is sampled and scored as in Algorithm 3. A key is added to M when its score separates out from the median score. We establish the following:

Theorem 8.1 (Utility of Algorithm 4). For α > 0, set n = Ω( 1

αk log(kr)) and r = Ω k2 α2 . Then with probability at least 0.99, |M| < αn and there is no correct map for the query distribution M U where U D0 is as in Algorithm 3.

We overview the proof with details deferred to Appendix B. The condition for adding a key to M is such that with probability at least 0.99, only N0 keys are placed in M, so |M| αn (Claim B.9). If the QR algorithm fails, there is no correct map for the distribution M D0.3 It remains to consider the case where the attack is not halted.

Since the mask M is shared with QR for free, QR only needs to estimate |U| (or q). But the sketch of M U partially masks the sketch of U. The set of non-transparent keys N 0 N0 decreases as M increases. Additionally, the effective sketch size k k is lower (that is, QR only obtains k i.i.d Geom[q] random variables). Recall that when k < log(k)/2, there is no correct map.

With general correct maps, we can only establish a weaker average version of the score gap over N 0 keys. This allows some N 0 keys to remain indistinguishable by score from transparent keys. But what works in Attacker s favour is that in this case the score of other N0 keys must increase faster. Let p(π, M, x) be the probability that key x is scored with map π and mask M. The probability is the same for all transparent keys x N 0 and we denote it by p (π, M). We establish (see Lemma B.8) that for a correct map π for M D0 it holds that X

x N 0 (p(π, M, x) p (π, M)) = Ω(1) .

Therefore, in r = O(k2) steps, the combined score advantage of N0 keys is (concentrated well around) O(k2). But crucially, any one key can not get too much advantage: once C(x) C = Ω(k) (where C is the median score), then key x is placed in the mask M, exits N 0, and stops getting scored. Therefore if QR does not fail, Ω(r/k) > |N0| keys are eventually placed in M, which must include all N0 keys.

9. Conclusion

We demonstrated the inherent vulnerability of the known composable cardinality sketches to adaptive inputs. We designed attacks that use a number of queries that asymptotically match the upper bounds: A linear number of queries with the standard estimator and a quadratic number of queries with any estimator applied to the sketch. Empirically, our attacks are simple and effective with small constants. An interesting direction for further study is to show that this vulnerability applies with any composable sketch structure. On the positive side, we suspect that restricting the maximum number of queries that any one key can participate in to sublinear (with standard estimators) or subquadratic (with general estimators) would enhance robustness.

3A situation of no correct maps can be identified by Attacker, by tracking the error rate of QR, even if not declared by QR.

Cardinality Sketches under Adaptive Inputs

Acknowledgements

The authors are grateful to Jelani Nelson and Uri Stemmer for discussions. Edith Cohen is partially supported by Israel Science Foundation (grant no. 1156/23).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ahn, K. J., Guha, S., and Mc Gregor, A. Analyzing graph structure via linear measurements. In Proceedings of the 2012 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 459 467, 2012. doi: 10.1137/1.9781611973099.40. URL https://epubs.siam.org/doi/abs/10. 1137/1.9781611973099.40.

Apache Software Foundation. Data Sketches, Accessed: 2024. URL https://datasketches.apache. org. Apache Software Foundation Documentation.

Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Synthesizing robust adversarial examples. In International conference on machine learning, pp. 284 293. PMLR, 2018.

Attias, I., Cohen, E., Shechner, M., and Stemmer, U. A framework for adversarial streaming via differential privacy and difference estimators. Co RR, abs/2107.14527, 2021.

Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. Counting distinct elements in a data stream. In RANDOM. ACM, 2002.

Bassily, R., Nissim, K., Smith, A. D., Steinke, T., Stemmer, U., and Ullman, J. R. Algorithmic stability for adaptive data analysis. SIAM J. Comput., 50(3), 2021. doi: 10.1137/16M1103646. URL https://doi.org/10. 1137/16M1103646.

Beimel, A., Kaplan, H., Mansour, Y., Nissim, K., Saranurak, T., and Stemmer, U. Dynamic algorithms against an adaptive adversary: Generic constructions and lower bounds. Co RR, abs/2111.03980, 2021.

Ben-Eliezer, O., Eden, T., and Onak, K. Adversarially robust streaming via dense-sparse trade-offs. Co RR, abs/2109.03785, 2021a.

Ben-Eliezer, O., Jayaram, R., Woodruff, D. P., and Yogev, E. A framework for adversarially robust streaming algorithms. SIGMOD Rec., 50(1):6 13, 2021b.

Boneh, D. and Shaw, J. Collusion-secure fingerprinting for digital data. IEEE Trans. Inf. Theory, 44(5):1897 1905, 1998. doi: 10.1109/18.705568. URL https: //doi.org/10.1109/18.705568.

Chakraborty, S., Vinodchandran¹, N. V., and Meel, K. S. Distinct elements in streams: An algorithm for the (text) book. Schloss Dagstuhl Leibniz-Zentrum f ur Informatik, 2022. doi: 10.4230/LIPICS.ESA.2022.34. URL https://drops.dagstuhl.de/entities/ document/10.4230/LIPIcs.ESA.2022.34.

Cherapanamjeri, Y. and Nelson, J. On adaptive distance estimation. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Chernoff, H. A measure of the asymptotic efficiency for test of a hypothesis based on the sum of observations. Annals of Math. Statistics, 23:493 509, 1952.

Cohen, E. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 55:441 453, 1997.

Cohen, E. Min-Hash Sketches, pp. 1 7. Springer US, Boston, MA, 2008. ISBN 978-3-642-27848-8. doi: 10. 1007/978-3-642-27848-8 573-1. URL https://doi. org/10.1007/978-3-642-27848-8_573-1.

Cohen, E. All-distances sketches, revisited: HIP estimators for massive graphs analysis. TKDE, 2015. URL http: //arxiv.org/abs/1306.3284.

Cohen, E. Stream sampling framework and application for frequency cap statistics. ACM Trans. Algorithms, 14 (4):52:1 52:40, 2018. ISSN 1549-6325. doi: 10.1145/ 3234338.

Cohen, E. Sampling big ideas in query optimization. In Geerts, F., Ngo, H. Q., and Sintos, S. (eds.), Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pp. 361 371. ACM, 2023. doi: 10.1145/3584372.3589935. URL https://doi. org/10.1145/3584372.3589935.

Cohen, E. and Geri, O. Sampling sketches for concave sublinear functions of frequencies. In Neur IPS, 2019.

Cohen, E., Lyu, X., Nelson, J., Sarl os, T., Shechner, M., and Stemmer, U. On the robustness of countsketch to adaptive inputs. In ICML, volume 162 of Proceedings

Cardinality Sketches under Adaptive Inputs

of Machine Learning Research, pp. 4112 4140. PMLR, 2022a.

Cohen, E., Nelson, J., Sarl os, T., and Stemmer, U. Tricking the hashing trick: A tight lower bound on the robustness of Count Sketch to adaptive inputs. ar Xiv:2207.00956, 2022b. doi: 10.48550/ARXIV.2207.00956. URL https: //arxiv.org/abs/2207.00956.

Desfontaines, D., Lochbihler, A., and Basin, D. A. Cardinality estimators do not preserve privacy. In Privacy Enhancing Technologies Symposium, volume 19, 2019. doi: https://doi.org/10.2478/popets-2019-0018. URL http://arxiv.org/abs/1808.05879.

Durand, M. and Flajolet, P. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. L. Preserving statistical validity in adaptive data analysis. In STOC, pp. 117 126. ACM, 2015a.

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. L. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC 15, pp. 117 126, New York, NY, USA, 2015b. Association for Computing Machinery. ISBN 9781450335362. doi: 10. 1145/2746539.2746580. URL https://doi.org/ 10.1145/2746539.2746580.

Flajolet, P. and Martin, G. N. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31:182 209, 1985.

Flajolet, P., Fusy, E., Gandouet, O., and Meunier, F. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science, (Proceedings), 2007a.

Flajolet, P., Fusy, E., Gandouet, O., and Meunier, F. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (Aof A). DMTCS, 2007b.

Freedman, D. A. A note on screening regression equations. The American Statistician, 37(2):152 155, 1983. doi: 10.1080/00031305.1983.10482729. URL

https://www.tandfonline.com/doi/abs/ 10.1080/00031305.1983.10482729.

Ganguly, S. Counting distinct items over update streams. Theoretical Computer Science, 378 (3):211 222, 2007. ISSN 0304-3975. doi:

https://doi.org/10.1016/j.tcs.2007.02.031. URL https://www.sciencedirect.com/ science/article/pii/S0304397507001223. Algorithms and Computation.

Gawrychowski, P., Mozes, S., and Weimann, O. Minimum cut in o(m log2 n) time. In ICALP, volume 168 of LIPIcs, pp. 57:1 57:15. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2020.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Google Cloud. Big Query Documentation: Approximate Aggregate Functions, Accessed: 2024. URL https://cloud.google.com/ bigquery/docs/reference/standard-sql/ approximate_aggregate_functions. Google Cloud Documentation.

Gutenberg, M. P. and Wulff-Nilsen, C. Decremental SSSP in weighted digraphs: Faster and against an adaptive adversary. In Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 20, pp. 2542 2561, USA, 2020. Society for Industrial and Applied Mathematics.

Hardt, M. and Ullman, J. Preventing false discovery in interactive data analysis is hard. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), pp. 454 463. IEEE Computer Society, 2014. doi: 10.1109/FOCS.2014. 55. URL https://doi.ieeecomputersociety. org/10.1109/FOCS.2014.55.

Hardt, M. and Woodruff, D. P. How robust are linear sketches to adaptive inputs? In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 13, pp. 121 130, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320290. doi: 10.1145/ 2488608.2488624. URL https://doi.org/10. 1145/2488608.2488624.

Hassidim, A., Kaplan, H., Mansour, Y., Matias, Y., and Stemmer, U. Adversarially robust streaming algorithms via differential privacy. In Annual Conference on Advances in Neural Information Processing Systems (Neur IPS), 2020.

Heule, S., Nunkesser, M., and Hall, A. Hyper Log Log in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013.

Indyk, P. and Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc.

Cardinality Sketches under Adaptive Inputs

30th Annual ACM Symposium on Theory of Computing, pp. 604 613. ACM, 1998.

Ioannidis, J. P. A. Why most published research findings are false. PLo S Med, (2):8, 2005.

Janson, S. Tail bounds for sums of geometric and exponential variables, 2017a. URL https://arxiv.org/ abs/1709.08157.

Janson, S. Tail bounds for sums of geometric and exponential variables, 2017b. URL https://arxiv.org/ abs/1709.08157.

Jayaram, R. and Woodruff, D. P. Towards optimal moment estimation in streaming and distributed models. ACM Trans. Algorithms, 19(3), jun 2023. ISSN 1549-6325. doi: 10.1145/3596494. URL https://doi.org/10. 1145/3596494.

Kane, D. M., Nelson, J., and Woodruff, D. P. An optimal algorithm for the distinct elements problem. In PODS, 2010.

Knop, A. and Steinke, T. Counting distinct elements under person-level differential privacy. Co RR, abs/2308.12947, 2023. doi: 10.48550/ARXIV.2308.12947. URL https: //doi.org/10.48550/ar Xiv.2308.12947.

Knuth, D. E. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 2nd edition, 1998.

Kreuter, B., Wright, C. W., Skvortsov, E. S., Mirisola, R., and Wang, Y. Privacy-preserving secure cardinality and frequency estimation. Technical report, Google, LLC, 2020.

Lukacs, P. M., Burnham, K. P., and Anderson, D. R. Model selection bias and Freedman s paradox. Annals of the Institute of Statistical Mathematics, 62(1):117, 2009.

Mironov, I., Naor, M., and Segev, G. Sketching in adversarial environments. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC 08, pp. 651 660, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605580470. doi: 10.1145/1374376.1374471. URL https://doi. org/10.1145/1374376.1374471.

Nelson, J., Nguy ˆen, H. L., and Woodruff, D. P. On deterministic sketching and streaming for sparse recovery and norm estimation. Lin. Alg. Appl., 441:152 167, January 2014. Preliminary version in RANDOM 2012.

Pagh, R. and Stausholm, N. M. Efficient Differentially Private F Linear Sketching. In Yi, K. and Wei, Z. (eds.), 24th International Conference on Database

Theory (ICDT 2021), volume 186 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 18:1 18:19, Dagstuhl, Germany, 2021. Schloss Dagstuhl Leibniz-Zentrum f ur Informatik. ISBN 978-3-95977179-5. doi: 10.4230/LIPIcs.ICDT.2021.18. URL https://drops.dagstuhl.de/entities/ document/10.4230/LIPIcs.ICDT.2021.18.

Papernot, N., Mc Daniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506 519, 2017.

Paterson, K. G. and Raynal, M. Hyperloglog: Exponentially bad in adversarial settings. Cryptology e Print Archive, Paper 2021/1139, 2021. URL https: //eprint.iacr.org/2021/1139. https:// eprint.iacr.org/2021/1139.

Pettie, S. and Wang, D. Information theoretic limits of cardinality estimation: Fisher meets shannon. In Khuller, S. and Williams, V. V. (eds.), STOC 21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pp. 556 569. ACM, 2021. doi: 10.1145/3406325.3451032. URL https: //doi.org/10.1145/3406325.3451032.

Reviriego, P. and Ting, D. Security of hyperloglog (HLL) cardinality estimation: Vulnerabilities and protection. IEEE Commun. Lett., 24(5):976 980, 2020. doi: 10.1109/LCOMM.2020.2972895. URL https: //doi.org/10.1109/LCOMM.2020.2972895.

Ros en, B. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135 158, 1997.

Shiloach, Y. and Even, S. An on-line edge-deletion problem. J. ACM, 28(1):1 4, jan 1981. ISSN 0004-5411. doi: 10.1145/322234.322235. URL https://doi.org/ 10.1145/322234.322235.

Smith, A., Song, S., and Guha Thakurta, A. The flajoletmartin sketch itself preserves differential privacy: Private counting with minimal space. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 19561 19572. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ e3019767b1b23f82883c9850356b71d6-Paper. pdf.

Steinke, T. and Ullman, J. Interactive fingerprinting codes and the hardness of preventing false discovery. In Gr unwald, P., Hazan, E., and Kale, S. (eds.), Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research,

Cardinality Sketches under Adaptive Inputs

pp. 1588 1628, Paris, France, 03 06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/ Steinke15.html.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Wajc, D. Rounding Dynamic Matchings against an Adaptive Adversary. Association for Computing Machinery, New York, NY, USA, 2020. URL https://doi.org/10. 1145/3357713.3384258.

Woodruff, D. P. and Zhou, S. Tight bounds for adversarially robust streams and sliding windows via difference estimators. In Proceedings of the 62nd IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2021.

Cardinality Sketches under Adaptive Inputs

A. Analysis of the Attack on the Standard Estimators

This section includes the proof of Theorem 4.1. We first consider k-mins sketches and T(S) = S 1. The modification needed for bottom-k sketches are in Section A.4.

A.1. Preliminaries

The following are order statistics properties useful for analysing Min Hash sketches. Let Xi Exp[1] for i [n] be i.i.d. random variables. Then the distribution of the minimum value and of the differences between the i + 1 and the ith order statistics (smallest values) are independent random variables with distributions

1 := min i [n] Xi Exp[n]

i := {Xi}(i+1) {Xi}(i) Exp[n i] i > 1

Lemma A.1 (Chebyshev s Inequality).

Pr |Z E[Z]| cσ2 1/c2 .

We set some notation: For a fixed ground set N and randomness ρ, for each hash function i [k], let mi j N be the key with the jth rank in the ith hashmap, that is, hi(mi j) is the jth smallest in {hi(u)}u N. Let

L := log2(rk) + 10

N i 0 := {mi j}j L

i [k] N i 0

N := N \ N0

be a rank threshold L, for i [k] the set N i 0 of keys with rank up to L in the ith hashmap, the set N0 that is the union of these keys across hashmaps, and the set N of the remaining keys in N.

We show that a choice of n = O(k L/α) ensures that certain properties that simplify our analysis hold. Our analysis applies to the event that these properties are satisfied:

Lemma A.2 (Good draws). For n = Ω( 1

αk log(rk)), the following hold with probability at least 0.99:

p1 (property of ρ and N) The keys mi j for i [k] and j L are distinct.

p2 In a run of Algorithm 1, all r steps, for all i [k], U includes a key from N i 0.

p3 n 3k L/α

Proof. p3 is immediate. For p1, note that if we set n 2

pk L then the claim follows with probability 1 p using the birthday paradox. For p2, the probability that a random U does not include one of the L smallest values of a particular hash function is 2 L. The probability that this happens for any of the k hash functions in any of the r rounds is at most rk2 L. Substituting L = log(rk) + 10 we get the claim.

For fixed N and ρ, consider the random variable Z := T(Sρ(U)) over sampling of U and the contributions Zi of hash function i [k] to Z.

Zi := min x U hi(x)

For a key u N, we consider the random variables Zi | u U and Z | u U that are conditioned on u U. From property p1 in Lemma A.2, the Zi are independent and are also independent when conditioned on u U.

Cardinality Sketches under Adaptive Inputs

A.2. Proof outline

We will need the following two Lemmas (the proofs are deferred to Section A.3). Intuitively we expect EU[Z] to be lower when conditioned on mi 1 U. We bound this gap from below. For fixed ρ and N, let

G(u, v) := EU|u U[Z] EU|v U[Z] .

Lemma A.3 (Expectations gap bound). For each i [k] and δ > 0,

min u N G(u, mi 1) δ

We bound from above the maximum over u N of Var U|u U[Z]: Lemma A.4 (Variance bound). For δ > 0 there is a constant c,

max u N Var U|u U[Z] c 1 + 1

We use the following to bound the number r of attack queries needed so that the sorted order by average score separates the minimum hash keys from the bulk of the keys in N : Lemma A.5 (Separation). Let α > 0. Assume that

minu N G(u, mi 1) M > 0

maxu N Var U|u U[Z] V 2

During Algorithm 1, the keys u N and mi 1 are selected in U in at least r 2V 2

α rounds each.

Then Pr[A[u] > A[mi 1]] α

Proof. Consider the random variable Y = A[u] A[mi 1] .

From our assumptions:

Var[Y ] 2V 2

Pr[Y < 0] Pr[|Y E[Y ]| E[Y ]]

Pr[|Y E[Y ]| M]

= Pr[|Y E[Y ]| 1 α V

Using Chebyshev s inequality.

We are now ready to conclude the utility proof of Algorithm 1: Lemma A.6 (Utility of Algorithm 1). For α, δ > 0. Consider Algorithm 1 with r = O( k

δα). Then for each i [r], with probability 1 δ, for any u N we have Pr[A[u] < A[mi 1]] α.

Proof. Consider a key mi 1 and a key u N . We bound the probability that A[u] < A[mi 1].

From Lemma A.4, with probability 1 1/k we have V 2 = O k

n2 . From Lemma A.3, for each i, with probability 1 δ, we have M δ/(3n). If we choose r = 4r in Algorithm 1, then with high probability a key x N is selected to U at least r times. The claim follows from Lemma A.5 by setting r = O( k

Cardinality Sketches under Adaptive Inputs

We are now ready to conclude the proof of Theorem 4.1. Recall that a subset U of keys of size M = 2αn has T(S(U)) over random ρ concentrated around k/M with standard error

k/M). We now consider the prefix U of M = 2αn keys in the sorted order by average scores. This selection has E[T(S(U))] (3α) k/(2αn). To see this, note that (1 δ) of the Min Hash values are the same as in the sketch of N. Therefore they have expected value 1/n. The remaining δ fraction have expected value 1/(2αn). Therefore E[Tρ(S(U))] = k((1 δ)/n + δ/(2αn)) = (k/(2αn))(δ + (1 δ)2α). Therefore T is a factor of 1/(δ + 2α) too small, which for small α and δ < α is a large constant multiplicative error.

A.3. Proofs of Lemma A.3 and Lemma A.4

For fixed ρ (and N), for i [k] and j [N 1], denote by

i 1 := min x N hi(x) hi(mi 1)

i j := {hi(x) | x N}(j+1) {hi(x) | x N}(j) hi(mi j+1) hi(mi j) for j > 1

the gap between the j and j + 1 smallest values in {hi(x)}x N. Lemma A.7 (Properties of i j). The random variables i j i [k], j [L] over the sampling of N are independent with distributions i j Exp[N j]. This holds also when conditioning on properties p1 and p2 of Lemma A.2.

Proof. It follows from properties of the exponential distribution that over the sampling of ρ, i j Exp[N j] are independent random variables for i, j. Note that p1 and p2 are independent of the actual values of hi(mi j) and only depend on the rank in the order.

We can now express the distribution of the random variable Zi in terms of i j: We have

For j 1, the probability that Zi = Pj ℓ=1 i ℓis 2 j/(1 2 L). This corresponds to the event that U does not include the keys mi ℓfor ℓ< j and includes the key mi j. The normalizing factor (1 2 L) arises from property p2 in Lemma A.2. In the sequel we omit this normalizing factor for brevity.

i 1 with probability 2 1/(1 2 L) i 1 + i 2 with probability 2 2/(1 2 L) i 1 + i 2 + i 3 with probability 2 3/(1 2 L) ... i 1 + + i j with probability 2 j/(1 2 L) ...

We now consider Zi conditioned on the event mi j U. Clearly for j = 1 (conditioning on the event mi 1 U) we have Zi i 1. For j 1, we have Zi = Ph ℓ=1 i ℓwith probability 2 h for h < j and Zi = Pj ℓ=1 i ℓwith probability 2 j+1.

We bound the expected value of Zi conditioned on presence of a key u U.

Lemma A.8. (i) (Anti-concentration) For u = mi 1, the random variable over sampling of ρ, N

G = max u N\{mi 1} EU|u U[Zi] EU|mi 1 U[Zi]

is such that for all c > 0, Prρ,N h G c n 1 i e 2c.

(ii) (Concentration) For u N, the random variable

G = EU|u U[Zi] EU|mi j U[Zi]

is such that for c 1

Cardinality Sketches under Adaptive Inputs

Proof. Per Lemma A.2 we are assuming the event (that happens with probability 1 δ that U includes a key u {mi j}j [L] for all i [k]. Therefore, for a key u N it holds that

EU|u U[Zi] = EU[Zi] .

Recall that Zi | mi 1 U = i 1. Otherwise (when not conditioned on mi 1 U or conditioned on presence of u = mi 1) Zi = i 1 with probability 1/2 (when the random U includes mi 1) and Zi i 1 + i 2 otherwise (when U does not include mi 1). Thus,

EU[Zi] EU|mi 1 U[Zi] i 2/2

EU|mi j U[Zi] EU|mi 1 U[Zi] i 2/2 for 1 < j L

Therefore for u = mi 1, G := EU|u U[Zi] EU|mi 1 U[Zi] i 2/2 .

Since i 2 Exp[N 1], we have for all t > 0, Prρ[G t] e 2t/(n 1). This establishes claim (i).

For claim (ii), note that

EU|u U[Zi] E[Zi]

EU|mi j U[Zi]

G := EU|u U[Zi] EU|mi j U[Zi]

1 2ℓ 1 i ℓ.

This is a sum of independent exponential random variables and recall that we assumed L n/3. Therefore, this is stochastically smaller that the respective geometrically decreasing weighted sum of independent Exp[3/n] random variables. It follows that

1 2ℓ 1 = 3 n2j

We apply an upper bound on the tail (Janson, 2017a) that shows that this concentrates almost as well as a single exponential random variable: Pr[G tµ] te t and obtain claim (ii)

Lemma A.3 is a corollary of the first claim of Lemma A.8.

We now express a bound on the variance of Zi, also when conditioned on the presence of any key u U, for fixed ρ, N.

Lemma A.9. For fixed N, ρ, and any u N

Var U|u U[Zi], Var[Zi] = Θ( X

j (3/2) j( i j)2) .

Cardinality Sketches under Adaptive Inputs

Var U|u U[Zi], Var[Zi] E[Z2 i ] X

j 1 2 j j X

ℓ=1 ( i j)2

( i j)2 = Θ( X

j (3/2) j( i j)2)

Proof of Lemma A.4. Since the Zi, also when conditioned on u U, are independent (modulu our simplifying assumption in Lemma A.2), it follows from Lemma A.9 that

Var U|u U[Z] = Θ

j 1 (3/2) j( i j)2

max u N Var U|u U[Z] = Θ

j 1 (3/2) j( i j)2

The right hand side Y := max u N Var U|u U[Z]

is a random variable over ρ, N that is a a weighted sum of the squares of independent exponential random variables i j. The PDF of a squared exponential random variable Exp[w]2 is w 2

t. The mean is 2 w2 and the variance is at most E[t2] = 24/w4. Applying this, we obtain that E[Y ] = Θ(k/n2) and Var[Y ] = O(k/n4).

From Chebyshev s inequality, Pr[Y E[Y ] c

k/n2] = O(1/c2) and we obtain for any δ > 0 and a fixed constant c

max u N Var U|u U[Z] c 1 + 1

A.4. Attack on the Bottom-k standard estimator

The argument is similar to that of k-mins sketches. We highlight the differences. Recall that a bottom-k sketch uses a single hash function h with the sketch storing the k smallest values S(U) := {h(x) | x U}(1:k). We use the kth order statistics (kth smallest value) T(S) := {h(x) | x U}(k).

For fixed ρ and N, let mj N (j [n]) be the key with the jth smallest hashmap h(mj) = {h(x) | x U}(k). Define 1 := h(m1) and for j > 1, j := h(mj) h(mj 1).

Let R be the random variable that is the rank in N of the key with the kth smallest hashmap in U. The distribution of R is the sum of k i.i.d. Geometric random variables Geom[q = 1/2]. We have E[R] = k/q and the concentration bound (Janson, 2017a) that for any c 1, Pr[R > c E[R]] ce c.

Let L = 10 + 2 log r, N0 = {mj}j k L/q be the keys with the L(k/q) smallest hashmaps. Let N = N \ N0 be the remaining keys. We show that the attack separates with probability α a key with one of the bottom-k ranks and a key in N .

Assume n > 3|N0|. Assume that we declare failure when a set U selected by the algorithm does not contain k keys from N0. The probability of such selection is at most Le L < 0.01/r and at most 0.01 in all r steps.

For fixed ρ and N, consider the random variable Z := T(S(U)). The following parallels Lemma A.3 and Lemma A.9:

Lemma A.10. Fixing ρ, N,

Cardinality Sketches under Adaptive Inputs

(i) let G(u, v) := EU|u U[Z] EU|v U[Z]

For each i [k] and δ > 0,

Pr min u N G(u, mi) δ

(ii) for δ > 0 and some constant c,

max u N Var U|u U[Z] c 1 + 1

Proof. (i) Note that Z = m R. When i [k], Z = m R+1. The gap is a weighted average of j for j [R, |N0|]. These are independent Exp[n i] random variables with i n/3. The expected value is Θ(1/n) and the tail bounds are at least as tight as for a single Exp[n/3] random variable.

(ii) We use the concentration bound on R to express the variance for fixed ρ, N as a weighted sum with total weight Θ(k) and each of weight O(1) of independent squared exponential random variables. The argument is as in the proof of Lemma A.9.

Using the same analysis, a subset U N of size αn has T(S(U)) that in expectation has the k/α smallest rank in N with standard error

k/α and normalized standard error 1/

k. The subset U selected as a prefix of the order generating in the attack includes the (1 δ)k of the bottom-k in N and δk of the bottom in U. This means that in expectation T(S(U)) has the kδ/α rank in N. That is, error that is (1/δ) factor off.

B. Analysis of Attack on General Query Response Algorithms

We include details for Sections 7 and 8.

B.1. Rank-domain representation of sketches

We use the rank domain representation SR ρ (U) of the input sketch Sρ(U). This representation is defined for subsets of a fixed ground set N. Instead of hash values, it includes the ranks in N of the keys that are represented in the sketch Sρ(U) with respect to the relevant hashmaps.

Definition B.1. (Rank domain representation) For a fixed ground set N, and a subset U N, the rank domain representation SR ρ (U) of a respective Min Hash sketch has the form (Y1, . . . , Yk), where Yi N.

k-mins sketch: For i [k] and j 1, let mi j be the key x N with the jth smallest hi(x). For i [k], let Yi := arg minj mi j U. That is, Yi is the smallest j such that mi j U.

k-partition sketch: For i [k] and j 1, let mi j be the key x N that is in part i with the jth smallest h(x). For i [k], let Yi := arg minj mi j U be the smallest rank in the ith part.

Bottom-k sketch: For j 1, let mj be the key x N with the jth smallest h(x) value. Let the bottom-k keys in U be mi1, mi2, . . . , mik where i1 < i2 < < ik. We then define Y1 := ii and Yj := ij ij 1 for 1 < j k.

Note that when the ground set N is available to the query response algorithm the rank domain and Min Hash representations are equivalent (we can compute one from the other).

The following properties of the rank domain facilitate a simpler analysis: (i) It only depends on the order induced by the hashmaps and not on actual values and thus allows us to factor out dependence on ρ, (ii) It subsumes the information on q (and |U|) available from Sρ(U) and (iii) It has a unified form and facilitates a unified treatment across the Min Hash sketch types.

The subsets U D0 generated by our attack algorithm selects a rate q and then sample U by including each x N independently with probability q. We consider the distribution, which we denote by SR[q] of the rank domain sketch under this sampling of U with rate q. We show that for a sufficiently large |N| = n, the rank domain representation is as follows:

Cardinality Sketches under Adaptive Inputs

Lemma B.2 (distribution of rank-domain sketches). For δ > 0, q (0, 1), and an upper bound r on the attack size, let L = log2(rk/δ)/q + 10, and assume n > 3k L/δ. Then for all the three Min Hash sketch types, the distribution SR[q] is within total variation distance δ from (Y1, . . . , Yk) that are k independent geometric random variables with parameter q: Yi Geom[q].

Proof. As in Lemma A.2. Applying the birthday paradox with n > 3k L/δ, with probability at least 1 δ: For k-mins sketches, the keys mi j for i [k] and j [L] are distinct. For k partition sketches, there are at least L keys assigned to each part so the keys mi j for i [k] and j [L] are well specified.

A sketch from SR[q] can be equivalently sampled using the following process:

k-mins and k-partition sketch: For each i [k], process keys mi j by increasing j 1 until Bern[q] and then set Yi = j.

Bottom-k sketch: Process keys mj in increasing j 1 until we get Bern[q] k times.

We next establish that with our choice of n, with probability at least 1 δ, in all of r sampling of U, the sketch SR(U) is determined by the Lk smallest rank keys. Therefore there are sufficiently many keys for the sketch to agree with the sampling k i.i.d. Geom[k] random variables.

For k-mins and k-partition sketches, the probability that for a single hashmap i [k] none of the L smallest rank is included is at most (1 q)L. Taking a union bounds over k maps and r sampling and using that log(1/(1 q) q gives the claim. With bottom-k sketches the requirement is that in all r selections, the kth smallest rank is O(k L).

Remark B.3. Estimating q from a sketch from SR[q] is a standard parameter estimation problem. A sufficient statistic T for estimating q is T(SR) := Pk i=1 Yi. Note the following properties:

The distribution SR[q] does not provide additional information on the cardinality |U| beyond an estimate of q.

The distribution of SR[q] conditioned on T(S) = τ is the same for all q (this follows from the definition of sufficient statistic).

The statistic T has expected value k/q, variance k(1 q)/q2, and single-exponential concentration (Janson, 2017b).

B.1.1. CONTINUOUS RANK DOMAIN REPRESENTATION

We now cast the distribution of SR[q] using a continuous representation SC. This is simply a tool we use in the analysis.

We can sample a sketch from SR[q] as follows

Set the rate q := ln(1 q)

Sample a sketch SC(U) = (Y 1, . . . , Y k) where Y i Exp[q ] are i.i.d

Compute SR(U) from SC(U) using Yi 1 + Y i + 1 for i [k].

The correctness of this transformation is from the relation between a geometric Geom[q] and exponential Exp[q ] distributions: Pr[Yi = t] = Pr[t 1 Y i < t] = e q (t 1) e q t = (1 e q ) e q t = q (1 q)t .

Note that we can always recover SR from SC but we need to know q in order to compute SC from SR:

Y i Exp[ ln(1 q)] | Y i [Yi 1, Yi) .

Therefore being provided with the continuous representation only makes the query response algorithm more informed and potentially more powerful. Also note that |q q | < q2/2.

A sufficient statistic for estimating q from SC[q ] is T := Pk i=1 Y i . In the sequel we will work with SC and omit the prime from q and T.

Cardinality Sketches under Adaptive Inputs

Now note that the distribution of T := SC[q] 1 for a given k and q is the sum of k i.i.d. Exp[q] random variables. This is the Erlang distribution that has density function for x [0, ]:

f T (k, q; x) = qk

(k 1)!xk 1e qx (4)

The distribution has mean E[T] = k/q, variance Var[T] = k/q2 and exponential tail bounds (Janson, 2017a):

For c > 1: Pr[T c k/q] 1

c e k(c 1 ln c) (5)

For c < 1: Pr[T c k/q] e k(c 1 ln c) (6)

Consider the random variable Z = (T E[T])/ p

that is the number of standard deviations of T from its mean. We have T = k

k q and Z = q T

The domain of Z is [

k, ) and the density function of Z is

f Z(k; z) =

k q )k 1e q( k

k (k 1)!(k + z

k)k 1e (k+z

This density satisfies Z

k f Z(k; x)xdx = 0 (9) Z

k f Z(k; x)x2dx = 1 (10)

k/4 f Z(k; x)x2dx = Θ(1) (11)

for c (0, 1], Z 0

c f Z(k; x)dx, Z c

0 f Z(k; x)dx = Ω(c) (12)

for c 0, Pr[T c

k] 1 c + 1e k (c ln(c+1)) (13)

for c (0, 1), Pr[T c

k] e k (1 c ln(1 c)) (14)

Note that T is available to the query response algorithm but q, and thus the value of Z are not available.

B.2. Correct maps

A map S 7 π(S) [0, 1] maps sketches to the probability of returning 1. We require that the maps selected by QR are correct as in Definition 6.3 with δ = O(1/

For a map π and τ we denote by π(τ) the mean value of π(S) over sketches with statistic value T(S) = τ. This is well defined since for the query distribution in our attacks D0 | q = q , even when conditioned on a fixed rate q , the distribution of the sketch conditioned on T(S) = τ does not depend on q (See Remark B.3).

We now specify conditions on the map π(τ) that must be satisfied by a correct π. A correct map may return an incorrect output, when conditioned on cardinality, with probability δ. This means that there are correct maps with large error on certain τ (since each cardinality has a distribution on τ). We therefore can not make a sharp claim on π(τ) that must hold for any τ in an applicable range. Instead, we make an average claim: For any interval of τ values that is wide enough to include Ω(1) of the values for some cardinality value c [A, 2A], the average error of the mapping must be O(δ).

Cardinality Sketches under Adaptive Inputs

Claim B.4. For any ξ > 0, there is c0 > 0 such that for any correct map π for A and δ c0/

k and τb > (1 + 0.1/

k)τa it holds that ( 1 τb τa R τb τa π(x)dx < ξ if τa > kn

1 τb τa R τ τ(1 a/

k) π(x)dx > 1 ξ if τb < kn

Proof. For a cardinality value c, the distribution of the statistic T conditioned on a cardinality value c is f T (k, c/n; x) (4). With cardinality value c, it holds that Pr[T < kn/c] 1/e and Pr[T > kn/c] 1/e. Moreover, the density in the interval kn

k, 1 + 0.1/

k] is Θ(1). It follows from the correctness requirement for cardinality value c = k/τ that there exists c1 > 0 such that: (R τ(1+0.1/

k) τ π(x)dx < c1δ if τ > kn

k) R τ τ(1 0.1/

k) π(x)dx > 1 c1δ if τ < kn

Therefore for τa > kn

k(τb τa)c1δ

and for τb < kn

π(x)dx (τb τa) (1 10

Choosing c0 10c1 establishes the claim.

For fixed q, the cardinality |U| of the selected U has distribution Binom(q, n). The n chosen for the attack is large enough so that for all our r queries ||U| qn|/(qn) 1/

k. That is, the variation in |U| for fixed q is small compared with the error of the sketch and |U| qn.

B.3. Relating Z and sampling probability of low rank keys

The sketch is determined by k keys that are lowest rank in U. We can view the sampling of U to the point that the sketch is determined in terms of a process, as in the proof of Lemma B.2, that examines keys from the ground set N in a certain order until the k that determine the sketch are selected. The process selects each examined key with probability q. For a bottom-k sketch, keys are examined in order of increasing rank until k are selected. With k-mins and k-partition, keys in each part (or hash map) are examined sequentially by increasing rank until there is selection for the part. For all sketch types, the statistic value T corresponds to the number of keys from N that are examined until k are selected. This applies also with the continuous representation SC.

We denote by N0 the set of keys that are examined with probability at least δc 1/(rk) when the rate is at least qa. We refer to these keys as low rank keys. It holds that |N0| k ln(1/δc)/qa. For δc = 1/O(rk) we have |N0| = O(k log(rk)). The remaining keys N := N \ N0 are unlikely to impact the sketch content and we refer to them as transparent.

With rate q, the probability that a certain key is included in U is q. We now consider a rate q and the inclusion probability conditioned on the normalized deviation from the mean Z. Transparent keys have inclusion probability q. The low rank keys N0 however have average inclusion probability that depends on Z. Qualitatively, we expect that when Z < 0, the inclusion probability is larger than q and this increases with magnitude |Z|. When Z > 0, the inclusion is lower and decreases with the magnitude. This is quantified in the following claim: Claim B.5. Fix a rate q and . Consider the distribution of U conditioned on Z = . The average probability over N0 keys to be in U is q

Proof. Equivalently, consider the distribution conditioned on T = 1

k). The sampling process selects k keys after examining T keys. The average effective sampling rate for the examined keys is qe = k/T. There are T keys out of N0 that are processed with effective rate qe and the remaining keys in N0 have effective rate q.

Cardinality Sketches under Adaptive Inputs

Averaging the effective rate over the T = 1

k) processed keys and the remaining N0 keys we obtain

T qe + (|N0| T) q

T + (|N0| T) q

|N0| = k + (|N0| ( k+

B.4. Scoring probability gap

For a map π, let p (π) be the score probability, over the distribution of q and Z, of a key in N . Let p0(π) be the average over N0 of the score probability of keys in N0.

Let fλ(x) be the density function of the selected rate, described by Algorithm 5. Note that the selected rate is in the interval

Algorithm 5: Sample rate Input: A, n ω n 2A ωa 1

2ω // range of inverse rates D U[0, ω/4] ω a ωa + D; ω b ω a + 7

4ω // range of sampled inverse rate return q 1 U[ω a,ω b ]

For each transparent key, the score probability is:

p (π) = Z qb

k))f Z(k; z) dz q fλ(q)dq (17)

On average over the low-rank keys N0 using Claim B.5 it is

p0(π) = Z qb

k))f Z(k; z)

dz fλ(q)dq (18)

For a correct map π (as in Definition 6.3), we express the gap between p (π) and p0(π). Note that we bound the gap without assuming much on the actual values, as they can highly vary for different correct π.

Lemma B.6 (Score probability gap). Consider a step of the algorithm and a correct map π (see Remark 6.4). Then

p0(π) p (π) = Ω 1

= Ω 1 k log(kr)

In the remaining part of this section we present the proof of Lemma B.6. We will need the following claim, that relates and scoring probability. Claim B.7. For | | <

k )) fλ(q)dq = Z qb

q ) fλ(q)dq + Θ(

Proof. Using the distribution specified in Algorithm 5, for any g(): Z qb

qa g(q)fλ(q)dq = 4

ωa+D g(1/x)dx (19)

Cardinality Sketches under Adaptive Inputs

We use w a = ωa + D and w b = ωa + D + 7

k )) dx = 1 1 +

Z ω b (1+ /

π(ky)dy (change variable x to y = x(1 + /

π(kx)dx Z ω a(1+ /

π(kx)dx + Z ω b (1+ /

Therefore,4

k )) π(kx) dx = (20)

π(kx)dx Θ(1) Z ω a(1+ /

π(kx)dx + Θ(1) Z ω b (1+ /

k ω) Θ(1) Z ω a(1+ /

π(kx)dx + Θ(1) Z ω b (1+ /

The last equality follows using Z ω b

π(kx)dx [0, ω b ω a] = [0, 7

k )) fλ(q)dq Z qb

q ) fλ(q)dq = (21)

q ) fλ(q)dq =

k )) π(kx) dx (Using (19))

0 d D Z ω a(1+ /

π(kx)dx + Θ( 1

0 d D Z ω b (1+ /

π(kx)dx (Apply (20))

We now separately bound terms5:

0 d D Z ω a(1+ /

π(kx)dx Z ω/4

0 d D Z ωa+D+

0 d W Z ωa+ω/4+W

4 (1 ξ) = Θ(ω2

k ) (Using Claim B.4) (22)

0 d D Z ω a(1+ /

π(kx)dx Z ω/4

0 d D Z ωa+D+

0 d W Z ωa+ω/4+W

π(kx)dx = O(ω2

4Note that can be negative. In which case in order to streamline expressions we interpret the asymptotic notation O(c ) as O(c| |). 5Argument for negative is similar

Cardinality Sketches under Adaptive Inputs

Combining (22) and (23) we obtain that

0 d D Z ω a(1+ /

π(kx)dx = Θ(ω2

We next bound the last term: Z ω/4

0 d D Z ω b (1+ /

π(kx)dx = Z ω/4

0 d D Z ( 9

4 ω+D) (1+ /

0 d D Z ( 9

0 d W Z 9 4 ω+W +ω/4

k ω2ξ (Apply Claim B.4) (25)

We substitute (24) and (25) in (21) to conclude the proof, choosing a small enough constant ξ.

Proof of Lemma B.6. We express the difference between the average score probability of a key in N0 (18) and the score probability of a key in N (17):

p0(π) p (π) =

k |N0| Z qb

kz q )f Z(k; z) zdz fλ(q)dq

kz q ) fλ(q)dq

f Z(k; z) zdz (26)

We separately consider Z in the range Iin := [

k/4] and Z outside this range in Iout = [

For outside the range we use that R qb qa π( k+

kz q ) fλ(q)dq [0, 1] and tail bounds on f Z(k; z) (13) (14) and get:

kz q ) fλ(q)dq

f Z(k; z) zdz = 1

N0 e Ω(k) Apply (14) and (13) (27)

For inside the range we apply Claim B.7:

kz q ) fλ(q)dq

f Z(k; z) zdz

q ) fλ(q)dq f Z(k; z) zdz + Z

k ) f Z(k; z) zdz Claim B.7

k |N0| Z qb

q ) fλ(q)dq Z

Iin f Z(k; z) zdz + 1

Iin f Z(k; z) z2dz

k |N0| Z qb

q ) fλ(q)dq 0 + 1

k Θ (1) Using (9) and (11)

k Θ(1) = Θ( 1

The statement of the Lemma follows by combining (27) and (28) in (26).

Cardinality Sketches under Adaptive Inputs

B.5. The case of symmetric estimators

The proof of Lemma 7.3 (gap for symmetric estimators) follows as a corollary of the proof of Lemma B.6.

Proof of Lemma 7.3. For symmetric maps (Definition 7.2) keys in N0 that have lower rank can only have higher scoring probabilities. That is, when j < j , the score probability of mi j is no lower than that of mi j . With bottom-k sketches, the score probability of mj is no lower than that of mj . In particular, the keys in N 0 have the highest average score among keys in their components. Additionally, there is symmetry between components. Therefore, the average score of each of the k lowest rank keys in N 0 is no lower than the average over all N0 keys:

EU[π(Sρ(U) 1(m U)] p0(π) .

Therefore using Lemma B.6:

EU[π(Sρ(U) 1(m U)] p (π) p0(π) p (π) EU[π(Sρ(U) 1(u U)] = Ω( 1 k log(kr)) .

B.6. Analysis details of the Adaptive Algorithm

We consider the information available to the query response algorithm. The mask M is shared with the query response and hence it only needs to estimate the cardinality of (the much larger) set U. The mask keys M hide information in Sρ(U) and make additional keys trasparent. For k-partition and k-mins sketches, keys mi h where h > arg minj mi j M are transparent. For bottom-k sketches, we see only k k bottom ranks in U if k k keys from M have lower ranks.

We describe the sampling of S(U M) (for given mask M) as an equivalent process that examines keys in U in order, selecting each examined key with probability q, until the sketch is determined. This process generalizes the process we described for the case without a mask in the proof of Lemma B.2:

Bottom-k sketch: Set counter c k. t Geom[q]. Process keys mj in increasing j:

1. If mj M decrease c and output mj. If c = 0 halt. 2. If mj M then decrease t.

(a) If t = 0 output mj, decrease c, and sample a new t Geom[q]. If c = 0 halt.

k-mins and k-partition sketches: For i [k] let hi arg minℓmi ℓ M. Sample t Geom[q]. Process i [k] in order:

1. If h is defined and h t then t t h + 1 and continue with next i. 2. If h is undefined or h > t then output mi t, sample new t Geom[q]. Continue to next i.

The QR algorithm has the results of the process which yields k k i.i.d. Geom[q] random variables. As keys are added to the mask M the information we can glean on q from the sketch, that corresponds to the number k of Geom[q] samples we obtain, decreases. As the mask gets augmented, the number of keys, additional keys in N0 become transparent in the sense that they have probability smaller than δ/r to impact the sketch if included in U. With k-mins and k-partition sketches keys where mi j > hi become transparent. With bottom-k sketches keys mj where j > mini(k |M (mℓ)ℓ<j|) Ω(log(r)) are transparent. These keys are no longer candidates to be examined by the process above. We denote by N 0 N0 the set of keys that remain non-transparent. It holds that |N 0| = O(k log(kr), where k is the mean k with our current mask. When k = O(log(kr)) becomes too small (see Remark 6.2), there are no correct maps and the algorithm halts and returns M.

Let p(π, M, x) be the probability that key x is scored with map π and mask M. The probability is the same for all transparent keys x N 0 and we denote it by p (π, M).

Lemma B.8 (Score probability gap with mask). Let π be a correct map for M D0. Then

x N 0 (p(π, M, x) p (π, M)) = Ω 1 log(kr)

Cardinality Sketches under Adaptive Inputs

Proof. The proof is similar to that of Lemma B.6 applies with respect to k and using that |N 0| = O(k log(kr).

Claim B.9. With probability at least 0.99, no transparent keys are placed in M.

Proof. First note that all transparent keys have the same score distribution (see proof of Theorem 7.1). Keys get placed in M when their score separates from the median score in N \ M. Note that since nearly all keys (except α fraction) are transparent, the median score is the score of a transparent key. From Chernoff bounds (3) the probability that a transparent key at a given step is placed in M (and deviates by more than λ from its expectation) is < 1/(100nr). Taking a union bound over all steps and transparent keys we obtain the claim.