# learningaugmented_countmin_sketches_via_bayesian_nonparametrics__671c2a6e.pdf

Journal of Machine Learning Research 24 (2024) 1-60 Submitted 1/21; Revised 12/22; Published 1/23

Learning-augmented count-min sketches via Bayesian nonparametrics

Emanuele Dolera emanuele.dolera@unipv.it University of Pavia, Italy

Stefano Favaro stefano.favaro@unito.it University of Torino and Collegio Carlo Alberto, Italy

Stefano Peluchetti speluchetti@cogent.co.jp Cogent Labs, Tokyo, Japan

Editor: Erik Sudderth

The count-min sketch (CMS) is a time and memory eﬃcient randomized data structure that provides estimates of tokens frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMSDP, has been proposed by Cai, Mitzenmacher and Adams (Neur IPS 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being that are obtained as mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a constructive proof that builds upon arguments that are tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a Bayesian proof of the CMS-DP that has the main advantage of building upon arguments that are usable under the popular Pitman-Yor process (PYP) prior, which generalizes the DP prior by allowing for a more ﬂexible tail behaviour, ranging from geometric tails to heavy power-law tails. This result leads to develop a novel learningaugmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the data stream of tokens via a PYP prior. Under this more general framework, we apply the arguments of the Bayesian proof of the CMS-DP, suitably adapted to the PYP prior, in order to compute the posterior distribution of a point query, given the hashed data. Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed to deal with the estimation of low-frequency tokens. An extension of our BNP approach to more general queries, such as range queries, is also discussed. Keywords: Bayesian nonparametrics; count-min sketch; Dirichlet process prior; likelihoodfree estimation; Pitman-Yor process prior; point query; power-law data stream; random hashing.

1. Introduction

When processing large data streams, it is critical to represent data in compact structures that allow to eﬃciently extract information. Sketches form a broad class of compact random-

2024 Emanuele Dolera, Stefano Favaro and Stefano Peluchetti.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0096.html.

Dolera, Favaro and Peluchetti

ized data structures that can be easily updated and queried to perform time and memory eﬃcient estimation in large data streams of tokens. They have found many applications in machine learning (Aggarwal and Yu, 2010), security analysis (Dwork et al., 2010), natural language processing (Goya et al., 2009), computational biology (Zhang et al., 2014; Leo Elworth et al., 2020), social networks (Song et al., 2009) and games (Harrison, 2010). We refer to the monographs of Cormode et al. (2012) and Cormode and Yi (2020), and references therein, for a comprehensive and up-to-date review on sketches. A notable problem involving sketches is the estimation or recovery of the frequency of a token in the stream, typically referred to as a point query . The count-min sketch (CMS) of Cormode and Muthukrishnan (2005) is arguably the most popular approach to estimate point queries, and it relies on random hashing to obtain a sketched representation of the data. The CMS achieves the goal of using a compact data structure to save time and memory, while having provable guarantees on the estimated frequency through hashed data. Nevertheless, there are several aspects of the CMS that may be improved. First, the CMS provides only point estimates, although the random hashing procedure may induce substantial uncertainty in the estimation, especially for low-frequency tokens. Second, the CMS relies on the assumption of a ﬁnite universe of tokens, although it is common for large data streams to assume an unbounded number of distinct tokens. Third, often there exists an a priori knowledge on the data, and therefore it may be desirable to incorporate such a knowledge into the CMS estimates.

Learning-augmented CMSs aim at improving the CMS through the use of statistical models for better exploiting the data (Aamand et al., 2019; Hsu et al., 2019). In such a context, Cai et al. (2018) ﬁrst considered a Bayesian nonparametric (BNP) approach that assumes tokens in the stream to be modeled as random samples from an unknown distribution, which is endowed with a Dirichlet process (DP) prior (Ferguson, 1973). The resulting learning-augmented CMS, which is referred to as the CMS-DP, estimates a point query through mean functionals of the posterior distribution of the point query, given the hashed data. The posterior distribution is at the core of the BNP approach, and it is derived through an intriguing constructive proof that exploits a restriction property and a ﬁnitedimensional projective property of the DP. The approach of Cai et al. (2018) allows for an unknown (unbounded) number of distinct tokens in the universe and, most importantly it allows to: i) make use of the DP prior for incorporating an a priori knowledge on the data into the CMS estimates; ii) make use of the posterior distribution for assessing the uncertainty of CMS estimates. Recently, Dolera et al. (2021) showed that the constructive proof of Cai et al. (2018) admits a non-trivial extension to the normalized inverse-Gaussian process (NIGP) prior, which features a distinguishing power-law tail behaviour (Lijoi et al., 2005), in contrast with the geometric tail behaviour of the DP prior. This has led to the introduction of the CMS-NIGP, which is a learning-augmented CMS under power-law data streams, though with the critical limitation that it can not be tuned to the power-law degree of the data.

1.1 Our contributions

In this paper, we further investigate the BNP approach to develop learning-augmented CMSs. The peculiar interplay between the predictive distribution and the restriction prop-

Learning-augmented count-min sketches via Bayesian nonparametrics

erty of the DP is the cornerstone of the constructive proof of Cai et al. (2018). While providing an intuitive derivation of the CMS-DP, such a proof builds upon some heuristic arguments that are tailored to the DP prior. This is a critical limitation of the approach of Cai et al. (2018), especially with respect to the ﬂexibility of incorporating an a priori knowledge on the data into the CMS estimates. Here, we present a Bayesian proof of the CMS-DP, that is we present a rigorous computation of the (regular) conditional distribution of a point query, given the hashed data, and we show that such a distribution coincides with the posterior distribution derived in Cai et al. (2018). Besides strengthening the BNP approach of Cai et al. (2018), our proof improves its ﬂexibility by avoiding the use of peculiar properties of the DP, paving the way to go beyond the use of the DP prior. In this respect, nonparametric priors with power-law tail behaviour are of special interest, as power-law distributions occur in many situations of scientiﬁc interest, and they have signiﬁcant consequences for the understanding of natural and social phenomena (Clauset et al., 2009). As well as being well known in natural language or textual data (Zipf, 1949; Cancho and Sol e, 2020; Harald, 2001), power-law phenomena have emerged for data arising from humans electronic activities, e.g. website visits, emails, interactions in social networks, password innovation, tags in annotation systems and editing of webpages (Huberman and Adamic, 1999; Barab asi, 2005; Muchnik et al., 2013; Tria et al., 2014; Rybski, 2016; Monechi et al., 2017).

Our Bayesian proof of the CMS-DP leads to extend the BNP approach of Cai et al. (2018) to the Pitman-Yor process (PYP) prior, which is arguably the most popular nonparametric prior with power-law tail behaviour (Pitman and Yor, 1997). The PYP is indexed by a discount parameter that controls the tail behaviour of the prior, ranging from geometric tails to heavy power-law tails, and including the tail behaviour of the NIGP prior. The PYP has neither a restriction property nor a ﬁnite-dimensional projective property analogous to that of the DP, and hence: i) we apply the arguments developed for the Bayesian proof of the CMS-DP, suitably adapted to the PYP, to compute the posterior distribution of a point query, given the hashed data; ii) we introduce a likelihood-free approach, which relies on the minimum Wasserstein distance method (Bernton et al., 2019), to estimate the prior s parameters. This leads to a learning-augmented CMS under powerlaw data, which is referred to as CMS-PYP. Besides generalizing the CMS-DP to power-law data streams, the CMS-PYP improves remarkably the CMS-NIGP, as it can be tuned to the power-law degree of the data through the discount parameter. Applications to synthetic and real data show that the CMS-PYP outperforms both the CMS and the CMS-DP in the estimation of low-frequency tokens, which are of interest in textual data (Pitel and Fouquier, 2015); the CMS-PYP also outperforms the CMS-NIGP for data with heavier power-law tails, whereas it is competitive with the CMS-NIGP for data with lighter power-law tails. In general, we show that CMS-PYP is competitive with respect to the count-mean-min (CMM) of Goyal et al. (2012), which is a variation of the CMS designed for the estimation of low-frequency tokens, and also with respect to the bootstrap-debiased-count-min (BDCM) of Ting (2018).

Dolera, Favaro and Peluchetti

1.2 Organization of the paper

The paper is structured as follows. In Section 2 we brieﬂy review the CMS-DP and its constructive proof, and then we present our Bayesian proof of the CMS-DP. In Section 3 we develop the CMS-PYP through the computation of posterior distribution of a point query, given the hashed data, and the estimation of the prior s parameters. Section 4 contains a numerical illustration of the CMS-PYP, both on synthetic data and real data. In Section 5 we discuss our work, as well as its extension to the problem of estimating more general queries, and present some directions for future research. Proofs of our results, except for the Bayesian proof of the CMS-DP, and additional technical results are deferred to appendices.

2. A Bayesian derivation of the CMS-DP

For any m 1 let x1:m = (x1, . . . , xm) be a stream of V-valued tokens, with V being a space of types (symbols). Assuming x1:m to be available through summaries obtained by its random hashing, the goal is to estimate, or recover, the frequency of a new token xm+1 in x1:m, i.e.

i=1 1{xi}(xm+1).

The CMS (Cormode and Muthukrishnan, 2005) is the most popular approach to estimate the point query fxm+1. For positive integers J and N such that [J] = {1, . . . , J} and [N] = {1, . . . , N}, let h1, . . . , h N, with hn : V [J], be random hash functions that are i.i.d. according to a pairwise independent hash family H. That is, h H is such that for all v1, v2 V, with v1 = v2, the probability that v1 and v2 hash to any j1 and j2, respectively, is Pr[h(v1) = j1, h(v2) = j2] = J 2. Pairwise independence is typically known as strong universality, and it implies uniformity, i.e. Pr[h(v) = j] = J 1 for any j [J] (Cormode and Yi, 2020, Chapter 3). Hashing x1:m through h1, . . . , h N creates N vectors of J buckets, say {(Cn,1, . . . , Cn,J)}n [N], as follows: Cn,j is initialized at zero, and whenever a new token xi with hn(xi) = j is observed we set Cn,j 1+Cn,j for every n [N]. The CMS estimates fxm+1 by ˆf (CMS) = min{C1,h1(xm+1), . . . , CN,h N(xm+1)}. (1)

We refer to Cormode and Yi (2020, Chapter 3) for a detailed account on the CMS and generalizations thereof dealing with general small summaries for big data. In this section, we consider the CMS-DP (Cai et al., 2018), which is a learning-augmented version of the CMS that relies on BNP modeling of the stream x1:m through a DP prior. We brieﬂy review the CMS-DP and its constructive proof, and then we present our Bayesian proof of the CMS-DP.

2.1 The CMS-DP and its constructive proof

2.1.1 The DP prior

A simple and intuitive deﬁnition of the DP follows from its stick-breaking construction (Ferguson, 1973; Sethuraman, 1994). For θ > 0 let: i) (Bi)i 1 be random variables i.i.d. as a Beta distribution with parameter (1, θ); ii) (Vi)i 1 be random variables independent

Learning-augmented count-min sketches via Bayesian nonparametrics

of (Bi)i 1, and i.i.d. from a non-atomic distribution ν on V. Then, deﬁne P1 = B1 and Pj = Bj Q 1 i j 1(1 Bi) for j 2, in such a way that P i 1 Pi = 1 almost surely. The (discrete) random probability measure P = P j 1 PjδVj is a DP on V with (base) distribution ν and mass parameter θ. The law of P thus provides a prior distribution on the space of discrete distributions on V. For short, P DP(θ; ν). See Ghosal and van der Vaart (2017) and references therein for a comprehensive account of the DP, including its deﬁnition in terms of the normalization of a Gamma completely random measure. For our work, it is useful to recall the restriction property and the ﬁnite-dimensional projective property of the DP (Ferguson, 1973; Regazzini, 2001). The restriction property is stated as follows: if A V and PA is the random probability measure on A induced by P DP(θ; ν) on V, i.e. the renormalized restriction of P to A, then PA DP(θν(A); νA/ν(A)), where νA is the restriction of the measure ν to A. The ﬁnite-dimensional projective property is stated as follows: if {B1, . . . , Bk} is a measurable k-partition of V, for k 1, then P DP(θ; ν) is such that (P(B1), . . . , P(Bk)) is distributed as a Dirichlet distribution with parameter (θν(B1), . . . , θν(Bk)). Because of the discreteness of P DP(θ; ν), a random sample X1:m = (X1, . . . , Xm) from P induces a random partition of the set {1, . . . , m} into 1 Km m partition subsets, labelled by distinct types v = {v1, . . . , v Km}, with corresponding frequencies (N1,m, . . . , NKm,m) such that 1 Ni,m m and P 1 i Km Ni,m = m. For 1 l m let Ml,m be the number of distinct types with frequency l, i.e. Ml,m = P 1 i Km 1{Ni,m}(l) such that P 1 l m Ml,m = Km and P 1 l m l Ml,m = m. The distribution of Mm = (M1,m, . . . , Mm,m) is deﬁned on Mm,k = {(m1, . . . , mn) : ml 0, P 1 l m ml = k, P 1 l m lml = m}, such that

Pr[Mm = m] = θk

1 imimi!1Mm,k(m), (2)

where (a)(n) denotes the rising factorial of a of order n, i.e. (a)(n) = Q 0 i n 1(a + i). See (Pitman, 2006, Chapter 3), and references therein, for details on the sampling formula (2). Let vl = {vi v : Ni,m = l}, i.e. the labels of types with frequency l, and let v0 = V v, i.e. the labels in of types not belonging to v. The predictive distribution induced by P DP(θ; ν) is

Pr[Xm+1 vl | X1:m] = Pr[Xm+1 vl | Mm = m] =

θ θ+m if l = 0

lml θ+m if l 1, (3)

for m 1. The DP prior is characterized as the sole (discrete) nonparametric prior for which: i) the conditional probability that Xm+1 belongs to v0, given X1:m, depends on X1:m only through m; ii) the conditional probability that Xm+1 belongs to vl, given X1:m, depends on X1:m only through m and Ml,m. Such a characterization is typically referred to as the suﬃcientness postulate of the DP (Regazzini, 1978; Zabell, 1997; Bacallado et al., 2017).

2.1.2 The CMS-DP

The CMS-DP of Cai et al. (2018) assumes that the stream x1:m is modeled as a random sample X1:m from an unknown discrete distribution P, which is endowed with a DP prior.

Dolera, Favaro and Peluchetti

X1:m | P iid P (4)

for m 1. Let h1, . . . , h N be a collection of random hash functions that are i.i.d. from the strong universal family H, and assume that h1, . . . , h N are independent of X1:m for any m 1; in particular, by de Finetti s representation theorem, it holds that h1, . . . , h N are independent of P DP(θ; ν). Under the CMS-DP the Xi s are hashed through h1, . . . , h N, thus creating {(Cn,1, . . . , Cn,J)}n [N], and estimates of the point query f Xm+1, with Xm+1 being of an arbitrary type v V, are obtained as suitable mean functionals of the posterior distribution of f Xm+1 given {Cn,hn(Xm+1)}n [N]. Cai et al. (2018) provided an intriguing constructive derivation of such a posterior distribution, which relies on two main arguments:

A1) because of the strong universality of H and the independence between hn and X1:m, the restriction property of the DP, in combination with the suﬃcientness postulate of the DP, implies that the tokens Xi s hashed in the j-th bucket Cn,j constitute random samples from a DP with scaled mass parameter θ/J, for any j [J] and n [N];

A2) because of the strong universality of H, the ﬁnite-dimensional projective property of the DP implies that the vector of hashed frequencies Cn = (Cn,1, . . . , Cn,J) is distributed as a Dirichlet-Multinomial distribution with parameter (θ/J, . . . , θ/J), for any n [N].

For a single hash function hn, the main result of Cai et al. (2018) may be summarized as follows. A random sample X1:m from P DP(θ; ν) induces a random partition of {1, . . . , m} into subsets labelled by v V, and (3) is the posterior distribution, given X1:m, over which subset Xm+1 joins. The frequency of that subset is the point query f Xm+1 we seek to estimate, i.e.

Pr[f Xm+1 = l | X1:m] = Pr[Xm+1 vl | X1:m] (5)

for l = 0, 1, . . . , m. However, we are assuming that the sampling information X1:m is available only through {Cn,hn(Xm+1)}n [N], and hence the posterior distribution (5) is not of interest in itself. Instead, it is of interest the distribution of f Xm+1, which is obtained from (5) by marginalizing out X1:m. By combining (5) with (2) (Cai et al., 2018, Section 3), it holds that

pf Xm+1(l; m, θ) := Pr[f Xm+1 = l] = θ (m l + 1)(l)

(θ + m l)(l+1) . (6)

For any n [N], strong universality of H and independence between hn and X1:m imply that hn induces a (ﬁxed) J-partition of V, say {Bhn,1, . . . , Bhn,J}, and the measure with respect to P DP(θ; ν) of each Bhn,j is 1/J. Therefore, according to A1), hn turns P DP(θ; ν) into J bucket-speciﬁc DPs, say Pj DP(θ/J; JνBhn,j) for j = 1, . . . , J, such that Pj governs

Learning-augmented count-min sketches via Bayesian nonparametrics

the distribution of the sole Xi s hashed in Bhn,j. For any l = 0, 1, . . . , cn, Cai et al. (2018) thus set

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn] = pf Xm+1

This is an heuristic assignment, namely the left-hand side of (7) is not obtained through a rigorous computation of the (regular) conditional distribution of f Xm+1 given Cn,hn(Xm+1). We refer to such a derivation as the constructive proof of the posterior distribution of f Xm+1 given Cn,hn(Xm+1). For the collection of hash functions h1, . . . , h N, the posterior distribution of f Xm+1, given {Cn,hn(Xm+1)}n [N], follows from Equation (7) by means of the assumption that the hn s are i.i.d. according to the strong universal family H. In particular, by a direct application of Bayes theorem, Cai et al. (2018, Section 3), showed that for l = 0, 1, . . . , min{c1, . . . , c N} it holds that

Pr[f Xm+1 = l | {Cn,hn(Xm+1)}n [N] = {cn}n [N]] =

Q n [N] pf Xm+1 l; cn, θ

(pf Xm+1(l; m, θ))N 1 . (8)

CMS-DP estimates of the point query f Xm+1, with respect to a suitable choice of a loss function, are obtained as functionals of the posterior distribution (8), e.g. posterior mode, mean and median. We refer to Cai et al. (2018) for a detailed discussion on BNP estimators of f Xm+1 and their interplay with the CMS. For a concrete application of (8), it remains to estimate the unknown prior s parameter θ > 0, and this follows from A2). In particular, Cn is distributed as a Dirichlet-Multinomial distribution with parameter (θ/J, . . . , θ/J), and the distribution of {Cn}n [N] follows by the assumption that the hn s are i.i.d. from H, that is

Pr[{Cn}n [N] = {cn}n [N]] = Y

cn,j! . (9)

Equation (9) provides the (marginal) likelihood function of {cn}n [N]. The explicit form of such a function allows for an easy implementation of a Bayesian estimation of the prior s parameter θ. Cai et al. (2018) adopted an empirical Bayes approach, which consists in estimating θ by maximizing, with respect to θ, the likelihood function (9). A fully Bayes, or hierarchical Bayes, approach can be also applied by setting a suitable prior distribution on θ.

2.2 A Bayesian proof of the CMS-DP

In Cai et al. (2018), the interplay between the predictive distribution and the restriction property of the DP is the cornerstone for the derivation of the posterior distribution (7). In particular, the constructive proof of the CMS-DP imposes two strong constraints with respect to the choice of the prior distribution: C1) the predictive distribution induced by the prior must have a simple analytical expression, i.e. the marginalization with respect to the sampling information X1:m must be doable explicitly, and it must satisfy a suﬃcientness postulate analogous to that of the DP prior; C2) the prior distribution must have a restriction property analogous to that of the DP prior, which allows us to make use of

Dolera, Favaro and Peluchetti

the distribution of f Xm+1 to assign the posterior distribution of f Xm+1 given Cn,hn(Xm+1). Nonparametric priors obtained by normalizing (homogeneous) completely random measures (James, 2002; Pr unster, 2002; Pitman, 2003; Regazzini et al., 2003; James at al., 2009) form a broad class of priors that generalize the DP prior and satisfy C2); this follows from the Poisson process representation of completely random measures, for which the Poisson coloring theorem holds true (Kingman, 1993, Chapter 5). However, the DP is the sole normalized (homogeneous) completely random measure that satisﬁes C1) (Regazzini, 1978). Beyond normalized completely random measures, the PYP prior is a popular generalization of the DP prior that satisﬁes C1). However, the PYP does not satisfy C2); this is because the PYP is not a normalized completely random measure. To the best of our knowledge, the DP prior is the sole (discrete) nonparametric prior that satisﬁes both C1) and C2), and hence it is the sole prior for which the constructive proof of Cai et al. (2018) works. The constructive proof thus provides a limitation for the BNP approach of Cai et al. (2018), as it implies a lack of ﬂexibility in the choice of the prior distribution for BNP modeling of x1:m. Here, we present a rigorous derivation of the posterior distribution of f Xm+1 given Cn,hn(Xm+1), which is referred to as the Bayesian proof of the CMS-DP. For any n [N], we consider the problem of computing the (regular) conditional distribution of f Xm+1 given Cn,hn(Xm+1), i.e.

Pr[f Xm+1 = l | Chn(Xm+1) = cn] = Pr f Xm+1 = l, Pm i=1 1{hn(Xi)}(hn(Xm+1)) = cn

Pr Pm i=1 1{hn(Xi)}(hn(Xm+1)) = cn , (10)

for l = 0, 1, . . . , cn. In the next theorem, we show that the (regular) conditional distribution (10) coincides with the posterior distribution (7) obtained by means of the constructive proof. That is, the Bayesian proof and the constructive proof lead to the same posterior distribution. As a critical feature, our Bayesian proof stands out for not relying on the peculiar restriction property of the DP; instead, by exploiting the strong universality of H, the Bayesian proof relies on evaluating the numerator and the denominator of (10) through standard combinatorial arguments and well-known distributional properties of a random sample X1:m from the DP (Pitman, 2003, 2006; Sangalli, 2006), i.e. marginal properties. It emerges that the Bayesian proof has two main advantages with respect to the constructive proof: i) it provides a rigorous proof of the CMS-DP, which avoids any heuristic assignment of the posterior distribution, thus strengthening the BNP approach of Cai et al. (2018); ii) it avoids the use of the peculiar restriction property of the DP, thus paving the way to the use of more general classes of prior distributions than the sole DP prior.

Theorem 1 For m 1, let x1:m be a stream of tokens that are modeled as a random sample X1:m from P DP(θ; ν), and let Xm+1 be an additional random sample from P. Moreover, let hn be a random hash function distributed as the strong universal family H, and let hn be independent of X1:m for any m 1, that is hn is independent of P. Then, for l = 0, 1, . . . , cn

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn] = θ

J (cn l + 1)(l) ( θ

J + cn l)(l+1) . (11)

Learning-augmented count-min sketches via Bayesian nonparametrics

Proof The proof consists of three steps: i) evaluate the numerator of (10); ii) evaluate the denominator of (10); iii) evaluate (10) with respect to what obtained in step i) and step ii). First, we observe that the independence between hn and X1:m allows us to invoke the freezing lemma (Baldi, 2017, Lemma 4.1), according to which we can treat hn as it was ﬁxed, i.e. non-random. To simplify the notation, we remove the subscript n from hn and cn. We start with the denominator of (10). Uniformity of h implies that h induces a (ﬁxed) J-partition {B1, . . . , BJ} of V such that Bj = {v V : h(v) = j} and ν(Bj) = J 1

for j = 1, . . . , J. Then, the ﬁnite-dimensional projective property of the DP implies that P(Bj) is distributed as a Beta distribution with parameter (θ/J, θ(1 1/J)) for j = 1, . . . , J. Hence, we write

i=1 1{h(Xi)}(h(Xm+1)) = c

E[(P(Bj))c+1(1 P(Bj))m c]

0 pc+1(1 p)m c Γ(θ) Γ( θ

J )p θ J 1(1 p)θ θ

J + c + 1)Γ(θ θ

J + m c) Γ(θ + m + 1) .

This completes the study of the denominator of (10). Now, we consider the numerator of (10). Let us deﬁne the event B(m, l) = {X1 = = Xl = Xm+1, {Xl+1, . . . , Xm} {Xm+1} = }. Then,

f Xm+1 = l,

i=1 1{h(Xi)}(h(Xm+1)) = c

i=1 1{h(Xi)}(h(Xm+1)) = c

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

That is, the distribution of (f Xm+1, Cj) is completely determined by the distribution of the random variables (X1, . . . , Xm+1). Let Π(s, k) denote the set of all possible partitions of the set {1, . . . , s} into k disjoints subsets π1, . . . , πk such that ni is the cardinality of πi. In particular, from Sangalli (2006, Equation 3.5), for any measurable A1, . . . , Am+1 we have that

Pr[X1 A1, . . . , Xm+1 Am+1] =

(π1,...,πk) Π(m+1,k)

i=1 (ni 1)!ν( r πi Ar)

for m 1. Let V be the Borel σ-algebra of V. Let νπ1,...,πk be a probability measure on (Vm+1, V m+1) deﬁned as

νπ1,...,πk(A1 Am+1) = Y

1 i k ν( r πi Ar),

Dolera, Favaro and Peluchetti

and attaching to B(m, l) a value that is either 0 or 1. In particular, νπ1,...,πk(B(m, l)) = 1 if and only if one of the πi s is equal to the set {1, . . . , l, m + 1}. Hence, based on the measure νπ1,...,πk, we write

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

(π1,...,πk 1) Π(m l,k 1) l!

i=1 (ni 1)!νπ1,...,πk

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

= θ (θ)(m l)

(θ)(m+1) l!

(π1,...,πr) Π(m l,r)

i=1 (ni 1)!νπ1,...,πr

i=1 1{h(Xi)}(h(Xm+1)) = c l

(π1,...,πr) Π(m l,r)

i=1 (ni 1)!νπ1,...,πr ( )

is the distribution of a random sample (X1, . . . , Xm l) under P DP(θ; ν). Again, the distribution of (X1, . . . , Xm l) is given in Sangalli (2006, Equation 3.5). Using the fact that P(Bj) is distributed as a Beta distribution with parameter (θ/J, θ(1 1/J)), for j = 1, . . . , J, we write

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

= θ (θ)(m l)

(θ)(m+1) l!

(π1,...,πr) Π(m l,r)

i=1 (ni 1)!νπ1,...,πr

i=1 1{h(Xi)}(h(Xm+1)) = c l

= θ (θ)(m l)

(θ)(m+1) l! m l c l

E[(P(Bj))c l(1 P(Bj))m c]

= θ (θ)(m l)

(θ)(m+1) l! m l c l

0 pc l(1 p)m c Γ(θ) Γ( θ

J )p θ J 1(1 p)θ θ

= θ (θ)(m l)

(θ)(m+1) l! m l c l

J + c l)Γ(θ θ

J + m c) Γ(θ + m l) ,

where the second identity follows from an application of Sangalli (2006, Proposition 3.1) under the DP prior; see also the formule displayed at page 469 of Sangalli (2006). From

Learning-augmented count-min sketches via Bayesian nonparametrics

(13) we write

f Xm+1 = l,

i=1 1{h(Xi)}(h(Xm+1)) = c

(θ)(m+1) l! m l c l

J + c l)Γ(θ θ

J + m c) Γ(θ + m l)

= θ m! (c l)!(m c)! Γ(θ) Γ( θ

J + c l)Γ(θ θ

J + m c) Γ(θ + m + 1) .

This completes the study of the numerator of (10). Then, by combining (14) and (12), for l = 0, 1, . . . , m

Pr[f Xm+1 = l | Ch(Xm+1) = c] (15)

= θ m! (c l)!(m c)! Γ(θ) Γ( θ

J +c l)Γ(θ θ

J +m c) Γ(θ+m+1)

J m c Γ(θ) Γ( θ

J +c+1)Γ(θ θ

J +m c) Γ(θ+m+1)

J (c l + 1)(l) ( θ

J + c l)(l+1) ,

which follows directly from the (regular) conditional distribution (10), whose denominator and numerator are replaced by Equation (12) and Equation (14), respectively. The proof is completed.

The next proposition is an interesting complement to Theorem 1: i) it provides a novel (constructive) representation of the posterior distribution (11) in terms of a mixture of Binomial distributions; ii) it characterizes the large cm asymptotic behaviour of the posterior distribution (11). Hereafter, we denote by w the convergence in distribution or weak convergence.

Proposition 2 Let Ba,b be a Beta random variable, and denote by f Ba,b its density function. Under the setting of Theorem 1, if FXm+1 is a random variable distributed as (11), then as cn + FXm+1

Moreover, if Binomial (n, p) denotes the Binomial distribution with parameter (n, p), then for l = 0, 1, . . . , cn

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn] = Z 1

0 Binomial (l; cn, p)f B1, θ

J (p)dp. (17)

See Appendix A for the proof of Proposition 2. Let FXm+1 be a random variable whose distribution coincides with the posterior distribution (11). Then, Equation (17) shows that the distribution of FXm+1 is a mixture of Binomial distributions, with the mixing

Dolera, Favaro and Peluchetti

distribution on the success probability being a Beta distribution with parameter (1, θ/J). That is, the posterior distribution (11) admits a straightforward representation in terms of a Beta-Binomial distribution with parameter (cn, 1, θ/J) (Johnson et al., 2005, Chapter 6). Moreover, Equation (16) shows that the mixing distribution is precisely the limiting distribution of the proportion c 1 n FXm+1 as cn + . Then, according to Proposition 2, we write FXm+1 = P 1 i cn Zi, where, by means of de Finetti s representation theorem, (Zi)i 1 is an exchangeable sequence of Bernoulli random variables with de Finetti s measure being the Beta distribution with parameter (1, θ/J). Concerning large m behavior, Berry Esseen estimates for the vicinity of the (rescaled) law of FXm+1 to the aforesaid de Finetti s measure are contained in Dolera and Favaro (2020b), while other similar estimates can be found in Dolera (2013). Then, for a collection of hash functions h1, . . . , h N, one may combine Proposition 2 with Equation (8) to obtain an alternative representation, in terms of product of Beta-Binomial distributions, of the posterior distribution of f Xm+1, given {Cn,hn(Xm+1)}n [N]. Despite Proposition 2 has not a direct impact with respect to the implementation of the CMS-DP, in the sense of improving its computation, we believe it is of interest in shedding light on distributional properties of the posterior distribution of f Xm+1, given Chn(Xm+1).

3. A learning-augmented CMS under power-law streams

Our Bayesian proof relies on less prior s assumptions than the constructive proof of Cai et al. (2018), and it paves the way to extend the BNP approach of Cai et al. (2018) to the PYP prior. The PYP prior is arguably the most popular generalization of the DP prior, with a ﬂexible tail behaviour that ranges from from geometric tails to heavy power-law tails (Pitman, 2006, Chapter 3). In this section, we apply the PYP prior to develop a novel learning-augmented CMS under power-law streams of tokens. In particular, we assume the stream x1:m to be modeled as a random sample from an unknown discrete distribution P, which is endowed with a PYP prior Q. Within the class of (discrete) nonparametric priors with power-law tail behaviour, the PYP prior stands out for both its mathematical tractability, ﬂexibility and interpretability, and therefore it is the natural candidate for applications (De Blasi et al., 2015; Bacallado et al., 2017). We recall that the PYP has neither a restriction property nor a suﬃcientness postulate analogous to those featured by the DP, and therefore the constructive proof of Cai et al. (2018) cannot be applied to obtain the posterior distribution of a point query. Moreover, we recall that the PYP does not feature a ﬁnite-dimensional projective property analogous to that of the DP, and therefore prior s parameters cannot be estimated through an empirical Bayes procedure, as discussed in Cai et al. (2018), or through a hierarchical (fully) Bayes procedure. In this section, we adapt the Bayesian proof of Section 2 in order to compute the posterior distribution of the point query f Xm+1, given {Cn,hn(Xm+1)}n [N], under a PYP prior. Then, we exploit the predictive distribution of the PYP prior to implement a likelihood-free approach to estimate the PYP prior s parameters. Our work leads to a generalization of the CMS-DP, which is referred to as the CMS-PYP, providing a novel learning-augmented CMS under power-law streams.

Learning-augmented count-min sketches via Bayesian nonparametrics

3.1 The CMS-PYP

3.1.1 The PYP prior

A simple and intuitive deﬁnition of the PYP follows from its stick-breaking construction (Perman et al., 1992; Pitman, 1995; Pitman and Yor, 1997). For α [0, 1) and θ > α let: i) (Bi)i 1 be independent random variables such that Bi is distributed as a Beta distribution with parameter (1 α, θ+iα); ii) (Vi)i 1 random variables, independent of (Bi)i 1, and i.i.d. from a non-atomic distribution ν on V. Then, deﬁne P1 = B1 and Pj = Bj Q 1 i j 1(1 Bi) for j 2, in such a way that P i 1 Pi = 1 almost surely. The (discrete) random probability measure P = P j 1 PjδVj is a PYP on V with (base) distribution ν, discount parameter α and mass parameter θ. For short, we write P PYP(α, θ; ν). We refer to Perman et al. (1992) and Pitman and Yor (1997) for an alternative deﬁnition of the PYP through a suitable transformation of the α-stable completely random measure Kingman (1993). See also Pitman (2006, Chapter 4) and references therein. The DP arises as a special case of the PYP by setting α = 0. For the purposes of the present paper, it is useful to recall the power-law tail behaviour featured by the PYP prior. In particular, let P PYP(α, θ; ν) with α (0, 1), and let (P(j))j 1 be the decreasing ordered random probabilities Pj s of P (Pitman, 2006, Chapter 3). Then, as j + the P(j) s follow a power-law distribution of exponent c = α 1 (Pitman and Yor, 1997). That is, α (0, 1) controls the power-law tail behaviour of the PYP through small probabilities P(j) s: the larger α the heavier the tail of P. See also Gnedin et al. (2007, Section 10) for a detailed account on the tail behaviour of the PYP prior. As for the DP, the discreteness of P PYP(α, θ; ν) implies that a random sample X1:m = (X1, . . . , Xm) from P induces a random partition of the set {1, . . . , m} into 1 Km m partition subsets, labelled by distinct types v = {v1, . . . , v Km}, with corresponding frequencies (N1,m, . . . , NKm,m) such that 1 Ni,m m and P 1 i Km Ni,m = m. For 1 l m let Ml,m be the number of distinct types with frequency l, i.e. Ml,m = P 1 i Km 1{Ni,m}(l) such that P 1 l m Ml,m = Km and P 1 l m l Ml,m = m. The distribution of Mm is

Pr[Mm = m] =

(k) (θ)(m) m!

α(1 α)(i 1)

mi 1 mi!1Mm,k(m), (18)

Pr[Km = k] =

(k) (θ)(m) C (m, k; α) (19)

for k = 1, . . . , m, where C (m, k; α) = (k!) 1 P 0 i k k i ( 1)i( iα)(m) denotes the generalized factorial coeﬃcient (Charalambides, 2005), with the proviso that C (0, 0; α) = 1 and C (m, 0; α) = 0. See Pitman (2006, Chapter 3) for details on (18) and (19). Now, let vl = {vi v : Ni,m = l}, i.e. the labels of types with frequency l and let v0 = V v, i.e. the labels of types not belonging to v. The predictive distribution induced by P PYP(α, θ; ν) is

Pr[Xm+1 vl | X1:m] = Pr[Xm+1 vl | Mm = m] =

θ+m if l = 0

θ+m if l 1, (20)

Dolera, Favaro and Peluchetti

for m 1. In particular, the PYP prior is characterized as the sole (discrete) nonparametric prior for which: i) the conditional probability that Xm+1 belongs to v0, given X1:m depends on X1:m only through m and Km; ii) the conditional probability that Xm+1 belongs to vl, given X1:m, depends on X1:m only through m and Ml,m (Bacallado et al., 2017, Proposition 1). At the sampling level, the power-law tail behaviour of P PYP(α, θ; ν) emerges from the analysis of the large m asymptotic behaviour of Km and Mr,m/Km (Pitman, 2006, Chapter 3). Let X1:m be a random sample from P. Pitman (2006, Theorem 3.8) shows that, as m + ,

mα a.s S α α,θ, (21)

where Sα,θ is a polynomially tilted α-stable random variable, that is the distribution of Sα,θ has density function f Sα,θ(x) x θgα(x)1R+(x) for gα being the positive α-stable density function. According to (21), it holds Km mαS α α,θ for large m, or equivalently Km [(θ + m)α θα]S α α,θ for large m (Favaro et al., 2009). It follows from (21) that, as m + ,

a.s α(1 α)(l 1)

Equation (21) shows that the number Km of distinct types in X1:m, for large m, grows as mα. This is precisely the growth of the number of distinct types in m 1 random samples from a power-law distribution of exponent c = α 1. Moreover, Equation (22) shows that pα,l = α(1 α)(l 1)/l! is the large m asymptotic proportion of the number of distinct types with frequency l in X1:m. Then, it holds pα,l cαl α 1 for large l, for a constant cα. This is precisely the distribution of the number of distinct types with frequency l in m 1 random samples from a power-law distribution of exponent c = α 1. See Figure 1 for an illustration of the large m behaviour of Km and Ml,n under the PYP prior, for some choices of the parameter (α, θ).

3.1.2 The CMS-PYP

To introduce the CMS-PYP, we assume that the stream x1:m is modeled as random samples X1:m from an unknown discrete distribution P, which is endowed with a PYP prior. That is,

X1:m | P iid P (23)

P PYP(α, θ; ν)

for m 1. Let h1, . . . , h N be a collection of random hash functions that are i.i.d. from the strong universal family H, and assume that h1, . . . , h N are independent of X1:m for any m 1; in particular, by de Finetti s representation theorem, h1, . . . , h N are independent of P PYP(α, θ; ν). Under the CMS-PYP the Xi s are hashed through h1, . . . , h N, thus creating {(Cn,1, . . . , Cn,J)}n [N], and estimates of the point query f Xm+1, with Xm+1 being of an arbitrary type v V, are obtained as functionals of the posterior distribution of

Learning-augmented count-min sketches via Bayesian nonparametrics

m 0 2000 4000 6000 8000 10000

m 0 2000 4000 6000 8000 10000

r 0 20 40 60 80 100

r 0 20 40 60 80 100

Figure 1: Behaviours in m 1 of the statistics Km and Mr,m/Km, for 1 m 104 under P PYP(α, θ; ν): α = 0 (blue -), α = .25 (red -.), α = .5 (yellow ) and α = .75 (purple :).

f Xm+1 given {Cn,hn(Xm+1)}n [N]. As for the derivation of the CMS-DP in Section 2, the assumption of independence between the hn s and X1:m plays a critical role to obtain the posterior distribution of f Xm+1 given {Cn,hn(Xm+1)}n [N]; that is, it allows us to treat the hn s as they were ﬁxed, i.e. non-random hash functions. For a single hash function hn, in the next theorem we provide a rigorous derivation of the posterior distribution of f Xm+1, given Cn,hn(Xm+1).

Theorem 3 For m 1, let x1:m be a stream of tokens that are modeled as a random sample X1:m from P PYP(α, θ; ν), and let Xm+1 be an additional random sample from P. Moreover, let hn be a random hash function distributed as the strong universal family H, and let hn be independent of X1:m for any m 1, that is hn is independent of P. Then, for l = 0, 1, . . . , cn

pf Xm+1(l; m, cn, α, θ) (24)

Dolera, Favaro and Peluchetti

:= Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

Pcn l i=0 Pm cn j=0 θ+α

J j C (cn l, i; α)C (m cn, j; α) Pcn+1 i=0 Pm cn j=0 θ

J j C (cn + 1, i; α)C (m cn, j; α) .

See Appendix B for the proof of Theorem 3. Theorem 3 provides an extension of Theorem 1 to the more general BNP model (23); in particular, Theorem 1 can be recovered from Theorem 3 by setting α = 0. See Appendix C for details. The proof of Theorem 3 relies on two properties of the distribution (18) for the random partition Mn induced by P PYP(α, θ; ν): P1) the distribution of Mn is of Gibbs-type (Pitman, 2003; Gnedin and Pitman, 2006), i.e.

Pr[Mm = m] = Vm,km!

mi 1 mi!1Mm,k(m), (25)

where (Vm,k)m 1, k m forms a collection of nonnegative (deterministic) weights that satisfy the triangular recursion Vm,k = Vm+1,k(m αk) + Vm+1,k+1 with the proviso V1,1 := 1; P2) for any constant c > α, the Vm,k s are such that for any m 1 and any k m it holds that Vm+1,k+1

Vm,k = c + kα

c + m . (26)

In particular, within the proof of Theorem 3, P2) is critical to exploit the quasi-conjugacy property of the PYP prior (Lijoi et al., 2008). While P1) holds true for a broad class of nonparametric priors, which are referred to as Gibbs-type priors (Pitman, 2003; Gnedin and Pitman, 2006; Pitman, 2006; De Blasi et al., 2015; Bacallado et al., 2017), P2) holds true only for the PYP prior. That is, the PYP prior is characterized as the sole Gibbs-type prior for which P2) holds true (Zabell, 1997; Bacallado et al., 2017). Therefore, we expect that an extension of Theorem 3 to Gibbs-type priors may require a non trivial eﬀort to overcome the lack of P2), though the general strategy underlying the Bayesian proof remains valid. For α [0, 1), an alternative expression for (24) may be given in terms of the distribution (19) of the number Km of distinct types in a random sample from the PYP. If cn > 0, then for l = 0, 1, . . . , cn

pf Xm+1(l; m, cn, α, θ) (27)

α )(Kcn l+Km cn ) ( θ

α)(Kcn l)( θ

J Kcn l 1 1

α)(Kcn+1+Km cn ) ( θ

α)(Kcn+1)( θ

J Kcn+1 1 1

with the proviso that K0 = 0, where Kcn l and Km cn in the numerator of (27) are independent random variables for any l = 0, 1, . . . , cn 1, and Kcn+1 and Km cn in the denominator of (27) are independent random variables. See Appendix D for the proof of Equation (27). Equation (27) gives a probabilistic representation of the posterior distribution (24), whose critical terms are the expected value of a suitable functional of (Kcn l, Km cn), i.e. the

Learning-augmented count-min sketches via Bayesian nonparametrics

numerator of (27), and the expected value of a suitable functional of (Kcn+1, Km cn), i.e. the denominator of (27). See Appendix E for a further alternative expression of (24), which is in terms of exponentially tilted α-stable random variables (Zolotarev, 1986). Figure 2 shows the shape behaviour of the posterior distribution (27) for diﬀerent values of (α, θ), keeping m, J and cn ﬁxed. For α = 0, i.e. under the DP prior, Cai et al. (2018) showed that the posterior distribution of f Xm+1, given Cn,hn(Xm+1) is monotonically decreasing or increasing. Under the PYP, the additional parameter α (0, 1) allows for a more ﬂexible shape behaviour.

Remark 4 Equation (27) is useful for the numerical evaluation of the posterior distribution (24), since it avoids numerical issues that arise in evaluating the generalized factorial coeﬃcients. In particular, (27) allows for a Monte Carlo (MC) evaluation of (24), which requires to sample from the random variable Km, for suitable choices of m. Sampling Km is straightforward, and it exploits the predictive probabilities of the PYP. In particular, from (20), Bernoulli(p) is the Bernoulli distribution with parameter p, for p (0, 1), then sampling Km reduces to sample (m 1) Bernoulli random variables. See Algorithm 1 in Section 4.

Under the PYP prior, Theorem 3 shows that the posterior distribution of f Xm+1, given Cn,hn(Xm+1), depends on the sampling information through cn and m. This is a critical diﬀerence with respect to the DP prior, where Theorem 3 shows that the posterior distribution of f Xm+1, given Cn,hn(Xm+1), depends on the sampling information only through m. Therefore, under the PYP prior, one may consider diﬀerent large m asymptotic behaviours for the posterior distribution (24). Here, we start by considering a local limit theorem of (24) for m + , while cn is ﬁxed. Under the setting of Theorem 3, for any l = 0, 1, . . . , cn it holds

p(l; cn, α, θ) := lim m + pf Xm+1(l; m, cn, α, θ) = cn l

(1 α)(l) (θ + 2α)(cn l) (θ + α + 1)(cn) (28)

p(l; cn, α, θ) = Z 1

0 Binomial(l; cn, p)f B1 α,θ+2α(p)dp, (29)

where f Ba,b is the density function of the distribution of a Beta random variable Ba,b. See Appendix F for the proof of Equation (28) and Equation (29). The next proposition is an interesting complement to Theorem 3, providing the large m asymptotic behaviour of the posterior distribution (24). In particular, we consider m + and cn + with the assumption that cn = λm for some choice of λ (0, 1). Such a large m asymptotic behaviour is in line with the large cn asymptotic behaviour presented in Proposition 2 under the DP prior.

Proposition 5 For α (0, 1) and c > 0 let Sα,c be a polynomially tilted α-stable random variable, i.e. the distribution of Sα,c has density function f Sα,c(x) x cgα(x)1R+(x) for gα being the positive α-stable density function; moreover, set Zα,θ+α = (J 1)1/αSα,0/Sα,θ+α and Wα,θ = (J 1)1/αSα,0/Sα,θ, with Sα,0 being independent of Sα,θ+α and of Sα,θ, and

Dolera, Favaro and Peluchetti

Figure 2: Posterior distribution of f Xm+1 given Cn,hn(Xm+1) = cn under P PYP(α, θ; ν): m = 1000, J = 50 and cn = 20.

denote by f Zα,θ+α and f Wα,θ the density functions of the distributions of Zα,θ+α and Wα,θ, respectively. Under the setting of Theorem 3, let FXm+1 be a random variable with distribution (24). As m + and under the large m asymptotic regime cn = λm, for some choice of λ (0, 1),

w B(λ) 1 α,θ+α, (30)

Learning-augmented count-min sketches via Bayesian nonparametrics

where B(λ) 1 α,θ+α is a random variable whose distribution has density function of the following form

f B(λ) 1 α,θ+α(x) =

Γ(θ+1) Γ(θ+α)Γ(1 α) f Wα,θ(λ 1 1)

f Zα,θ+α λ 1 1

1 x x1 α 1(1 x)θ+α 11(0,1)(x).

See Appendix G for the proof of Proposition 5. As in the context of the DP prior discussed in Section 2, Proposition 5 shows that the posterior distribution of f Xm+1 given Chn(Xm+1) admits a representation in terms of a mixture of Binomial distributions. In particular, Proposition 5 may be viewed as the natural counterpart of Proposition 2, though the resulting mixing distribution is not as simple as the Beta distribution of Proposition 2. For the collection of hash functions h1, . . . , h N, the posterior distribution of f Xm+1, given {Cn,hn(Xm+1)}n [N], follows from Theorem 3 by means of the assumption that the hn s are i.i.d. according to the strong universal family H. In particular, by a direct application of Bayes theorem, straightforward calculations show that for l = 0, 1, . . . , min{c1, . . . , c N} it holds that

Pr[f Xm+1 = l | {Cn,hn(Xm+1)}n [N] = {cn}n [N]] =

Q n [N] pf Xm+1(l; m, cn, α, θ)

(pf Xm+1(l; m, α, θ))N 1 , (31)

pf Xm+1(l; m, α, θ) := Pr[f Xm+1 = l] = m l

(1 α)(l) (θ + α)(m l)

for l = 0, 1, . . . , m, where f B1 α,θ+α denotes the density function of the distribution of a Beta random variable with parameter (1 α, θ + α). See Appendix H for the proof of Equation (31). CMS-PYP estimates of the point query f Xm+1, with respect to a suitable choice of a loss function, are obtained as functionals of the posterior distribution (31), e.g. posterior mode, mean and median. In general, the numerical evaluation of the posterior distribution (31), as well as the evaluation of its alternative expression in terms of (27), requires care in order to achieve numerical stability and eﬃciency, as it is discussed in the last part of this section. To apply (31), it remains to estimate the prior s parameter (α, θ). For ease of exposition, we denote by C the N J matrix with entries Cn,j for n [N] and j [J]. Assuming that the matrix C has been computed from m tokens, the sum of the entries of each row of C is equal to the sample size m. Since the PYP does not have a restriction property analogous to that of the DP, under the BNP model (23) the distribution of C is not available in closed-form. Hence, the prior s parameter (α, θ) cannot be estimated following the empirical Bayes approach adopted by Cai et al. (2018) in the context of the DP prior. Instead, here we estimate (α, θ) by relying on the minimum Wasserstein distance method (Bernton et al., 2019). This method estimates (α, θ) by selecting the value of (α, θ) that minimizes the expected Wasserstein distance between a summary statistic of the data and the corresponding summary statistic of synthetic data generated under the BNP model (23). In our context, a natural choice for the summary statistic is the matrix C. By construction, the rows of C are i.i.d.; moreover, since H is assumed to be a perfectly random hash family, each column of C is exchangeable. Then, we can deﬁne the reference summary statistic C

Dolera, Favaro and Peluchetti

as a vector of length NJ containing the (unordered) entries of the matrix C. For any ﬁxed m 1 and a any ﬁxed prior s parameter (α, θ), let e X1:m = ( e X1, . . . , e Xm ) be a random sample from P PYP(α, θ; ν), i.e. e X1:m is modeled as (23). For a moderate sample size m , generating random variates from e X1:m is straightforward by means of the predictive distribution (20) of the PYP. These random variates, by a direct transformation through the hash functions h1, . . . , h N drawn at random from H, lead to random variates from the hashed frequencies and to random variates from reference summary statistic, denoted by e C(α, θ, m ). In practice, m is such that m m and the computational cost of sampling from (20) scales super-linearly in m . To account for this mis-match we scale the entries of e C(α, θ, m ) by m/m , so that each row of e C(α, θ, m )m/s sum to m. Now, we are interested in ﬁnding (ˆα, ˆθ) such that

(ˆα, ˆθ) = arg min (α,θ) E h W1

C, e C(α, θ, m ) m

where W1 is the Wasserstein distance of order 1, and the expectation is taken with respect to e C. To fully specify the optimization problem we choose ρ(x, y) = |x y| as distance underlying W1 (Bernton et al., 2019). We use an MC approximation of the expectation in (32), i.e.,

C, e Cr(α, θ, m ) m

for R 1, where (e C1(α, θ, m ), . . . , e CR(α, θ, m )) are i.i.d. according to e C(α, θ, m ). We refer to Bernton et al. (2019) for a theoretical and empirical analysis of the minimum distance Wasserstein method. To improve the MC approximation displayed in (33), which might be detrimental for the minimization problem in (32), we ﬁx the same random numbers underlying the routines used for generating random variates from the predictive distribution (20) of the PYP over all values of (α, θ). Moreover the optimization is carried out via noiserobust Gaussian optimization (Letham at al., 2019). We report experimental results in Section 4.

3.2 Computational aspects of the CMS-PYP

For the CMS-PYP estimator of fxm+1 we consider the posterior mean ˆf (PYP), that is the expected value of the posterior distribution (31). The evaluation of ˆf (PYP) follows from two steps:

i) the estimation of the prior s parameter (α, θ) by means of the minimum Wasserstein distance method;

ii) the evaluation, with respect to the estimated prior s parameter, of the posterior distribution (31).

Step i) has been described above. Step ii) can be implemented either via the exact representation in (27) or via its limiting behaviour in (28), which is accurate provided that the total number of observed tokens m is large relative to the considered cn. This is often the case, especially for real world large datasets where applying CMS in any of its variants is most

Learning-augmented count-min sketches via Bayesian nonparametrics

warranted. In our numerical experiments we consider datasets whose total observed tokens range from 2 millions to almost 1 billion. The evaluation (27) requires the computation of multiple expectations, one for each l = 0, . . . , cn, which we approximate via MC integration. For each MC estimator to be valid it is necessary to sample each Kcn l independently from Km cn in each expectation term. However the MC estimators themselves, one for each l = 0, . . . , cn, can be correlated. One sample for all MC estimators can be thus obtained as follows: i) Algorithm 1 is used to sample the vector [Kcn l | l = 0, . . . , cn] in one pass with O(cn) cost ii) Km cn is sampled from the distribution of [(θ + (m cn))α θα]S α α,θ where S α α,θ is a polynomially tilted α-stable random variable. Sampling from S α α,θ can be achieved eﬃciently by using rejection sampling as described in Devroye (2009). The convergence of Km cn to its limiting distribution is fast in m cn, as illustrated in Figure 3. To ensure numerical stability with both (27) and (28) we work in log-space, i.e. compute the (natural) logarithm of each multiplicative term of (31), and exponentiate back only as ﬁnal computation. Similarly, to avoid underﬂow/overﬂow issues, we apply the log-sum-exp trick to sums arising from the MC estimators. The denominator of (27) does not need to be evaluated, as it suﬃces to compute pf Xm+1(l; cn, α, θ) up to a constant of proportionality and then normalize the masses to sum up to 1. In doing this, the MC variance is additionally reduced.

Figure 3: Exact sampling of Km via Algorithm 1 and (asymptotic) approximate sampling of Km via Km [(θ + m)α θα]S α α,θ for diﬀerent values of m, α = 0.5, θ = 10; densities estimated by kernel density estimation.

4. Experiments

We present numerical experiments for the CMS-PYP. We apply the CMS-PYP to synthetic and real textual data, and we compare its performance with respect to some Bayesian and non-Bayesian approaches. With regards to the Bayesian approaches, we consider the CMSDP of Cai et al. (2018), here denoted by ˆf (DP), and the CMS-NIGP of Dolera et al. (2021), here denoted by ˆf (NIGP). In particular, ˆf (DP) is the mean of the posterior distribution of f Xm+1, given {Cn,hn(Xm+1)}n [N], under the DP prior, whereas ˆf (NIGP) is the mean of the posterior distribution of f Xm+1, given {Cn,hn(Xm+1)}n [N], under the NIGP prior. With regards to non-Bayesian approaches, we consider the CMS of Cormode and Muthukrishnan

Dolera, Favaro and Peluchetti

Algorithm 1 Sampling Kc l for l = 0, . . . , c

K[0] 0; K[1] 1; i 1; while i c do

Ber random sample from Bernoulli θ+αK[i 1]

θ+i ; K[i] K[i 1] + Ber; i i + 1; end while return K[c], K[c 1], . . . , K[1]

(2005), the CMM of Goyal et al. (2012) and the BDCM of Ting (2018). The CMS, here denoted by ˆf (CMS), has beed speciﬁed in (1). Both the CMM, here denoted by ˆf (CMM), and the BDCM, here denoted by ˆf (BDCM), rely on the same summary statistics used in the CMS, i.e. the hashed frequencies {Cn}n [N]. This facilitates the implementation of a fair comparison among estimators, since the storage requirement and sketch update complexity are unchanged. The CMM estimator subtracts the value of the estimated noise from each of the N counters, and returns minimum between the median of the N residues and the CMS estimator. For a token x the noise corresponding to each counter is given by (m Cn,hn(x))/(J 1). Similarly to the CMM, the BDCM estimator aims at de-biasing the CMS estimator. Speciﬁcally, we consider the Algorithm 1 of Ting (2018). An (almost) unbiased estimator is obtained by removing the mean of the CMS estimators computed over the vector counters {Cn,j}n [N] for j = 1, . . . , J. The BDCM estimator is obtained by restricting this unbiased estimator to be non-negative. In all the experiments random hash functions hn(x) as hn(x) = ((anx + bn) mod LP) mod J,

where x is the non-negative integer index corresponding to the token of interest, LP denotes a large prime number (here chosen to be 232 1), and an and bn are two random integer numbers that are i.i.d. as a Uniform distribution on [1, LP]. The required hash functions are generated once for each numerical experiment and kept ﬁxed while comparing diﬀerent estimators.

4.1 Estimation of the prior s parameter (α, θ)

We present an empirical study of the likelihood-free estimation approach detailed in Section 3. We start with a scenario where the data generating process (PYP-DGP) is (23). In particular, we generate 10 synthetic datasets of m = 300000 tokens each, for diﬀerent prior s parameter (α, θ). See Table 1. For each dataset, the estimation of the prior s parameter (α, θ) is performed by means of (32) and (33) with m = 100000 for R = 25. The optimization procedure is based on Letham at al. (2019), as implemented by the AX library. See https://ax.dev/ for details. The stochastic objective function (33) is evaluated a total of 50 times for each dataset. Results from Table 1 support our inferential procedure for (α, θ). It is also apparent that, for the datasets under consideration, α is more easily identiﬁed than θ. We also consider synthetic datasets generated from Zipf s distributions

Learning-augmented count-min sketches via Bayesian nonparametrics

with (exponent) parameter c > 1, i.e. a Zipf s data generating process with parameter c (Zc-DGP). In particular, we recall that the parameter c controls the tail behaviour of the Zipf s distribution: the smaller c the heavier is the tail of the distribution, i.e., the smaller c the larger the fraction of types with low-frequency tokens. We generate 7 synthetic datasets of m = 500000 tokens each, for diﬀerent parameter c. See Table 2. For each dataset, the estimation of the prior s parameter (α, θ) is performed by means of (32) and (33) with m = 100000 for R = 25. The optimization procedure is still based on the work of Letham at al. (2019). The stochastic objective function (33) is evaluated a total of 50 times for each dataset. The results from Table 2 shows that the PYP prior is able to adapt to diﬀerent power-law tails behaviours. In particular, we observe that the larger c the smaller ˆα, which is in agreement with the interpretation of α as the parameter controlling the tail behaviour of the PYP prior.

PYP-DGP Estimates

0.00 25.00 0.02 36.31 0.10 25.00 0.11 21.86 0.20 25.00 0.18 16.78 0.30 25.00 0.26 22.83 0.40 25.00 0.41 17.32 0.50 25.00 0.56 10.69 0.60 25.00 0.56 13.42 0.70 25.00 0.63 24.89 0.80 25.00 0.77 10.21 0.90 25.00 0.88 11.26

Table 1: Prior s parameter (α, θ) estimates, under PYP-DGP.

Zc-DGP Estimates

1.05 0.92 25.37 1.18 0.80 5.56 1.33 0.71 1.53 1.54 0.67 0.61 1.82 0.38 0.49 2.22 0.17 0.11 2.86 0.01 0.23

Table 2: Prior s parameter (α, θ) estimates, under Zc-DGP.

Dolera, Favaro and Peluchetti

4.2 Applications to synthetic and real data

With regards to synthetic data, we consider datasets generated from Zipf s distributions with exponent c = 1.3, 1.6, 1.9, 2.2. Each dataset consists of m = 500000 tokens. We make use of a 2-universal hash family, and then assume the following pairs of hashing parameters: i) J = 320 and N = 2; ii) J = 160 and N = 4. Bayesian and non-Bayesian approaches are compared in terms of the MAE (mean absolute error) between true frequencies and their estimates. Table 3 and Table 5 report the MAE for the Bayesian approaches, with respect to the case J = 320 and N = 2 and the case J = 160 and N = 4, respectively. From Table 3 and Table 5, it is clear that ˆf (PYP) has a remarkable better performance than ˆf (DP) in the estimation of low-frequency tokens. In particular, for both Table 3 and Table 5, if we consider the bin of low-frequencies (0, 256] the MAE of ˆf (PYP) is alway smaller than the MAE of ˆf (DP), i.e. ˆf (PYP) outperforms ˆf (DP). This behaviour becomes more and more evident as the parameter c decreases, that is the heavier is the tail of the distribution the more the estimator ˆf (PYP) outperforms the estimator ˆf (DP). For any ﬁxed exponent c, the gap between the MAEs of ˆf (PYP) and ˆf (DP) reduces as v increases, and this reduction is much more evident as c becomes large. For any exponent c we expect a frequency threshold, say v (c), such that ˆf (PYP) underestimates fxm+1 for v > v (c). From Table 3 and Table 5, for any two exponents c1 and c2 such that c1 < c2 it will be v (c1) > v (c2). From Table 3, i.e. J = 320 and N = 2, it emerges that ˆf (PYP) has a remarkable better performance than ˆf (NIGP) for data with heavier power-law tails (c = 1.3, 1.6), whereas from Table 5, i.e. J = 160 and N = 4, ˆf (PYP) is competitive with ˆf (NIGP) for data with heavier power-law tails (c = 1.3, 1.6). In general, the better performance of ˆf (PYP) with respect to ˆf (NIGP) is not surprising, as ˆf (NIGP) has the critical limitation that it can not be tuned to the power-law degree of the data. Table 4 and Table 6 report the MAE for the non-Bayesian approaches for the case J = 320 and N = 2 and the case J = 160 and N = 4, respectively. From Table 4 and Table 6, it is clear that ˆf (PYP) outperforms the ˆf (CMS) in the estimation of low-frequency tokens for both the choices of hashing parameters, i.e. the case J = 320 and N = 2 and the case J = 160 and N = 4. Moreover, ˆf (PYP) outperforms both ˆf (CMM) and ˆf (BDCM) in the estimation of low-frequency token. The better performance of ˆf (PYP) with respect to ˆf (CMM) is a remarkable results, as the CMM is known to stand out in the estimation of low-frequency tokens Goyal et al. (2012, Figure 1). In general, a good performance in the estimation of low-frequency tokens is a desirable feature in natural language or textual data, where it is common the power-law behaviour of the data stream of tokens. In such a data, highest frequency events are often of low interest: frequent words are often grammatical, highly polysemous or without any interesting semantics, while low-frequency words are more relevant. We conclude by presenting an application of the CMS-PYP to some textual datasets, for which the distribution of words is typically a power-law distribution (Clauset et al., 2009). In particular, we consider 4 textual datasets of increasing corpora size: the 20 Newsgroups dataset1, the Enron dataset2, the Wiki Text-103 dataset3 and the 1 Billion Word

1. http://qwone.com/~jason/20Newsgroups/ 2. https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ 3. https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

Learning-augmented count-min sketches via Bayesian nonparametrics

Language Model Benchmark (1BWLMB) dataset4. The 20 Newsgroups dataset consists of m = 2765300 tokens with 53975 distinct tokens, whereas the Enron dataset consists of m = 6412175 tokens with 28102 distinct tokens. Following the experiments in Cai et al. (2018), we make use of a 2-universal hash family, with the following hashing parameters: i) J = 12000 and N = 2; ii) J = 8000 and N = 4. By means the goodness of ﬁt test proposed in Clauset et al. (2009), we found that the 20 Newsgroups and Enron datasets ﬁt with a power-law distribution with exponent ν = 2.3 and ν = 2.1, respectively. The CMSPYP estimators ˆf (PYP) for the 20 Newsgroups and Enron datasets are obtained through the implementation of (27). Table 8 and Table 9 report the MAEs of ˆf (CMS), ˆf (CMM) and ˆf (BDCM), whereas Table 7 reports the MAEs of the estimators ˆf (DP) and ˆf (PYP). Results of Table 8 and Table 9 and Table 7 conﬁrm the behaviour observed in Zipf synthetic data. That is, ˆf (PYP) outperforms ˆf (DP) for low-frequency tokens. Moreover, a comparison with respect to ˆf (CMM) reveals that ˆf (PYP) is competitive with ˆf (CMM) in the context of the estimation of low-frequency tokens. Finally, we consider the Wiki Text-103 and 1BWLMB datasets. The former consists of m = 82810656 tokens with 606753 distinct tokens, whereas the latter consists of m = 658195953 tokens with 1256524 distinct tokens. The ﬁt test of Clauset et al. (2009) results in power-law distributions with exponent ν = 2.15 and ν = 1.5 respectively. Taking into account the increased corpora sizes we consider the following hashing parameters: i) J = 50000 and N = 2; ii) J = 35000 and N = 4 for Wiki Text-103; i) J = 140000 and N = 2; ii) J = 100000 and N = 4 for 1BWLMB. The CMS-PYP estimators ˆf (PYP) are obtained through the implementation of (28). Table 10 reports the MAEs of the estimators ˆf (DP) and ˆf (PYP) applied to the Wiki Text-103 dataset and to the 1BWLMB dataset. The CMS-PYP estimator oﬀers a competitive performance with respect to both the DP and the CMM estimators. The use of (28) reduces the computational signiﬁcantly, in which case the time required to compute the CMS-PYP estimators is similar to the time required for DP estimators.

Z1.3 Z1.6 Z1.9 Z2.2

Bins of xm+1 ˆf (DP) ˆf (NIGP) ˆf (PYP) ˆf (DP) ˆf (NIGP) ˆf (PYP) ˆf (DP) ˆf (NIGP) ˆf (PYP) ˆf (DP) ˆf (NIGP) ˆf (PYP)

(0,1] 1,057.61 231.31 1.12 626.85 134.75 3.36 306.70 65.71 115.15 51.38 12.91 3.80 (1,2] 1,194.67 287.43 2.08 512.43 119.22 2.29 153.57 37.03 31.16 288.27 61.87 93.99 (2,4] 1,105.16 262.18 3.63 472.59 95.78 1.85 2,406.00 353.73 1,237.41 133.31 26.90 17.57 (4,8] 1,272.02 302.89 7.40 783.88 175.10 8.89 457.57 83.30 136.16 117.76 21.58 8.26 (8,16] 1,231.63 257.08 11.83 716.52 136.66 10.00 377.99 66.44 90.41 411.21 77.39 127.69 (16,32] 1,252.18 248.41 22.58 829.17 190.05 14.81 286.98 41.99 65.47 501.00 90.29 178.07 (32,64] 1,309.14 284.12 39.23 780.70 139.52 36.47 413.95 67.30 181.84 216.84 48.00 92.07 (64,128] 1,716.76 312.59 104.03 946.20 125.07 79.94 1,869.23 353.10 1,678.82 63.05 65.91 85.70 (128,256] 1,102.96 97.9 168.34 1,720.49 273.50 342.18 199.87 110.32 98.20 45.98 130.94 136.25

Table 3: Synthetic data: MAE for ˆf (DP), ˆf (NIGP) and ˆf (PYP), case J = 320, N = 2.

5. Discussion

At the core of the CMS-DP of Cai et al. (2018) lies the computation of the posterior distribution of a point query, given the hashed data, and then estimates of the point query are obtained as mean functionals of such a posterior distribution. While the CMS-DP has

4. https://www.statmt.org/lm-benchmark/

Dolera, Favaro and Peluchetti

Z1.3 Z1.6 Z1.9 Z2.2

Bins of xm+1 ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM)

(0,1] 1,061.3 161.72 99.02 629.40 62.19 57.08 308.11 81.10 77.37 51.65 1.04 27.80 (1,2] 1,197.9 169.74 35.67 514.31 102.42 45.38 154.20 2.00 55.14 289.50 2.04 21.16 (2,4] 1,108.3 116.37 62.74 474.82 52.10 26.18 2,419.51 2,215.85 1156.54 134.05 3.40 24.63 (4,8] 1,275.9 378.04 145.04 786.73 214.46 40.30 460.13 258.90 68.67 118.40 6.44 62.20 (8,16] 1,236.1 230.32 90.54 719.84 232.24 66.08 380.05 139.50 39.40 413.13 129.03 51.54 (16,32] 1,256.8 221.98 172.14 831.70 79.73 62.98 288.59 23.90 263.34 503.60 364.30 32.60 (32,64] 1,312.8 235.87 197.59 783.90 184.99 73.98 415.58 54.82 116.04 217.81 82.92 28.73 (64,128] 1,721.7 766.29 119.64 950.31 304.36 56.60 1,875.50 1,762.20 1,120.00 64.01 64.01 25.79 (128,256] 1,107.7 334.57 121.17 1,727.19 1,488.38 109.26 202.09 163.61 239.54 46.80 46.80 25.39

Table 4: Synthetic data: MAE for ˆf (CMS), ˆf (CMM) and ˆf (BDCM), case J = 320, N = 2.

Z1.3 Z1.6 Z1.9 Z2.2

Bins of xm+1 ˆf (DP) ˆf (NIGP) ˆf (PYP) ˆf (DP) ˆf (NIGP) ˆf (PYP) ˆf (DP) ˆf (NIGP) ˆf (PYP) ˆf (DP) ˆf (NIGP) ˆf (PYP)

(0,1] 2,206.09 0.9 0.77 1,254.85 0.25 1.07 420.76 0.18 0.98 153.20 0.32 28.78 (1,2] 2,333.06 0.5 1.07 1,326.71 0.70 2.13 549.12 0.82 1.93 180.71 1.24 21.60 (2,4] 2,266.35 1.3 1.70 1,267.97 2.47 3.53 482.45 2.53 3.55 182.18 2.66 14.92 (4,8] 2,229.22 4.6 4.54 1,371.27 4.67 6.11 538.91 5.28 6.28 250.32 5.96 40.18 (8,16] 2,207.42 10.5 7.06 1,159.29 10.68 11.68 487.69 10.86 10.64 245.09 10.28 95.33 (16,32] 2,279.80 20.7 11.60 1,211.41 19.21 23.88 529.77 22.08 19.04 293.68 21.57 56.37 (32,64] 2,301.99 42.6 28.56 1,280.17 43.14 43.61 632.45 42.64 40.84 118.26 44.49 29.04 (64,128] 2,241.57 92.2 71.58 1,112.41 94.43 93.50 419.42 95.19 81.83 177.61 95.10 58.47 (128,256] 2,235.40 170.0 114.75 1,133.85 173.87 148.71 522.21 185.83 226.96 128.09 180.41 77.92

Table 5: Synthetic data: MAE for ˆf (DP), ˆf (NIGP) and ˆf (PYP), case J = 160, N = 4.

Z1.3 Z1.6 Z1.9 Z2.2

Bins of xm+1 ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM)

(0,1] 2,212.1 590.48 126.11 1,262.0 146.11 36.60 424.80 130.90 53.37 154.70 47.10 17.66 (1,2] 2,339.8 359.57 158.41 1,332.7 63.21 72.87 552.00 65.00 189.88 182.70 2.01 18.76 (2,4] 2,270.9 69.42 54.81 1,277.8 301.89 176.24 487.30 163.55 42.74 184.70 97.15 25.86 (4,8] 2,234.6 339.95 92.71 1,375.7 579.94 98.10 545.20 243.08 26.14 252.50 62.70 21.57 (8,16] 2,213.3 313.37 62.11 1,165.7 152.53 59.33 493.20 196.20 102.04 247.30 29.70 22.56 (16,32] 2,283.0 23.30 111.41 1,217.2 22.94 84.80 535.50 154.30 31.80 295.90 190.92 23.36 (32,64] 2,305.7 133.09 172.81 1,284.6 209.13 63.20 637.80 150.05 37.70 120.60 71.86 24.01 (64,128] 2,244.5 102.43 57.11 1,120.2 118.42 73.93 425.10 198.60 36.43 180.30 113.75 22.57 (128,256] 2,237.4 294.43 118.11 1,141.3 573.12 48.07 525.90 267.15 41.26 129.70 129.70 22.73

Table 6: Synthetic data: MAE for ˆf (CMS), ˆf (CMM) and ˆf (BDCM), case J = 160, N = 4.

proved to improve on some aspects of CMS, it has the major drawback that the posterior distribution of a point query is obtained through a constructive proof that builds upon arguments that are tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we presented a Bayesian proof of the CMS-DP, that is we computed the (regular) conditional distribution of f Xm+1 given Cn,hn(Xm+1), and we showed that such a distribution coincides with the posterior distribution obtained in Cai et al. (2018) through the constructive proof. Besides strengthening the BNP approach of Cai et al. (2018) through rigorous arguments, our proof improve its ﬂexibility by avoiding the use of properties that are peculiar to the DP, thus paving the way to go beyond the use of the DP prior. This ﬁrst result led to develop a novel learning-augmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the

Learning-augmented count-min sketches via Bayesian nonparametrics

J = 12000 and N = 2 J = 8000 and N = 4

20 Newsgroups Enron 20 Newsgroups Enron

Bins of xm+1 ˆf (DP) ˆf (PYP) ˆf (DP) ˆf (PYP) ˆf (DP) ˆf (PYP) ˆf (DP) ˆf (PYP)

(0,1] 46.39 1.22 12.20 0.90 53.39 0.99 70.98 1.18 (1,2] 16.60 1.85 13.80 1.86 30.49 2.10 47.38 2.05 (2,4] 38.40 3.24 61.49 3.60 32.49 3.66 52.49 4.14 (4,8] 59.39 5.04 88.39 7.68 38.69 6.59 53.08 6.13 (8,16] 54.29 10.90 23.40 12.85 25.29 13.17 56.98 11.55 (16,32] 17.80 20.89 55.09 23.97 24.99 22.69 89.98 19.29 (32,64] 40.79 43.93 128.48 48.94 39.69 46.42 108.37 47.61 (64,128] 25.99 77.72 131.08 78.51 22.09 91.15 55.67 70.81 (128,256] 13.59 170.82 50.68 165.28 25.79 191.35 80.76 172.07

Table 7: 20 Newsgroups and Enron real data: MAE for ˆf (PYP) and ˆf (DP).

20 Newsgroups Enron

Bins of xm+1 ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM)

(0,1] 46.4 5.41 18.04 12.2 0.90 29.20 (1,2] 16.6 2.16 56.00 13.8 2.00 28.30 (2,4] 38.4 7.91 23.14 61.5 9.90 26.00 (4,8] 59.4 35.70 31.42 88.4 17.32 24.30 (8,16] 54.3 45.40 22.72 23.4 9.52 118.70 (16,32] 17.8 20.99 27.72 55.1 21.00 21.60 (32,64] 40.8 58.86 48.24 128.5 134.47 24.40 (64,128] 26.0 91.59 23.20 131.1 110.27 27.60 (128,256] 13.6 186.92 28.05 50.7 140.43 110.40

Table 8: Real data: MAE for ˆf (CMS), ˆf (CMM) and ˆf (BDCM), case J = 12000 and N = 2.

20 Newsgroups Enron

Bins of xm+1 ˆf (CMS) ˆf (CMM) ˆf (BDCM) ˆf (CMS) ˆf (CMM) ˆf (BDCM)

(0,1] 53.4 4.50 25.90 71.0 51.00 31.56 (1,2] 30.5 2.00 20.30 47.4 27.20 21.26 (2,4] 32.5 4.80 19.00 52.5 3.90 22.42 (4,8] 38.7 6.23 22.40 53.1 10.50 34.88 (8,16] 25.3 13.50 24.50 57.0 22.20 14.74 (16,32] 25.0 21.60 22.20 90.0 20.60 27.94 (32,64] 39.7 39.22 24.00 108.4 61.38 49.38 (64,128] 22.1 86.32 19.50 55.7 66.50 21.12 (128,256] 25.8 183.96 26.30 80.8 90.20 34.10

Table 9: Real data: MAE for ˆf (CMS), ˆf (CMM) and ˆf (BDCM), case J = 8000 and N = 4.

Dolera, Favaro and Peluchetti

Wiki Text-103 1BWLMB

J = 50000 and N = 2 J = 35000 and N = 4 J = 140000 and N = 2 J = 100000 and N = 4

Bins of xm+1 ˆf (DP) ˆf (PYP) ˆf (DP) ˆf (PYP) ˆf (DP) ˆf (PYP) ˆf (DP) ˆf (PYP)

(0,1] 97.30 43.15 119.59 40.40 702.70 41.16 156.50 34.99 (1,2] 61.30 34.07 145.39 31.30 104.10 35.18 138.10 35.89 (2,4] 157.70 34.29 91.79 31.50 50.10 37.55 65.00 35.50 (4,8] 192.59 35.45 120.49 34.00 552.59 33.92 49.40 37.50 (8,16] 191.59 41.38 111.09 32.20 176.40 34.75 43.70 32.70 (16,32] 195.19 33.10 127.09 46.50 143.40 38.09 97.40 35.09 (32,64] 248.29 34.29 102.09 44.30 600.30 37.31 168.90 40.39 (64,128] 632.19 37.71 208.29 42.40 143.40 44.31 89.90 41.50 (128,256] 107.69 42.09 140.29 59.90 485.29 48.99 58.60 55.49

Table 10: Wiki Text-103 and 1BWLMB real data: MAE for ˆf (PYP) and ˆf (DP).

data stream of tokens via a PYP prior. Under this more general framework, we applied the arguments of the Bayesian proof of the CMS-DP, suitably adapted to the PYP prior, to compute the posterior distribution of a point query, given the hashed data. Both the CMSDP and the CMS-PYP have been also investigated with respect to large sample asymptotic behaviours of their corresponding posterior distributions. Applications to synthetic data and real textual data revealed that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, and it is competitive with respect to the CMM and the BDCM. Our Bayesian proof of the CMS-DP can be extended to deal with more general queries. Of notable interest is the problem of estimating the overall frequency of s 1 tokens in the stream, also referred to as s-range query, which generalizes the point query (Cormode and Yi, 2020, Chapter 3). For m 1 let x1:m be a stream of V-valued tokens, and for positive integers J and N let h1, . . . , h N, with hn : V [J], be random hash functions that are i.i.d. from a pairwise independent hash family H. Then, assuming x1:m to be available through the hashed data {(Cn,1, . . . , Cn,J)}n [N], the goal is to estimate, or recover, the vector of frequencies (fxm+1, . . . , fxm+s) of s new tokens (xm+1, . . . , xm+s) in x1:m, with fxm+r being deﬁned as

i=1 1{xi}(xm+r)

for r = 1, . . . , s, and hence the s-range query fs = P 1 r s fxm+r. The arguments of the constructive proof of Cai et al. (2018) exploit the unidimensional nature of point queries, and therefore they cannot be used for the vector (fxm+1, . . . , fxm+s) nor for fs. In Appendix I we show how to adapt our Bayesian proof to the problem of computing the posterior distribution of (fxm+1, . . . , fxm+s), given hashed data, and, as an illustrative example, we present the posterior distribution of (fxm+1, fxm+2). We focus on the DP prior, thought the same arguments apply to the PYP prior. Unfortunately, the posterior distribution of (fxm+1, fxm+2) has a rather complicated form, and for a large m the computational burden for its evaluation becomes overwhelming. We defer to future work the study of a large sample behaviour of the posterior distribution, with the aim of obtaining a simple approximated version of it. Our work paves the way to some fruitful directions for future research in the context of the BNP approach to obtain learning-augmented CMSs. Investigating large sample asymptotic properties of the CMS-DP and CMS-PYP would be of interest, especially with

Learning-augmented count-min sketches via Bayesian nonparametrics

the aim of obtaining simple approximated versions of the posterior distributions (8) and (31). For a single hash function, i.e. N = 1, Proposition 2 and Proposition 5, as well as Equation (28), provide results in this direction. However, it would be of greater interest to consider corresponding results for an arbitrary N, that is for the posterior distributions (8) and (31). Our conjecture is that, under suitable assumptions, the large m limiting posterior distribution of a rescaled point query reduces to a distribution that involves only the minimum of the hashed frequencies, i.e. min{c1, . . . , c N}, thus making a link with the CMS. In this respect, it would be interesting to obtain some form of central limit theorem for the posterior distributions (8) and (31). For α = 0, Cai et al. (2018) showed that the posterior mode may recover the CMS estimate of Cormode and Muthukrishnan (2005), while other CMS-DP estimates may be viewed as CMS estimates with shrinkage; it is natural to ask whether there exists a similar interplay between the CMS-PYP and variations of the CMS for power-law data streams, e.g. the CMM. Other directions of interest consist in using the CMS-DP and CMS-PYP for large-scale streaming algorithms, e.g., for large text or streaming graphs applications (Cormode et al., 2012), as well as to accommodate nonlinear update operations, such as the conservative update (Cormode and Yi, 2020, Chapter 3). It remains open the problem of extending Theorem 3 to other nonparametric priors than the PYP prior, such as Gibbs-type priors and, in general, nonparametric priors arising from the normalization of completely random measures (James, 2002; Pr unster, 2002; Pitman, 2003; Regazzini et al., 2003). We expect that an extension to Gibbs-type priors may require a non trivial eﬀort to overcome the lack of the critical property (26), while still exploiting the general strategy underlying the Bayesian proof. Instead, an extension to more general priors than Gibbs-type priors might require a diﬀerent strategy of proof. An extension of Theorem 3 to general classes of nonparametric priors, which include the PYP prior as a special case, may be useful to discover novel peculiar properties of the PYP prior within such classes, as well as suﬃcientness postulates with respect to the hashed data. A further interesting problem consists in extending Theorem 3 with respect to the use of the sampling information. Theorem 3 provides the posterior distribution of f Xm+1 given Cn,hn(Xm+1), thus making use of sole sampling information contained in the bucket Cn,hn(Xm+1) i.e. the bucket where Xm+1 is hashed. While this approach is in line with the CMS of Cormode and Muthukrishnan (2005), from a statistical perspective it would be more reasonable to make use of the sampling information contained in all the buckets (Cn,1, . . . , Cn,J). In particular, one may consider the computation of the posterior distribution of f Xm+1 given (Cn,1, . . . , Cn,J) and h(Xm+1). We conjecture that the DP prior is the sole Gibbs-type prior for which the posterior distribution of f Xm+1 given (Cn,1, . . . , Cn,J) and h(Xm+1) coincides with the posterior distribution of f Xm+1 given Cn,hn(Xm+1). This would provide a suﬃcientness postulate for the DP prior with respect to hashed data. Work on this is ongoing.

Appendix A. Proof of Proposition 2

The proof of Equation (17) is straightforward, and it follows from Equation (11) by means of the deﬁnition of Beta-Binomial distribution (Johnson et al., 2005, Chapter 6). With regards to the proof of Equation (16), for t R+ and u N0, let (t)[u] := Q 0 i u 1(t i)

Dolera, Favaro and Peluchetti

denote the falling factorial of t of order u, with the proviso that (t)[0] := 1. In particular, (t)(u) = ( 1)u( t)[u]. Recall that the (u, v)-th Stirling number of the second type, here denoted by S(u, v), is deﬁned as the v-th coeﬃcient in the expansion of tu into falling factorials, i.e. tu = P 0 v u S(u, v)(t)[v]; moreover, it is assumed: S(0, 0) = 1, S(u, 0) = 0 for u > 0 and S(u, v) = 0 for v > u. Then, for r 1

l=0 lr θ J θ J + cn

(cn l + 1)(l) θ

k=0 S(r, k)(l)[k]

! θ J θ J + cn

(cn l + 1)(l) θ

k=0 S(r, k)

θ J θ J + cn

(cn l + 1)(l) θ

k=0 S(r, k)

θ J θ J + cn

Γ (cn + 1) Γ θ

k=0 S(r, k)

θ J θ J + cn

Γ (cn + 1) Γ θ

J + cn k! Γ 1 + θ

Γ (cn + 1 k) θ

= c r n θ J

k=0 S(r, k) Γ(k + 1)Γ (cn + 1) Γ (cn + 1 k) θ

By a direct application of Stirling formula for the ratio of Gamma functions, as cn + it holds

r c r n θ J

k=0 S(r, k)Γ(k + 1) θ

J Γ(r + 1) θ

= Γ (r + 1) Γ θ

J + r + 1 Γ (1)

= E[Br 1, θ

for any r 1. This completes the proof of (16), and hence the proof of Proposition 2 is completed.

Appendix B. Proof of Theorem 3

The proof is along lines similar to the Bayesian proof of Section 2. To simplify the notation, we remove the subscript n from hn and cn. Then, we are interest in computing the posterior distribution

Pr[f Xm+1 = l | Ch(Xm+1) = c] (34)

Learning-augmented count-min sketches via Bayesian nonparametrics

f Xm+1 = l |

i=1 1{h(Xi)}(h(Xm+1)) = c

= Pr f Xm+1 = l, Pm i=1 1{h(Xi)}(h(Xm+1)) = c

Pr Pm i=1 1{h(Xi)}(h(Xm+1)) = c

for l = 0, 1, . . . , m. The independence between hn and X1:m allows us to invoke the freezing lemma (Baldi, 2017, Lemma 4.1), according to which we can treat hn as it was ﬁxed, i.e. non-random. We start with the denominator of (34). Uniformity of the hash function h implies that h induces a (ﬁxed) J-partition {B1, . . . , BJ} of V such that Bj = {v V : h(v) = j} and ν(Bj) = J 1 for j = 1, . . . , J. Accordingly, we can write the denominator of (34) as

i=1 1{h(Xi)}(h(Xm+1)) = c

E[(P(Bj))c+1(1 P(Bj))m c]

E[(P(Bj))c+1P( Bj)m c]

(i+j) (θ)(m+1)

j C (c + 1, i; α)C (m c, j; α),

where the last equality follows from Sangalli (2006, Equation 3.3). This completes the study of the denominator of (34). Now, we consider the numerator of (34). Let us deﬁne the event B(m, l) = {X1 = = Xl = Xm+1, {Xl+1, . . . , Xm} {Xm+1} = }. In particular, we write

f Xm+1 = l,

i=1 1{h(Xi)}(h(Xm+1)) = c

i=1 1{h(Xi)}(h(Xm+1)) = c

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

That is, the distribution of (f Xm+1, Cj) is completely determined by the knowledge of the distribution of (X1, . . . , Xm+1). Let Π(s, k) denote the set of all possible partitions of the set {1, . . . , s} into k disjoints subsets π1, . . . , πk such that ni is the cardinality of πi. In particular, from Sangalli (2006, Equation 3.5), for any measurable A1, . . . , Am+1 we have that

Pr[X1 A1, . . . , Xm+1 Am+1] =

Qk 1 i=0 (θ + iα)

Dolera, Favaro and Peluchetti

(π1,...,πk) Π(m+1,k)

i=1 (1 α)(ni 1)ν( r πi Ar)

for m 1. Let V be the Borel σ-algebra of V. Let νπ1,...,πk be a probability measure on (Vm+1, V m+1) deﬁned as

νπ1,...,πk(A1 Am+1) = Y

1 i k ν( r πi Ar),

and attaching to B(m, l) a value that is either 0 or 1. In particular, νπ1,...,πk(B(m, l)) = 1 if and only if one of the πi s is equal to the set {1, . . . , l, m + 1}. Hence, based on the measure νπ1,...,πk, we write

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

Qk 1 i=0 (θ + iα)

(π1,...,πk 1) Π(m l,k 1) (1 α)(l)

i=1 (1 α)(ni 1)νπ1,...,πk

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

= θ(θ + α)(m l)

(θ)(m+1) (1 α)(l)

Qr 1 i=0 (θ + α + iα)

(θ + α)(m l)

(π1,...,πr) Π(m l,r)

i=1 (1 α)(ni 1)νπ1,...,πr

i=1 1{h(Xi)}(h(Xm+1)) = c l

Qr 1 i=0 (θ + α + iα)

(θ + α)(m l)

(π1,...,πr) Π(m l,r)

i=1 (1 α)(ni 1)νπ1,...,πr ( )

is the distribution of a random sample (X1, . . . , Xm l) from P PYP(α, θ + α; ν). This a consequence of the quasi-conjugacy property of the PYP prior (Lijoi et al., 2008). Again, the distribution of (X1, . . . , Xm l) is given in Sangalli (2006, Equation 3.5). In particular, we write

i=l+1 1{h(Xi)}(h(Xm+1)) = c l

= θ(θ + α)(m l)

(θ)(m+1) (1 α)(l)

Qr 1 i=0 (θ + α + iα)

(θ + α)(m l)

(π1,...,πr) Π(m l,r)

i=1 (1 α)(ni 1)νπ1,...,πr

i=1 1{h(Xi)}(h(Xm+1)) = c l

= θ(θ + α)(m l)

(θ)(m+1) (1 α)(l)

E[(P(Bj))c l(1 P(Bj))m c]

Learning-augmented count-min sketches via Bayesian nonparametrics

= θ(θ + α)(m l)

(θ)(m+1) (1 α)(l)

E[(P(Bj))c l P( Bj)m c]

= θ(θ + α)(m l)

(θ)(m+1) (1 α)(l)

(i+j) (θ + α)(m l)

j C (c l, i; α)C (m c, j; α),

where the second identity and the last identity follow from an application of Sangalli (2006, Proposition 3.1) and Sangalli (2006, Equation 3.3), respectively, under the PYP prior; see also the formule displayed at page 469 of Sangalli (2006)). Accordingly, from (36) we can write that

f Xm+1 = l,

i=1 1{h(Xi)}(h(Xm+1)) = c

θ(θ + α)(m l)

(θ)(m+1) (1 α)(l)

(i+j) (θ + α)(m l)

j C (c l, i; α)C (m c, j; α).

This completes the study of the numerator of (34). By combining (34) with (35) and (37) we obtain

f Xm+1 = l |

i=1 1{h(Xi)}(h(Xm+1)) = c

(θ + α)(m l)

(θ)(m+1) (1 α)(l)

Pc l i=0 Pm c j=0 ( θ+α

α )(i+j) (θ+α)(m l) 1

J j C (c l, i; α)C (m c, j; α)

Pc+1 i=0 Pm c j=0 ( θ

α)(i+j) (θ)(m+1) 1

J j C (c + 1, i; α)C (m c, j; α)

Pc l i=0 Pm c j=0 θ+α

J j C (c l, i; α)C (m c, j; α) Pc+1 i=0 Pm c j=0 θ

J j C (c + 1, i; α)C (m c, j; α) .

for l = 0, 1, . . . , c. By an application of Charalambides (2005, Equation 2.56 and Equation 2.60) it is easy to show that (38) is a proper distribution on {0, 1, . . . , c}. The proof is completed.

Appendix C. Theorem 1 from Theorem 3 with α = 0

We show how Theorem 3 reduces to Theorem 1 by setting α = 0. First, we recall two identities involving the generalized factorial coeﬃcient C (m, k; α) and the signless Stirling number of the ﬁrst type. See Charalambides (2005, Chapter 2) for details. In particular, it holds m X

k=0 ak|s(m, k)| = (a)(m) (39)

Dolera, Favaro and Peluchetti

for a > 0, and

lim α 0 C (m, k; α)

αk = |s(m, k)|. (40)

Hereafter, we apply the identities (39) and (40) in order to show that Theorem 3 reduces to Theorem 1 by setting α = 0. In this respect, we rewrite the posterior distribution (24) as follows

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

Pcn l i=0 Pm cn j=0 θ+α

J j C (cn l, i; α)C (m cn, j; α) Pcn+1 i=0 Pm cn j=0 θ

J j C (cn + 1, i; α)C (m cn, j; α) .

lim α 0 Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

= lim α 0 θ J

Pcn l i=0 Pm cn j=0 θ+α

(i+j) αi+j 1

J j C (cn l,i;α)

αi C (m cn,j;α)

αj Pcn+1 i=0 Pm cn j=0 θ

(i+j) αi+j 1

J j C (cn+1,i;α)

αi C (m cn,j;α)

[by the identity (40)]

Pcn l i=0 θ

J i |s(cn l, i)| Pm cn j=0 θ 1 1

J j |s(m cn, j)| Pcn+1 i=0 θ

J i |s(cn + 1, i)| Pm cn j=0 θ 1 1

J )j|s(m cn, j)|

[by the identity (39)]

(cn l) θ 1 1

(cn l) θ 1 1

J Γ(cn + 1)Γ(cn l + θ

Γ(cn l + 1)Γ( θ

J + cn + 1)

θ J θ J + cn

(cn l + 1)(l) ( θ

J + cn l)(l) ,

which is the expression for the posterior distribution stated in Theorem 1. The proof is completed.

Appendix D. Proof of Equation (27)

Let X1:m be a random sample from P PYP(α, θ; ν), with α [0, 1) and θ > α, and let Km be the number of distinct types in X1:m. We recall from (19) that for k = 1, . . . , m it holds

Pr[Km = k] =

(k) (θ)(m) C (m, k; α).

Learning-augmented count-min sketches via Bayesian nonparametrics

Now, assuming cn > 0 and m cn > 0, we rewrite the posterior distribution of Theorem 3 in terms of the distribution of Km. In particular, for any l = 0, 1, . . . , cn 1 we can write that

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

Pcn l i=0 Pm cn j=0 θ+α

J j C (cn l, i; α)C (m cn, j; α) Pcn+1 i=0 Pm cn j=0 θ

J j C (cn + 1, i; α)C (m cn, j; α)

(1 α)(l) (θ)(cn l)(θ)(m cn) (θ)(cn+1)(θ)(m cn)

Pcn l i=0 Pm cn j=0 θ+α

α)(i) (θ)(cn l) C (cn l, i; α) ( θ

α)(j) (θ)(m cn) C (m cn, j; α)

Pcn+1 i=0 Pm cn j=0 θ

α)(i) (θ)(cn+1) C (cn + 1, i; α) ( θ

α)(j) (θ)(m cn) C (m cn, j; α)

(1 α)(l) (θ)(cn l) (θ)(cn+1)

Pcn l i=0 Pm cn j=0 θ+α

α)(j) Pr[Kcn l = i]Pr[Km cn = j]

Pcn+1 i=0 Pm cn j=0 θ

α)(j) Pr[Kcn+1 = i]Pr[Km cn = j]

(Kcn l+Km cn) ( 1

J ) Kcn l(1 1

J ) Km cn ( θ

α)(Kcn l)( θ

(Kcn+1+Km cn) ( 1

J ) Kcn+1(1 1

J ) Km cn ( θ

α)(Kcn+1)( θ

α J cn l (1 α)(l) (θ + cn l)(l+1)

α +Kcn l+Km cn) Γ( θ

α+Kcn l)Γ( θ

J Kcn l 1 1

α+Kcn+1+Km cn) Γ( θ

α+Kcn+1)Γ( θ

J Kcn+1 1 1

where Kcn l and Km cn in the numerator are independent random variables for any l = 0, 1, . . . , cn 1, and Kcn+1 and Km cn in the denominator are independent random variables. For l = cn

Pr[f Xm+1 = cn | Cn,hn(Xm+1) = cn]

J (1 α)(cn)

Pm cn j=0 θ+α

J j C (m cn, j; α) Pcn+1 i=0 Pm cn j=0 θ

J j C (cn + 1, i; α)C (m cn, j; α)

J (1 α)(cn) (θ)(m cn) (θ)(cn+1)(θ)(m cn)

Dolera, Favaro and Peluchetti

Pm cn j=0 θ+α

α)(j) (θ)(m cn) C (m cn, j; α)

Pcn+1 i=0 Pm cn j=0 θ

α)(i) (θ)(cn+1) C (cn + 1, i; α) ( θ

α)(j) (θ)(m cn) C (m cn, j; α)

J (1 α)(cn) 1 (θ)(cn+1)

Pm cn j=0 θ+α

α)(j) Pr[Km cn = j]

Pcn+1 i=0 Pm cn j=0 θ

α)(j) Pr[Kcn+1 = i]Pr[Km cn = j]

J (1 α)(cn)

(Km cn) (1 1

J ) Km cn ( θ

(Kcn+1+Km cn) ( 1

J ) Kcn+1(1 1

J ) Km cn ( θ

α)(Kcn+1)( θ

α J (1 α)(cn) Γ(θ/α)(θ)(cn+1)

α +Km cn) Γ( θ

α+Km cn) 1 1

α+Kcn+1+Km cn) Γ( θ

α+Kcn+1)Γ( θ

J Kcn+1 1 1

where Kcn+1 and Km cn are independent random variables. This completes the proof of Equation (27).

Appendix E. An alternative expression for Equation (24)

For any α (0, 1), an alternative expression for (24) may be given in terms of the distribution of exponentially tilted α-stable random variables (Zolotarev, 1986). In particular, if gα denotes the density function of a positive α-stable distribution, then for any c > 0 an exponentially tilted α-stable random variable is deﬁned as the random variable Tα,c whose distribution has density function f Sα,c(x) exp{ c1/αx}gα(x)1R+(x). If cn > 0, then for l = 0, 1, . . . , cn

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

Pcn l i=0 Pm cn j=0 θ+α

J j C (cn l, i; α)C (m cn, j; α) Pcn+1 i=0 Pm cn j=0 θ

J j C (cn + 1, i; α)C (m cn, j; α)

α ) R + 0 x θ+α

α 1e x Pcn l i=0 Pm cn j=0 x

J j C (cn l, i; α)C (m cn, j; α) dx

α) R + 0 x θ α 1e x Pcn+1 i=0 Pm cn j=0 x

J j C (cn + 1, i; α)C (m cn, j; α) dx

By means of Favaro et al. (2015, Equation 13) we can write the numerator and the denominator of the previous expression in terms of the distribution of Tα,c, for suitable choices of

Learning-augmented count-min sketches via Bayesian nonparametrics

c. That is,

α ) R + 0 x θ+α

α 1e x Pcn l i=0 Pm cn j=0 x

J j C (cn l, i; α)C (m cn, j; α) dx

α) R + 0 x θ α 1e x Pcn+1 i=0 Pm cn j=0 x

J j C (cn + 1, i; α)C (m cn, j; α) dx

α ) R + 0 x θ+α

α E h T cn l α, x

α E T m cn α,x(1 1

α) R + 0 x θ α 1e x x

α E h T cn+1 α, x

α E T m cn α,x(1 1

α ) R + 0 x θ+α+m l

α 1e x E h T cn l α, x

i E T m cn α,x(1 1

α) R + 0 x θ+m+1

α 1e x E h T cn+1 α, x

i E T m cn α,x(1 1

(1 α)(l) (41)

(m l) R + 0 E h T cn l α, x

i E T m cn α,x(1 1

f G θ+α+m l

(m+1) R + 0 E h T cn+1 α, x

i E T m cn α,x(1 1

α ,1(x)dx ,

where f Ga,b is the density function of a Gamma distribution with parameter (a, b). Equation (41) allows for an MC evaluation of (24), which requires to sample from a Gamma distribution and to sample Tα,c, for suitable choices of c. See Devroye (2009) and references therein.

Appendix F. Proof Equation (28) and Equation (29)

Under the setting of Theorem 3, we consider m + , while cn is ﬁxed. For any l = 0, 1, . . . , cn

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

Pcn l i=0 Pm cn j=0 θ+α

J i C (cn l, i; α) (1 1

J ) j C (m cn,j;α) Pm cn j=0 (1 1

J ) j C (m cn,j;α) Pcn+1 i=0 Pm cn j=0 θ

J i C (cn + 1, i; α) (1 1

J ) j C (m cn,j;α) Pm cn j=0 (1 1

J ) j C (m cn,j;α)

Dolera, Favaro and Peluchetti

Pcn l i=0 θ+α

J i C (cn l, i; α) Pm cn j=0 θ+α

J ) j C (m cn,j;α) Pm cn j=0 (1 1

J ) j C (m cn,j;α) Pcn+1 i=0 θ

J i C (cn + 1, i; α) Pm cn j=0 θ

J ) j C (m cn,j;α) Pm cn j=0 (1 1

J ) j C (m cn,j;α)

Now, consider the numerator of the last member in (42). From Dolera and Favaro (2020a, Lemma 2), as m +

J j C (m cn, j; α) P j 1 1 1

J j C (m cn, j; α) (43)

Now, consider the denominator of the last member in (42). Again, from Dolera and Favaro (2020a, Lemma 2), as m +

J j C (m cn, j; α) Pm cn j=0 1 1

J j C (m cn, j; α) (44)

Then, by combining (42) with (43) and (44), for any l = 0, 1, . . . , cn, as m + we can write

lim m + Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

Pcn l i=0 θ+α

J i C (cn l, i; α)e (1 1

Pcn+1 i=0 θ

J i C (cn + 1, i; α)e (1 1

J ) θ α +i+1

α 1 (θ + α)(cn l)

Pcn l i=0 ( θ+α

α )(i) (θ+α)(cn l) C (cn l, i; α) θ+α

Pcn+1 i=0 ( θ

α)(i) (θ)cn+1 C (cn + 1, i; α) θ

α 1 (θ + α)(cn l)

Pcn l i=0 ( θ+α

α )(i) (θ+α)(cn l) C (cn l, i; α) θ+α

Pcn+1 i=0 ( θ

α)(i) (θ)cn+1 C (cn + 1, i; α) θ

Learning-augmented count-min sketches via Bayesian nonparametrics

[by Pitman (2006, Equation 3.13)]

α 1 (θ + α)(cn l)

α + (θ+2α)(cn l) α(θ+α+1)(cn l 1) θ+α

θ α + (θ+α)(cn+1) α(θ+1)(cn+1 1) θ

1 (θ + α)(cn l)

(θ+2α)(cn l) (θ+α+1)(cn l 1)

(θ+α)(cn+1) (θ+1)(cn+1 1)

(1 α)(l) (θ + 2α)(cn l) (θ + α + 1)(cn) .

This completes the proof of Equation (28). Equation (29) follows by a direct calculation from (28).

Appendix G. Proof of Proposition 5

Let Ba,b be a Beta random variable with parameter (a, b), and denote by f Ba,b the density function of the distribution of Ba,b. We start by some considerations on the distribution of Ba,b:

Γ(θ + α + m l)Γ(1 α + l)

Γ(θ + m + 1) = Z 1

0 tθ+α+m l 1(1 t)l αdt;

Γ(θ + α + m l)Γ(1 α + l)

Γ(θ + m + 1) = Γ(θ + α)Γ(1 α)

Γ(θ + 1) E[Bm l θ+α,1 α(1 Bl θ+α,1 α)].

Moreover, we observe that we can rewrite the numerator and the denominator of (27) as follows

0 gα(h)gα(x)x θ α

h x(J 1) 1 α m c

h x(J 1) 1 α + 1 θ+m l+α dxdh

= Γ(2 + θ/α)

Γ(1 + θ + α)

0 f Sα,0(h)f Sα,θ+α(x)

h x(J 1) 1 α m c

h x(J 1) 1 α + 1 θ+m l+α dxdh

= Γ(2 + θ/α)

Γ(1 + θ + α)

(x + 1)θ+m l+α

f Zα,θ+α(x)dx

0 gα(h)gα(x)x θ

h x(J 1) 1 α m c

h x(J 1) 1 α + 1 θ+m+1 dxdh

Dolera, Favaro and Peluchetti

= Γ(1 + θ/α)

0 f Sα,0(h)f Sα,θ(x)

h x(J 1) 1 α m c

h x(J 1) 1 α + 1 θ+m+1 dxdh

= Γ(1 + θ/α)

(x + 1)θ+m+1

f Wα,θ(x)dx,

respectively. First, we prove that the distribution Pr[f Xm+1 | Cn,hn(Xm+1) = cn] admits a representation in terms of a suitable mixture of Binomial distribution. In particular, we write

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

Pcn l i=0 Pm cn j=0 θ+α

J j C (cn l, i; α)C (m cn, j; α) Pcn+1 i=0 Pm cn j=0 θ

J j C (cn + 1, i; α)C (m cn, j; α)

(θ + α)(m l)

(θ)(m+1) (1 α)(l)

Pm cn i=0 m c i ( 1)m cn i Pm l i k=0 ( θ+α

α )(k) (θ+α)(m l i) 1 Jk C (m l i, k; α)

Pm cn i=0 m cn i ( 1)m cn i Pm i+1 k=0 ( θ

α)(k) (θ)(m i+1) 1 Jk C (m i + 1, k; α)

Γ(θ + α + m l)

Γ(θ + m + 1) (1 α)(l)

(0,+ )2 gα(h)gα(x)x θ α

h x (J 1) 1 α m cn

h x (J 1) 1 α +1 θ+m l+α dxdh

(0,+ )2 gα(h)gα(x)x θ

h x (J 1) 1 α m cn

h x (J 1) 1 α +1 θ+m+1 dxdh

0 tm l(1 t)l R + 0 h xm cn (x+1)θ+m l+α i f Zα,θ+α(x)dx R + 0 h xm cn (x+1)θ+m+1 i f Wα,θ(x)dx f Bθ+α,1 α(t)dt

= 1 D(m, cn; α, θ, J)

(1 t)l t x + 1

cn l# tx x+1 m cn

(x + 1)θ+α f Bθ+α,1 α(t)f Zα,θ+α(x)dtdx

= 1 D(m, cn; α, θ, J)

(1 t)l t x + 1

cn l# tx x+1 m cn

(x + 1)θ+α f Bθ+α,1 α(t)f Zα,θ+α(x)dtdx

Γ(θ+1)Γ(m l+θ+α)Γ(l+1 α)

Γ(θ+α)Γ(1 α)Γ(m+θ+1)

D(m, cn; α, θ, J)

(x + 1)l cn

(x + 1)θ+α f Zα,θ+α(x)dx

Γ(θ+1)Γ(m l+θ+α)Γ(l+1 α)

Γ(θ+α)Γ(1 α)Γ(m+θ+1)

D(m, cn; α, θ, J)

0 xm cn(x + 1)l m θ αf Zα,θ+α(x)dx, (45)

D(m, cn; α, θ, J) = Z +

(x + 1)θ+m+1

f Wα,θ(x)dx.

Learning-augmented count-min sketches via Bayesian nonparametrics

It is easy to show that (45) is mixture of Binomial distributions. In particular, from (45) we write

Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn] (46)

= 1 D(m, cn; α, θ, J)

!l t x+1 x+1 xt

x+1 cn tx x+1 m cn

(x + 1)θ+α f Bθ+α,1 α(t)f Zα,θ+α(x)dtdx,

Now, FXm+1 be a random variable with distribution (46) and compute the moment of order r of FXm+1. From the representation of the distribution of FXm+1 as a mixture of Binomial distribution,

E[(FXm+1)r]

= E[fr Xm+1 | Cn,hn(Xm+1) = cn]

= 1 D(m, cn; α, θ, J)

lr(1 t)l t x + 1

cn l# tx x+1 m cn

(x + 1)θ+α f Bθ+α,1 α(t)f Zα,θ+α(x)dtdx.

Now, in the previous expression, we consider the summation within brackets. Recall that the (u, v)-th Stirling number of the second type, here denoted by S(u, v), is deﬁned as the v-th coeﬃcient in the expansion of tu into falling factorials, i.e. tu = P 0 v u S(u, v)(t)[v]; moreover, it is assumed: S(0, 0) = 1, S(u, 0) = 0 for u > 0 and S(u, v) = 0 for v > u. Then, we write

lr(1 t)l t x + 1

k=0 S(r, k)k! l k

(1 t)l t x + 1

k=0 S(r, k) cn! (cn k)!(1 t)k cn X

(1 t)l k t x + 1

k=0 S(r, k) cn! (cn k)!(1 t)k cn k X

(1 t)j t x + 1

k=0 S(r, k) cn! (cn k)!(1 t)k 1 t + t x + 1

= cr n(1 t)r 1 t + t x + 1

cn r + O(cr 1 n ),

Dolera, Favaro and Peluchetti

where O(cr 1 n ) in the last identity is intended as cn + . Accordingly, we can write the following

r | Cn,hn(Xm+1) = cn

= 1 D(m, cn; α, θ, J)

0 (1 t)r 1 t + t x + 1

cn r tx x+1 m cn

(x + 1)θ+α f Bθ+α,1 α(t)f Zα,θ+α(x)dtdx + O 1

Now, the double integral on the right-hand side of (47) can be rewritten by means of the following change of variable: y = (1 + x)/((1 + x)(1 t) + t) (1, 1 1 t). In particular, we can write

0 (1 t)r 1 t + t x + 1

cn r tx x+1 m cn

(x + 1)θ+α f Bθ+α,1 α(t)f Zα,θ+α(x)dtdx (48)

1 yr cn y 1

m cn 1 y(1 t)

θ+α tf Zα,θ+α y 1 1 y(1 t)

[1 y(1 t)]2 dy

f Bθ+α,1 α(t)dt.

We develop a large m asymptotic analysis of (48), as well as of D(m, cn; α, θ, J), under the large m asymptotic regime cn = λm. We start from the term D(m, cn; α, θ, J), which we rewrite as

D(m, cn; α, θ, J) = Z +

m ϕ(x)dx (49)

where β := 1 λ and ϕ(x) := f Wα,θ(x)/(1 + x)θ+1. The function ψ : x 7 xβ/(x + 1) has a unique maximum point x := β 1 β = 1 λ

λ . Moreover, straightforward computations show that

ψ (x) = βxβ 1 (1 β)xβ

ψ (x) = β(1 β)xβ 2 2β(2 β)xβ 1 + (1 β)(2 β)xβ

Then, ψ (x) = xβ 1/(1 + x)3 and the Laplace method leads to the following large m behaviour

D(m, cn; α, θ, J) 1 mϕ(x) xβ

xβ 1 . (50)

Learning-augmented count-min sketches via Bayesian nonparametrics

We consider (48), i.e. the integral within brackets on the right-hand side of (48), which we rewrite as

1 yr (y 1)β

θ+α tf Zα,θ+α y 1 1 y(1 t)

[1 y(1 t)]2 dy

0 (1 + x)r xβ

ϕt(x) := 1 (x + 1)(1 t)

θ+α tf Zα,θ+α x 1 (x+1)(1 t)

[1 (x + 1)(1 t)]2 .

To apply the Laplace method, we note that (0, 1) t 7 t/(1 t) (0, + ) is a strictly monotonically increasing function. Thus, β < t entails x := β 1 β < t 1 t and, for such t, it holds

0 (1 + x)r xβ

m ϕt(x)dx (1 + x)r

for large m. On the other hand, β > t entails x := β 1 β > t 1 t and, for such t, there holds a similar large m asymptotic expansion. Now, by exploiting the fact that ψ : x 7 xβ/(x + 1) is a strictly monotonically increasing function for x (0, t/(1 t)), then we can write the following

0 (1 + x)r xβ

m ϕt(x)dx 1 mθ+2α

ρ(t) = 1 mθ+2α [tβ(1 t)1 β]mρ(t)

for large m, where ρ is a suitable function independent of m. Accordingly, we can write that

0 (1 t)r[tβ(1 t)1 β]mρ(t)f Bθ+α,1 α(t)dt C(β) m1+θ+2α

as m + . Then, starting from Equation (47) and then gathering (51) and (52) we can write

1 D(m, cn; α, θ, J)

0 (1 t)r[tβ(1 t)1 β]mρ(t)f T (t)dt

β (1 t)rϕt(x)f T (t)dt

Dolera, Favaro and Peluchetti

as m + . According to (50) the ﬁrst term in the right-hand side of (53) is negligible, and hence

β (1 t)rϕt(x)f Bθ+α,1 α(t)dt

0 τ rϕ1 τ(x)f Bθ+α,1 α(1 τ)dτ

as m + , which, because of the large m asymptotic regime cn = λm, completes the proof.

Appendix H. Proof of Equation (31)

Because of the independence assumption of H, i.e. the hash functions hn s are i.i.d. according to the strong universal family H, and by an application of Bayes theorem, we can write

Pr[f Xm+1 = l | {Cn,hn(Xm+1)}n [N] = {cn}n [N]]

= 1 Pr[{Cn,hn(Xn+1)}n [N] = {cn}n [N]]Pr[f Xm+1 = l]

n=1 Pr[Cn,hn(Xm+1) = cn | f Xm+1 = l]

= 1 Pr[{Cn,hn(Xm+1)}n [N] = {cn}n [N]]Pr[f Xm+1 = l]

Pr[Cn,hn(Xm+1) = cn, f Xm+1 = l]

Pr[f Xm+1 = l]

= 1 Pr[{Cn,hn(Xm+1)}n [N] = {cn}n [N]](Pr[f Xm+1 = l])1 N

n=1 Pr[Cn,hn(Xm+1) = cn]Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn]

= (Pr[f Xm+1 = l])1 N N Y

n=1 Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn] (54)

for l = 0, 1, . . . , min{c1, . . . , c N}, where Pr[f Xm+1 = l | Cn,hn(Xm+1) = cn] is precisely the posterior distribution computed in Theorem 3 with respect to the hash function hn, whereas

Pr[f Xm+1 = l] = X

m Mm,k Pr[Mm = m]Pr[Xm+1 vl | X1:m]

m Mm,k Pr[Mm = m]Pr[Xm+1 vl | X1:m],

where Pr[Mm = m] is in Equation (18) and Pr[Xm+1 vl | X1:m] is in Equation (20). That is,

Pr[f Xm+1 = l] = X

α(1 α)(i 1)

θ+m if l = 0

θ+m if l 1. .

Learning-augmented count-min sketches via Bayesian nonparametrics

Pr[f Xm+1 = 0] = X

α(1 α)(i 1)

mi 1 mi! θ + kα

= θ θ + m + α θ + m E[Km]

= θ θ + m + α θ + m

(θ + α)(m) α(θ + 1)(m 1) θ

where the last equality follows from Pitman (2006, Equation 3.13). Accordingly, we can write that

Pr[f Xm+1 = 0] = (θ + α)(m)

Pr[f Xm+1 = l] = X

α(1 α)(i 1)

mi 1 mi! ml(l α)

θ + m E[Ml,m]

θ + m (1 α)(l 1)

l! (m)[l] (θ + α)(m l) (θ + 1)(m 1)

where the last equality follows from Favaro et al. (2013, Proposition 1). Accordingly, for l = 1, . . . , m,

Pr[f Xm+1 = l] = (1 α)(l)

l! (m)[l] (θ + α)(m l)

(θ + 1)(m) . (55)

Equation (31) follows by combining the distribution (54) with (55). This completes the proof.

Appendix I. CMS for range queries under DP priors

We assume that the stream x1:m is modeled as a random sample X1:m from an unknown discrete distribution P, which is endowed with a DP prior, i.e. P DP(θ; ν). Let h1, . . . , h N be a collection of random hash functions that are i.i.d. from the strong universal family H, and assume that h1, . . . , h N are independent of X1:m for any m 1; in particular, by de Finetti s representation theorem, h1, . . . , h N are independent of P DP(θ; ν). Under this BNP framework, a s-range query induces the posterior distribution of the frequencies (fxm+1, . . . , fxm+s) given {(Cn,hn(v1), . . . , Cn,hn(vs))}n [N], for arbitrary {xm+1, . . . , xm+s} V. This posterior distribution, in turn, induces the posterior distribution of the s-range query fs given {(Cn,hn(v1), . . . , Cn,hn(vs))}n [N]. CMS-DP estimates of fs are obtained as functionals of the posterior distribution of fs given {(Cn,hn(v1), . . . , Cn,hn(vs))}n [N]. To compute the posterior distribution of (fxm+1, . . . , fxm+s) given {(Cn,hn(v1), . . . , Cn,hn(vs))}n [N], it is natural to consider s additional random samples (Xm+1, . . . , Xm+s). In particular, for

Dolera, Favaro and Peluchetti

any r = 1, . . . , s let f Xm+r be the frequency of Xm+r in X1:m, i.e.,

i=1 1{Xi}(Xm+r)

and let Cn,hn(Xm+r) be the hashed frequency of all Xi s, for i = 1, . . . , m, such that hn(Xi) = hn(Xm+r), i.e.,

Cn,hn(Xm+r) =

i=1 1hn(Xi)(h(Xm+r)).

Now, let Xs = (Xm+1, . . . , Xm+s) and for n [N] let f Xs = (f Xm+1, . . . , f Xm+s). For n [N] let Cn,hn(Xs) = (Cn,hn(Xm+1), . . . , Cn,hn(Xm+s). For each hn we are interested in the posterior distribution

Pr f Xs = ls | Cn,hn(Xs) = cn = Pr[f Xs = ls, Cn,hn(Xs) = cn]

Pr[Cn,hn(Xs) = cn] . (56)

for ls {0, 1, . . . , m}s. For the collection of hash functions h1, . . . , h N, the posterior distribution of f Xs given {Cn,hn(Xs)}n [N] follows from the posterior distribution (56) by the assumption that the hn s are i.i.d. according to the strong universal family H, and Bayes theorem. Hereafter we show that the Bayesian proof of Section 2 can be readily extended to the computation of the posterior distribution (56). We outline this extension for any range s 1, and then we present an explicit example for s = 2. To simplify the notation, we remove the subscript n from hn and cn. Then, we are interested in computing the posterior distribution

Pr f Xs = ls | Ch(Xs) = c = Pr[f Xs = ls, Ch(Xs) = c]

Pr[Ch(Xs) = c] . (57)

For s = 1 the posterior distribution (57) reduces to (10). The independence between hn and X1:m allows us to invoke the freezing lemma (Baldi, 2017, Lemma 4.1), according to which we can treat hn as it was ﬁxed, i.e. non-random. We analyze the posterior distribution (57) starting from its denominator. In particular, the denominator of (57) can be written as follows

Pr[Ch(Xs) = c] = X

(j1,...,js) [J]s Pr[Ch(Xs) = c, h(Xm+1) = j1, . . . , h(Xm+s) = js]

(j1,...,js) [J]s Pr

i=1 1h(Xi)(j1) = c1, . . . ,

i=1 1h(Xi)(js) = cs,

h(Xm+1) = j1, . . . , h(Xm+s) = js

To evaluate

i=1 1h(Xi)(j1) = c1, . . . ,

i=1 1h(Xi)(js) = cs, h(Xm+1) = j1, . . . , h(Xm+s) = js

Learning-augmented count-min sketches via Bayesian nonparametrics

we split the sum over [J]s and we organize the summands as follows. First, we introduce a variable k which counts how many distinct object there are in each vector (j1, . . . , js), so that k {1, 2, . . . , min{s, J}}. Second, we consider the vector (r1, . . . , rk) of frequencies of the distinct k objects. Third, we consider the vector (j 1, . . . , j k) of distinct objects with {j 1, . . . , j k} {1, . . . , J}. Then, we evaluate the probability (58) in the distinguishing case that

j1 = = jr1 =: j 1 jr1+1 = = jr1+r2 =: j 2 . . . jr1+ +rk 1+1 = = jr1+ +rk =: j k such that the probability (58) of interest is diﬀerent from zero if and only if the following holds true

c1 = = cr1 =: c 1 cr1+1 = = cr1+r2 =: c 2 . . . cr1+ +rk 1+1 = = cr1+ +rk =: c k.

i=1 1h(Xi)(j 1) = c 1, . . . ,

i=1 1h(Xi)(j k) = c k, h(Xm+1) = = h(Xm+r1) = j 1, . . .

. . . , h(Xm+r1+ +rk 1+1) = = h(Xm+r1+ +rk) = j k

Now, we set B r := {x V : h(x) = j r} for any r {1, . . . , k} and we set B k+1 = k r=1B r C. Thus, {B 1, . . . , B k+1} is a ﬁnite partition of V. If k = J, then B k+1 = and in such case we intend that {B 1, . . . , B k+1} is replaced by {B 1, . . . , B k}. Accordingly, we can write the identity

i=1 1h(Xi)(j 1) = c 1, . . . ,

i=1 1h(Xi)(j k) = c k, h(Xm+1) = = h(Xm+r1) = j 1, . . .

. . . , h(Xm+r1+ +rk 1+1) = = h(Xm+r1+ +rk) = j k

= m c 1, . . . , c k

i=1 pc i +ri i

(1 p1 pk)m Pk i=1 c i µB 1,...,B k+1(dp1 . . . dpk)

where µB 1,...,B k+1 is the distribution of (P(B 1), . . . , P(B k)) which, by the ﬁnite-dimensional projective property of the DP, is a Dirichlet distribution with parameter (θ/J, . . . , θ/J) on k. If k < J

i=1 1h(Xi)(j 1) = c 1, . . . ,

i=1 1h(Xi)(j k) = c k, h(Xm+1) = = h(Xm+r1) = j 1, . . .

. . . , h(Xm+r1+ +rk 1+1) = = h(Xm+r1+ +rk) = j k

Dolera, Favaro and Peluchetti

= Γ(θ) [Γ( θ

J )]kΓ((J k) θ

h Qk i=1 Γ( θ

J + c i + ri) i Γ((J k) θ

J + m Pk i=1 c i )

Γ(θ + m + s) ,

and if k = J

i=1 1h(Xi)(j 1) = c 1, . . . ,

i=1 1h(Xi)(j k) = c k, h(Xm+1) = = h(Xm+r1) = j 1, . . .

. . . , h(Xm+r1+ +rk 1+1) = = h(Xm+r1+ +rk) = j k

= Γ(θ) (Γ( θ

Qk i=1 Γ( θ

J + c i + ri) Γ(θ + m + s) .

Upon denoting by Ik(c n,1, . . . , c n,k; r1, . . . , rk) the right expression of the integral, we conclude that

Pr[Ch(Xs) = c] (59)

(π1,...,πk) Π(s,k) (π1, . . . , πk; c1, . . . , cs) m c 1, . . . , c k

Ik(c 1, . . . , c k; |π1|, . . . , |πk|),

where: i) Π(s, k) denotes the set of all possible partitions of the set {1, . . . , s} into k disjoint subsets π1, . . . , πk; |πi| stands for the cardinality of the subset πi; ii) (π1, . . . , πk; c1, . . . , cs) is either 0 or 1 with the proviso that it equals 1 if and only if, for all z {1, . . . , k} for which |πz| 2, all the integers ci with i πz are equal; for any i {1, . . . , k}, ci represents the common integer associated to πi. Formula (59) simpliﬁes remarkably for small values of s. For instance,

i) for s = 1

Pr[Ch(Xm+1) = c1] = J m c1

ii) for s = 2

Pr[Ch(Xm+1) = c1, Ch(Xm+2) = c2] (60)

= J1{c1 = c2} m c1

I1(c1; 2) + J(J 1) m c1, c2

I2(c1, c2; 1, 1).

We conclude by studying the numerator in (57). This expression is determined by the complete knowledge of the joint distribution of (X1, . . . , Xm+s). As above, we can start by writing

Pr[f Xs = ls, Ch(Xs) = c]

Learning-augmented count-min sketches via Bayesian nonparametrics

(π1,...,πk) Π(s,k) (π1, . . . , πk; l1, . . . , ls) n l 1, . . . , l k

B(m; l 1, . . . , l k; π1, . . . , πk)

i=1 1h(Xi)(j1) = c1, . . . ,

i=1 1h(Xi)(js) = cs

where the event B(m; l 1, . . . , l k) is characterized by the relations among random variables Xm+r s

X1 = = Xl 1 = Xm+r for all r π1 Xl 1+1 = = Xl 1+l 2 = Xm+r for all r π2 . . . . . .

Xl 1+ +l k 1+1 = = Xl 1+ +l k = Xm+r for all r πk

Xm+r1 = Xm+r2 for all r1 πa, r2 πb for all a = b

{Xl 1+ +l k+1, . . . Xm} {Xm+1, . . . , Xm+s} = .

The numerator of (57) can be treated as the denominator of (57), namely by exploiting the double partition structure induced by the above relations on the random variables Xi s and h(Xi) s. We observe that the combination of this two partition structures proves particularly cumbersome to be written for general s 1. For this reason, further manipulations of the posterior distribution (57) will be deferred to the proof the next theorem, where we assume s = 2.

Theorem 6 For m 1, let x1:m be a stream of tokens that are modeled as a random sample X1:m from P DP(θ; ν), and let (Xm+1, Xm+2) be a pair of additional random samples from P. Moreover, let hn be a random hash function distributed as the strong universal family H, and let hn be independent of X1:m for any m 1, that is hn is independent of P. Then

Pr[f Xm+1 = l1, f Xm+2 = l2 | Cn,hn(Xm+1) = cn,1, Cn,hn(Xm+2) = cn,2]

= Num(l1, l2, cn,1, cn,2)

Den(cn,1, cn,2) l1, l2 0

Den(cn,1, cn,2) = J1{cn,1 = cn,2 = c}( θ

J )(c+2)(θ θ

J )(m c) c!(m c)!

+ J(J 1) ( θ

J )(cn,1+1)( θ

J )(cn,2+1)(θ 2θ

J )(m cn,1 cn,2) cn,1!cn,2!(m cn,1 cn,2)! ;

Num(l1, l2, cn,1, cn,2) = 1{l1 = l2 =: l, cn,1 = cn,2 = c}θ(l + 1)( θ

J )(c l)(θ θ

J )(m c) (c l)!(m c)!

Dolera, Favaro and Peluchetti

+ 1{cn,1 = cn,2 = c}θ2( θ

J )(c l1 l2)(θ θ

J )(m c) J(c l1 l2)!(m c)!

J )(cn,1 l1)( θ

J )(cn,2 l2)(θ 2θ

J )(m cn,1 cn,2) (cn,1 l1)!(cn,2 l2)!(m cn,1 cn,2)! .

Proof Following the Bayesian proof for s 1, we start by expressing the posterior distribution of (f Xm+1, f Xm+2) given Cn,hn(Xm+1) and Cn,hn(Xm+2) as a ratio of two probabilities, and then we deal with the numerator and denominator. That is, we write the following expression

Pr[f Xm+1 = l1, f Xm+1 = l2 | Cn,hn(Xm+1) = cn,1, Cn,hn(Xm+2) = cn,2] (61)

= Pr f Xm+1 = l1, f Xm+2 = l2, Pm i=1 1hn(Xi)(Xm+1) = cn,1, Pm i=1 1hn(Xi)(Xm+2) = cn,2

Pr Cn,hn(Xm+1) = cn,1, Cn,hn(Xm+2) = cn,2

Observe that the denominator of the posterior distribution (61) reduces to (60). Then, by using the ﬁnite-dimensional projective property of the DP, we can write the following expressions

J1{cn,1 = cn,2 = c} m c

= J1{cn,1 = cn,2 = c} m c

0 pc+2(1 p)m c Γ(θ) Γ(θ/J)Γ(θ(1 1/J))pθ/J 1(1 p)θ(1 1/J) 1dp

= J1{cn,1 = cn,2 = c} m c

Γ(θ) Γ(θ/J)Γ(θ(1 1/J)) Γ(θ/J + c + 2)Γ(θ(1 1/J) + m c)

Γ(θ + m + 2)

J(J 1) m cn,1, cn,2

I2(cn,2, cn,2; 1, 1)

= J(J 1) m cn,1, cn,2

2 pcn,1+1 1 pcn,2+1 2 (1 p1 p2)m cn,1 cn,2

Γ(θ) Γ(θ/J)Γ(θ/J)Γ(θ(1 2/J))pθ/J 1 1 pθ/J 1 2 (1 p1 p2)θ(1 2/J) 1dp1dp2

= J(J 1) m cn,1, cn,2

Γ(θ) (Γ(θ/J))2Γ(θ(1 2/J))

Γ(θ/J + cn,1 + 1)Γ(θ/J + cn,2 + 1)Γ(θ(1 2/J) + m cn,1 cn,2)

Γ(θ + m + 2) .

Pr Cn,hn(Xm+1) = cn,1, Cn,hn(Xm+2) = cn,2 (62)

Learning-augmented count-min sketches via Bayesian nonparametrics

= J1{cn,1 = cn,2 = c} m cn,1

Γ(θ) Γ(θ/J)Γ(θ(1 1/J))

Γ(θ/J + c + 1)Γ(θ(1 1/J) + m c)

Γ(θ + m + 2)

+ J(J 1) m cn,1, cn,2

Γ(θ) (Γ(θ/J))2Γ(θ(1 2/J))

Γ(θ/J + cn,1 + 1)Γ(θ/J + cn,2 + 1)Γ(θ(1 2/J) + m cn,1 cn,2)

Γ(θ + m + 2) .

Now, we focus on the numerator of the posterior distribution (61), which is rewritten as follows

f Xm+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2

f Xn+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2, Xm+1 = Xm+2

f Xn+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2, Xm+1 = Xm+2

First, we consider the ﬁrst term on the right-hand side of the probability (63). In particular, we write

f Xm+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2, Xm+1 = Xm+2

= 1{l1 = l2 =: l, cn,1 = cn,2 = c} m l

X1 = . . . , Xl = Xm+1 = Xm+2, {Xl+1, . . . , Xm} {Xm+1} = ,

i=1 1hn(Xi)(Xm+1) = c

= 1{l1 = l2 =: l, cn,1 = cn,2 = c} m l

X1 = . . . , Xl = Xm+1 = Xm+2, {Xl+1, . . . , Xm} {Xm+1} = ,

i=l+1 1hn(Xi)(Xm+1) = c l

which is determined by the distribution of (X1, . . . , Xm+2). In view of Sangalli (2006, Equation 3.5)

Pr[X1 C1, . . . , Xm+2 Cm+2] =

(π1,...,πk) Π(m+2,k)

i=1 (|πi| 1)! ν( r πi Cr) .

We set D(m, l) := {X1 = . . . , Xl = Xm+1 = Xm+2, {Xl+1, . . . , Xm} {Xm+1} = }, and we deﬁne µ(π1,...,πk) as the probability measure on (Vm+2, V m+2) generated by the following

Dolera, Favaro and Peluchetti

νπ1,...,πk(C1 Cm+2) :=

i=1 ν( r πi Cr) ,

It is clear that such measures attach to D(m, l) a probability value that is either 0 or 1. In particular, νπ1,...,πk(D(m, l)) = 1 if and only if one of the π s (e.g. πk, being these partitions given up to the order) is exactly equal to the set {1, . . . , l, m + 1, m + 2}. Accordingly, we write

i=l+1 1hn(Xi)(Xm+1) = c l

(π1,...,πk 1) Π(m l,k 1) (l + 1)!

i=1 (|πi| 1)!νπ1,...,πk

i=l+1 1hn(Xi)(Xm+1) = c l

= θ(θ)(m l)

(θ)(m+2) (l + 1)!

(π1,...,πr) Π(m l,r)

i=1 (|πi| 1)!

j=1 ν({j})νπ1,...,πr

i=l+1 1hn(Xi)(j) = c l

= θ(θ)(m l)

J(θ)(m+2) (l + 1)!

(π1,...,πr) Π(m l,r)

i=1 (|πi| 1)!

j=1 νπ1,...,πr

i=l+1 1hn(Xi)(j) = c l

f Xm+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2, Xm+1 = Xm+2

= 1{l1 = l2 =: l, cn,1 = cn,2 = c} m! (c l)!(m c)! θ(l + 1) Γ(θ + m + 2)

Γ(θ) Γ(θ/J)Γ(θ(1 1/J))Γ(θ/J + c l)Γ(θ(1 1/J) + m c).

Now, we consider the second term on the right-hand side of the probability (63). In particular, we write

f Xm+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2, Xm+1 = Xm+2

X1 = . . . , Xl1 = Xm+1, Xl1+1 = . . . , Xl1+l2 = Xm+2, Xm+1 = Xm+2,

Learning-augmented count-min sketches via Bayesian nonparametrics

{Xl1+l2+1, . . . , Xm} {Xm+1, Xm+2} = ,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2

X1 = . . . , Xl1 = Xm+1, Xl1+1 = . . . , Xl1+l2 = Xm+2, Xm+1 = Xm+2,

{Xl1+l2+1, . . . , Xm} {Xm+1, Xm+2} = ,

l21hn(Xl1+1)(Xm+1) +

i=l1+l2+1 1hn(Xi)(Xm+1) = cn,1 l1,

l11hn(X1)(Xm+2) +

i=l1+l2+1 1hn(Xi)(Xm+2) = cn,2 l2

E(m, l1, l2) :=

X1 = . . . , Xl1 = Xm+1, Xl1+1 = . . . , Xl1+l2 = Xm+2, Xm+1 = Xm+2,

{Xl1+l2+1, . . . , Xm} {Xm+1, Xm+2} =

we have that νπ1,...,πk(E(m, l1, l2)) = 1 if and only if two of the π s (e.g. πk 1 and πk, being these partitions given up to the order) are exactly equal to the sets {1, . . . , l1, m + 1} and {l1 + 1, . . . , l1 + l2, m + 2}, respectively. Therefore, from above, we write the following probability

E(m, l1, l2), l21hn(Xl1+1)(Xm+1) +

i=l1+l2+1 1hn(Xi)(Xm+1) = cn,1 l1,

l11hn(X1)(Xm+2) +

i=l1+l2+1 1hn(Xi)(Xm+2) = cn,2 l2

n l1 l2+2 X

(π1,...,πk 2) Π(m l1 l2,k 2) l1!l2!

i=1 (|πi| 1)!

l21hn(Xl1+1)(Xm+1) +

i=l1+l2+1 1hn(Xi)(Xm+1) = cn,1 l1,

l11hn(X1)(Xm+2) +

i=l1+l2+1 1hn(Xi)(Xm+2) = cn,2 l2

= θ2(θ)(m l1 l2)

(θ)(m+2) l1!l2!

(θ)(m l1 l2)

(π1,...,πr) Π(m l1 l2,r)

i=1 (|πi| 1)!

(j1,j2) [J]2 ν({j1})ν({j2})

Dolera, Favaro and Peluchetti

i=l1+l2+1 1hn(Xi)(j1) = cn,1 l1 l21{j1 = j2},

i=l1+l2+1 1hn(Xi)(j2) = cn,2 l2 l11{j1 = j2}

We observe that the expression within the brackets in the last term, as a sum over [J]2, can be split into the sum of two terms, according on whether j1 = j2 or not. Therefore, we write

(θ)(m l1 l2)

(π1,...,πr) Π(m l1 l2,r)

i=1 (|πi| 1)!

j1=j2 [J] ν({j1})ν({j2})νπ1,...,πr

i=l1+l2+1 1hn(Xi)(j1) = cn,1 l1 l2,

i=l1+l2+1 1hn(Xi)(j2) = cn,2 l2 l1

J 1{cn,1 = cn,2 =: c} m l1 l2 c l1 l2

Γ(θ) Γ(θ/J)Γ(θ(1 1/J))

Γ(θ/J + c l1 l2)Γ(θ(1 1/J) + m c)

Γ(θ + m l1 l2) .

On the other hand, assuming J 3

(θ)(m l1 l2)

(π1,...,πr) Π(m l1 l2,r)

i=1 (|πi| 1)!

(j1,j2) [J]2 j1 =j2

ν({j1})ν({j2})νπ1,...,πr

i=l1+l2+1 1hn(Xi)(j1) = cn,1 l1,

i=l1+l2+1 1hn(Xi)(j2) = cn,2 l2

m l1 l2 cn,1 l1, cn,2 l2

Γ(θ) [Γ(θ/J)]2Γ(θ(1 2/J))

Γ(θ/J + cn,1 l1)Γ(θ/J + cn,2 l2)Γ(θ(1 2/J) + m cn,1 cn,2)

Γ(θ + m l1 l2) .

f Xm+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2, Xm+1 = Xm+2

Learning-augmented count-min sketches via Bayesian nonparametrics

θ2(θ)(m l1 l2)

(θ)(m+2) l1!l2!

" 1 J 1{cn,1 = cn,2 = c} m l1 l2 c l1 l2

Γ(θ) Γ(θ/J)Γ(θ(1 1/J)) Γ(θ/J + c l1 l2)Γ(θ(1 1/J) + m c)

Γ(θ + m l1 l2)

m l1 l2 cn,1 l1, cn,2 l2

Γ(θ) [Γ(θ/J)]2Γ(θ(1 2/J))

Γ(θ/J + cn,1 l1)Γ(θ/J + cn,2 l2)Γ(θ(1 2/J) + m cn,1 cn,2)

Γ(θ + m l1 l2)

Then, by combining the probability (64) and the probability (65) we write the following expression

f Xm+1 = l1, f Xm+2 = l2,

i=1 1hn(Xi)(Xm+1) = cn,1,

i=1 1hn(Xi)(Xm+2) = cn,2

= m! Γ(θ + m + 2)

n 1{l1 = l2 =: l, cn,1 = cn,2 = c} θ(l + 1) (c l)!(m c)!β1(θ, J)

Γ(θ/J + c l)Γ(θ(1 1/J) + m c)

+ 1{cn,1 = cn,2 = c} θ2

J(c l1 l2)!(m c)!β1(θ, J)Γ(θ/J + c l1 l2)Γ(θ(1 1/J) + m c)

(cn,1 l1)!(cn,2 l2)!(m cn,1 cn,2)!β2(θ, J)

Γ(θ/J + cn,1 l1)Γ(θ/J + cn,2 l2)Γ(θ(1 2/J) + m cn,1 cn,2) o (66)

β1(θ, J) := Γ(θ) Γ(θ/J)Γ(θ(1 1/J))

β2(θ, J) := Γ(θ) [Γ(θ/J)]2Γ(θ(1 2/J)).

The proof is completed by combing the posterior distribution (61) with probabilities (62) and (66).

Theorem 6 extends Theorem 1 to the more general problem of estimating 2-range queries. In particular, for the collection of hash functions h1, . . . , h N, the posterior distribution of (f Xm+1, f Xm+2) given {(Cn,hn(Xm+1), Cn,hn(Xm+2))}n [N] follows from Theorem 6 by the assumption that the hn s are i.i.d. according to the strong universal family H, and Bayes theorem. CMS-DP estimates of the 2-range query f2 = fxm+1 + fxm+2 are then obtained as functionals of the posterior distribution of f2, e.g. posterior mode, posterior mean and posterior median. To conclude, it remains to estimate the prior s parameter θ > 0 based on hashed frequencies; this is obtained following the empirical Bayes procedure described in Section 2.

Dolera, Favaro and Peluchetti

Acknowledgement

The authors are grateful to the Action Editor (Professor Erik Sudderth) and four anonymous Referees for their comments and corrections that allow to improve remarkably the paper. Stefano Favaro wishes to thank Graham Cormode, Matteo Sesia and Luca Trevisan for stimulating discussions on sketches and generalizations thereof. Emanuele Dolera and Stefano Favaro received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme under grant agreement No 817257. Emanuele Dolera and Stefano Favaro are thankful for the ﬁnancial support of Italian Ministry of Education, University and Research (MIUR), Dipartimenti di Eccellenza grant 2018-2022. Stefano Favaro is also aﬃliated to IMATI-CNR Enrico Magenes (Milan, Italy).

Aamand, A., Indyk, P. and Vakilian, A. (2019). Frequency estimation algorithms under Zipﬁan distribution. Preprint ar Xiv:1908.05198.

Aggarwal, C. and Yu, P. (2010). On classiﬁcation of high-cardinality data streams. In Proceedings of the 2010 SIAM International Conference on Data Mining.

Bacallado, S., Battiston, M., Favaro, S. and Trippa, L. (2017). Suﬃcientness postulates for Gibbs-type priors and hierarchical generalizations. Statistical Science 32, 487 500.

Baldi, P. (2017) Stochastic calculus. Springer.

Barab asi, A.L. (2005) The origin of bursts and heavy tails in human dynamics. Nature 435, 227.

Bernton, E., Jacob, P.E., Gerber, M. and Robert, C.P. (2019). On parameter estimation with the Wasserstein distance. Information and Inference 8, 657 676.

Cai, D., Mitzenmacher, M. and Adams, R.P. (2018). A Bayesian nonparametric view on count min sketch. In Advances in Neural Information Processing Systems.

Cancho, R.F. and Sol e, R.V. (2003). Least eﬀort and the origins of scaling in human language. Proceeding of the National Academy of Sciences of USA 100, 788 791.

Charalambides, C. (2005) Combinatorial methods in discrete distributions. Wiley.

Clauset, A., Shalizi, C.R. and Newman, M.E.J. (2009). Power-law distributions in empirical data. SIAM Review 51, 661 703.

Cormode, G., Garofalakis, M. and Haas, P.J. (2012). Synopses for massive data: samples, histograms, wavelets, sketches. Foundations and Trends in Databases.

Cormode, G. and Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 58 75.

Learning-augmented count-min sketches via Bayesian nonparametrics

Cormode, G. and Yi, K. (2020). Small summaries for big data. Cambridge University Press.

De Blasi, P., Favaro, S., Lijoi, A., Mena, R.H., Pr unster, I. and Ruggiero, M. (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 212 229.

Devroye, L. (2009). Random variate generation for exponentially and polynomially tilted stable distributions. ACM Transactions on Modeling and Computer Simulation 19, 4.

Dolera, E. (2013). Estimates of the approximation of weighted sums of conditionally independent random variables by the normal law. J. Inequal. Appl. 2013, 320.

Dolera, E. and Favaro, S. (2020). A Berry Esseen theorem for Pitman s α diversity. The Annals of Applied Probability 30, 847 869.

Dolera, E. and Favaro, S. (2020). Rates of convergence in de Finetti s representation theorem, and Hausdorﬀmoment problem. Bernoulli 26, 1294 1322.

Dolera, E., Favaro, S. and Peluchetti, S. (2021). A Bayesian nonparametric approach to count-min sketch under power-law data stream. In International Conference on Artiﬁcial Intelligence and Statistics.

Dwork, C. and Naor, M. and Pitassi, T. and Rothblum, G. and Yekhanin, S. (2010). Pan-private streaming algorithms. In Proceedings of the Symposium on Innovations in Computer Science.

Favaro, S., Lijoi, A., R.H., Mena and Pr unster, I. (2009). Bayesian nonparametric inference for species variety with a two parameter Poisson-Dirichlet process prior. Journal of the Royal Statistical Society Series B 71, 992 1008.

Favaro, S., Lijoi, A. and Pr unster, I. (2013). Conditional formulae for Gibbs-type exchangeable random partitions. The Annals of Applied Probability 23, 1721 1754.

Favaro, S. and Nipoti, B. and Teh, Y.W. (2015). Random variate generation for Laguerre-type exponentially tilted alpha-stable distributions. Electronic Journal of Statistics 9, 1230 1242.

Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 209 230.

Ghosal, S. and van der Vaart, A. (2017) Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press.

Gnedin, A., Hansen, B. and Pitman, J. (2007). Notes on the occupancy problems with inﬁnitely many boxes: general asymptotics and power law. Probability Surveys 4, 146 171.

Gnedin, A. and Pitman J. (2006). Exchangeable Gibbs partitions and Stirling triangles. Journal of Mathematical Sciences, 138, 5674-5685.

Dolera, Favaro and Peluchetti

Goyal, A., Daum e, H. and Cormode, G. (2012). Sketch algorithms for estimating point queries in NLP. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.

Goyal, A., Daum e, H. and Venkatasubramanian, S. (2009). Streaming for large scale NLP: language modeling. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.

Harald, B.R. (2001). Word Frequency Distributions. Springer

Harrison, B.A. (2010) Move prediction in the game of Go. Ph.D Thesis, Harvard University.

Hsu, C., Indyk, P., Katabi, D. and Vakilian, A. (2019) Learning-based frequency estimation algorithms. In Proceedings of the International Conference on Learning Representations.

Huberman, B.A. and Adamic, L.A. (1999) Internet: growth dynamics of the World Wide Web. Nature 401, 131

James, L.F. (2002). Poisson process partition calculus with applications to exchangeable models and Bayesian nonparametrics. Preprint ar Xiv:math/0205093.

James, L.F., Pr unster, I., Lijoi, A. (2009). Posterior analysis for normalized random measures with independent increments. Scandinavian Journal of Statistics 36, 76 97.

Johnson, N.L., Kemp, A.W. and Kotz, S. (2005) Univariate discrete distributions, Wiley Series in Probability and Statistics.

Kingman, J.F.C. (1993). Poisson processes. Wiley Online Library.

Leo Elworth, L.A., Wang, Q., Kota, P.K., Barberan, C.J., Coleman, B., Balaji, A., Gupta, G., Baraniuk, R.G., Shrivastava, A. and Treangen, T.J. (2020). To petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Research 48 5217 5234.

Letham, B., Karrer, B., Ottoni, G. and Bakshy, E. (2019). Constrained Bayesian optimization with noisy experiments. Bayesian Analysis 14, 495 519.

Lijoi, A., Mena, R. H., and Pr unster, I. (2005). Hierarchical mixture modeling with normalized inverse-Gaussian priors. Journal of the American Statistical Association 100, 1278 1291.

Lijoi, A., Pr unster, I. and Walker, S.G. (2008). Bayesian nonparametric estimators derived from conditional Gibbs structures. Tha Annals of Applied Probability 18, 1519 1547.

Monechi, B., Ruiz-Serrano, A., Tria, F., and Loreto, V. (2017). Waves of novelties in the expansion into the adjacent possible. Plo S ONE 12

Learning-augmented count-min sketches via Bayesian nonparametrics

Muchnik, L., Pei, S., Parra, L.C., Reis, S.D.S, Andrade, J.S., Havlin, S. and Makse, H.A. (2013). Origins of power-law degree distribution in the heterogeneity of human activity in social networks. Nature Scientiﬁc Reports 3, 1783

Perman, M., Pitman, J. and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields 92, 21 39.

Pitel, G. and Fouquier, G. (2015). Count-min-log sketch: approximately counting with approximate counters. In Proceedings of the International Symposium on Web Algorithm.

Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields 102, 145 158.

Pitman, J. (2003). Poisson-Kingman partitions. In Science and Statistics: A Festschrift for Terry Speed, Goldstein, D.R. Eds. Institute of Mathematical Statistics.

Pitman, J. (2006). Combinatorial stochastic processes. Lecture Notes in Mathematics, Springer Verlag.

Pitman, J. and Yor, M. (1997). The two parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability 25, 855 900.

Pr unster, I. (2002). Random probability measures derived from increasing additive processes and their application to Bayesian statistics. Ph.d thesis, University of Pavia.

Regazzini, E. (1978). Intorno ad alcune questioni relative alla deﬁnizione del premio secondo la teoria della credibili a. Giornale dell Istituto Italiano degli Attuari 41, 77 89.

Regazzini, E. (2001). Foundations of Bayesian statistics and some theory of Bayesian nonparametric methods. Lecture Notes, Stanford University.

Regazzini, E., Lijoi, A. and Pr unster, I. (2003). Distributional results for means of normalized random measures with independent increments. The Annals of Statistics 31, 560 585.

Rybski, D., Buldyrev, S.V., Havlin, S., Liljeros, F. and Makse, H A. (2016). Scaling laws of human interaction activity. Proceeding of the National Academy of Sciences of USA 106, 12640.

Sangalli, M.L. (2006). Some developments of the normalized random measures with independent increments. Sankhya A 68, 461 487.

Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors. Statistica Sinica 4, 639 650.

Song, H.H., Cho, T.W., Dave, V., Zhang, Y. and Qiu, L. (2009). Scalable proximity estimation and link prediction in online social networks. In Proceedings of the ACM SIGCOMM Conference on Internet measurement.

Dolera, Favaro and Peluchetti

Ting, D. (2018). Count-min: optimal estimation and tight error bounds using empirical error distributions. In International Conference on Knowledge Discovery and Data Mining.

Tria, F., Loreto, V., Servedio, V.D.P and Strogatz, S.H. (2014). The dynamics of correlated novelties. Nature Scientiﬁc Reports 4, 5890.

Zabell, S.L. (1997). The continuum of inductive methods revisited. In The cosmos of science: essays in exploration, Earman, J. and Norton, J.D. Eds. Universty of Pittsburgh Press.

Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C. and Brown, C.T. (2014). These are not the k-mers you are looking for: eﬃcient online k-mer counting using a probabilistic data structure. Plo S one 9.

Zipf, G.K. (1949). Human behaviour and the principle of least eﬀort: an introduction to human ecology. Addison-Wesley.

Zolotarev, V.M. (1986). One dimensional stable distributions. American Mathematical Society.