# learningbased_support_estimation_in_sublinear_time__297f2d0b.pdf

Published as a conference paper at ICLR 2021

LEARNING-BASED SUPPORT ESTIMATION IN SUBLINEAR TIME

Talya Eden , Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld & Sandeep Silwal Computer Science and Artiﬁcial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139, USA {teden,indyk,shyamsn,ronitt,silwal}@mit.edu

Tal Wagner Microsoft Research Redmond, WA 98052, USA talw@mit.edu

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to εn from a sample of size O(log2(1/ε) n/ log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced signiﬁcantly, to

log(1/ε) n1 Θ(1/ log(1/ε)).

We evaluate the proposed algorithms on a collection of data sets, using the neuralnetwork based estimators from Hsu et al, ICLR 19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

1 INTRODUCTION

Estimating the support size of a distribution from random samples is a fundamental problem with applications in many domains. In biology, it is used to estimate the number of distinct species from experiments (Fisher et al., 1943); in genomics to estimate the number of distinct protein encoding regions (Zou et al., 2016); in computer systems to approximate the number of distinct blocks on a disk drive (Harnik et al., 2016), etc. The problem has also applications in linguistics, query optimization in databases, and other ﬁelds.

Because of its wide applicability, the problem has received plenty of attention in multiple ﬁelds1, including statistics and theoretical computer science, starting with the seminal works of Good and Turing Good (1953) and Fisher et al. (1943). A more recent line of research pursued over the last decade (Raskhodnikova et al., 2009; Valiant & Valiant, 2011; 2013; Wu & Yang, 2019) focused on the following formulation of the problem: given access to independent samples from a distribution

Authors listed in alphabetical order 1A partial bibliography from 2007 contains over 900 references. It is available at https://courses.cit.cornell.edu/jab18/bibliography.html.

Published as a conference paper at ICLR 2021

P over a discrete domain {0, . . . , n 1} whose minimum non-zero mass2 is at least 1/n, estimate the support size of P up to εn. The state of the art estimator, due to Valiant & Valiant (2011); Wu & Yang (2019), solves this problem using only O(n/ log n) samples (for a constant ε). Both papers also show that this bound is tight.

A more straightforward linear-time algorithm exists, which reports the number of distinct elements seen in a sample of size N = O(n log ε 1) (which is O(n) for constant ε), without accounting for the unseen items. This algorithm succeeds because each element i with non-zero mass (and thus mass at least 1/n) appears in the sample with probability at least 1 (1 1/n)N > 1 ε, so in expectation, at most ε n elements with non-zero mass will not appear in the sample. Thus, in general, the number of samples required by the best possible algorithm (i.e., n/ log n) is only logarithmically smaller than the complexity of the straightforward linear-time algorithm.

A natural approach to improve over this bound is to leverage the fact that in many applications, the input distribution is not entirely unknown. Indeed, one can often obtain rough approximations of the element frequencies by analyzing different but related distributions. For example, in genomics, frequency estimates can be obtained from the frequencies of genome regions of different species; in linguistics they can be inferred from the statistical properties of the language (e.g., long words are rare), or from a corpus of writings of a different but related author, etc. More generally, such estimates can be learned using modern machine learning techniques, given the true element frequencies in related data sets. The question then becomes whether one can utilize such predictors for use in support size estimation procedures in order to improve the estimation accuracy.

Our results In this paper we initiate the study of such learning-based methods for support size estimation. Our contributions are both theoretical and empirical. On the theory side, we show that given a good enough predictor of the distribution P, one can solve the problem using much fewer than n/ log n samples. Speciﬁcally, suppose that in the input distribution P the probability of element i is pi, and that we have access to a predictor Π(i) such that Π(i) pi b Π(i) for some constant approximation factor b 1.3 Then we give an algorithm that estimates the support size up to εn using only log(1/ε) n1 Θ(1/ log(1/ε))

samples, assuming the approximation factor b is a constant (see Theorem 1 for a more detailed bound). This improves over the bound of Wu & Yang (2019) for any ﬁxed values of the accuracy parameter ε and predictor quality factor b. Furthermore, we show that this bound is almost tight.

Our algorithm is presented in Algorithm 1. On a high level, it partitions the range of probability values into geometrically increasing intervals. We then use the predictor to assign the elements observed in the sample to these intervals, and produce a Wu-Yang-like estimate within each interval. Speciﬁcally, our estimator is based on Chebyshev polynomials (as in Valiant & Valiant (2011); Wu & Yang (2019)), but the ﬁner partitioning into intervals allows us to use polynomials with different, carefully chosen parameters. This leads to signiﬁcantly improved sample complexity if the predictor is sufﬁciently accurate.

On the empirical side, we evaluate the proposed algorithms on a collection of real and synthetic data sets. For the real data sets (network trafﬁc data and AOL query log data) we use neural-network based predictors from Hsu et al. (2019). Although those predictors do not always approximate the true distribution probabilities up to a small factor, our experiments nevertheless demonstrate that the new algorithm offers substantial improvements (up to 3x reduction in relative error) in the estimation accuracy compared to the state of the art algorithm of Wu & Yang (2019).

1.1 RELATED WORK

Estimating support size As described in the introduction, the problem has been studied extensively in statistics and theoretical computer science. The best known algorithm, due to Wu & Yang

2This constraint is naturally satisﬁed e.g., if the distribution P is an empirical distribution over a data set of n items. In fact, in this case all probabilities are multiples of 1/n so the support size is equal to the number of distinct elements in the data set. 3Our results hold without change if we modify the assumption to r Π(i) pi r b Π(i), for any r > 0. We use r = 1 for simplicity.

Published as a conference paper at ICLR 2021

(2019), uses O(log2(1/ε) n/ log n) samples. Because of the inherent limitations of the model that uses only random samples, Canonne & Rubinfeld (2014) considered an augmented model where an algorithm has access to the exact probability of any sampled item. The authors show that this augmentation is very powerful, reducing the sampling complexity to only O(1/ε2). More recently, Onak & Sun (2018) proved that Canonne and Rubinfeld s algorithm works as long as the probabilities accessed are accurate up to a (1 ε

3)-multiplicative factor. However, this algorithm strongly relies on the probabilities being extremely accurate, and the predicted probabilities even being off by a small constant factor can cause the support size estimate to become massively incorrect. As a result, their algorithm is not robust to mispredicted probabilities, as our experiments show.

A different line of research studied streaming algorithms for estimating the number of distinct elements. Such algorithms have access to the whole data set, but must read it in a single pass using limited memory. The best known algorithms for this problem compute a (1 + ε)-approximate estimation to the number of distinct elements using O(1/ε2 + log n) bits of storage (Kane et al., 2010). See the discussion in that paper for a history of the problem and further references.

Learning-based algorithms Over the last few years, there has been a growing interest in using machine learning techniques to improve the performance of classical algorithms. This methodology found applications in similarity search (Wang et al., 2016; Sablayrolles et al., 2019; Dong et al., 2020), graph optimization (Khalil et al., 2017; Balcan et al., 2018), data structures (Kraska et al., 2018; Mitzenmacher, 2018), online algorithms (Lykouris & Vassilvitskii, 2018; Purohit et al., 2018), compressed sensing (Mousavi et al., 2015; Baldassarre et al., 2016; Bora et al., 2017) and streaming algorithms (Hsu et al., 2019; Jiang et al., 2019). The last two papers are closest to our work, as they solve various computational problems over data streams, including distinct elements estimation in Jiang et al. (2019) using frequency predictors. Furthermore, in our experiments we are using the neural-network-based predictors developed in Hsu et al. (2019). However, our algorithm operates in a fundamentally different model, using a sublinear (in n) number of samples of the input, as opposed to accessing the full input via a linear scan. Thus, our algorithms run in sublinear time, in contrast to streaming algorithms that use sublinear space.

Distribution property testing This work can be seen more broadly in the context of testing properties of distributions over large discrete domains. Such questions are studied at the crossroads of social networks, statistics, information theory, database algorithms, and machine learning algorithms. Examples of speciﬁc properties that have been extensively considered include testing whether the distribution is uniform, Gaussian, high entropy, independent or monotone increasing (see e.g. Rubinfeld (2012); Canonne (2015); Goldreich (2017) for surveys on the topic).

2 LEARNING-BASED ALGORITHM

2.1 PRELIMINARIES

Problem setting and notation. The support estimation problem is formally deﬁned as follows. We are given sample access to an unknown distribution P over a discrete domain of size n. For simplicity, we identify the domain with [n] = {1, . . . , n}. Let pi denote the probability of element i. Let S(P) = {i : pi > 0} be the support of P. Our goal is to estimate the support size S = |S(P)| using as few samples as possible. In particular, given ε > 0, the goal is to output an estimate S that satisﬁes S [S εn, S + εn].

We assume that the minimal non-zero mass of any element is at least 1/n, namely, that pi 1/n for every i S(P). This is a standard promise in the support estimation problem (see, e.g., Raskhodnikova et al. (2009); Valiant & Valiant (2011); Wu & Yang (2019)), and as mentioned earlier, it naturally holds in the context of counting distinct elements, where pi is deﬁned as the count of element i in the sample divided by n. Furthermore, a lower bound on the minimum non-zero probability is a necessary assumption without which no estimation algorithm is possible, even if the number of samples is allowed to be an arbitrarily large function of n, i.e., not just sublinear algorithms. The reason is that there could be arbitrarily many elements with exceedingly small probabilities that would never be observed. See for example the discussion in the supplementary Section 5 of Orlitsky et al. (2016).

Published as a conference paper at ICLR 2021

In the learning-based setting, we furthermore assume we have a predictor Π that can provide information about pi. In our analysis, we will assume that Π(i) is a constant factor approximation of each pi. In order to bound the running time of our algorithms, we assume we are given access to a ready-made predictor and (as in Canonne & Rubinfeld (2014)) that evaluating Π(i) takes unit time. In our experiments, we use neural network based predictors from Hsu et al. (2019). In general, predictors need to be trained (or otherwise produced) in advance. This happens in a preprocessing stage, before the input distribution is given to the algorithm, and this stage is not accounted for in the sublinear running time. We also note that training the predictor needs to be done only once for all future inputs (not once per input).

The Wu & Yang (2019) estimator. In the classical setting (without access to a predictor), Wu & Yang (2019) gave a sample-optimal algorithm based on Chebyshev polynomials. We now describe it brieﬂy, as it forms the basis for our learning-based algorithm.

Suppose we draw N samples, and let Ni be the number of times element i is observed. The output estimate of Wu & Yang (2019) is of the form

i [n] (1 + f(Ni)),

where f(Ni) is a correction term intended to compensate for the fact that some elements in the support do not appear in the sample at all. If pi = 0, then necessarily Ni = 0 (as i cannot appear in the sample). Thus, choosing f(0) = 1 ensures that unsupported elements contribute nothing to SWY. On the other hand, if pi > log n

N , then by standard concentration we have Ni > Ω(log n) with high probability; thus choosing f(Ni) = 0 for all Ni > L = Ω(log n) ensures that high-mass elements are only counted once in SWY. It remains to take care of elements i with pi [ 1

By a standard Poissonization trick, the expected additive error |S E[ SWY]| can be bounded by P

i [n] |PL(pi)|, where PL is the degree-L polynomial

k! f(k) xk.

To make the error as small as possible, we would like to choose f(1), . . . , f(L) so as to minimize |PL(pi)| on the interval pi [ 1

N ], under the constraint PL(0) = 1 (which is equivalent to f(0) = 1). This is a well-known extremal problem, and its solution is given by Chebyshev polynomials, whose coefﬁcients have a known explicit formula. Indeed, Wu & Yang (2019) show that choosing f(1), . . . , f(L) such that E[N]k

k! f(k) are the coefﬁcients of an (appropriately shifted and scaled) Chebyshev polynomial leads to an optimal sample complexity of O(log2(1/ε) n/ log n).

2.2 OUR ALGORITHM

Our main result is a sample-optimal algorithm for estimating the support size of an unknown distribution, where we are given access to samples as well as approximations to the probabilities of the elements we sample. Our algorithm is presented in Algorithm 1. It partitions the interval [ 1

N ] into geometrically increasing intervals {[ bj

n ] : j = 0, 1, . . .}, where b is a ﬁxed constant that we refer to as the base parameter (in our proofs this parameter upper bounds the approximation factor of the predictor, which is why we use the same letter to denote both; its setting in practice is studied in detail in the next section). The predictor assigns the elements observed in the sample to intervals, and the algorithm computes a Chebyshev polynomial estimate within each interval. Since the approximation quality of Chebyshev polynomials is governed by the ratio between the interval endpoints (as well as the polynomial degree), this assignment can be leveraged to get more accurate estimates, improving the overall sample complexity. Our main theoretical result is the following.

Theorem 1. Let b > 1 be a ﬁxed constant. Suppose we have a predictor that given i [n] sampled from the input distribution, outputs Π(i) such that Π(i) pi b Π(i). Then, for any ε > n 1/2, if we set L = O(log(1/ε)) and draw N Poisson L n1 1/L samples, Algorithm 1 reports an estimate S that satisﬁes S [S ε

n S] with probability at least 9/10. In other words,

Published as a conference paper at ICLR 2021

Algorithm 1: Learning-Based Support Estimation Input: Number of samples N, domain size n, base b, polynomial degree L, predictor Π Output: Estimate S of the support size

1 Partition [ 1

n, 1] into intervals Ij = h bj

n i for j = 0, . . . , logb n

2 Let a0, a1, . . . , a L be the coefﬁcient of the Chebyshev polynomial PL(x) from Equation (1) (Note that ak = 0 for all k > L)

3 Draw N random samples

4 Ni = # of times we see element i in samples

5 for every interval Ij do

n 0.5 log n

i [n]:Π(i) Ij

bj Ni Ni! NNi

9 Sj = #{i [n] : Ni 1, Π(i) Ij}

12 return S = Plogb n j=0 Sj

using O log (1/ε) n1 Θ(1/ log(1/ε))

samples, we can approximate the support size S up to an additive error of ε

n S, with probability at least 9/10.

Note that sample complexity of Wu & Yang (2019) (which is optimal without access to a predictor) is nearly linear in n, while Theorem 1 gives a bound that is polynomially smaller than n for every ﬁxed ε > 0. Also, note that

n S n, so our estimate S is also within ε n of the support size S.

We also prove a corresponding lower bound, proving that the above theorem is essentially tight.

Theorem 2. Suppose we have access to a predictor that returns Π(i) such that Π(i) pi 2 Π(i) for all sampled i. Then any algorithm that estimates the support size up to an εn additive error with probability at least 9/10 needs Ω(n1 Θ(1/ log(1/ε))) samples.

We prove Theorems 1 and 2 in Appendix A. We note that while our upper bound proof follows a similar approach to Wu & Yang (2019), our lower bound follows a combinatorial approach differing from their linear programming arguments.

The Chebyshev polynomial. For completeness, we explicitly write the polynomial coefﬁcients used by Algorithm 1. The standard Chebyshev polynomial of degree L on the interval [ 1, 1] is given by4

QL(x) = cos(L arccos(x)).

For Algorithm 1, we want a polynomial as follows:

k=0 akxk satisfying PL(0) = 1 and PL(x) [ ε, ε] for all 1 x b2.

This is achieved by shifting and scaling QL, namely, this polynomial can be written as

PL(x) = QL 2x (b2+1)

4The fact this is indeed a polynomial is proven in standard textbooks, e.g., Timan et al. (1963).

Published as a conference paper at ICLR 2021

where ε equals QL b2+1

b2 1 1 , which decays as e Θ(L) if b is a constant. Thus it sufﬁces to

choose L = O(log(1/ε)) in Theorem 1.

2.3 ASSUMPTIONS ON THE PREDICTOR S ACCURACY

In Theorems 1 and 2, we assume that the predictor Π is always correct within a constant factor for each element i. As we will discuss in the experimental section, the predictor that we use does not always satisfy this property. Hence, perhaps other models of the accuracy of Π are better suited, such as a promised upper bound on TV (Π, P), i.e., the total variation distance between Π and P. However, it turns out that if we are promised that TV (Π, P) ε, then the algorithm of Canonne & Rubinfeld (2014) can in fact approximate the support size of P up to error O(ε) n using only O(ε 2) samples. Conversely, if we are only promised that TV (Π, P) γ for some γ ε, the algorithm of Wu & Yang (2019), which requires Θ(n/ log n) samples to approximate the support size of P, is optimal. We formally state and prove both of these results in Appendix A.3.

Our experimental results demonstrate that our algorithm can perform better than both the algorithms of Canonne & Rubinfeld (2014) and Wu & Yang (2019). This suggests that our assumption (a pointwise approximation guarantee, but with an arbitrary constant multiplicative approximation factor), even if not fully accurate, is better suited for our application than a total variation distance bound.

3 EXPERIMENTS

In this section we evaluate our algorithm empirically on real and synthetic data.

Datasets. We use two real and one synthetic datasets:

AOL: 21 million search queries from 650 thousand users over 90 days. The queries are keywords for the AOL search engine. Each day is treated as a separate input distribution. The goal is to estimate the number of distinct keywords.

IP: Packets collected at a backbone link of a Tier1 ISP between Chicago and Seattle in 2016 over 60 minutes.5 Each packet is annotated with the sender IP address, and the goal is to estimate the number of distinct addresses. Each minute is treated as a separate distribution.

Zipﬁan: Synthetic dataset of samples drawn from a ﬁnite Zipﬁan distribution over {1, . . . , 105} with the probability of each element i proportional to i 0.5.

The AOL and IP datasets were used in Hsu et al. (2019), who trained a recurrent neural network (RNN) to predict the frequency of a given element for each of those datasets. We use their trained RNNs as predictors. For the Zipﬁan dataset we use the empirical counts of an independent sample as predictions. A more detailed account of each predictor is given later in this section. The properties of the datasets are summarized in Table 1.

Baselines. We compare Algorithm 1 with two existing baselines:

The algorithm of Wu & Yang (2019), which is the state of the art for algorithms without predictor access. We abbreviate its name as WY.6

The algorithm of Canonne & Rubinfeld (2014), which is the state of the art for algorithms with access to a perfect predictor. We abbreviate its name as CR.

Error measurement. We measure accuracy in terms of the relative error |1 S/S|, where S is the true support size and S is the estimate returned by the algorithm. We report median errors over 50 independent executions of each experiment, one standard deviation.

5From CAIDA internet traces 2016, https://www.caida.org/data/monitors/ passive-equinix-chicago.xml. 6Apart from their tight analysis, Wu & Yang (2019) also report experimental results showing their algorithm is empirically superior to previous baselines.

Published as a conference paper at ICLR 2021

Table 1: Datasets used in our experiments. The listed values of n (total size) and support size (distinct elements) for AOL/IP are per day/minute (respectively), approximated across all days/minutes.

Name Type # Distributions Predictor n Support size

AOL Keywords 90 (days) RNN 4 105 2 105

IP IP addresses 60 (minutes) RNN 3 107 106

Zipﬁan Synthetic 1 Empirical 2 105 105

Summary of results. Our experiments show that on one hand, our algorithm can indeed leverage the predictor to get signiﬁcantly improved accuracy compared to WY. On the other hand, our algorithm is robust to different predictors: while the CR algorithm performs extremely well on one dataset (AOL), it performs poorly on the other two (IP and Zipﬁan), whereas our algorithm is able to leverage the predictors in those cases too and obtain signiﬁcant improvement over both baselines.

3.1 BASE PARAMETER SELECTION

Algorithm 1 uses two parameters that need to be set: The polynomial degree L, and the base parameter b. The performance of the algorithm is not very sensitive to L, and for simplicity we use the same setting as Wu & Yang (2019), L = 0.45 log n . The setting of b requires more care.

Recall that b is a constant used as the ratio between the maximum and minimum endpoint of each interval Ij in Algorithm 1. There are two reasons why b cannot be chosen too small. One is that the algorithm is designed to accommodate a predictor that provides a b-approximation of the true probabilities, so larger b makes the algorithm more robust to imperfect predictors. The other reason is that small b means using many small intervals, and thus a smaller number of samples assigned to each interval. This leads to higher noise in the Chebyshev polynomial estimators invoked within each interval (even if the assignment of elements to intervals is correct), and empirically impedes performance. On the other hand, if we set b to be too large (resulting in one large interval the covers almost the whole range of pi s), we are essentially not using information from the predictor, and thus do not expect to improve over WY.

To resolve this issue, we introduce a sanity check in each interval Ij, whose goal is to rule out bases that are too small. The sanity check passes if Sj [0, 1/lj], where lj is the left endpoint of Ij (i.e., Ij = [lj, rj]), and Sj is as deﬁned in Algorithm 1. The reasoning is as follows. On one hand, the Chebyshev polynomial estimator which we use to compute Sj can in fact return negative numbers, leading to failure modes with Sj < 0. On the other hand, since all element probabilities in Ij are lower bounded by lj, it can contain at most 1/lj elements. Therefore, any estimate Sj of the number of elements in Ij which is outside [0, 1/lj] is obviously incorrect.

In our implementation, we start by running Algorithm 1 with b = 2. If any interval fails the sanity check, we increment b and repeat the algorithm (with the same set of samples). The ﬁnal base we use is twice the minimal one such that all intervals pass the sanity check, where the ﬁnal doubling is to ensure we are indeed past the point where all checks succeed.

The effect of this base selection procedure on each of our datasets is depicted in Figures 1, 3, and 5.7 For a ﬁxed sample size, we plot the performance of our algorithm (dotted blue) as the base increases compared to WY (solid orange, independent of base), as well as the fraction of intervals that failed the sanity check (dashed green). The plots show that for very small bases, the sanity check fails on some intervals, and the error is large. When the base is sufﬁciently large, all intervals pass the sanity check, and we see a sudden plunge in the error. Then, as the base continues to grow, our algorithm continues to perform well, but gradually degrades and converges to WY due to having a single dominating interval. This afﬁrms the reasoning above and justiﬁes our base selection procedure.

7In those plots, for visualization, in order to compute the error whenever a sanity check fails in Ij, we replace Sj with a na ıve estimate, deﬁned as the number of distinct sample elements assigned to Ij. Note that our implementation never actually uses those na ıve estimates, since it only uses bases that pass all sanity checks.

Published as a conference paper at ICLR 2021

0 10 20 30 40 50 60 Base

Dataset: AOL Day 10, 10%

Our Alg WY Alg Fail %

Figure 1: Error by base, AOL, sample size 10% n

0 10 20 30 40 50 Sample size: % of n

Dataset: AOL Day 10

Our Alg WY Alg CR Alg

Figure 2: Error per sample size, AOL

0 200 400 600 800 Base

Dataset: IP Minute 59, 1%

Our Alg WY Alg Fail %

Figure 3: Error by base, IP, sample size 1% n

0 5 10 15 20 25 Sample size: % of n

Dataset: IP Minute 59

Our Alg WY Alg CR Alg

Figure 4: Error per sample size, IP

0 20 40 60 80 100 120 Base

Dataset: Zipfian, 5%

Our Alg WY Alg Fail %

Figure 5: Error by base, Zipﬁan, sample size 5%n

0 10 20 30 40 50 Sample size: % of n

Dataset: Zipfian

Our Alg WY Alg CR Alg

Figure 6: Error per sample size, Zipﬁan

3.2 RESULTS

AOL data. As the predictor, we use the RNN trained by Hsu et al. (2019) to predict the frequency of a given keyword. Their predictor is trained on the ﬁrst 5 days and the 6th day is used for validation. The results for day #10 are shown in Figure 2. (The results across all days are similar; see more below.) They show that our algorithm performs signiﬁcantly better than WY. Nonetheless, CR (which relies on access to a perfect predictor) achieves better performance on a much smaller sample size. This is apparently due to highly speciﬁc traits of the predictor; as we will see presently, the performance of CR is considerably degraded on the other datasets.

IP data. Here too we use the trained RNN of Hsu et al. (2019). It is trained on the ﬁrst 7 minutes and the 8th minute is used for validation. However, unlike AOL, Hsu et al. (2019) trained the RNN to predict the log of the frequency of the given IP address, rather than the frequency itself, due to training stability considerations. To use it as a predictor, we exponentiate the RNN output. This inevitably leads to less accurate predictions.

Published as a conference paper at ICLR 2021

10 20 30 40 50 60 70 80 90 Day #

Dataset: AOL, 5%

Our Alg WY Alg CR Alg

(a) Sample size: 5% n

10 20 30 40 50 60 70 80 90 Day #

Dataset: AOL, 10%

Our Alg WY Alg CR Alg

(b) Sample size: 10% n

Figure 7: Error across the different AOL days

10 20 30 40 50 60 Minute #

Dataset: IP, 1%

Our Alg WY Alg CR Alg

(a) Sample size: 1% n

10 20 30 40 50 60 Minute #

Dataset: IP, 2.5%

Our Alg WY Alg CR Alg

(b) Sample size: 2.5% n

Figure 8: Error across the different IP minutes

The results for minute 59 are shown in. Figure 4. (As in the AOL data, the results across all minutes are similar; see more below). Again we see a signiﬁcant advantage to our algorithm over WY for small sample sizes. Here, unlike the AOL dataset, CR does not produce good results.

Zipﬁan data. To form a predictor for this synthetic distribution, we drew a random sample of size 10% of n, and used the empirical count of each element in this ﬁxed sample as the prediction for its frequency. If the predictor is queried for an element that did not appear in the sample, its predicted probability is reported as the minimum 1/n. We use this ﬁxed predictor in all repetitions of the experiment (which were run on fresh independent samples). The results are reported in Figure 6. As with the IP data, our algorithm signiﬁcantly improves over WY for small sample sizes, and both algorithms outperform CR by a large margin.

AOL and IP results over time. Finally, we present accuracy results over the days/minutes of the AOL/IP datasets (respectively). The purpose is to demonstrate that the performance of our algorithm remains consistent over time, even when the data has moved away from the initial training period and the predictor may become stale . The results for AOL are shown in Figures 7a, 7b, yielding a median 2.2-fold and 3.0-fold improvement over WY (for sample sizes 5% n and 10% n, respectively). The results for IP are shown in Figures 8a and 8b, yielding a median 1.7-fold and 3.0fold improvement over WY (for sample sizes 1% n and 2.5% n, respectively). As before, CR performs better than either algorithm on AOL, but fails by a large margin on IP.

ACKNOWLEDGMENTS

This research was supported in part by the NSF TRIPODS program (awards CCF-1740751 and DMS-2022448); NSF awards CCF-1535851, CCF-2006664 and IIS-1741137; Fintech@CSAIL; Simons Investigator Award; MIT-IBM Watson collaboration; Eric and Wendy Schmidt Fund for

Published as a conference paper at ICLR 2021

Strategic Innovation, Ben-Gurion University of the Negev; MIT Akamai Fellowship; and NSF Graduate Research Fellowship Program.

The authors would like to thank Justin Chen for helpful comments on a draft of this paper, as well as the anonymous reviewers.

Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In International Conference on Machine Learning, pp. 353 362, 2018.

Luca Baldassarre, Yen-Huan Li, Jonathan Scarlett, Baran G ozc u, Ilija Bogunovic, and Volkan Cevher. Learning-based compressive subsampling. IEEE Journal of Selected Topics in Signal Processing, 10(4):809 822, 2016.

Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. In International Conference on Machine Learning, pp. 537 546, 2017.

Cl ement Canonne and Ronitt Rubinfeld. Testing probability distributions underlying aggregated data. In International Colloquium on Automata, Languages, and Programming, pp. 283 295. Springer, 2014.

Cl ement L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electron. Colloquium Comput. Complex., 22:63, 2015.

Yihe Dong, P. Indyk, Ilya P. Razenshteyn, and T. Wagner. Learning space partitions for nearest neighbor search. In ICLR, 2020.

Ronald A Fisher, A Steven Corbet, and Carrington B Williams. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, pp. 42 58, 1943.

Oded Goldreich. Introduction to Property Testing. Cambridge University Press, 2017.

Irving J Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4):237 264, 1953.

Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. Estimating unseen deduplication from theory to practice. In 14th {USENIX} Conference on File and Storage Technologies ({FAST} 16), pp. 277 290, 2016.

Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.

Tanqiu Jiang, Yi Li, Honghao Lin, Yisong Ruan, and David P Woodruff. Learning-augmented data stream algorithms. In International Conference on Learning Representations, 2019.

Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 41 52, 2010.

Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348 6358, 2017.

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pp. 489 504, 2018.

Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. In International Conference on Machine Learning, pp. 3302 3311, 2018.

V. A. Markov. On functions of least deviation from zero in a given interval. St. Petersburg, 1892.

Published as a conference paper at ICLR 2021

Michael Mitzenmacher. A model for learned bloom ﬁlters and optimizing by sandwiching. In Advances in Neural Information Processing Systems, pp. 464 473, 2018.

Ali Mousavi, Ankit B Patel, and Richard G Baraniuk. A deep learning approach to structured signal recovery. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pp. 1336 1343. IEEE, 2015.

Krzysztof Onak and Xiaorui Sun. Probability revealing samples. In International Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 2018 2026. PMLR, 2018.

Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283 13288, 2016. ISSN 0027-8424. doi: 10.1073/pnas.1607774113. URL https://www.pnas.org/ content/113/47/13283.

Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ml predictions. In Advances in Neural Information Processing Systems, pp. 9661 9670, 2018.

Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM Journal on Computing, 39(3):813 842, 2009.

Ronitt Rubinfeld. Taming big probability distributions. XRDS, 19(1):24 28, 2012.

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv e J egou. Spreading vectors for similarity search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Sk Gu G2R5tm.

AF Timan, M Stark, IN Sneddon, and S Ulam. Theory of approximation of functions of a real variable. 1963.

Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 685 694, 2011.

Paul Valiant and Gregory Valiant. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems, pp. 2157 2165, 2013.

Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. Learning to hash for indexing big data - a survey. Proceedings of the IEEE, 104(1):34 57, 2016.

Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857 883, 2019.

James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G Mac Arthur. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature communications, 7(1):1 5, 2016.

Published as a conference paper at ICLR 2021

A OMITTED PROOFS

A.1 ANALYSIS OF THE UPPER BOUND

In this subsection, we prove Theorem 1, which provides the upper bound for the sample complexity of Algorithm 1.

Proof of Theorem 1. Consider some i [n]. The predictor Π gives us an estimate Π(i) [pi, b pi] of the unknown true probability pi. Let ji be the integer for which Π(i) [ bji

n ]. We treat i as assigned by the predictor to the interval Iji. Observe that as a consequence, we have

Suppose i is observed k times in the sample. Deﬁne,

bj k k! N k if bji

n 0.5 log n

n > 0.5 log n

N and k 1 0 if bji

n > 0.5 log n

N and k = 0

The ak s are the Chebyshev polynomial coefﬁcients as per Algorithm 1. Note that as we drew a total of Pois(N) samples, the number of times each element i appears in the sample is distributed as Ni = Pois(N pi) times, and the values Ni are independent across different i s (this is the standard Poissonization trick). Thus the S(i) s are also independent. Moreover, note that the value Sj deﬁned in Algorithm 1 satisﬁes Sj = P

i:ji=j S(i). Finally, by eq. (2) we have pi n bji [1, b2]. Thus, if

n 0.5 log n

N , we compute the expectation of S(i) as (letting j = ji to ease notation),

E[ S(i)] = 1 +

k=0 P(Ni = k) ak n

e Npi(Npi)k

= 1 + e Npi

k=0 ak pi n

= 1 + e Npi PL pi n

[1 ε, 1 + ε],

bj [1, b2] and since e Npi 1. However, if bji

n > 0.5 log n

N , then E[ S(i)] equals the probability that i is sampled, which is 1 (1 pi)N. But since pi > bji

n > 0.5 log n

N , we have

that 1 E[ S(i)] > 1 1 0.5 log n

N N > 1 e(0.5 log n/N) N = 1 n 1/2. We note that if i is

never seen (which is possible even if pi = 0), we do not know ji. However, in this case, S(i) = 0 regardless of the value of ji. Therefore, we get an estimator S(i) such that S(i) = 0 if pi = 0 and E[ S(i)] [1 ε, 1 + ε] otherwise, assuming ε > n 1/2.

Next we analyze the variance of S(i). For i with pi = 0, S(i) = 0 always, so Var( S(i)) = 0. Otherwise, if bji

n > 0.5 log n

N , then S(i) is always between 0 and 1, so Var( S(i)) 1. Finally, if

Published as a conference paper at ICLR 2021

n 0.5 log n

N , then Var( S(i)) E[( S(i) 1)2] and we can write (again with j = ji)

E[( S(i) 1)2] =

k=0 P(Ni = k) ak pi n

e Npi(Npi)k

max 0 k L a2 k

max 0 k L a2 k

where for the last inequality we noted that k! kk, pi n

bj b2, and bj 1.

Now, it is a well known consequence of the Markov Brothers inequality that all the coefﬁcients of the standard degree L Chebyshev polynomial QL(x) are bounded by e O(L) (Markov, 1892).

Since PL = PL k=0 akxk is just ε Q b2+1

2 , we have that for any ﬁxed b = O(1), the

coefﬁcients ak are all bounded by e O(L) as well. Therefore, for N = C b2 L n1 1/L for some constant C, we have that b2 k n/N n1/L/C, so we can bound Equation (3) by

for some other constant C .

In summary, if pi = 0, we have that E[ S(i)] [1 ε, 1 + ε] and Var( S(i)) ε2n if we choose C to be a sufﬁciently large constant, since ε = e Θ(L) for a constant 1 < b O(1). This is true even for bji

n > 0.5 log n

N since ε2n > 1. However, we know that S = Plogb n j=0 Sj = Pn i=1 S(i), since by the deﬁnition of Sj, Sj = P

i:ji=j S(i). Therefore, since the estimators S(i) are independent, we have that E[ S] = Pn i=1 E[ S(i)] (1 ε, 1 + ε) S and Var( S) = P

i:pi =0 Var( S(i)) ε2 n S, where we use the independence of the S(i) s. Therefore, since S n, with probability at least 0.9, | S S| = O(ε)

n S by Chebyshev s inequality.

Remark. We note that our analysis actually works (and becomes somewhat simpler) even if the algorithm is modiﬁed to deﬁne Sj = P

i [n]:Π(i) Ij

bj Ni Ni! NNi for all intervals, as

opposed to Sj = #{i [n] : Ni 1, Π(i) Ij} for the intervals Ij with bj

n > 0.5 log n

N . However, because we can decrease the bias as well as the variance for these larger intervals, we modify the algorithm accordingly. While this does not affect the theoretical guarantees, it demonstrated an improvement in practice. This threshold was also used in Wu & Yang (2019) (the choice of leading constant 0.5 in the threshold term 0.5 log n

N is arbitrary, and for simplicity we have adopted the value they used).

A.2 ANALYSIS OF THE LOWER BOUND

In this subsection, we prove Theorem 2, which proves that the sample complexity of Algorithm 1 is essentially tight.

Proof of Theorem 2. In order to prove the lower bound, we shall deﬁne two distributions P and Q for which the ﬁrst k moments are matching, but their support size differs on at least εn elements. Furthermore, both distributions will be supported on {0} k

n , . . . , 2k

n , so that a 2-factor approximation predictor does not provide any useful information to an algorithm trying to distinguish the two distributions.

Published as a conference paper at ICLR 2021

We start with an overview of the deﬁnition of the two distributions and then explain how to formalize the arguments, following the Poissonnization and rounding techniques detailed in Raskhodnikova et al. (2009).

Let ε = k 2k 1 2k k 1 for some integer k 1. Note that ε = e Θ(k). In order to deﬁne the

distributions, we deﬁne k + 1 real numbers a0, . . . , ak as ai = ( 1)i ( k i) 2k 1 (k+i) for every i {0, . . . , k}. Suppose for now that n is a multiple of the least common multiple of 2k 1, k, . . . , 2k, so that ai n is an integer for every i. We deﬁne P and Q as follows:

The distribution P: For every ai such that ai > 0, the support of P contains ai n elements j that have probability pj = ai k+i

The distribution Q: For every ai such that ai < 0, the support of Q contains ai n elements j that have probability qj = ai k+i

First we prove that P and Q are valid distributions and that all of their non-zero probabilities are greater than 1/n.

Claim A.1. P and Q as deﬁned above are distributions. Furthermore, their probability values are either 0 or greater than 1/n.

Proof. The second part of the claim follows directly from the deﬁnition of P and Q. We continue to prove the ﬁrst part, that P and Q are indeed distributions. It holds for P that

ai|ai>0 (nai) k + i

n 1 k + i k i

1 2k 1 k + i

Similarly, for Q,

ai<0 (n( ai)) k + i

n 1 k + i k i

1 2k 1 k + i

We continue to prove that the ﬁrst k moments of P and Q are matching, and that their support size differs by εn.

Claim A.2. Let a1, . . . , ak be deﬁned as above. Then

For any r {1, . . . , k}, it holds that Pk i=0 ai (k + i)r = 0.

Pk i=0 ai = ε.

Proof. By plugging the ai s as deﬁned above,

i=0 ai (k + i)r =

2k 1 (k + i) (k + i)r = 1 2k 1

i=0 ( 1)i k i

(k + i)r 1.

Hence, letting r = r 1, it sufﬁces to prove that Pk i=0( 1)i k i (k+i)r = 0 for all 0 r k 1. For any ﬁxed k, note that since (k + i)r is a degree r polynomial in i, (k + i)r can be written as a linear combination of i s for 0 s r with coefﬁcients b0, . . . , br . Therefore, we would like to

prove that: Pk i=0( 1)i k i Pr

s=0 bs i s = 0. Fix some s in {0, . . . , r }: it sufﬁces to show that for any integer k, Pk i=0( 1)i k i i s = 0 for all 0 s k 1.

Published as a conference paper at ICLR 2021

Since k i i s = k k i,i s,s = k s k s i s , by setting k = k s and i = i s, we get

i=0 ( 1)i k i

i=0 ( 1)i k s

i =0 ( 1)i +s k

( 1)s (1 1)k = 0,

where the last equality is since k = k s 1. This concludes the proof of the ﬁrst item.

We continue to prove the second item in the claim:

i=0 ai = ε. (4)

Recall that ai = ( 1)i ( k i) 2k 1 (k+i), and that ε = k 2k 1 2k k 1 , so plugging these into Equation (4)

and multiplying both sides by 2k 1 2k k , this is equivalent to proving

Since k i 2k k = 2k k,i,k i = 2k k+i k+i k , the right-hand side of Equation (5) equals

k + i 2k k + i

i=0 ( 1)i 2k k + i

k + i 1 k 1

Multiplying by k, it sufﬁces to show that

i=0 ( 1)i 2k k + i

k + i 1 k 1

To do this, let j = k + i. Then, note that k+i 1 k 1 = j 1 k 1 = (j 1) (j k+1)

(k 1)! which is a degree k 1 polynomial in j. For 1 j k 1 the polynomial equals 0, but for j = 0 the polynomial equals ( 1)k 1. Therefore, the right hand side of Equation (6) equals

j=1 ( 1)j k 2k j

(j 1) (j k + 1)

(k 1)! = 1+( 1) k

j=0 ( 1)j 2k j

(j 1) (j k + 1)

As proven in the ﬁrst part of this proof, for any P(j) of degree 0 r 2k 1, P2k j=0( 1)j 2k j P(j) = 0. Therefore, the summation on the right hand side is 0, so this simpliﬁes to 1, as required.

The following corollary follows directly from the deﬁnition of P and Q and the previous claim.

Corollary A.1. The following two items hold for P and Q as deﬁned above.

E[P r] = E[Qr] for all r {1, . . . , k}.

TV (P, Q) = εn.

The above corollary states that indeed P and Q as deﬁned above have matching moments for r = 1 to k, and that they differ by εn in their support size. This concludes the high level view of the construction of the distributions P and Q. In order to ﬁnalize the proof we rely on the standard Possionization and rounding techniques.

First, by Theorem 5.3 in Raskhodnikova et al. (2009), any s-samples algorithm can be simulated by an O(s)-Poisson algorithm. Hence, we can assume that the algorithm takes Poi(s) samples, rather than an arbitrary number s. Second, we can alleviate the assumption that the ai n values (similarly

Published as a conference paper at ICLR 2021

ai n) are integral for all i, by rounding down the value in case it is not integral, and choosing n Pk i=0 ai n additional elements in P so that each has probability 1/n (and analogously for Q). By Claim 5.5 in Raskhodnikova et al. (2009) this process increases the number of values in P and Q by at most O(k2). Hence, the distributions are now well deﬁned, and we can rely on the following theorem from Raskhodnikova et al. (2009).

Theorem 3 (Corollary 5.7 in Raskhodnikova et al. (2009), restated.). Let P and Q be random variables over positive integers b1 < . . . < bk 1, that have matching moments 1 through k 1. Then for any Poission-s algorithm A that succeeds to distinguish P and Q with high probability, s = Ω(n1 1/k).

Therefore, plugging the value of ε, we get an s = Ω(n1 Θ(log(1/ε))) lower bound as claimed.

A.3 BOUNDS WHEN THE PREDICTOR IS ACCURATE UP TO LOW TOTAL VARIATION DISTANCE

In this section, we consider estimating support size when we are promised a bound on the total variation distance, or ℓ1 distance, between the prediction oracle and the true distribution. First, we show that if the predictor is accurate up to extremely low total variation distance, then the algorithm of Canonne & Rubinfeld (2014) is optimal.

Theorem 4. Let Π represent our predictor, where for each sample we are given a random sample i P along with Π(i). If we deﬁne p(i) := P(i P), then if P Π 1 = PN i=1 |p(i) Π(i)| ε

2, then using O(ε 2) samples, the algorithm of Canonne and Rubinfeld can approximate the sample size up to an ε n additive error.

Proof. The algorithm of Canonne & Rubinfeld (2014) works as follows. For each sample (i, Π(i)), it computes 1/Π(i), and averages this over O(ε 2) samples.

From now on, we also make the assumption that Π(i) 1/n for all i. This is because we know p(i) 1/n whenever i is sampled, so the predictor Π = max(Π, 1/n) is a strict improvement over Π for any i that is sampled.

If we draw i P, then E[ 1 p(i)] = P

i:p(i) =0 p(i) 1 p(i) = S, where S is the support size. Also,

E[ 1 Π(i) 1 p(i)] = P

i:p(i) =0 p(i) ( 1 Π(i) 1 p(i)) = P[ p(i)

Π(i) 1], which, if Π(i) 1

n, is bounded in

absolute value by P | p(i)

Π(i) 1| n P |p(i) Π(i)| ε n

2 . So, Ei P[ 1 Π(i)] [S ε n, S + ε n]. Since we have made sure that Π(i) 1

n whenever x is sampled, V ar( 1 Π(i)) n2. Therefore, by a basic application of Chebyshev s inequality, an average of O(ε 2) samples of 1 Π(i) is sufﬁcient to estimate S up to a ε n additive error.

The Canonne and Rubinfeld algorithm has been shown to be far less robust to the error of our predictor, even when given an equal number of samples as our main algorithm. This suggests that an ε-total variation distance bound assumption is a less suitable assumption in our context.

Conversely, if the predictor is only promised to be accurate up to not-as-low total variation distance, we show that no algorithm can do signiﬁcantly better than Wu & Yang (2019).

Theorem 5. Let Π represent our predictor, and recall that p(i) := P(i P). Suppose the only promise we have on Π is that P Π 1 = PN i=1 |p(i) Π(i)| γ for some γ Cε, where C is some sufﬁciently large ﬁxed constant and ε 1

n. Then, there does not exist an algorithm that can learn the support size of P up to an ε n additive error using o(n/ log n) samples.

Proof. Consider some unknown distribution D over [n] that we want to learn the support size of, with only samples and no frequency predictor. Moreover, assume that for all i [n], D(i) 1/n whenever D(i) > 0, where D(i) = P(i D). Now, let m = n/γ, and consider the following distribution P on [m]. For n < i m, let p(i) = 1 m = γ

n, and for 1 i n, let p(i) = γ D(i). Let Π be the uniform distribution on [n + 1 : m], i.e., Π(i) = 0 if i n and Π(i) = 1 m n if i n + 1. It is clear that P and Π are both distributions over [n] (i.e., the sum of the probabilities

Published as a conference paper at ICLR 2021

equal 1), the minimum nonzero probability of P is at least γ

m, and that TV (P, Π) γ. Finally, the support size of P is precisely m n more than the support size of D.

Assume the theorem we are trying to prove is false. Then, there is an algorithm that, given k = o( m log m) random samples (i, Π(i)) where each i P, can determine the support size of P up to an ε m factor. Then, since the support size of D is exactly m n less than the support size of D, one can learn the support size of D up to an ε m ε

C additive error using k samples (i, Π(i)) for i P.

Now, using only Θ(γ k) samples from D, it is easy to simulate k samples from (i, Π(i)), where i P. This is because to generate a sample (i, Π(i)), we ﬁrst decide if i > n or i n (the latter occurring with probability γ). In the former case, we draw i uniformly at random and let Π(i) = 1 m n. In the latter case, we know that P conditioned on i n has the same distribution as D, so we just draw i D and Π(i) = 0, since Π is supported only on [n + 1 : m]. With very high probability (by a simple application of the Chernoff bound), we will not use more than Θ(γ k) samples from D.

Overall, this means that using only Θ(γ k) = o γ m log m = o n log m = o n log n samples (since

γ ε 1/n) from D, we have learned the support size of D up to an n

C additive error. This algorithm works for an arbitrary D and the predictor Π does not actually depend on D. Therefore, for sufﬁciently large C, this violates the lower bound of Wu & Yang (2019), who show that one cannot learn the support size of D up to n

C additive error using o n log n samples.