# optimal_hypothesis_selection_in_almost_linear_time__cc4b69c8.pdf

Optimal Hypothesis Selection in (Almost) Linear Time

Maryam Aliakbarpour Department of Computer Science Rice University Houston, TX 77005 maryama@rice.edu

Mark Bun Department of Computer Science Boston University Boston, MA 02215 mbun@bu.edu

Adam Smith Department of Computer Science Boston University Boston, MA 02215 ads22@bu.edu

Hypothesis selection, also known as density estimation, is a fundamental problem in statistics and learning theory. Given a sample set from an unknown distribution P and a finite class of candidate distributions (hypotheses) H := H1, H2, . . . , Hn, the goal is to design an algorithm that selects a distribution ˆH from H that best describes P. The accuracy of the algorithm is measured by the distance between ˆH and P, compared to the distance between the closest distribution in H and P (denoted by OPT). Specifically, we aim for ˆH P TV to be at most α OPT + ϵ for some small ϵ and α. While the value of ϵ can be reduced with an increasing number of samples, α is an inherent characteristic of the algorithm. Achieving α < 3 is impossible, even with only two candidate hypotheses, unless the number of samples is proportional to the domain size of P [Bousquet, Kane, Moran 19]. Finding a computationally efficient algorithm that achieves the optimal α has been a primary focus of research since the early work of [Devroye, Lugosi 01]. Before our work, the algorithms achieving α < 5 required time Ω(n2). We present the first algorithm that operates in almost linear time ( O(n/ϵ3)) and achieves α = 3. This result improves upon a long line of hypothesis selection research. Previously known algorithms had either worse time complexity, a larger α factor, or additional assumptions about the problem setting. Additionally, we provide another (almost) linear-time algorithm with better dependency on the additive accuracy parameter ϵ, albeit with a slightly worse accuracy parameter of α = 4.

1 Introduction

Hypothesis selection, also known as density estimation, is a fundamental problem in statistics and learning theory. This problem involves identifying a density function that accurately represents the distribution of a given dataset. Suppose we are given a dataset of samples drawn from an unknown distribution P and a finite class of known distributions, representing different hypotheses: H := {H1, H2, . . . , Hn}. The goal is to select a distribution in H that is close to P in total variation (TV) distance.

Typically, learning a distribution over a domain X with an ϵ error in TV distance requires Ω(|X|/ϵ2) samples, which presents a substantial lower bound in sample complexity for distributions over large

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

domains. Surprisingly, the findings of Yatracos, Devroye, and Lugosi revealed that for hypothesis selection, this sample complexity can be independent of the domain size and only logarithmic in the number of hypotheses [Yat85, DL96, DL97]. With just s := Θ(log n/ϵ2) samples from P, it is possible to learn P within error α OPT + ϵ, where OPT denotes the distance of the nearest distribution in H to P. Specifically, Devroye and Lugosi introduced two algorithms for this problem: the Scheffé tournament, which operates in O(n2 s) time1 and achieves α = 9; and the minimum distance estimate, which runs in O(n3 s) time and achieves α = 3 [DL01, Chapter 6]. In these results, although ϵ can be decreased as the number of samples increases, α remains an inherent parameter of the algorithm.

Significant effort has been directed towards finding computationally efficient algorithms for this problem while maintaining sample efficiency. The trade-off between the accuracy parameter α and computational efficiency has been a focal point. Mahalanabis and Stefankovic in [MS08] enhanced the minimum distance estimate, improving the time complexity to O(n2 s). They also introduced a nearly linear-time algorithm that achieves α = 3, but requires exponential time in n for preprocessing the class H. Other nearly linear-time algorithms were developed achieving α = 9 in [AJOS14, AFJ+18, AAC+23]2 and α = 5 in [ABS23]. Furthermore, linear-time algorithms have been presented under the assumption that the algorithm receives the value of OPT (or its upper bound) as input [DK14, ABS23]. However, achieving α < 3 is not possible unless the number of samples becomes poly(|X|), as indicated in [BKM19].

Despite the long-standing history of this problem, the following question remained open:

Is there an algorithm for hypothesis selection with the optimal number of samples s = O(log(n)/ϵ2) and optimal accuracy parameter α = 3 that runs in O(n s) time?

We present the first almost linear-time algorithm that uses the optimal number of samples and achieves the optimal accuracy parameter α = 3. Our algorithm runs in O(n s / ϵ) time (see Theorem 2). Additionally, we introduce another algorithm with improved dependency on ϵ, running in O(n s) time while obtaining a slightly higher accuracy parameter, α = 4 (see Theorem 3). Our results represent a significant step forward, as they are the first in decades to achieve time complexity linear in n for any α < 5. See Table 1.

Applications of Hypothesis Selection The primary application of hypothesis selection is to identify the best distribution from a set of known models that represent potential underlying data distributions we can effectively manage. For example, this set might include Poisson, gamma, and binomial distributions with various parameters used to model the number of arrivals per time unit. This makes hypothesis selection applicable to tasks like interpretable distribution selection and strategy optimization, where the objective is to choose the most suitable model from available options.

Another key strength of hypothesis selection is its agnostic nature, allowing it to adapt even when the true distribution lies outside the considered class. This robustness makes it effective in noisy data settings, with applications in denoising and anomaly detection. From a theoretical standpoint, hypothesis selection is fundamental in learning structured distributions, particularly when combined with the cover method, as seen in learning mixtures of Gaussians [DK14, SOAJ14]. For additional references, see Section 1.3.

Importance of improving α by a constant factor In most learning algorithms, the error guarantee decreases polynomially as the number of samples increases, so constant factors may not be as crucial. However, this is not the case in hypothesis selection. The output hypothesis is guaranteed to be (α OPT + ϵ)-close to P in total variation distance. While increasing the number of samples can reduce ϵ to negligible levels, it does not improve the term α OPT. Hence, α is an inherent property of the algorithm and directly impacts the best achievable error guarantee. Therefore, even a constant improvement in α is significant.

One might argue that, alternatively, OPT could be reduced by carefully curating the class H. However, beyond the practical challenges of finding a better H, it is unclear whether this can be achieved

1This time bound assumes constant-time comparisons of the Hi s density functions. See Section 1.1. 2Theorem 4.1 in [AAC+23] states the result for α = 27. However, according to personal correspondence with the authors, it is possible to modify their algorithm in conjunction with the minimum distance estimate and improve α to 9.

without significantly increasing the size of the hypothesis class H. For example, in the cover method, when aiming to learn a distribution within a class C, we set H to be a γ-net that serves as a cover for C, ensuring that OPT < γ. While using a finer γ-net can reduce OPT, it may also drastically increase the size of H, thereby increasing the algorithm s running time, since the size of the net can grow super-polynomially with respect to γ. For instance, in the case of mixtures of k Gaussians, the size of the net depends on γ as roughly O(γ 3 k) (see [SOAJ14]). Thus, reducing γ by a factor of three could exponentially increase the size of H in terms of k, thereby increasing both the running time and sample complexity, ultimately resulting in an inefficient algorithm.

Table 1: Summary of Past Results in Hypothesis Selection. All algorithms use s = Θ(log n/ϵ2) samples.

Result α Time Complexity Additional requirement

Min distance estimate [DL01] 3 O(n3 s)

Scheffé tournament [DL01] 9 O(n2 s)

Min distance estimate [MS08] 3 O(n2 s)

[AJOS14, AFJ+18] 9 O(n s)

[ABS23] 5 O(n s)

[AAC+23] 9 O(n s/ log n)

[MS08] 3 O(n s) Exponential time preprocessing

[DK14, ABS23] 3 O(n s) Assume knowledge of OPT Lower bound [BKM19] Achieving α < 3 requires poly(|X|) samples

This work:Algorithm 1 3 O(n s / ϵ)

This work: Algorithm 4 4 O (n s)

1.1 Problem Setup

Suppose we have an unknown distribution P over a domain X and a set of n known distributions H := {H1, H2, . . . , Hn}. Let OPT denote the distance between P and the closest distribution to it in H 3: OPT(H, P) := min H H H P TV . We use the standard access model for this problem as introduced in [DL01]. The algorithm accesses the distributions through the following types of queries:

1. The algorithm can request a sample from the unknown distribution P.

2. The algorithm can compare the PDF of two known distributions: For every domain element x X and every pair of indices i and j, it can ask if Hj(x) < Hi(x). This is equivalent to asking if x is in the Scheffé set of Hi and Hj (defined in Equation (2)).

3. The algorithm can query the probability mass of the Scheffé sets according to all the known distributions.

Remark 1. The last requirement of our model can be relaxed. Only estimates of the probability masses of the Scheffé sets are needed for our algorithms. Thus, one can alternatively assume sample access to Hi s, and estimate these values via samples.

Definition 1.1 (Proper learner for hypothesis selection). Suppose algorithm A is given parameters ϵ, δ (0, 1), α R>0 and has access to an unknown distribution P and a class of n known distributions H (as described above). We say A is an (α, ϵ, δ)-proper learner for the hypothesis selection problem if for every P and H, A outputs ˆH H for which, with probability at least 1 δ,

ˆH P TV α OPT(H, P) + ϵ . (1)

3We omit the arguments (H, P) when they are clear from the context.

We refer to α as the accuracy parameter, ϵ as the error (or proximity) parameter, and δ as the confidence parameter of the algorithm.

1.2 Main theorems

Below, we provide informal versions of our theorems. Theorem 2. For every ϵ, δ (0, 1), Algorithm 1 is an (α = 3, ϵ, δ)-proper learner for hypothesis selection that uses s = O(log(n/δ)/ϵ2) samples and time O(n s / (δ3ϵ)) = O(n/(δ3ϵ3)). Theorem 3. For every ϵ, δ (0, 1), Algorithm 4 is an (α = 4, ϵ, δ)-proper learner for hypothesis selection that uses s = O(log(n/δ)/ϵ2) samples and time O(n s log(1/δ)) = O(n log2(1/δ)/ϵ2)).

For formal statements, see Theorem 5 (Appendix A) and Corollary B.1 (Appendix B).

Our results are the first to achieve time complexity linear in n for any α < 5. To achieve this, we introduce novel algorithmic techniques that will hopefully be broadly useful see Section 3 for an overview. Both algorithms use the optimal number of samples. The first algorithm obtains optimal accuracy parameter α = 3 with a time complexity of O(n/(δ3 ϵ) s), which is off by an O(1/(δ3 ϵ)) factor. Our second algorithm achieves the optimal time complexity up to logarithmic factors while achieving a slightly higher α = 4. Our results leave a fascinating open question: Can one combine the best aspects of both algorithms, maintaining α = 3, achieving O(n s) time (or even lower), and sample complexity s = O(log(n/δ)/ϵ2)?

Remark 4. Readers may be surprised by the polynomial dependence on 1/δ in Theorem 2. In many settings, the success probability of a learning algorithm can be amplified from a constant, say 2/3, to at least 1 δ at a cost of at most log(1/δ) in running time and sample complexity. However, in hypothesis selection, there is no (known) general technique for boosting the confidence parameter while keeping α the same. The issue is that choosing the best output from several runs of a given algorithm requires executing a second hypothesis selection algorithm, which introduces another factor of α in the approximation leading to a total factor of at least 9. As a result, these kinds of two-phase algorithms are not sufficient in the low α regime. Some previous results, such as [ABS23], also suffer from a polynomial dependency on δ.

1.3 Other related work

Hypothesis selection has been studied in various settings including improper setting. In [BKM19], the authors consider the improper version of the problem where the output hypothesis ˆH may not necessarily be in H. They presented an improper learner with accuracy guarantee α = 2. The sample complexity of their algorithm was improved by [BBK+21], who gave an algorithm with nearly optimal sample complexity and the same accuracy parameter α = 2. It is worth noting that our algorithms are proper learners and solve this problem with a slightly better sample complexity. In addition, like other proper learners, our algorithms select their output from a predefined set, which can facilitate choosing a distribution with specific structural property (e.g., mixture of Gaussians). In certain applications, this selection ensures consistency with the problem s underlying assumptions, which enhances interpretability and robustness.

The problem of hypothesis selection in sublinear time was studied for distributions on discrete domains [AAC+23]. Among other results, the authors developed a data structure that upon receiving samples from a unknown distribution P returns a hypothesis ˆH in o(n) time. While their algorithm runs in sublinear time, their sample complexity depends on the domain size of the distribution, and their setting allows pre-processing of the class H in polynomial time. Another interesting variation for hypothesis selection is designing differentially private learners for the problem which has received attention over the past few years [BKSW19, CKM+19, GKK+20].

An important application of hypothesis selection arises when there is a structural assumption on the underlying distributions. One approach for learning these classes is to selectively choose a cover for the class. One can then use the learners for the standard hypothesis selection problem (which we study in the paper) and use the cover as the class H. Examples of such structural assumptions include Poisson binomial distributions [DDS12], mixtures of Gaussians [KMV12, DK14, SOAJ14, DKS17, KSS18, ABM18, ABH+20], distributions with piecewise polynomial PDFs [ADLS17], and histograms [Pea95, CDSS14, CDKL22]. See Diakonikolas [Dia16] for a survey of results.

2 Preliminaries

Notation: We use the following notation throughout this article. We use [n] to indicate the set {1, 2, . . . , n}. For a distribution Q over X, Q(x) denotes the PDF of Q at the domain element x X. For any measurable subset of the domain S X, Q(S) indicates the probability mass of the set S according to Q. We use Q1 Q2 TV := sup S X |Q1(S) Q2(S)| to denote the total variation distance between Q1 and Q2. We say Q1 is ϵ-close to Q2 if Q1 Q2 TV is at most ϵ. Conversely, we say Q1 is ϵ-far from Q2 if Q1 Q2 TV is greater than ϵ. We use the standard O, Ω, Θ notation for asymptotic functions. Additionally, we use O, Ω, and Θ to hide polylog factors.

Scheffé sets: For every pair of hypotheses Hi and Hj in H, we define the Scheffé set of Hi and Hj as follows:

Si,j := {x X | Hi(x) < Hj(x)} if i j Sj,i if i > j (2)

It is known that the Scheffé set maximizes the probability discrepancy between Hi and Hj, thus fully characterizing the total variation distance between the two distributions:

Hi Hj TV = sup S X |Hi(S) Hj(S)| = |Hi (Si,j) Hj (Si,j)| . (3)

The optimal hypothesis: Recall that we assume the algorithm is given samples drawn from an unknown distribution P. Let Hi denote the closest hypothesis in H to P. That is, Hi is the hypothesis for which Hi P TV = OPT. When there is more than one hypothesis with this property, we pick one arbitrarily as Hi .

Semi-distances: For every pair i, j in [n], we define wj(Hi) to be |Hi (Si,j) P (Si,j)|. In words, wj(Hi) is the distance of Hi to P observed on the Scheffé set of Hi and Hj. For every pair i, j [n], we use ˆwj(Hi) to denote an estimate of wj(Hi) based on the observed sample. For the sake of consistency, we define wi(Hi) to be zero. In addition, we define the score function W(Hi) := maxj [n] wj(Hi). Similarly, ˆW(Hi) is defined to be maxj [n] ˆwj(Hi).

Refined access model: Similar to previous work [DL01, MS08], we use estimates of the semidistances. One can easily estimate these quantities, denoted by ˆwj(Hi), via the access model we have described earlier by letting ˆwj(Hi) be the empirical ratio of the samples that are in Si,j. Throughout this paper, we assume that there are two universal parameters δ = Θ(δ) and ϵ = Θ(ϵ), for which with probability 1 δ every ˆwj(Hi) is within ϵ of wj(Hi):

i, j [n] : | ˆwj(Hi) wj(Hi)| ϵ .

A simple application of the Hoeffding bound and the union bound shows that one can compute all of the estimates via s = O(log(n/δ )/ϵ 2) samples, and each estimate can be computed in O(s) time.

Our algorithms access the distributions in H {P} only via querying ˆwj(Hi). This fact brings the sample complexity of our algorithms to s = O(log(n/δ )/ϵ 2) samples. The time complexity of our algorithms is determined by the number of queries they make to the ˆwj(Hi) s, multiplied by time that we spend on each query, O(s). Moreover, in the proofs of our theorems, we assume without loss of generality that all the ˆwj(Hi) s are accurate. Conditioning on the accuracy will not decrease the probability of correctness of any of our algorithms by more than δ due to Fact C.1.

3 Overview of our techniques

In this section, we provide a high-level discussion of our algorithm. The important notations and definitions used here are provided in Section 2. For the high-level discussions in this section, we assume we have access to the exact values of the semi-distances wj(Hi). In the formal proofs presented in subsequent sections, we will substitute each wj(Hi) with an estimated value ˆwj(Hi).

If the error of these estimates is below ϵ = Θ(ϵ), it can be shown that the overall impact of this substitution on our final distance guarantee (Equation 1) is at most Θ(ϵ). See the Refined access model in Section 2 for further details.

3.1 Background: Semi-distances and the minimum distance estimate

To solve the hypothesis selection problem, we seek a certificate that ensures we output a hypothesis ˆH such that ˆH P TV is at most 3 OPT. Similar to previous work, our algorithms operate based on the probability masses of the Scheffé sets (Equation (2)) of pairs of hypotheses in H. The semi-distance wj(Hi), defined as |Hi (Si,j) P (Si,j)|, captures the distance between Hi and P as measured on this particular set Si,j. One suggestion for readers to internalize the semi-distances is to view them as a distance between Hi and P that is measured from the perspective of Hj. By definition, wj(Hi) is always a lower bound for Hi P TV:

wj(Hi) := |Hi (Si,j) P (Si,j)| sup S X |Hi(S) P(S)| = Hi P TV.

However, it is possible for wj(Hi) to be much lower, making it difficult for an algorithm to estimate the TV distance between Hi and P just using semi-distances.

Nevertheless, for each hypothesis Hi, a specific semi-distance wi (Hi), associated with the optimal hypothesis Hi , determines its quality. As shown in the following proof, if wi (Hi) OPT, then Hi is 3, OPT-close to P, with the total variation distance between Hi and P bounded by wi (Hi) and OPT via the triangle inequality:

Hi P TV Hi Hi TV + Hi P TV = |Hi (Si,j) Hj (Si,j)| + OPT (By Eq. 3) wi (Hi) + wi(Hi ) + OPT wi (Hi) + 2 OPT.

This observation implies that if we assert that wi (Hi) is bounded by OPT, we can output Hi as our final solution to the problem and be done. The challenge is that we neither know i nor OPT.

This issue is addressed by the minimum distance estimate presented in [DL01, MS08] using a score function W(Hi), as defined earlier: W(Hi) := maxj [n] wj(Hi). The minimum distance estimate outputs a hypothesis ˆH that minimizes W(Hi). We can assert that, for the output of this algorithm, ˆH, wi ( ˆH) is at most OPT. This approach simultaneously bypasses the issues of not knowing i and OPT.

Note that W(Hi) serves as a proxy for the quality of Hi since W(Hi) is an upper bound for wi (Hi). Using W(Hi), we can address the issue of not knowing i . On the other hand, although OPT is not known, we have a good lower bound for OPT, which is W(Hi ). Putting these observations together, we obtain:

ˆH P TV wi ( ˆH) + 2 OPT W( ˆH) + 2 OPT W(Hi ) + 2 OPT 3 OPT.

These inequalities are derived using the triangle inequality and the fact that ˆH was chosen to be the arg mini [n] W(Hi). See Figure 1 for an illustration of the above inequality.

The primary hurdle with the minimum distance estimate is that it is costly to compute. Computing each W(Hi) takes O(n s) time, bringing the total time complexity of the algorithm to O(n2 s). One might naturally conjecture that sampling may help to compute an estimate of W(Hi). Instead of using W(Hi) := maxj [n] wj(Hi), we can use W(Hi) := maxj R wj(Hi) where R is a set of random indices in [n]. The issue with this approach is that there is no guarantee of i being selected in R, making W(Hi) too low while Hi may be far from P. Hence sampling, if used in a trivial manner, is not beneficial.

3.2 The algorithm with α = 3

In this section, we present an overview of Algorithm 1 that attains α = 3. The details of this algorithm and its related theorems are provided in Section A. At a high level, similar to the minimum distance estimate, we work towards finding a hypothesis that minimizes W(Hi). To increase efficiency, we

P (Si,j) Hi (Si,j) Hi (Si,j)

TV distance = absolute difference of probabilities on Scheffé set

Probability of Si,j according to different distributions:

Figure 1: A description of bounding Hi P TV via W(Hi) and OPT

work with estimates of W(Hi) s, denoted by W(Hi). The general structure of our algorithm is as follows: Initially, all W(Hi) are set to zero. At every step, we come up with several pairs of hypotheses Hi and Hj and update our estimates by setting W(Hj) to max( W(Hj), wi(Hj)). Our approach ensures that at every step, W(Hi) is equal to maxj R wj(Hi) for a small, carefully chosen set R. Eventually, we output a hypothesis with roughly the smallest W(Hi).

Focusing on small W W W: Our first idea is to focus on updating the W for hypotheses whose W(Hi) values are around the current minimum W. The rationale for this action comes from a simple fact: W(Hj) always underestimates the value of W(Hj). Hence, if we observe that W(Hi) is substantially larger than the current minimum, then W(Hi) is also substantially larger than the current minimum. This implies that Hi is not a suitable candidate for the minimum at this stage of the algorithm, and it can be ignored for now.

Bucketing hypotheses based on W W W: To implement this idea, we partition the hypotheses into k = Θ(1/ϵ ) buckets B := {B1, B2, . . . , Bk} based on W(Hi). The bucket ℓcontains all the hypotheses Hi such that W(Hi) [(ℓ 1)ϵ , ℓϵ ). At every step, we focus on the smallest nonempty bucket Bℓ(the smallest ℓfor which |Bℓ| = 0). Bℓcontains all the hypotheses whose W(Hj) is around the minimum W. We pick pairs of hypotheses, Hi H and Hj Bℓ, and update W(Hj). Note that our updates may increase W(Hj), and we may remove Hj from Bℓand put it into a larger bucket (a bucket with a larger index ℓ). Also, observe that we never move a hypothesis into a smaller bucket since W(Hj) never decreases. We continue our updates to reach one of the following outcomes:

Bℓbecomes empty. That is, our updates made all W(Hi) fall out of the range of these buckets [0, ℓϵ ). Thus, every W(Hi) (and consequently every W(Hi)) is at least ℓ ϵ . Every time that we empty out a bucket, we have increased our threshold for minimum W(Hi) by ϵ . Hence, we are getting closer to a bucket with the true minimum, which we hope to reach in O(1/ϵ ) steps. Bℓis not empty, but we can confidently confirm most of the hypotheses in Bℓare an acceptable output for the algorithm. Although we cannot ensure that Hi is indeed in Bℓ, we can find an acceptable final answer by selecting a random hypothesis in Bℓ.

Which pairs to update? Next, we outline our update scheme to implement the above ideas in linear time. To enhance time efficiency, our aim is to optimize the updating process to ensure both quality and quantity in the chosen updates. Quality, in this context, relates to the extent of change in W(Hj) following an update: We consider Hi to have made a substantial update to Hj if W(Hj) + ϵ < ˆwj(Hi). Such updates cause a significant shift, increasing the value of W(Hj) by

more than ϵ and subsequently moving Hj to a different bucket with a higher index ℓ. We refer to this event as Hi removing Hj from its bucket. The quantity, on the other hand, relates to the number of Hj instances that a given Hi can remove from Bℓ. An ideal Hi eliminates a substantial portion of hypotheses from Bℓ(say a constant fraction). We call such an Hi a prompting hypothesis. Now, if for O(log |Bℓ|) rounds we find a prompting hypothesis and update all the W(Hj) for Hj Bℓ, we will empty out the bucket Bℓ, and we can move forward.

Finding a prompting hypothesis: To find a prompting hypothesis quickly, we iterate over all Hi in H, sample a few Hj, and check if Hi substantially updates W(Hj). If Hi substantially improves a large fraction of the sampled hypotheses, then we declare that Hi is a prompting hypothesis. See Section A.1.1 for further details. In addition to that, we provide a more advanced version of this procedure in Section A.1.2 that allows us to shave off an O(log n) factor.

Getting stuck? Here is your way out: What happens when Bℓis not empty, and we cannot find a prompting hypothesis? We show that if we do not find a prompting hypothesis, then a random hypothesis in Bℓis an acceptable answer.

The surprising part about this statement is that it holds regardless of the size of the bucket Bℓdue to hypothesis sampling procedure we have to find a prompting hypothesis. In search for a prompting hypothesis, we iterate over all Hi and sample roughly O(log n/δ) many hypotheses Hj s in Bℓ. We check, if Hi can remove them from the bucket. Note that if Hi was not found to be prompting, it implies that Hi cannot substantially update most of the hypotheses in Bℓ. Thus, We have that with high probability for 1 δ fraction of the hypotheses in Bℓ, wi(Hj) ϵ ℓ.

The clever hack here is an observation about Hi . Earlier, we discussed that we are looking for a hypothesis Hi with wi (Hi) OPT. We claim that a random hypothesis in the last Bℓ(almost) has this property. On one hand, given that Hi was not found to be a prompting hypothesis, for 1 δ fraction of Hi in Bℓ, wi (Hi) must be at most ℓ ϵ . On the other hand, the fact that all the previous buckets, B1, . . . , Bℓ 1, are empty indicates Hi has shown a semi-distance of at least (ℓ 1) ϵ . Hence, OPT, which is at least as large as all Hi semi-distances, is at least (ℓ 1) ϵ . Therefore, wi (Hi) ϵ + OPT. Therefore, for 1 δ fraction of the hypotheses in Bℓ, they are 3 OPT + Θ(ϵ) close to P. For a formal argument, see Lemma A.1.

Now, assume the search for the prompting hypothesis fails. Recall that during the search, we have checked every single hypothesis in H. During the search, at some point, we must have stumbled upon Hi and did not declare it as a prompting hypothesis. In this case, either our sampling for substantial updates failed (which happens with small probability), or we can infer that there are not too many far hypotheses in Bℓ. Either way, we can output a random hypothesis from Bℓas the final sample without increasing the error probability by too much.

Dependency on δ: It is worth noting that this last step results in a polynomial dependency on δ (as opposed to a more desirable dependency of log(1/δ)). This is mainly due to the fact that to ensure that a random hypothesis in Bℓis not too far, with a probability of 1 δ in a single round, we have to try O(log(n)/δ) hypotheses in Bℓand check if Hi is prompting them. Hence, relying on this structural property of Hi makes a polynomial dependency on δ inevitable. Improving the dependency on δ for this algorithm would require new algorithmic ideas.

3.3 The algorithm with α = 4

We introduce another (almost) linear-time algorithm for the hypothesis selection problem, where the time complexity shows improved dependency on ϵ. However, this algorithm has a slightly worse accuracy parameter compared to our previous algorithm: α = 4. This algorithm and the subsequent theorems are provided in Section B.

Algorithm with a guessed upper bound of OPT OPT OPT: As mentioned earlier, unlike previous work, our algorithm does not receive any prior information about the value of OPT. It might be speculated that there exists an easy reduction between our problem and another version of hypothesis selection, where an auxiliary parameter σ is provided to the algorithm such that OPT σ. A learner without the knowledge of σ can make a guess about σ and run one of the existing algorithms that works with the knowledge of an upper bound of OPT (e.g., [ABS23]) and check if it finds a good hypothesis or

not. With this procedure in mind, we can run a binary search to find the smallest σ for which we find a hypothesis.4 However, a challenge with this approach is that the algorithm might yield a hypothesis even when σ < OPT, making it difficult for us to refute the hypothesis that is found. Even if the output ˆH satisfies W( ˆH) > σ, it does not necessarily invalidate our guessed value σ.

To overcome this challenge, we design an algorithm, namely A, with enhanced accuracy guarantees for the output hypothesis. We provide A with the value σ. We treat σ as a "guess" for which we hope that OPT σ. We require the algorithm to either refute our guess and declare OPT > σ, or output a hypothesis ˆH with a reasonable distance to P, irrespective of the relationship between OPT and σ. More precisely, the distance of the output hypothesis should be bounded by: ˆH P TV α max(OPT, σ) + ϵ. With these revised guarantees, it is permissible to run a binary search over different values of σ in {ϵ, 2 ϵ, . . . , 1/ϵ ϵ}, and output the hypothesis that A returns on the smallest σ. For the rest of this discussion, we focus on designing A and a provided parameter σ.

Seeds and finding an acceptable hypothesis via seeds: In this algorithm, we introduce a new concept called a seed for hypotheses. A seed provides us with a structural property that enables us to identify an acceptable hypothesis, regardless of the relationship between Hi and the seed. More formally, for a hypothesis Hi, we define Ma,b(Hi) as the set of hypotheses Hj such that wj(Hi) is between a and b:

Ma,b(Hi) := {Hj H | a < ˆwj(Hi) b} .

We may omit the subscript of M when it is clear from the context. We say a hypothesis Hi is a seed if almost all of its ˆwj(Hi) are low and its W(Hi) is not too large. The quality of a seed is determined by three parameters: a, b, and m.

Definition 3.1 (Seed). Given parameters a, b R 0, and a non-negative integer m, we say a hypothesis is an (a, b, m)-seed if the following hold: 1) ˆW(Hi) b, and 2) |Ma,b(Hi)| m.

A seed with a small-sized set M and a small b exhibits an interesting property: Suppose we have a seed Hi with a constant-sized set M. In this case, in O(n) time, we can compute W(Hj) of every hypothesis Hj in M. Now, let Hℓ M be one of the hypotheses that minimize W(Hj): Hℓ:= arg min Hj M W(Hj). We claim either Hi or Hℓis acceptable output in this case.

When Hi is in H \ M, we can infer that wi (Hi) is at most b (or, more precisely, b + ϵ ). From our earlier discussion, if b is sufficiently small (compared to σ), we can show that Hi is close to P. In cases where Hi is in M, although we do not have strong guarantees for Hi itself, we have a very good answer already: Hℓis essentially a minimum distance estimate for the set M which includes Hi . Thus, we can conclude: First, Hℓis an acceptable output because it is 3 OPT-close; second, W(Hℓ) is a lower bound for OPT.

Now, as we described, we have two acceptable outputs for two cases of the problem. For Hi H\M, Hi is an acceptable answer. For Hi M, Hℓis an acceptable answer. The only problem is that we do not know which case we are in. Our strategy is to pick a hypothesis that is still reasonably close to P even if we are in the other case. Let us fix a threshold value T. We compare W(Hℓ) with Tσ: 1) If W(Hℓ) T σ, as we have discussed earlier, Hℓis (T σ + 2OPT)-close to P. This bound holds regardless of where Hi is. 2) Next, if W(Hℓ) > T σ, we cannot say Hℓis a good choice for us. However, we can conclude that Hi is a close hypothesis to P even when Hi happens to be in M. In this case, we know that OPT W(Hℓ) > T σ. We take advantage of this knowledge and obtain the following bound:

Hi P TV = b σ + 2 OPT b

T + 2 OPT .

Thus, regardless of where Hi is, Hi is (max(a, b/T) + 2) max (σ, OPT). Now, the algorithm is fairly straightforward. If W(Hℓ) is small, output Hℓ; otherwise, output Hi. We summarize these four cases in the following table. Depending on a and b, we set T to minimize the maximum distance we

4This idea works to some extent, but it causes α to be increased by 2, which is not useful in our setting. See [ABS23] for more details.

endure in these cases. It is worth noting that in this argument, we did not rely on any prior knowledge concerning the relationship between σ and OPT.

Hi M Hi H \ M

W(Hℓ) > T σ Hi P TV a σ + 2 OPT Hi P TV (b/T + 2)OPT W(Hℓ) > T σ Hℓ P TV T σ + 2 OPT Hℓ P TV T σ + 2 OPT

Table 2: Four cases when we process a good seed.

How to find the first seed? Here, we provide an overview of our approach for identifying an initial seed. For a detailed explanation of the algorithm and its performance, refer to Section B.1.

To start, fix two parameters, a = σ and b = 3σ. To identify a seed hypothesis, we iterate over all hypotheses H1, . . . , Hn, checking whether each hypothesis Hi is a strong seed (i.e., a seed with a small |M|) by sampling several Hj s and verifying if wj(Hi) is at most a. Roughly speaking, for an integer m [n], if we sample approximately O(n/m) Hj s and find that no wj(Hi) > a, we can, with high probability, confirm that the size of M does not exceed m. This approach requires O(n2/m) time.

There are, however, a few caveats with this method. First, if a seed is identified using the above approach, it may have a large W(Hi) > b. In this case, we can infer that Hi is likely far from P. Given the time invested in identifying Hi, we aim to leverage this information. Broadly, observing that wj(Hi) a implies that many hypotheses are close to Hi (assuming that wi(Hj) values are also small). Knowing that Hi is far from P, we can mark all hypotheses close to Hi as also far, allowing us to proceed with the search. While this may seem counterproductive, marking a significant number of hypotheses as far constitutes progress for our algorithm. Specifically, if, at some point, all hypotheses are marked, we can declare that σ < OPT.

The second caveat is more challenging. Ideally, we seek a seed with a constant m, but finding such a seed would require O(n2/m) = O(n2) time. Consequently, for a linear-time algorithm, we can only afford to find seeds where m = Θ(n). In other words, the quality of the seed we can initially identify is much lower than the quality required for a producing solution. This leads to our next idea: boosting a seed, which is an approach to incrementally improve the seed s quality in roughly O(log n) steps.

Boosting a Seed: In this process, we use an initial seed to iteratively find a stronger seed with a reduced value of |M|. For a formal argument, see Section B.2. Assume a rate parameter η (0, 1). As discussed earlier, by spending O(n/η) time, we can identify a seed with m η n. Initially, m = |M| might be O(n). Hence, rather than calculating W(Hj) exactly for all Hj M, we compute an approximate maximum semi-distance W(Hj) by sampling t hypotheses. Specifically, for each Hj M, we set W(Hj) := max Hk ˆwk(Hj), where Hk s are sample hypotheses. Using these approximate values W, we process the seed as follows. Let Hℓdenote the hypothesis minimizing W(Hj), with the following possible outcomes:

1. High W(Hℓ): If W(Hℓ) is high, all W(Hj) (and thus W(Hj)) values are likely large, making Hi an acceptable final solution.

2. Low W(Hℓ): While this does not guarantee a low W(Hℓ), it suggests that most wi(Hℓ) values are small. Based on the value of W(Hℓ), we consider two cases:

2.1 W(Hℓ) W(Hℓ): Hℓis likely far from P but has many close hypotheses, allowing us to mark these nearby hypotheses as far. 2.2 Moderate W(Hℓ): In this case, Hℓcan serve as our new seed, as it yields a smaller set M(Hℓ) than Hi.

Note that when we select Hℓas our new seed, M(Hℓ) is roughly O(n/t). However, we compute W(Hj) s for only |M(Hi)| η n many hypotheses. Hence, with an increased sample size t = O(1/η2), this approach yields a smaller |M(Hℓ)| η2 n. The step of the process requires O(|M| t) = m/η2 = O(n/η), paralleling the initial search. By repeating, we progressively reduce m = |M|, and for a constant η, m decreases by a fixed factor until reaching a constant m seed, as desired.

Acknowledgments

M.A. was supported by NSF awards CNS-2120667, CNS-2120603, CCF-1934846, and BU s Hariri Institute for Computing. This work was initiated while M.A. was affiliated with Boston University and Northeastern University and was done in part while M.A. was a research fellow at the Simons Institute for the Theory of Computing. M.B. was supported in part by NSF award CNS-2046425 and a Sloan Research Fellowship. A.S. was supported in part by NSF awards CCF-1763786 and CNS-2120667 as well as Faculty Awards from Google and Apple.

[AAC+23] Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, and Sandeep Silwal. Data structures for density estimation. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023.

[ABH+20] Hassan Ashtiani, Shai Ben-David, Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. J. ACM, 67(6):32:1 32:42, 2020.

[ABM18] Hassan Ashtiani, Shai Ben-David, and Abbas Mehrabian. Sample-efficient learning of mixtures. In Sheila A. Mc Ilraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 2679 2686. AAAI Press, 2018.

[ABS23] Maryam Aliakbarpour, Mark Bun, and Adam Smith. Hypothesis selection with memory constraints. To appear in Neur IPS 2023, 2023.

[ADLS17] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1278 1289. SIAM, 2017.

[AFJ+18] Jayadev Acharya, Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Maximum selection and sorting with adversarial comparators. The Journal of Machine Learning Research, 19:59:1 59:31, 2018.

[AJOS14] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Sorting with adversarial comparators and application to density estimation. In 2014 IEEE International Symposium on Information Theory, pages 1682 1686, 2014.

[BBK+21] Olivier Bousquet, Mark Braverman, Gillat Kol, Klim Efremenko, and Shay Moran. Statistically near-optimal hypothesis selection. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021, Denver, CO, USA, February 7-10, 2022, pages 909 919. IEEE, 2021.

[BKM19] Olivier Bousquet, Daniel Kane, and Shay Moran. The optimal approximation factor in density estimation. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 318 341. PMLR, 25 28 Jun 2019.

[BKSW19] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hypothesis selection. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d AlchéBuc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 156 167, 2019.

[CDKL22] Clément L. Canonne, Ilias Diakonikolas, Daniel Kane, and Sihan Liu. Nearly-tight bounds for testing histogram distributions. In Advances in Neural Information Processing Systems 35 (Neur IPS), 2022. To appear.

[CDSS14] Siu-on Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Near-optimal density estimation in near-linear time using variable-width histograms. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1844 1852, 2014.

[CKM+19] Clément L. Canonne, Gautam Kamath, Audra Mc Millan, Adam D. Smith, and Jonathan R. Ullman. The structure of optimal private tests for simple hypotheses. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 310 321. ACM, 2019.

[DDS12] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning poisson binomial distributions. In Howard J. Karloff and Toniann Pitassi, editors, Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pages 709 728. ACM, 2012.

[Dia16] Ilias Diakonikolas. Learning structured distributions. In CRC Handbook of Big Data, pages 267 283. 2016.

[DK14] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pages 1183 1213. PMLR, 2014.

[DKS17] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 73 84. IEEE Computer Society, 2017.

[DL96] Luc Devroye and Gábor Lugosi. A universally acceptable smoothing factor for kernel density estimates. The Annals of Statistics, 24(6):2499 2512, 1996.

[DL97] Luc Devroye and Gábor Lugosi. Nonasymptotic universal smoothing factors, kernel complexity and Yatracos classes. The Annals of Statistics, 25(6):2626 2637, 1997.

[DL01] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer, 2001.

[GKK+20] Sivakanth Gopi, Gautam Kamath, Janardhan Kulkarni, Aleksandar Nikolov, Zhiwei Steven Wu, and Huanyu Zhang. Locally private hypothesis selection. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 1785 1816. PMLR, 2020.

[KMV12] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Disentangling gaussians. Commun. ACM, 55(2):113 120, 2012.

[KSS18] Pravesh K. Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 1035 1046. ACM, 2018.

[MS08] Satyaki Mahalanabis and Daniel Stefankovic. Density estimation in linear time. In 21st Annual Conference on Learning Theory - COLT, pages 503 512, 2008.

[Pea95] Karl Pearson. Mathematical contributions to the theory of evolution. ii. skew variation in homogeneous material. [abstract]. Proceedings of the Royal Society of London, 57:257 260, 1895.

[SOAJ14] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimal-sample estimators for spherical gaussian mixtures. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27, 2014.

[Yat85] Yannis G. Yatracos. Rates of convergence of minimum distance estimators and kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985.

Algorithm 1 An algorithm for hypothesis selection with α = 3.

1: procedure SELECT-HYPOTHESIS(H, ϵ, δ, and query access to ˆwj(Hi) s) 2: W(Hi) 0 for every i [n] 3: k 1/ϵ + 1 4: B1 H 5: B2, . . . , Bk 6: ϵ ϵ/3 7: δ δ/3 8: γ δ

9: δfn δ /k 10: δfp γ δ /(4 k n log(n)) 11: for ℓ= 1, . . . , k do Iterating over buckets 12: for i = 1, . . . , n do Iterating over hypotheses to find a prompting one 13: if |Bℓ| = 0 then 14: Break the for loop and move on to the next bucket. 15: if IS-PROMPTING(B, ℓ, Hi, γ, ϵ , δ ) then Checks if Hi is prompting. 16: Cℓi 0 A counter for the number of hypotheses that Hi removes from Bℓ 17: for Hj Bℓdo Update all hypotheses in Bℓvia the prompting hypothesis 18: if ˆwi(Hj) > W(Hj) + ϵ then

19: W(Hj) max W(Hj), ˆwi(Hj)

20: Move Hj to B W (Hj)/ϵ 21: Cℓi Cℓi + 1 22: if Cℓi Cℓi+|Bℓ| < γ/2 then It indicates Hi was not prompting. 23: Output and halt. 24: i 0 Restarts the for loop. 25: if |Bℓ| = 0 then 26: ˆH a random hypothesis in Bℓ 27: Output ˆH and halt.

A Almost linear time algorithm with α = 3 α = 3 α = 3

In this section, we focus on our first result, the algorithm with α = 3. The pseudocode of our approach is presented in Algorithm 1. We prove the performance of this algorithm in Theorem 5. An overview of our approach is described in Section 3.2.

At a high level, our algorithm begins with all W(Hi) set to zero, and all hypotheses placed in B1. We then iterate over all buckets. In round ℓ, we check if there is a prompting hypothesis Hi H that can move a significant fraction of hypotheses from Bℓ. Specifically, we look for an Hi that moves a γ-fraction of hypotheses from Bℓby updating them, where γ = Θ(δ). If we find such a hypothesis Hi, we update all the hypotheses in Bℓand move as many as possible out of Bℓ. During this process, we count the number of hypotheses that Hi removes from Bℓ, which is tracked by Cℓi in the algorithm. If the actual fraction of hypotheses removed from Bℓis less than γ/2, we infer that IS-PROMPTING falsely declared Hi as a prompting hypothesis (false positive). Thus, the algorithm returns . Otherwise, we restart the search by setting i = 1 and iterate over all Hi to find the next prompting hypothesis. We continue this process until no prompting hypothesis is found or the bucket is emptied. If we completely empty a bucket, we move on to the next one. If not, we select a random hypothesis in Bℓas the final output and halt.

Theorem 5. Suppose H is a class of n known hypotheses and P is an unknown hypothesis. Assume the algorithm has access to accurate estimates of ˆwj(Hi) s that have an error of at most ϵ . Let ϵ and δ be two parameters in (0, 1). If 3 ϵ ϵ, then Algorithm 1 is an (α = 3, ϵ, δ)-proper learner for P in H. The running time of our algorithm is O((n/(δ3ϵ)) Tq), where Tq is the time to obtain each ˆwj(Hi).

Proof. Our proof consists of three parts:

1. If the algorithm returns ˆH, with high probability, ˆH is not too far from P.

2. While the algorithm may not produce a hypothesis all the time, the overall failure probability of the algorithm is low.

3. The running time of the algorithm meets the desired bounds in the theorem.

The output ˆHˆHˆH is close with high probability. Suppose we output ˆH Bℓat step ℓ. Our goal is to show that ˆH is (3 OPT + ϵ)-close to P with probability 1 δ . We define two events Eℓand Eℓfor each ℓ [k] based on the fraction of far hypotheses in Bℓin Line 25:5

Eℓ:= the event that a non-empty Bℓin Line 25 contains at least γ |Bℓ| hypotheses that are (3 OPT + ϵ)-far from P .

Eℓ:= the event that a non-empty Bℓin Line 25 contains fewer than γ |Bℓ| hypotheses that are (3 OPT + ϵ)-far from P .

Conditioning on the event Eℓ, fewer than a γ fraction of the hypotheses in Bℓare far. Since ˆH is a randomly selected hypothesis in Bℓ, the probability of ˆH being far is less than γ:

Pr h ˆH P TV > 3 OPT + ϵ | Eℓ i < γ . (4)

Next, we focus on the case where Eℓholds: at least a γ fraction of the hypotheses in Bℓare (3 OPT + ϵ)-far from P. In the following lemma, we show that Hi is a prompting hypothesis for such a Bℓ. The proof of this lemma is available in Section A.2.

Lemma A.1. Suppose Bℓ H is the smallest non-empty bucket for which at least a γ fraction of its hypotheses are (3 OPT + ϵ)-far from P. Then Hi is a prompting hypothesis that can remove a γ fraction of the hypotheses in Bℓ.

If we output a hypothesis in Bℓ, clearly, the prompting hypotheses had failed to empty out the bucket. Therefore, the last time the algorithm was iterating over all i s in [n], the IS-PROMPTING procedure returned false for every single Hi, and no restart happened. However, during those iterations, at some point, i reached i , implying that the IS-PROMPTING procedure did not catch that Hi is a prompting hypothesis. This event happens with probability at most δfn.

Pr[ Eℓ] Pr[ IS-PROMPTING returns false for Hi and Hi is a prompting hypothesis]

Pr[ IS-PROMPTING returns false for Hi | Hi is a prompting hypothesis] δfn . (5)

Given Equation (4) and Equation (5), we bound the probability of outputting a far ˆH as follows:

Pr h Outputting ˆH and ˆH P TV > 3 OPT + ϵ i

ℓ=1 Pr h ˆH P TV > 3 OPT + ϵ | Eℓ i Pr[ Eℓ]

ℓ=1 Pr h ˆH P TV > 3 OPT + ϵ | Eℓ i Pr

ℓ=1 Pr[ Eℓ] +

k δfn + γ 2 δ .

5It is worth noting that we are using the Eℓnotation somewhat non-rigorously here. This notation does not represent the complementary event of Eℓ. Instead, it denotes the complementary event of Eℓconditioning on the outputting of ˆH from the bucket Bℓ.

The last line is due to the fact that we set δfn = δ /k and γ = δ .

Bounding the overall error probability: We have shown that if we output ˆH, it is an acceptable answer with high probability. Now, we need to show that the overall error probability of the algorithm is not too high as well.

Our first claim is that the algorithm always ends, meaning that we do not restart the for loop infinitely many times in Line 24. Note that the restart happens only after we check that at least γ/2 of the hypotheses are removed from Bℓin Line 22. Therefore, the total number of restarts for any given bucket is at most log1/(1 γ/2)(|Bℓ|) + 1 4 log(n)/γ times.

In addition, we claim the algorithm does not end without outputting ˆH or . Every Hi belongs to some bucket at every step of the algorithm. This is due to the assumption that ˆwj(Hi) is in [0, 1]. Thus, W(Hi) is always between 0 W(Hi) 1 < kϵ . Thus, it is not possible to reach |Bℓ| = 0 for all ℓ [k] in Line 25. For the bucket that cannot be emptied out, the algorithm outputs either or ˆH.

Next, we show that the probability of outputting is at most δ . Note that the algorithm outputs when Hi was not able to remove at least γ/2 of the hypotheses in Bℓ. As we have established earlier, the restart happens only 4 log(n)/γ times for every bucket. Therefore, the IS-PROMPTING procedure is invoked at most 4kn log(n)/γ times overall. Each time we have a chance of δfp of declaring a hypothesis prompting erroneously. The algorithm in Line 22 checks if the error has happened, and it outputs when it does. Hence, we have:

Pr[ Outputting ] 4kn log(n)

γ δfp δ . (6)

The final inequality follows from setting δfp = γ δ /(4 k n log(n)). Thus, we can bound the overall error probability as follows:

Pr[ Wrong answer] = Pr[ Outputting ]

+ Pr h Outputting ˆH s.t. ˆH P TV > 3 OPT + ϵ i 3 δ δ .

The last inequality holds since δ = δ/3.

Running time: The initialization takes O(n+k) time. The algorithm runs over k buckets. For each bucket, we may restart the for loop O(log(n)/γ) times. Within each for loop before restarting, we invoke the IS-PROMPTING procedure O(n) times where, with the exception of one of them, all return false. Let T (i) false indicate the time that the procedure spends on Hi in the false case. Also, set this quantity be Ttrue if the procedure returns true (which happens at most once before restarting). If we find a prompting hypothesis, we query the ˆwi(Hj) s O(n) times, and spend O(n Tq) for this step. Hence, the total time complexity is:

Total time complexity = O

i T (i) false + Ttrue + n Tq

Advanced version of IS-PROMPTING: We described the guarantees of the advanced version of the IS-PROMPTING procedure in Lemma A.3. In that lemma, we have shown that the running time of the false case is proportionate to the number of hypotheses we remove from Bℓduring the procedure of IS-PROMPTING. We denote this quantity in the r-th search for the prompting hypothesis Hi in bucket bℓvia n(ℓ,i) r . Since the total number of hypotheses that we can remove is at most O(n), we have:

i T (i) false =

i O log δ 1 fn + log log δ 1 fp Tq/γ2 + n(ℓ,i) r Tq/γ

= O n log δ 1 fn + log log δ 1 fp Tq/γ2 + n Tq/γ

= O n log δ 1 fn + log log δ 1 fp Tq/γ2

Using the setting of our parameters, δfp = γ δ /(4 k n log(n)) = O(n log n/(δ ϵ)) and δfn = δ /k = O(1/(δ ϵ)).

i T (i) false = O n

+ log log n

+ log log (n) Tq

Moreover, for the case that a prompting hypothesis is found:

Ttrue = O log δ 1 fp log δ 1 fn + log log δ 1 fp Tq/γ2

+ log log n

+ log log (n) Tq

Hence, the total time complexity is:

Total time complexity = O

i T (i) false + Ttrue + n Tq

n log log n + n log 1

Simple version of IS-PROMPTING: In this version, we only have one possible confidence parameter, which is min(δfn, δfp). Hence, using Lemma A.2, we have:

T (i) false = Ttrue = O log (min(δfn, δfp)) 1 Tq/γ2 .

Total time complexity = O

i T (i) false + Ttrue + n Tq

= O n log(n)

Thus, the proof is complete.

A.1 Identifying prompting hypothesis

In this section, we provide two algorithms for identifying a prompting hypothesis. The simple algorithm, presented in Section A.1.1, involves sampling the hypotheses in a bucket to estimate the fraction of the hypotheses that Hi can move, using a standard application of the Hoeffding bound. The advanced algorithm, presented in Section A.1.2, is based on the fact that we require different confidence parameters for false negative and false positive errors, given our analysis for Algorithm 1. We also show that the time complexity of this algorithm, in the case it returns false, is proportional to the number of hypotheses we can remove from the bucket during this procedure, which later helps us to use an amortized argument for the overall time spent on false returns.

A.1.1 Simple version of IS-PROMPTING

Algorithm 2 An algorithm to identify a prompting hypothesis (simple version).

1: procedure IS-PROMPTING-SIMPLE(B, ℓ, Hi, γ, ϵ , δ, and query access to ˆwj(Hi) s) 2: t 8 log(2/δ)/γ2

3: ur 0 ur counts # sampled hypotheses that Hi updates 4: repeat t times 5: Hj a random hypothesis in Bℓ 6: if ˆwi(Hj) > W(Hj) + ϵ then 7: ur ur + 1 8: 9: if ur

4 then 10: return false 11: else 12: return true

Lemma A.2. Suppose we are given two parameters δ, ϵ (0, 1), and three positive integers n, k, and ℓ [k]. Assume we are given a class of hypotheses H, its partition into k buckets B = {B1, B2, . . . , Bk}, and a hypothesis Hi H. The IS-PROMPTING-SIMPLE procedure in Algorithm 2 receives H, B, ℓ, Hi, δfp, δfn, and ϵ as its input and returns true or false with the following guarantees:

If Hi substantially updates at least a γ-fraction of the hypotheses in Bℓ, then IS-PROMPTINGSIMPLE returns true with probability at least 1 δ.

If Hi substantially updates less than a γ/2-fraction of the hypotheses in Bℓ, then ISPROMPTING-SIMPLE returns false with probability at least 1 δ.

We have the following guarantees for the algorithm:

Number of queries to ˆwi(Hj) s: O log(δ 1)/γ2 ,

Running time: O log(δ 1) Tq/γ2 ,

where Tq is the time to obtain each ˆwj(Hi).

Proof. This is a simple application of the Hoeffding bound:

i 2 exp 2 t γ2 δ .

A.1.2 Advanced version of IS-PROMPTING

We present an overview of the algorithm for the advanced version of IS-PROMPTING. Suppose we are given a parameter γ, and our goal is to design an algorithm that does the following with high probability: it returns true when Hi can substantially update a γ-fraction of the hypotheses in Bℓ, and it returns false when Hi can substantially update less than a γ/2 fraction of the hypotheses in Bℓ.

For this new procedure, we take advantage of an asymmetry that exists between returning false and true. This asymmetry arises mainly from the analysis of Algorithm 1. Roughly speaking, for every bucket ℓ, to find a single prompting hypothesis, we check O(n) (potentially non-prompting) hypotheses. On the other hand, we only need log1/γ(Bℓ) prompting hypotheses to empty out the whole bucket. Hence, we expect that the procedure returns mostly false rather than true.

Another difference here is the cost we pay for mistakenly declaring true or false. When we declare a hypothesis prompting, we iterate over all hypotheses in Bℓand check if we can remove them from the bucket. If the hypothesis was not actually prompting, we have paid a substantial cost (in time) without making progress. Thus, we cannot have too many hypotheses that the algorithm declares prompting while they are not (false positives). However, the probability of mistakenly missing a prompting hypothesis does not have to be this small (false negatives).

Below is a more precise definition of these errors:

δfp := Pr[ Returning true | Hi substantially updates less than a γ/2 fraction]

δfn := Pr[ Returning false | Hi substantially updates at least a γ fraction]

Given the analysis of our algorithm, we can show that the probability of false positives must be around O(1/ |Bℓ|) = O(1/n). On the other hand, our analysis is robust as long as the probability of false negatives is roughly O(δ), where δ is the overall confidence of our algorithm. (The reason for this choice of parameter may not be obvious from this high-level discussion; however, it is a direct artifact of our analysis.) This gap is particularly large when δ is a small constant, as is common in the literature.

The IS-PROMPTING procedure (Algorithm 3) is designed in a way that has different time complexities depending on its output being true or false. The procedure uses more than O(log n) (more like O(log n log(1/δ)/γ2)) time if it returns true. However, the main advantage is that it uses much less time when the output is false. In fact, instead of paying O(log n), we only spend O(log(1/δ)) time plus an amortized cost of O(1/γ) for the O(Bℓ) calls to the procedure.

The algorithm runs in roughly O(log n) rounds. In each round, it draws a few random hypotheses from Bℓ(roughly O(log(1/δ)/γ2)). In each round, we check if the fraction of hypotheses Hi can substantially update is close to γ. At any round, if we see that the fraction is not close to γ, we return false. The hope is that for a non-prompting hypothesis, we either stop very quickly or the hypothesis passes too many rounds without us noticing that it is not prompting. Therefore, we must have seen an inflated number of substantial updates in these rounds. Before returning false, we perform those substantial updates among the sampled hypotheses that we have observed to remove those movable hypotheses from Bℓ.

We can capitalize on this fact in our cost analysis. In fact, we can show that in the false case, if the procedure takes roughly O(t/γ) time, it must have removed t hypotheses from the bucket. Thus, one can show that the amortized cost of these elongated rounds is only O(1/γ).6

Lemma A.3. Suppose we are given three parameters δfp, δfn, ϵ (0, 1), and three positive integers n, k, and ℓ [k]. Assume we are given a classes of hypotheses H, its partition into k buckets B = {B1, B2, . . . , Bk} and a hypothesis Hi H. The IS-PROMPTING procedure in Algorithm 3 receives H, B, ℓ, Hi, δfp, δfn, and ϵ as its input and returns true or false with the following guarantees:

If Hi substantially updates at least γ fraction of the hypotheses in Bℓ, then IS-PROMPTING returns true with probability at least 1 δfn.

If Hi substantially updates less than γ/2 fraction of the hypotheses in Bℓ, then ISPROMPTING returns false with probability at least 1 δfp.

6The parameters in this high-level discussion lack precision. For a rigorous analysis, refer to our proof of Lemma A.3.

Algorithm 3 An algorithm to identify a prompting hypothesis.

1: procedure IS-PROMPTING(B, ℓ, Hi, γ, ϵ , δfn, δfp, and query access to ˆwj(Hi) s)

2: R l log3/2 1 δfp

3: S0 S is a set of hypotheses that Hi can update substantially. 4: for r = 1, . . . , R do 5: if |Sr 1|

2 then 6: return true 7: Sr Sr 1 8: ur 0 ur indicates # hypotheses Hi updates

9: repeat t := l 48 log(2R/δfn)

10: G(r) j a random hypothesis in Bℓ

11: if ˆwi G(r) j > W G(r) j + ϵ then

12: ur ur + 1 13: Sr Sr {G(r) j }

14: 15: if |Sr| |Sr 1| < γ t

8 then 16: for Hj Sr do

17: W(Hj) max W(Hj), ˆwi(Hj)

18: Move Hj to B W (Hj)/ϵ 19: return false 20: if ur

4 then 21: for Hj Sr do

22: W(Hj) max W(Hj), ˆwi(Hj)

23: Move Hj to B W (Hj)/ϵ 24: return false 25: return true

If the algorithm returns true, we have the following guarantees for the algorithm:

Number of queries to ˆwi(Hj) s: O log δ 1 fp log δ 1 fn + log log δ 1 fp /γ2 ,

Running time: O log δ 1 fp log δ 1 fn + log log δ 1 fp Tq/γ2 .

Suppose the algorithm returns false. Assume n(ℓ,i) r indicates the number of hypotheses that the algorithm removes from Bℓ. Then, we have the following guarantees for the algorithm:

Number of queries to ˆwi(Hj) s: O log δ 1 fn + log log δ 1 fp /γ2 + n(ℓ,i) r /γ

Running time: O log δ 1 fn + log log δ 1 fp Tq/γ2 + n(ℓ,i) r Tq/γ

In above bounds, where Tq denotes the time complexity to obtain each ˆwj(Hi).

Proof. The probabilities in this proof are taken over the random choices of G(r) j s. Note that S contains at the hypotheses for which we can change.

Probability of false positive: First, we show that probability of outputting true, while Hi can substantially update less γ/2 fraction of hypotheses in Bℓ, is at most δfp. The algorithm return true in two places. First, at the beginning of each round, we check whether Sr 1 contains at least γ/2-fraction of hypotheses. This answer is always correct with probability one. Since the algorithm

has an evidence that Hi can certainly update at least γ/2 of hypotheses in Bℓ. Thus, this case does not affect our false positive rate δfp.

Second, the algorithm returns true after all rounds end. This case only happens when ur/t is at least 3γ/4 in every round in Line 20. It is not hard to see that ur is a binomial random variable with t trials. The success probability of each trial is the ratio of the hypotheses in Bℓthat Hi can update substantially. Thus, in the case where Hi substantially updates less than γ fraction of hypotheses, by Markov s inequality, we have:

Pr[ Returning true | Hi substantially updates less than γ/2-fraction]

= Pr r [R] : ur

R (2/3)R δfp

Probability of false negative: Suppose that Hi substantially updates at least γ-fraction of the hypotheses in Bℓ. That is, Hi can update each G(r) j with probability at least γ. Fix a round r. We bound the probability of returning false in Line 19 and Line 24.

In Line 19, we return false when |Sr| |Sr 1| < γ t

8 . That is, the number of new hypotheses we added to Sr is not too large. Consider the random hypotheses we draw at round r. Our claim is that most of these random hypotheses are not in Sr 1. The probability that one G(r) j be selected from Sr 1 is |Sr 1| / |Bℓ|. However, this expectation is less than γ/2. Otherwise, the algorithm would have returned true earlier in Line 5. With this information in mind, we can use the Hoeffding bound and get:

Pr number of G(r) j Sr 1 5 γ t

" number of G(r) j Sr 1 t 5 γ

" number of G(r) j Sr 1 t > |Sr 1|

exp 2 t (γ/4)2 δfn

In Line 24, we return false if ur/t is less than 3 γ/4. By Chernoff bound, we have:

Taking the union bound over all rounds, we obtain:

Pr[ Returning false | Hi substantially updates at least γ-fraction] δfn .

Running time: In the case the algorithm return true, the algorithm makes O(R t) queries to

ˆwi G(r) j s. And, it runs in the O(R t Tq) . If algorithm returns false in round r, similarly it

makes O(r t) queries to ˆwi G(r) j s. And, it runs in the O(r t Tq) . Note that with the exception of the last round in every round r [r 1]. At least (γ t)/8 new hypotheses are added to Sr . Therefore, the size of Sr is at least (γ t (r 1))/8. We remove all of these hypotheses before we output false which cause the size of Bℓto drop by n(ℓ,i) r := |Sr|. In this case, the algorithm makes O(t + n(ℓ,i) r /γ) queries to ˆwi G(r) j s. And, it runs in the O t + n(ℓ,i) r /γ

the proof is complete by setting R := l log3/2(1/δfp) m and t := 48 log(2R/δfn)/γ2 .

A.2 Proof of Lemma A.1

Lemma A.1. Suppose Bℓ H is the smallest non-empty bucket for which at least a γ fraction of its hypotheses are (3 OPT + ϵ)-far from P. Then Hi is a prompting hypothesis that can remove a γ fraction of the hypotheses in Bℓ.

Proof. We show that Hi can substantially update W of any far hypothesis in Bℓ. Let Hf denote a (3 OPT + ϵ)-far hypothesis. First, observe that ˆwi (Hf) must be large compared to OPT:

ˆwi (Hf) wi (Hf) ϵ = |Hf (Sf i ) P (Sf i )| ϵ

|Hf (Sf i ) Hi (Sf i )| |Hi (Sf i ) P (Sf i )| ϵ

Hf Hi TV Hi P TV ϵ (By Fact C.2)

Hf P TV 2 Hi P TV ϵ

> 3 OPT + ϵ 2 OPT ϵ OPT + 2 ϵ .

In addition, OPT cannot be much smaller than ℓϵ . The main reason is that all the buckets before Bℓ are empty. Hence, Hi belongs to a bucket Bℓ where ℓ is at least ℓ. By our definition of bucketing: W (Hi ) must be at least (ℓ 1) ϵ . Hence we have:

(ℓ 1) ϵ W (Hi ) ˆW(Hi ) W(Hi ) OPT .

Note that we assume Hf belongs to Bℓ, so Wf is less than ℓϵ . Putting all these observation together, we achieve:

Wf + ϵ < ℓϵ + ϵ OPT + 2ϵ < ˆwi (Hf) .

Therefore, if we update Wf via ˆwi (Hf), then Hi can significantly improve Wf. This fact implies that Hi can removes all the bad hypothesis from Bℓwhich concludes the proof of the lemma.

B Linear time algorithm with α = 4

In this section, we present our second algorithm whose sample complexity has a better dependency on the accuracy parameter ϵ. An overview of our approach is presented in Section 3.3. The pseudocode of our approach is presented in Algorithm 4. We prove the performance of this algorithm in Theorem 6.

Apart from improving the accuracy parameters, we also provide modifications to our algorithm that can achieve low rounds of adaptivity (roughly speaking, it means the number of times the algorithm needs to look at the sample set). More formally, in our access model, we define one round of adaptivity as follows: The algorithm selects a set of t = O(n2) pairs of indices {(iℓ, jℓ)}ℓ [t] and query all ˆwjℓ(Hiℓ) at once. The number of adaptivity rounds represents the total count of times the algorithm needs to repeat this process in order to produce its final output. This is particularly interesting when considering a federated learning setting in which the rounds of interactivity are important. The modification of our algorithm runs in time O(n1+λ) runs in O(1/λ) rounds of adaptivity.

Theorem 6. Suppose Algorithm 4 receives these parameters as input: σ [0, 1], η (0, 1/4), ϵ, δ (0, 1). Also, assume the algorithm has access to a class of n hypotheses H, and a set of unmarked hypotheses in Q H where initially Q = H. Suppose the algorithm can query ˆwj(Hi) s with error at most ϵ ϵ/6. The algorithm has one of the two possible outcomes: either it declares that σ < OPT; or it outputs a hypothesis Hi such that: with probability at least 1 δ we have:

ˆH P TV 4 max (σ, OPT) + ϵ .

The algorithm runs in O(n/η Tq) time and O log2 1/η(n) rounds of adaptivity. Here Tq is the time

required to obtain each ˆwj(Hi).

Algorithm 4 finds a hypothesis that is 4 max (σ, OPT) + ϵ close to P.

1: procedure SELECT-HYPOTHESIS(H, Q, σ, ϵ, δ, η, , and query access to ˆwj(Hi) s) 2: if |Q| = 0 then 3: return σ < OPT . 4: if |Q| = 1 then 5: Hi the only unmarked hypothesis left. 6: if ˆW(Hi) > σ + ϵ then 7: return σ < OPT . 8: else 9: return Hi. 10: ϵ ϵ/6

11: δ δ/ 100 log1/(2η)(n) log1/(η)(n)

12: m η n 13: Run FIND-SEED(H, Q, σ, ϵ , δ , η) 14: if FIND-SEED declares σ < OPT then 15: return σ < OPT . 16: else if FIND-SEED returns start over then 17: return SELECT-HYPOTHESIS(H, Q, σ, ϵ, δ, η) 18: Hi the seed that FIND-SEED found. 19: while m > 1

η do 20: Run BOOST-SEED(H, Q, Hi, σ, κ = 2, κ = 2, ϵ , η) 21: if BOOST-SEED finds the final answer. then 22: return Hi found by BOOST-SEED. 23: else if BOOST-SEED finds a new seed then 24: Hi the seed that BOOST-SEED returns 25: m η m 26: else if BOOST-SEED returns start over then 27: return SELECT-HYPOTHESIS(H, Q, s, σ, ϵ, δ) 28: Hℓ arg min Hj M ˆW(Hj) M denotes M2 σ+ϵ , 4 σ+5 ϵ (Hi).

29: if ˆW(Hℓ) 2 σ + ϵ then 30: Output Hℓ. 31: else 32: Output Hi.

Proof. For the sake of argument, assume each of the sub-routines in the algorithm works as guaranteed with probability one (as oppose to the case where they work as expected with probability 1 δ ). Later on, we discuss the overall confidence of the algorithm to remove this assumption.

Accuracy of seeds: the procedure to find the initial seeds provides us with a (σ + ϵ , 3 σ + 3ϵ , m)- seed with high probability. Using Fact C.4, this seed is also a (2 σ+ϵ , 4 σ+5ϵ , m). In boosting seeds, we start with (2 σ + ϵ , 4 σ + 5ϵ , m) and we get (2 σ + ϵ , 4 σ + 5ϵ , m ) where m = η m . This statement is justified by Theorem 9, setting κ = κ = 2, and using the same values for a = 2, σ + ϵ and b = 4, σ + 5ϵ for our seeds. It is worth noting that the only parameter that changes while we boost is m.

Accuracy of the output Consider the case where we produce output when |Q| is zero. That implies that our algorithm marked all the hypotheses. Throughout the course of algorithms, we never mark a hypothesis Hi unless we have found an evidence that Hi P TV > σ. This was established by observing ˆwj(Hi) > σ + ϵ , which implies Hi P TV must be greater than σ; Or, showing a close-by hypotheses to Hi is far from P and apply triangle inequality. See our argument for Case 3.2 in the proof of Theorem 9 for example. Thus, if all hypotheses are marked, it must be the case that for every Hi, Hi P TV is greater than σ. Hence, when the size of Q is zero we can truly assert that σ < OPT .

For the case that |Q| = 1, the algorithm focuses on Hi, the only unmarked hypothesis left. If we found ˆW(Hi) is greater than σ +ϵ , we can simply conclude that for every Hi, Hi P TV is greater than σ, and σ < OPT . Otherwise, using Fact C.3, the Hi is a valid answer:

Hi P TV W(Hi) + 2OPT ˆW(Hi) + ϵ + 2OPT 3 max (σ, OPT) + 2ϵ .

In addition, the accuracy of the Hi that is returned in Line 22 is guaranteed by the Theorem 9, and setting κ = κ = 2. Next, lets focus on the hypotheses we output after the if condition in Line 29. When ˆW(Hℓ) is small, we have:

Hℓ P TV W(Hℓ)+2OPT ˆW(Hℓ)+ϵ +2OPT 2 σ+2ϵ +2OPT 4 max (σ, OPT)+2ϵ .

When ˆW(Hℓ) is large, the analysis is very similar to Case 1 in the proof of Theorem 9. Again, by setting κ = κ = 2, and using the facts that ˆW(Hℓ) ϵ κ σ and ˆW(Hi) (κ + 2) σ + 5 ϵ , we obtain: Hi P TV 4 max (σ, OPT) + 6ϵ .

Number of recursions: Given Theorem 8 and Theorem 9, we only see start over when (1 2η) fraction of unmarked hypotheses in Q have been marked. Since we have n hypotheses, we do not start over more than O(log1/η(n)) times.

Running time: In the while loop, m decreases with a factor of η every time we find a new seed. Hence, the total number of iterations is bounded by O(log1/η(n)). Using Theorem 8 and Theorem 9, invoking the procedures for finding a seed and boosting a seed each takes O(n log(n/δ )/η Tq) time. Thus, the total time complexity is:

η log(n/δ) + log log1/η(n) log2 1/η(n) Tq

= O n log(1/δ) Tq

Overall confidence parameter: The total number of subroutines we call here is at most O(log2 1/η(n)). Thus, by setting δ to O(δ/ log2 1/η(n)) and using the union bound, we can show the overall error probability is bounded by δ.

Rounds of adaptivity: Using Theorem 8 and Theorem 9, invoking the procedures for finding a seed and boosting a seed each takes O(1) rounds of adaptivity. Hence, the overall rounds of adaptivity

is O log2 1/η(n) rounds.

Remark 7. One could argue that we have demonstrated the algorithm operates in O(n/η Tq) time with a probability of 1 δ. However, this fact does not inherently guarantee that the algorithm consistently maintains the desired time complexity. Fortunately, a straightforward solution exists to address this concern.

Our algorithm may fail for two primary reasons: either too few hypotheses were marked, or the identified seed had a larger m value than anticipated. In either scenario, the algorithm can verify the occurrence of this undesirable, but improbable event. In these situations, we can output to indicate the algorithm s failure to produce a valid answer. With this adjustment, the time complexity remains low as desired.

Corollary B.1. Suppose H is a class of n known hypotheses and P is an unknown hypothesis. Let ϵ and δ be two parameters in (0, 1). Assume the algorithm has access to accurate estimates of ˆwj(Hi) s that have error at most ϵ . If 3 ϵ ϵ, then for every η (1/n, 1/4) there exists an (α = 4, ϵ, δ)- proper learner for P in H. The running time of our algorithm is O((n/η) log(1/δ) log(1/ϵ) Tq),

and it can be implemented in O log2 1/η(n) log(1/ϵ) rounds of adaptivity.

Proof. This is a direct corollary of Theorem 6 combined with a binary search over values of σ {ϵ , ϵ } where ϵ := ϵ/100. Note that for every σ OPT, the algorithm has to produce a hypothesis, since it cannot declare OPT < σ with high probability. Outputting the hypothesis associated with the smallest σ would give us the desired guaranteed.

B.1 Finding initial seed

In this section, we describe an algorithm that receives parameter σ as input and finds a (σ + ϵ , 3 σ + 2ϵ , η n )-seed in roughly O(n/η Tq) time with high probability or declares σ < OPT".

Algorithm 5 finds a (σ + ϵ , 3 σ + 3ϵ , m)-seed

1: procedure FIND-SEED(H, Q, σ, ϵ , δ, η, and query access to ˆwj(Hi) s) 2: m η n 3: for i = 1, . . . , n do 4: repeat t := 8 n log (2 n/δ) /m times 5: Hj a uniformly random sample drawn from H 6: if ˆwj(Hi) > σ + ϵ then 7: Mark Hi in Q. 8: Hj a uniformly random sample drawn from Q 9: if ˆwj(Hi) > σ + ϵ then 10: Mark Hi in Q. 11: 12: if |Q| t then 13: for every Hj Q do 14: if ˆwj(Hi) > σ + ϵ then 15: Mark Hi in Q. 16: if Hi is unmarked. then 17: Compute ˆW(Hi). 18: if ˆW(Hi) 3 σ + 3 ϵ then 19: return Hi as a seed. 20: else 21: for Hj H do 22: if ˆwi(Hj) > σ + ϵ or ˆwj(Hi) σ + ϵ then 23: Mark Hj in Q.

24: return Start over. 25: return σ < OPT".

Theorem 8. Suppose that we are given a class of n hypotheses, H, a rate parameter η (0, 1/4), two parameters σ, δ (0, 1), and we have access to ˆwj(Hi) with error at most ϵ for every i, j [n]. Algorithm 5 queries O (n log(n/δ)/η)) many ˆ wi(Hj) and runs in O ((n log(n/δ)/η) Tq) time. Also, it can be implemented in O(1) rounds of adaptivity. This algorithm returns a hypothesis Hℓ, declares σ < OPT, or returns start over for which the following guarantees hold with probability 1 δ:

If σ OPT, then the algorithm does not declare that σ is less than OPT.

If the algorithm returns a hypotheses Hℓ, then Hℓis a (σ + ϵ , 3 σ + 3 ϵ , m)-seed where m := η n .

If the algorithm returns start over. with probability 1 δ, it marks (1 2η) fraction of the hypotheses in Q.

Proof. Our first claim is that if σ is at least OPT, then we do not declare otherwise. This is due to the fact that the discrepancy between Hi and P on any subset of the domain, including the Scheffé sets, will not exceed OPT. Hence, ˆwj(Hi ) is at most OPT + ϵ σ + ϵ , and we will not mark Hi in Line 7, Line 10, nor Line 15. That is, for at least one hypothesis, i.e., Hi , the if statement in Line 16 holds. And, we return a seed or start over. (And, we will never reach Line 25.)

Next, we show that if the algorithm outputs Hi, then it is a (σ + ϵ , 3 σ + 3 ϵ , m)-seed with high probability. First, observe that ˆW(Hi) must be at most 3 σ + 3 ϵ due to the if condition in Line 18, satisfying one of the two conditions we need for our desired seed in Definition 3.1.

Now, we show the next required condition for Hi to be a good seed holds as well: |Mσ+ϵ ,3 σ+3 ϵ (Hi)| m. This condition requires that the number of Hj s in H for which ˆwj(Hi) is in (σ + ϵ , 3 σ + 3 ϵ ] is bounded by m. We have already established that ˆwj(Hi) for every j [n] is bounded by 3 σ + 3 ϵ since ˆW(Hi) is at most 3 σ + 3 ϵ . Hence, we only need to show that there are at most m hypotheses Hj such that wj(Hi) is larger than σ + ϵ .

Now, if there are more than m hypothesis Hj in H such that ˆwj(Hi) > σ + ϵ . Now, if we sample t := 8 n log (2n/δ) /m hypotheses, we should observe at least one Hj for which ˆwj(Hi) is larger than σ + ϵ , with probability at least 1 δ/(2n). However, we know that such a hypothesis was never observed because we did not mark Hi earlier. By the union bound, with a probability of at least 1 δ/2, no such Hi exists. Hence, with probability at least 1 δ/2, Hi is an (σ+ϵ , 3 σ+3ϵ , m )-seed as promised in the statement of the lemma.

Next, assume we output start over. after finding an unmarked hypothesis Hi. In this case, we know that Hi is not marked and ˆW(Hi) > 2σ + 3ϵ . Note that we have sampled t edges of Hi, and we never see wj(Hi) > σ + ϵ . Using a very similar argument as we have above: one can show there cannot be more than m = η |Q| for which wj(Hi) > σ + ϵ with probability 1 δ/2 (for every Hi). That means, we will mark |Q| m many hypothesis in Q. If |Q| > t 1/η, this implies that (1 2η) fraction of the hypothesis in Q are removed. If |Q| t, then we know all the hypothesis in Q have wj(Hi) σ + ϵ and all of them will be marked.

Furthermore, we show that we did not wrongfully mark a hypothesis that was σ-close to P. Observe that the condition in Line 22 holds in two cases:

Case 1: ˆwi(Hj) > σ + ϵ ˆwi(Hj) > σ + ϵ ˆwi(Hj) > σ + ϵ . It is straightforward to show that Hj is not σ-close to P since we have:

Hj P TV wi(Hj) ˆwi(Hj) ϵ > σ .

Case 2: ˆwi(Hj) σ + ϵ ˆwi(Hj) σ + ϵ ˆwi(Hj) σ + ϵ and ˆwj(Hi) σ + ϵ ˆwj(Hi) σ + ϵ ˆwj(Hi) σ + ϵ . Even though ˆwℓ(Hj) is small in this case, we can indirectly deduce that Hj is not σ-close to P. By the definition of ˆwj(Hi) and ˆwi(Hj), we have:

Hi Hj TV = |Hi (Si,j) Hj (Si,j)| |Hi (Si,j) P (Si,j)| |P (Si,j) Hj (Si,j)|

= wi(Hj) + wj(Hi) ˆwi(Hj) + ˆwj(Hi) + 2 ϵ 2σ + 2 ϵ .

On the other hand, by the triangle inequality, we have:

P Hj TV P Hi TV Hi Hj TV W(Hi) Hj Hi TV

ˆW(Hi) ϵ Hi Hj TV > 3 σ + 3 ϵ ϵ (2 σ + 2 ϵ ) = σ .

Therefore, in both of the cases above, we do not mark an σ-close distribution to P.

By the union bound over all the steps, our guarantees hold with probability 1 δ. In addition, it is not hard to see that the running time of the algorithm is O ((n t + n) Tq) = O(n (log(n/δ)) Tq/η) time.

Rounds of adaptivity: It is straight forward to show that this algorithm can be implemented with a constant round of adaptivity. Let Q0 denote the initial state of unmarked hypotheses at the algorithm s outset. Since hypotheses are marked in sequence, in round i, the set of unmarked hypotheses is Qi := Q0 Hi, Hi+1, . . . , Hn. As a result, we can agree on the set of random Hj s in Q and H and request ˆwj(Hi) within a single round. If the size of |Qi| is at most t, then we include all ˆwj(Hi) for every i n and Hj Q. Next, upon discovering an unmarked Hi, we initiate another round of adaptivity to request all ˆwj(Hi) and ˆwj(Hi) for all j [n]. This enables the implementation of the rest of the algorithm.

B.2 Boosting a seed

In this section, we provide an algorithm that receives a (κ σ + ϵ , (κ + 2)σ + 5 ϵ , m)-seed and aims to find a (κ σ + ϵ , (κ + 2)σ + 5 ϵ , m )-seed with smaller m . That means, the size of M for this new seed is smaller which brings us closer to have enough time to fully investigate the set M for a future seed. Our algorithm runs in O(n Tq/η) time where η m /m is our rate parameter. This running time is adaptive to our budget. If we are aiming for O(n) algorithm, we find a seed such that m is only smaller than m by a constant factor (constant η); Alternatively, if we have more time, we can decrease m with a faster rate (smaller η).

While our main goal is to find a better seed, our algorithm may not always be able to do so; However, it makes progress towards the end goal of the algorithm one way or the other. There are three possible outcomes for our algorithm:

1. The algorithm finds an accurate enough ˆH as the final output of the algorithm. 2. The algorithm finds a better seed as we aimed for. 3. The algorithm marks (1 2η) fraction of the unmarked hypotheses; And after that, we start over the procedure.

Although we may have regressed in last case above, it does not happen too often. Since we have only n unmarked hypotheses to begin with, and we marked a large fraction of them every time, we do not start more than O(log1/2η(n)) times over the course of the algorithm. Our approach is presented in Algorithm 6, and we prove its performance in Theorem 9.

Algorithm 6 aims to find a better seed.

1: procedure BOOST-SEED(H, Q, Hi, σ, κ, κ , ϵ , δ, η, and query access to ˆwj(Hi) s) 2: m η m 3: t 8 n log (2n/δ) /m 4: for Hj M do We denote M(κ σ+ϵ , (κ+2)σ+5 ϵ (Hi) by M.

5: W(Hj) COMPUTEW(H, Q, Hi, t)

6: Hℓ arg min Hj M W(Hj)

7: if W(Hℓ) > κ σ + ϵ then 8: Return Hi as the final answer. 9: else 10: Compute ˆW(Hℓ). 11: if ˆW(Hℓ) (κ + 2) σ + 5 ϵ then 12: Return Hℓas a seed. 13: else 14: for Hj H do 15: if ˆwℓ(Hj) > σ + ϵ or ˆwj(Hℓ) κ σ + ϵ then 16: Mark Hj in Q.

17: return Start over.

Theorem 9. Suppose Algorithm 6 receives these parameters as input: σ R 0, κ, κ 1, η (0, 1/4), ϵ , δ (0, 1), and two non-negative integers n, m. Also, assume the algorithm has access to a class of n hypotheses H, and a set of unmarked hypotheses in Q H. It also receives a hypothesis Hi H. The algorithm has one of the three possible outcomes listed below. Now, for any setting of such input parameters, if Hi is a (κ σ + ϵ , (κ + 2)σ + 5 ϵ , m) seed, then the following guarantees hold for the outcome of Algorithm 6 with a probability of at least 1 δ:

1. If the algorithm outputs a hypothesis ˆH as the final answer, then we have for every ϵ 6 ϵ :

ˆH P TV max κ + 2, κ + 2

κ + 2 max (σ, OPT) + ϵ .

2. If the algorithm outputs a hypothesis Hℓas a new seed, then Hℓis a (κ σ + ϵ , (κ + 2)σ + 5 ϵ , m )-seed where m := η m .

Algorithm 7 computes W(Hi).

1: procedure COMPUTEW(H, Q, Hi, t, and query access to ˆwj(Hi) s) 2: W 0 3: repeat t times 4: Hj a uniformly random sample drawn from H

5: W max W, ˆwj(Hi)

6: Hj a uniformly random sample drawn from Q

7: W max W, ˆwj(Hi)

8: 9: if |Q| t then 10: for all Hj Q do

11: W max W, ˆwj(Hi)

12: Return W.

3. If the algorithm requires us to start over, then we have marked at least (1 2η) of the unmarked hypotheses.

The algorithm runs in O(n (log(n/δ)) Tq/η) time and can be implemented in O(1) rounds of adaptivity.

Proof. Let M denote the set Mκ σ+ϵ ,(κ+2)σ+5 ϵ (Hi). We consider the three possible outcomes of the algorithm, and prove the algorithm satisfied the desired guarantee in each of the three cases.

Case 1: W(Hℓ) > κ σ + ϵ W(Hℓ) > κ σ + ϵ W(Hℓ) > κ σ + ϵ . In this case, the algorithm will return Hi as the final answer. We show that the total variation distance between Hi and P is as desired. There are two possibilities depending on whether Hi is in M or not.

Case 1.1: Hi M Hi M Hi M. Using that Hℓhas the minimum W(Hℓ) among all the hypotheses in M, we can bound OPT as follows:

OPT W(Hi ) ˆW(Hi ) ϵ ˆW(Hℓ) ϵ W(Hℓ) ϵ κ σ .

We prove that Hi is not too far from P. Note that since Hi was an (κ σ + ϵ , (κ + 2) σ + 5 ϵ , m)-seed, we are guaranteed that ˆW(Hi) is bounded by (κ + 2) σ + 5 ϵ . Using the above bound for OPT and Fact C.3, we get:

Hi P TV W(Hi) + 2 OPT ˆW(Hi) + ϵ + 2 OPT (κ + 2) σ + 6 ϵ + 2 OPT

κ + 2 OPT + 6 ϵ κ + 2

κ + 2 max(σ, OPT) + ϵ .

Case 1.2: Hi M Hi M Hi M. Since Hi is a (κ σ + ϵ , (κ + 2)σ + 5 ϵ , m)-seed, ˆwi (Hi) is bounded by ˆW(Hi) (κ + 2)σ + 5 ϵ . Now that Hi is not in M, it must be the case that ˆwi (Hi) is bounded by κ σ + ϵ . In this case, we have the following bound for the total variation distance between Hi and P via Fact C.3:

Hi P TV wi (Hi) + 2 OPT ˆwi (Hi) + ϵ + 2 OPT κ σ + 2 ϵ + 2 OPT (κ + 2) max(σ, OPT) + ϵ .

Case 2: W(Hℓ) κ σ + ϵ W(Hℓ) κ σ + ϵ W(Hℓ) κ σ + ϵ and ˆW(Hℓ) (κ + 2) σ + 5 ϵ ˆW(Hℓ) (κ + 2) σ + 5 ϵ ˆW(Hℓ) (κ + 2) σ + 5 ϵ . In this case, we output Hℓas a new seed. Given the assumptions of this case, we show that Hℓis a (κ σ + ϵ , (κ + 2)σ + 5 ϵ , m )- seed. First, ˆW(Hℓ) is at most (κ + 2) σ + 5 ϵ implying the first desired property for being an (κ σ + ϵ , (κ + 2)σ + 5 ϵ , m )-seed holds for Hℓ. Next, we show the second desired property for Hℓ: the size of Mκ σ+ϵ , (κ +2)σ+5 ϵ (Hℓ) is bounded by m . At a high level, we expect the size of

this set to be small; Otherwise, we would have observed one of these large ˆwℓ(Hj) when we have sampled t hypotheses to compute W(Hℓ).

Suppose Hℓhas more than m edges with ˆwj(Hℓ) > κ σ + ϵ . Now, if we sample t := 8 n log (2n/δ) /m hypotheses, we should observe at least one Hj for which ˆwj(Hℓ) is larger than κ σ + ϵ , with probability at least 1 δ/(2n). However, we know that such a hypothesis was never observed because W(Hℓ), which is the maximum of ˆwj(Hℓ) s, is bounded from above by κ σ + ϵ . By the union bound, with a probability of at least 1 δ/2, no such Hℓexists. Hence, with probability at least 1 δ/2, Hℓis an (κ σ + ϵ , (κ + 2) σ + 5ϵ , m )-seed as promised in the statement of the lemma.

Case 3: W(Hℓ) κ σ + ϵ W(Hℓ) κ σ + ϵ W(Hℓ) κ σ + ϵ and ˆW(Hℓ) > (κ + 2) σ + 5 ϵ ˆW(Hℓ) > (κ + 2) σ + 5 ϵ ˆW(Hℓ) > (κ + 2) σ + 5 ϵ . In this case the algorithm marks any unmarked hypothesis Hj for which ˆwℓ(Hj) > σ + ϵ or ˆwj(Hℓ) κ σ + ϵ , and we start over. First, we show that we did not wrongfully mark a hypothesis that was σ-close to P. Observe that the condition in Line 15 holds in two cases:

Case 3.1: ˆwℓ(Hj) > σ + ϵ ˆwℓ(Hj) > σ + ϵ ˆwℓ(Hj) > σ + ϵ . It is straightforward to show that Hj is not σ-close to P since we have:

Hj P TV wℓ(Hj) ˆwℓ(Hj) ϵ > σ .

Case 3.2: ˆwℓ(Hj) σ + ϵ ˆwℓ(Hj) σ + ϵ ˆwℓ(Hj) σ + ϵ and ˆwj(Hℓ) κ σ + ϵ ˆwj(Hℓ) κ σ + ϵ ˆwj(Hℓ) κ σ + ϵ . Even though ˆwℓ(Hj) is small in this case, we can indirectly deduce that Hj is not σ-close to P. By the definition of ˆwj(Hℓ) and ˆwℓ(Hj), we have:

Hℓ Hj TV = |Hℓ(Sℓ,j) Hj (Sℓ,j)| |Hℓ(Sℓ,j) P (Sℓ,j)| |P (Sℓ,j) Hj (Sℓ,j)|

= wℓ(Hj) + wj(Hℓ) ˆwℓ(Hj) + ˆwj(Hℓ) + 2 ϵ κ σ + σ + 4 ϵ .

On the other hand, by the triangle inequality, we have:

P Hj TV P Hℓ TV Hℓ Hj TV W(Hℓ) Hj Hℓ TV

ˆW(Hℓ) ϵ Hℓ Hj TV > (κ + 2) σ + 4 ϵ ((κ + 1) σ + 4 ϵ ) = σ .

Therefore, in both of the cases above, we do not mark an σ-close distribution to P.

Second, we claim that we mark at least (1 2 η) fraction of the unmarked hypotheses in Q. Recall that when we compute W(Hℓ) we also sample t hypotheses from the set of unmarked hypotheses, and we did not observe any edge ˆwj(Hℓ) that is larger than κ σ+ϵ . With a very similar argument we had above, with probability 1 δ/(2n), we do not have more than m := η |Q| m/n η |Q| hypotheses in Q such that ˆwj(Hℓ) > κ σ + ϵ . Otherwise, we would have seen one of these edges, and W(Hℓ) would have been larger. By the union bound, this fact holds for every Hℓ. Hence, the if condition in Line 15 holds for over |Q| η |Q| of the unmarked hypotheses in Q, and the algorithm mark them. If |Q| 1/η, it is easy to show that |Q| η |Q| is at least (1 2 η) |Q|. Thus, we mark at least (1 2 η) fraction of the hypotheses in Q as we have claimed. Now, if |Q| 1/η, it is easy to show that |Q| < t. Hence, the algorithm involves all the Hj s in Q to compute ˆW(Hℓ). Therefore, every single ˆwj(Hℓ) is at most κ σ + ϵ and we mark all the hypotheses in Q.

By the union bound over all the steps, our guarantees hold with probability 1 δ. In addition, it is not hard to see that the running time of the algorithm is O ((m t + n) Tq) = O(n (log(n/δ)) Tq/η) time.

Rounds of adaptivity: Our algorithm can be implemented in constant rounds of adaptivity. To compute W s, we can preselect the random Hj s. If Q contains fewer than t + 1 hypotheses, we include every ˆwj(Hi) for each Hj Q and i [n]. Upon discovering Hℓ, another round of adaptivity allows us to query every ˆwj(Hℓ) and ˆwℓ(Hj) to calculate ˆW(Hℓ) and mark the hypotheses in Q.

C Preliminary facts and lemmas

Fact C.1. For all distribution P and for all probability event E defined under P, we have:

P P|E TV = 1 P(E) ,

where P|E denotes the probability distribution P when conditioned on the event E. Fact C.2. For every pair of distributions Hi and Hj, we have:

Hi Hj TV = |Hi (Sij) Hj (Sij)| .

The following fact is adapted from [DL01, MS08, ABS23]. Fact C.3. Suppose Hi is the closest hypothesis to P in H(i.e., Hi P TV). For every pair of hypotheses Hi and Hj, the following holds:

1. Hi P TV wj(Hi) + 2 Hj P TV ,

2. Hi P TV wi (Hi) + 2 OPT ,

3. Hi P TV W(Hi) + 2 OPT .

Proof. The proof is based on triangle inequality, and the definitions of wj(Hi) s and OPT. For every Hj in H, we have:

Hi P TV Hi Hj TV + Hj P TV = |Hi (Si,j) Hj (Si,j)| + Hj P TV |Hi (Si,j) P (Si,j)| + |P (Si,j) Hj (Si,j)| + Hj P TV = wj(Hi) + wi(Hj) + Hj P TV wj(Hi) + 2 Hj P TV .

Hence, Item 1 is proved. Now, if we set Hj to be Hi , then Hi P TV is equal to OPT implying Item 2. Item 3 is concluded from Item 2 and the fact that wi (Hi) is upper bounded by W(Hi).

Fact C.4. Suppose we are given six parameters a, b, a , b R 0, and m, m Z 0. If a a , b b , and m m , every (a, b, m)-seed is also an (a , b , m )-seed. Fact C.5. Suppose we have a set of n hypotheses H, and a predicate, R(H) : H {0, 1}, that maps the hypotheses in H to zero or one. Assume H contains more than m hypotheses with R(H) = 1 for an arbitrary integer parameter 0 m < n. If we draw s 8 n log(δ 1)/m hypotheses from H uniformly at random, we will observe at least one hypothesis with R(H) = 1 with probability at least 1 δ.

Proof. Let p denote the fraction of such hypothesis in H with R(H) = 1. Observe that p > 0 since there are at least m + 1 such hypothesis in H. Then, using the Chernoff bound, we have:

Pr[ r = 0] Pr h r < s p

D Data structure for marked and unmarked hypotheses

Here, we describe a simple data structure that allows us to efficiently keep track of marked and unmarked hypotheses. Our data structure consists of two integers, n and t, and two arrays of size n: index[] and list[]. Here, n represents the number of hypotheses that the data structure supports. The list array is an arbitrary ordering of integers from 1 to n. Throughout the operations of this data structure, we preserve the following guarantee: the value of index[i] indicates where to find element i in the list. In other words, list[index[i]] is always i. The integer t serves as a threshold quantity. The first t numbers in the list correspond to unmarked hypotheses, while the remaining numbers represent marked hypotheses. Initially, list is a list of integers from 1 to n in

ascending order. And, for every hypothesis Hi, index[i] is set to i. This data structure supports the following operations:

Initialization ds(size): To initialize the data structure, we set n and t to size. Next, we set list to be a list of integers from 1 to n in ascending order. To ensure consistency with the list, each index[i] is set to i. mark(i): To mark hypothesis Hi, we swap the element at index i with the element at index t in the list array, and then decrement t by 1. is_marked(i): Given our definition, if index[i] is less than or equal to t, then Hi is unmarked; otherwise, it is marked. is_all_marked(): If t is equal to zero, then it means all the hypotheses are marked. random_unmarked(): To select a random unmarked hypothesis, we generate a random integer r between 1 and t, and output the hypothesis at list[r]. random_marked(): To select a random marked hypothesis, we follow the same process as for an unmarked hypothesis, except that r is generated between t+1 and n.

It is easy to see that the initialization takes O(n) time, and the rest of the operations only take O(1) time.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Our results are a theoretical contribution to the field of learning theory. The paper, including the appendix, contains all the algorithms, theorem statements, and their proofs. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Please see the discussion in Section 1.2 regarding potential improvements and future directions. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We have clearly stated our model and our assumptions. All the proofs are included in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: We do not have any experimental results.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example 4.1 If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. 4.2 If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. 4.3 If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). 4.4 We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: We do not have any experimental results. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: We do not have any experimental results. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: We do not have any experimental results. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: We do not have any experimental results.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Our results are a theoretical formulation of a fundamental problem in statistics and learning theory. They can be viewed as a computationally efficient version of existing results. Given the theoretical nature of our findings, we do not anticipate any harm to society.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Given the theoretical nature of our findings, we do not anticipate broader societal impact for this paper.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: We do not have any experimental results.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: We did not use any existing assets, other than the publicly available articles that we have cited.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: The contributions of this paper are theoretical. As is standard in the field, we will make our results publicly accessible and do not anticipate generating revenue from them.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: We do not have any crowdsourcing experiments nor human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: We do not have any experiments involving crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.