# hypothesis_selection_with_memory_constraints__c94dd1c1.pdf

Hypothesis Selection with Memory Constraints

Maryam Aliakbarpour Department of Computer Science Rice University Houston, TX 77005 maryama@rice.edu

Mark Bun Department of Computer Science Boston University Boston, MA 02215 mbun@bu.edu

Adam Smith Department of Computer Science Boston University Boston, MA 02215 ads22@bu.edu

Hypothesis selection is a fundamental problem in learning theory and statistics. Given a dataset and a finite set of candidate distributions, the goal is to select a distribution that matches the data as well as possible. More specifically, suppose that we have sample access to an unknown distribution P over a domain X that we know is well-approximated by one of a class of n distributions (a.k.a. hypotheses), H := {H1, H2, . . . , Hn}. The goal is to design an algorithm that outputs a distribution ˆH H whose total variation distance from P is nearly minimal. In this work, we study the hypothesis selection problem under memory constraints. We consider a model where samples from P are presented in a stream and we access each sample x via PDF-comparison queries that allow us to compare the probability densities of any pair of hypotheses at the domain point x (i.e., is Hi(x) < Hj(x)?). This model allows us to study how much needs to be stored, at any point in time, about the portion of the stream seen so far. Our main result is an algorithm that achieves a nearly optimal tradeoff between memory usage and sample complexity. In particular, given b bits of memory (for b roughly between log n and n), our algorithm solves the hypothesis selection problem with s samples, where b s = O(n log n). This result is optimal up to an O(log n) factor, for all b.

1 Introduction

Learning the probability density function of observed data is a fundamental question in statistics with numerous applications in machine learning.Variants of this problem have been studied for nearly a century. Hypothesis selection is a classic version of this problem where the goal is to learn a distribution within a pre-specified class. Let H := {H1, H2, . . . , Hn} be a class of n known distributions over the same domain X. Suppose that we have sample access to an unknown distribution P over X which is in, or a very close to, a member of H. The goal is to design an algorithm that outputs a distribution ˆH H whose total variation distance to P is close to that of the closest distribution in H. A great deal of effort has been dedicated to solving this problem using a number of samples that does not depend on the domain size |X|. Perhaps surprisingly, one can learn an unknown distribution P H with O(log n) samples, independent of the domain size. In contrast, learning an arbitrary distribution P over X requires Ω(|X|) samples. Refined guarantees for this problem have been studied extensively, building an understanding of the accuracy, sample

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

complexity, and computational efficiency achievable, as well as the compatibility of hypothesis selection with robustness and privacy [Yat85, DL96, DL97, DL01, MS08, DDS12, DK14, DKS17, ADLS17, BKM19, CKM+19, BKSW19, GKK+20, ABH+20, BBK+21].

In this paper, we expand upon the emerging theory of learning with limited memory [SD15, Raz18, Sha14, DS18, DGKR19, SSV19, AMNW22, BBS22] by studying hypothesis selection from this new perspective. We strive to answer the following questions: Given a working memory of b bits, how many data points (samples) do we need to solve the hypothesis selection problem? Prior work assumes that we can store all the samples in the memory, which can be quite large when we aim to learn a distribution over extremely large objects, such as, DNA sequences, gene expression data, text data, and brain image data.

We consider a model where samples are processed one at a time in a stream. Similar to models in [GR22, CFG+23, AF23], we access each sample x via queries that allow us to compare the PDF of Hi s at point x, and we measure the size of the memory retained between processing samples. Our main result is an algorithm that achieves a nearly optimal tradeoff between the memory size b and the number of samples s. Given b bits of memory (for b roughly between log n and n), our algorithm solves the hypothesis selection problem with s samples where b s = O(n log n). A result of Shamir [Sha14, Theorem 2] gives a class of n hypotheses H such that any algorithm that learns unknown distributions in H requires s b = Ω(n). Our tradeoff is thus optimal up to a O(log n) factor.

1.1 Main result

Suppose we have a class of n known distributions H := {H1, H2, . . . , Hn} and an unknown distribution P. We aim to design a proper learning algorithm that outputs a distribution in H whose total variation distance to P is comparable to that of the closest distribution in H. The algorithm has access to a stream of i.i.d. samples drawn from P. At every given point, the algorithm can look at one sample, namely x, in the stream and query a PDF-comparator oracle by sending i, j [n], and receiving a bit indicating if Hi(x) < Hj(x). This is equivalent to asking whether a sample x is in the Scheffé set of two distributions Hi and Hj, which is defined as the set of elements to which Hj assigns higher probability than Hi. The algorithm can also discard the current sample and move on to the next one. In addition to having access to the samples of P, we have more direct access to the known distributions in H. The algorithm can query the probability masses of the Scheffé sets according to each Hi as is customary in the literature [DL01, MS08]. This assumption can be relaxed; only estimates of such probabilities are needed, which can be obtained if we have sample access to Hi s. Further details are available in Remark 3. To summarize our access model, our algorithm can make one of the following types of queries: 1) Request a new sample. 2) Query the PDF-comparator on the current sample x and ask if Hi(x) < Hj(x) for every pair of indices i, j [n]. 3) Asks for the probability mass of Scheffé set between Hi and Hj according to Hi for every pair of indices i, j [n].

The accuracy of our output distribution is measured with respect to how far the best distribution in H is from P. We define OPT(H, P) := min H H P H TV . When H and P are clear from context, we denote this simply by OPT. With this setup in mind, we define a proper learning algorithm (with promise) for hypothesis selection. The term promise refers to the extra parameter Γ that the algorithm receives as input: the learner may assume that OPT Γ. (See Remark 2 for the case where Γ is not provided.) Definition 1.1. Suppose A has sample access to an unknown distribution P. And, it can query the probabilities of the Scheffé sets according to each Hi and a PDF-comparator for every pair of hypotheses in H. Assume A is given an additional input parameter Γ > 0. We say algorithm A is an (α, ϵ, δ)-proper learner with a promise for H if the following holds: for every distribution P, if Γ OPT(H, P) then, with probability at least 1 δ over its input sample (drawn i.i.d. from P) and internal coin tosses, A outputs a distribution ˆH H such that:

P ˆH TV α Γ + ϵ . (1)

One can generally reduce ϵ and δ by increasing the number of samples taken or the time spent, whereas the value of OPT, and hence Γ, is inherent to the problem instance. Therefore, it is more important and challenging to design algorithms that minimize the multiplicative parameter α.

Main theorem: Our main result is a proper learner (with promise) that obtains a nearly optimal tradeoff between memory and data. Formally, we have the following theorem:

Theorem 1. There exists a constant c for which the following holds. Let H be an arbitrary class consisting of n distributions, and let ϵ > 0. For every natural number b where c (log n + log((log n)/ϵ)) b n, there exists an (α = 9, ϵ, δ = 0.1)-proper learner (with promise) for H using b bits of memory and the following number of samples:

when b n log log n ,

ϵ2 log log n

when n log log n b n .

Roughly speaking, this theorem says that given b bits of memory that suffice to perform basic operations such as keeping track of indices and counting samples (i.e., b = Ω(log n + log((log n)/ϵ))), then s b O(n log(n)/ϵ2) samples suffice to solve the hypothesis selection problem. A result of Shamir [Sha14, Theorem 2] gives a class of n hypotheses H such that every algorithm that learns a random P in H requires s b = Ω(n). Our tradeoff is thus optimal up to an O(log n) factor. We speculate that our result is tight even for the class considered in [Sha14], but we leave the proof of a tighter lower bound to future work.

Remark 2. Both versions of the hypothesis selection problem (with and without a promise that OPT Γ) have been extensively studied (e.g. [DL01, DK14]). For simplicity, we present our result when Γ is given a priori. Section C describes a reduction that yields a similar result when Γ is not provided; the accuracy guarantee is analogous to that of Equation 1, but for slightly larger α.

1.2 Overview of our key ideas and other results

Background on comparing two hypotheses: Like many previous hypothesis selection algorithms, our algorithm relies on the ability to compare two hypotheses H1 and H2 based on the probabilities of their Scheffé set S. We estimate P(S), and declare Hi the winner for which Hi(S) is closer to P(S). This natural approach works especially well when P {H1, H2} or when P is much closer to one of the two hypotheses. Intuitively, a near-optimal hypothesis H should win many comparisons against other hypotheses. A typical hypothesis selection algorithm will run a tournament to identify such an H. The primary advantage of using Scheffé based comparisons is that they are inexpensive to compute. Only O(log n/ϵ2) samples are sufficient to estimate the probability masses of all Scheffé sets within an error of ϵ. This exceptional sample efficiency enables the hypothesis selection algorithms circumvent the lower bound of Ω(|X| /ϵ2) for distribution learning in cases where we have prior knowledge that P is close to a particular class of distributions.

That being said, the Scheffé based comparisons are not straightforward. One challenge is that these comparisons are not ideal: even if our estimates of the probabilities of the Scheffé sets were perfect, we might pick the hypothesis that is further from P in total variation distance (because S is just one event and might not ideally distinguish P from either H1 or H2). Furthermore, comparisons are not transitive (again, because we measure only Scheffé set probabilities). Essentially the only useful property they have is this: if H1 is Γ-close to P, it will win; if H1 is much further from P (by a constant factor) than both Γ and H2 P TV, then it will lose. These issues make the outputs of comparisons hard to interpret and complicate their use for selection.

Our terminology: To elaborate on our approach, we start by defining a acceptable hypotheses, and our general terminology for referring to the quality of the hypothesis. We partition the hypotheses into three groups: 1) Excellent hypotheses, whose distance to P is Γ. 2) Decent hypotheses, whose distance to P is larger than Γ, but smaller than 3 Γ + ϵ. These distances are set in a way that decent hypothesis may win against an excellent hypothesis. Decent hypotheses are challenging to deal with because they may fool us into believing that an excellent hypothesis is not a good output, yet later they may lose against a very far hypothesis. 3) Unacceptable hypotheses, whose distance to P is larger than 3Γ + ϵ, which will lose to every excellent hypotheses. We say a hypothesis is acceptable if it is either excellent or decent. Generally, the goal is to find an acceptable hypothesis.

1.2.1 Random ladder tournament

The main technical tool we introduce is a simple algorithm we call the random ladder tournament, which is inspired by ladder tournaments, a type of tournaments used in games and sports. Our algorithm finds an acceptable hypothesis using a linear number of comparisons. The algorithm proceeds in rounds as follows. Holding on to a single hypothesis in each round, we sample a uniformly random hypothesis from H and compare it to the current hypothesis. The winner proceeds to the next round. Our novel analysis of this tournament shows that after O(|H|) rounds, either the final winner produced at the end of the tournament or a winner selected randomly along its trajectory will be acceptable with high probability.

Proof overview: As previously mentioned, comparisons are often far from ideal; Even when the excellent hypothesis wins at a certain step, there is no guarantee that it would remain the winner till the end. To evade this issue, we utilize our knowledge of the upper bound of OPT, denoted by Γ, and modify the comparisons in a way that ensures the excellent hypothesis never loses once it wins. In our comparisons, we favor the current winner; If the distance of the current winner to P on the Scheffé set is at most Γ, we declare the current winner as the winner (even when the opponent hypothesis seems closer). Clearly, the excellent hypothesis will never be more than Γ away for P on any set. Hence, it will never lose again. However, what happens when the excellent hypothesis never wins? At each step, we pick each hypothesis in H uniformly at random, implying the excellent hypothesis is expected to be chosen in at least O(1) steps in our algorithms. The fact that the random hypothesis loses to the current winners of those steps indicates that those winners must be acceptable hypotheses; otherwise, the excellent hypothesis would have won them. Additionally, the fact that the excellent hypothesis appears as the random hypothesis in O(1) steps and loses to all of them, confirms that a constant fraction of the current winners are indeed acceptable. By combining these observations, we prove that the random ladder tournament will either find the excellent hypothesis at the end or encounter many acceptable hypotheses along the way.

Implementing the tournament with memory constraints: This algorithm is a highly effective technical tool for solving our problem. It can be implemented in a very memory-efficient manner. The algorithm processes the hypotheses in an online fashion, and it does not need to memorize the result of any past comparisons. At any given moment, it only remembers two hypotheses: the current hypothesis and the next randomly drawn hypothesis. And, if we draw fresh samples for every comparison, we can implement this algorithm by storing essentially O(1) numbers but sampling O(n log(n)/ϵ2) data points. On the other extreme, we can implement this algorithm with very little data, i.e., O(log n/ϵ2) samples, but a large amount of memory by storing an appropriate summary of samples (e.g., the probabilities of all the Scheffé sets in O(n2) bits).

We leverage this flexibility in memory usage of the random ladder tournament and provide an algorithm that only uses b bits of memory. We perform the comparisons in the random tournament in blocks of size t. For each block of comparisons, we draw new samples and store a summary of them. We pick the parameter t in way that the summary fits in b bits, and it is sufficient for us to perform the next the t comparisons. For example, if t

b, we can store the number of samples in all the Scheffé sets of t hypotheses in b bits. This summary is enough for us to perform any comparisons between those hypotheses. Combined with a novel summary sorted lists, described in Section 2.2.2 we get a tradeoff of the form s b O(n log3(n)/ϵ4). In Section 3.2, we presented this tradeoff alongside another tradeoff (with better dependence to ϵ, but worse dependence to log n). Both of these tradeoff are worse than our main result, in a sense that s b is larger by a factor of O(poly(log(n), 1/ϵ)). However, the these results are yielding more accurate proper learners; For the random ladder tournament, α = 3, while in our main result α = 9.

Simpler linear-time selection: The random ladder tournament leads to new results for the hypothesis selection problem outside of the scope of our original motivation for this paper. If we ignore memory constraints and store every sample, the random ladder tournament algorithm is simultaneously near-optimal in terms of number of samples, time, and accuracy to our knowledge, it is the first such algorithm. Specifically, it makes a linear number of comparisons and uses O(log n/ϵ2) samples from P, which are both optimal for worst-case choices of H. Moreover, it finds a hypothesis with optimal accuracy under a promise. More precisely, it finds a hypothesis in H that is roughly

3 Γ-close to P given a parameter Γ that is guaranteed to be at least OPT. This is essentially optimal, as no proper learning algorithm can achieve accuracy better than 3 OPT [BKM19] even when Γ = OPT is given to the algorithm. Furthermore, with a simple adjustment to our algorithm, one can solve the hypothesis selection problem without any knowledge of Γ in nearly linear time and with roughly the same number of samples and obtain a 5 OPT-close hypothesis in H. See Section C.

1.2.2 Improved tradeoff by hypothesis filtration The flexibility of the random ladder tournament already results in memory-data tradeoffs, but these are suboptimal by a factor of poly(log n, 1/ϵ). To improve the result, we design an algorithm called filter, that picks an acceptable hypothesis with a modest chance. While this chance is not high, we can use the output of the filtration and run the random ladder tournament on these filtered hypotheses. Since the quality of the input hypotheses in this second round is better, the random ladder tournament can be implemented effectively on a smaller set H and results in a better tradeoff. Hence, we obtain our main result. See Section B.4 for the proof of our main results. Below, we give a short description of the filter algorithm. We defer the full description and the proofs to Section B.3.

hypothesis filtration: The best approach to finding an acceptable hypothesis drastically depends on the quality of the hypotheses in H. For example, if there is no decent hypothesis in H, any single elimination tournament will easily find an excellent hypothesis due to the fact that any comparison involving at least one excellent hypothesis will declare the excellent hypothesis the winner. On the other hand, if there are a lot of decent hypotheses, then there is a decent chance that a randomly selected hypothesis is decent, hence an acceptable one. The filtration algorithm works by combining these two observations. The algorithm randomly performs one of the following, each with probability 1/2: 1) It draws a set of random hypotheses. Assuming there is no decent hypothesis in the set, it runs a group elimination tournament (see below), and outputs the result. 2) It outputs a random hypothesis in H. One can show that if the decent hypotheses are scarce in H, we have a decent chance of getting an excellent hypothesis and no decent hypothesis in the random set; hence, we find an excellent hypothesis via our group elimination algorithm. On the other hand, if the decent hypotheses are abundant, then we have a decent chance of picking one in the second step. By selecting each strategy with a probability 1/2, we preserve half of the success probability in whichever case holds. While this probability is not a lot, it is larger than 1/n (the probability of picking the excellent hypothesis from H). We take advantage of this fact, and show one can run the random ladder tournament in fewer rounds and obtain the desired result. See Section B.3 for more details. Group elimination tournament: Our second technical tool is an algorithm that finds an excellent hypothesis if no decent hypothesis exists in H. This algorithm is a combination of a single-elimination tournament and an all-go-against-all tournament. In this algorithm, at every step, we partition the hypotheses into multiple groups, run a previously known all-go-against-all tournament within each group (say minimum distance estimate [DL01, MS08]), and then send the winners of each group to the next round. If a group contains an excellent hypothesis, then in an all-go-against-all tournament, we will pick an excellent hypothesis as the winner. Hence, an excellent hypothesis will bubble up in each round and be declared the final winner. With a careful choice of group sizes and the number of rounds, this algorithm can be implemented using O(n) bits of memory and O(log n) samples under the assumption that the hypotheses are indexed from 1 to n. A technical challenge that arises, however, is that we need to run this algorithm on a subset of roughly O(b) random hypotheses (so that we can run it within a memory bound of b). However, storing b random indices already requires O(b log n) space. We show that we can instead draw indices pseudorandomly using a pairwise independent generator that can be implemented in small space. See Section B.1 for more details.

1.3 Related work

Prior to our work, numerous studies have investigated the problem of hypothesis selection in various settings. In their seminal work, Yatracos [Yat85], and Devroye and Lugosi [DL96, DL97, DL01] present algorithms to find a close distribution in H based on only estimation of the probabilities of the Scheffé sets of pairs of hypothesis in H. These approaches suggest that we only need accurate estimation of the probability that P assigns to O(|H|2) = O(n2) sets, resulting in sample complexity proportional to O(log n) instead of Ω(|X|). For a comprehensive overview, see Chapter 6 in [DL01].

Mahalanabis and Stefankovic in [MS08] provide an algorithm with linearly many probes to P and α = 3 but they require an exponential time pre-processing of the class H.1 Daskalakis and Kamath

1It is worth noting that we do not have any pre-processing step in this paper.

in [DK14] give a nearly-linear time algorithm that given an upper bound on OPT Γ, returns ˆH such that ˆH P TV α Γ + ϵ for some large constant α = 512. Acharya et al. in [AJOS14, AFJ+18] provide a O(n log n) time algorithm that finds a hypothesis with accuracy guarantee of α = 9. For proper learners, Bousquet et al. in [BKM19] showed that the best α in Equation 1 that one can hope for is α = 3 as long as the sample complexity does not grow with the domain size |X|. Aamand et al. in [AAC+23] also study statistical computational tradeoffs (time-data) for the hypotheses over discrete domains. Unlike our work, their setting allows pre-processing of the class H in polynomial time.

There is a long history of research focusing on a special case of this problem where H is a specific structured class of distributions such as mixtures of Gaussians [KMV12, DK14, SOAJ14, KSS18, ABM18, ABH+20], histograms [Pea95, CDSS14, CDKL22], and polynomials [ADLS17]. The abstract hypothesis selection we study here is commonly used as a subroutine in solving these problems (usually in conjunction with some sort of a cover method). For a survey of results, see [Dia16].

Lastly, there is another field of study that tackles a question similar to ours: the problem of sorting items with noisy comparisons. One can view hypothesis selection as the task of finding the minimum item. With the exception of [AFJ+18] that we have discussed above, these probabilistic noise models do not capture the geometric structure presenting in our problem. Therefore, we did not find any of these result yielding an immediate solution to our problem.

2 Preliminaries

2.1 Notation

Let [n] denote the set {1, 2, . . . , n}. Suppose that we have a probability distribution P over a domain X. For element x in X, P(x) denotes the probability of x. We assume X is the domain of all the probability distributions in this article. For any subset of the domain S, P(S) denotes the sum of the probabilities of all the elements in S. We denote the total variation distance between two distributions P1 and P2 by P1 P2 TV, and it is defined as max S X |P1(S) P2(S)| where S is any measurable subset of the domain. We say P is ϵ-close to Q iff P Q TV is at most ϵ. Also, we say P is ϵ-far from Q iff P Q TV is greater than ϵ. We denote the Scheffé set of two distributions as follows: S(H1, H2) := {x X | H1(x) < H2(x)} .

2.2 Basic tools

In this section, we focus on our primary tools we have used throughout this article. First, we start with a comparisons algorithm that allows us to compare two hypotheses. Next, we use the fact that one can estimate the probabilities of the O(n2) many Scheffé sets up to error c ϵ (for some small constant c < 1) with probability at least 1 δ using O(log(n/δ)/ϵ2) samples from P.

2.2.1 Comparing two hypotheses

Our algorithms works based on a basic operation that allows us to compare two hypotheses H1 and H2 based on the probabilities of their Scheffé sets. We pick the hypothesis that appears to be closer to P and declare it the winner (and the other hypothesis the loser). The challenging part is that these comparisons are not ideal. As in, even when our estimations of the probabilities of the Scheffé sets are accurate enough, we may pick a hypothesis that is further. However, one can guarantee that if the distance of the further hypothesis among H1 and H2 is much worse than the distance of the closer one, the comparison procedure would never select it as the winner. The algorithm is given three parameters: 1) Γ: an upper bound for OPT; 2) ϵ: indicating the estimation error for the Scheffé set probabilities is Θ(ϵ); and 3) the confidence parameter δ. Upon invoking this algorithm, it determines the result of the comparison with the guarantees formalized in the following lemma:

Lemma 2.1. Upon receiving three parameters: Γ, ϵ, and δ, Algorithm 1 uses O(log(1/δ)/ϵ2) samples and satisfies the following guarantees with probability at least 1 δ:

If H1 is Γ-close to P, then the algorithm returns H1.

If H1 is (max (Γ, H2 P TV) + 2 H2 P TV + ϵ)-far from P, then the algorithm returns H2.

The proof of this lemma is provided in Section D.

Algorithm 1 Choosing between two hypotheses

1: procedure COMPARE(H1, H2, Γ, ϵ, δ, sample access to P) 2: Draw m = Θ(log(1/δ)/ϵ2) samples from P 3: ˆq fraction of samples from P in S(H1, H2) using the PDF-comparator. 4: p1 H1 (S(H1, H2)). 5: p2 H2 (S(H1, H2)). 6: ϵ ϵ/2 7: if |p1 ˆq| Γ + ϵ or |p1 ˆq| |p1 ˆq| then 8: Output H1. H1 is the winner. 9: else 10: Output H2. H2 is the winner.

Remark 3. While we have assumed that we have access to the probabilities of the Scheffé sets in the above algorithm, this assumption is not necessary. In fact, one can obtain similar guarantees to Lemma 2.1 as long as we have estimates of these probabilities up to accuracy c ϵ for a sufficiently small constant c < 1. Hence, our result can be adjusted to the setting that we have sample access to Hi s instead.

Our terminology based on comparisons: We say a distribution H is excellent iff H P TV is Γ. We usually use H to denote an excellent hypothesis. We say a distribution H is decent iff H P TV is greater than Γ, but smaller than 3 Γ + ϵ. It implies that a decent hypothesis may win an excellent H H. We say a distribution H is acceptable iff it is excellent or decent. And, we say a distribution H is unacceptable iff it is not acceptable. Note that it is impossible for an unacceptable hypothesis to win an excellent hypothesis with probability greater than δ. Throughout this article, when we say valid comparisons, we refer to an event in which the Scheffé sets are estimated accurately, so the conditions in Lemma 2.1 hold. Since we have only O(n2) many Scheffé sets, using O(log(n/δ)/ϵ2) samples is enough to assume all the comparisons are valid with a probability of at least 1 δ.

2.2.2 Space efficient sample summaries for comparisons within a batch of hypotheses

In order to obtain sample efficiency, we need sufficient information about samples that enables us to reuse the same set of samples for multiple comparisons. Below, we describe two approaches to summarize the sample set.

Scheffé counts. Observe that to compare two hypotheses H1 and H2, we only use the count of the number of samples in their Scheffé set S(H1, H2). Hence, instead of storing samples directly, for every pair of hypotheses in H, we store the number of samples in their Scheffé set. We refer to this number as the Scheffé counts. If we have m samples, we can store all the O(n2) Scheffé counts using O(n2 log m) bits.

Storing each sample via a sorted list. In this approach, we store a succinct summary of each sample that allows us to infer its membership in each Scheffé set. This information suffices to perform the comparisons we need later. To store a sample x, we save an ordered list of hypotheses that is sorted according to the probability of x (the PDF at point x). That is, we store a list of indices i1, i2, . . . , in such that Hi1(x) Hi2(x) Hin(x) (use hypothesis indices to break the tie). The summary of a single sample can be computed in O(n log n) time and takes O(n log n) bits of space to store. Now, to compare the PDF of two hypotheses Hi and Hj, we simply can find i and j in the list and check whether i precedes j in the list.2 Alternatively if we want to compare PDF s faster, we can store the indices in another list call dx that maps indices to list positions. We set dx[iℓ] = ℓ for every ℓ [n]. In this case, to compare the PDF s, we can simple check if dx[i] < dx[j].

2.2.3 All-go-against-all tournament

There are standard techniques in the literature to solve hypothesis selection problem when the estimates of all the Scheffé sets are available. See Chapter 6 in [DL01] and [MS08]. These approaches

2It is worth noting that we may not answer the PDF-comparison correctly when Hi(x) = Hj(x). However, since x does not contribute to the discrepancy between Hi and Hj, its inclusion (or exclusion) to the Scheffé set is inconsequential.

Algorithm 2 Random ladder tournament

1: procedure RANDOM LADDER TOURNAMENT(H, p0, Γ, ϵ, sample access to P and D) 2: k 2000/p0 3: Q an empty list. 4: W0 Assume W0 is a hypothesis that always loses. 5: for i = 1, 2, . . . , k do 6: Ri a random hypothesis drawn from D 7: Wi COMPARE(Wi 1, Ri, Γ, ϵ, 1/(100 k)) 8: Add Wi to Q, with probability p0. 9: Add Wk to Q. 10: Output a random hypothesis in Q.

suggest that we only need accurate estimation of the probability that P assigns to O(|H|2) = O(n2) sets, therefore, the sample complexity can be proportionate to O(log n) instead of O(|X|). For a formal statement, see Fact D.1.

3 Random ladder tournament

In this section, we focus on the Random ladder tournament which is a proper learner (with promise) of P in a finite class of n hypotheses, H. The algorithm is given a parameter Γ as an upper bound for OPT(H, X). It outputs ˆH H such that, with high probability we have: ˆH P TV 3 Γ + ϵ . Bousquet et al. in [BKM19] have shown that no proper learner can achieve an error better than 3 OPT + ϵ unless the sample complexity grows with the domain size |X|. Their lower bound holds even when the algorithm is provided with Γ = OPT. Hence, the factor α = 3 above is optimal.

For clarity in our presentation, in Section 3.1, we present the random ladder tournament assuming that: 1) We set the confidence parameter to be a small constant (δ = 0.1). 2) the algorithm can call the COMPARE subroutine without specifying how this subroutine is implemented. In Section A.1, we show one can modify this algorithm to work for arbitrary small confidence parameter δ. Also, in Section 3.2, we discuss how one can implement this algorithm with memory constraints. We use the sample summary approaches described in Section 2.2.2, and we obtain two (sub-optimal) memory-data tradeoffs for this algorithm.

3.1 Random ladder tournament with no memory constraints

In this section, we present the random ladder tournament that solves the hypothesis selection problem while a parameter Γ is given to the algorithm as an upper bound for OPT. For this result, we assume there is a meta distribution over H, denoted by D, with a non-negligible probability p0 to draw a hypothesis that is OPT-close. The algorithm considers hypotheses one at a time where at step i, we draw Ri from the meta distribution. While later we use this algorithm with other meta distributions, it may help the reader to view the meta distribution as a uniform distribution over a set of n hypotheses, H. In this case, if we draw a random hypothesis from H, it is guaranteed to be Γ-close to P with probability p0 1/n. Our algorithm finds a sufficiently close hypothesis using Θ(1/p0) comparisons.

At a high-level, our algorithm goes through a list of randomly drawn hypotheses from D, compares them, and keeps track of the current winner hypothesis. We denote the current winner hypothesis at step i by Wi. Initially, we start with W0 being equal to a fake hypothesis that loses to any other hypothesis. Then at every step, we take a randomly drawn hypothesis Ri and compare it with Wi 1. We set the current winner of step i, Wi, be the winner of a comparison between Wi 1 and Ri. We show that after k = Θ(1/p0) steps either the final winner, Wk, is an Γ-close hypotheses, or many of Wi s that we have encountered are close to P. That is, either Wk is an excellent hypothesis or a random Wi is an acceptable choice. To exploit this fact at every step, we add each Wi to a list, namely Q, with some small probability. At the end, we also add Wk to Q. We prove a random hypothesis in Q will be close to P with high probability. Algorithm 2 shows the formal description of our approach, and we prove its performance in Theorem 4. Figure 1 illustrates a visual representation of the algorithm.

Theorem 4. Suppose we can draw i.i.d. random hypotheses from an arbitrary meta distribution D over H. And, we have a hypothesis P that we aim to learn properly in H. Assume we are given parameters p0 and Γ such that the probability that a random hypothesis is Γ-close to P is at least p0. For any ϵ (0, 1), Algorithm 2 is (α = 3, ϵ, δ = 0.1)-proper learner (with promise) for the class H.

Proof. First, note that each comparison, invoked in Line 7, will satisfies the properties of Lemma 2.1 with probability at least 1 1/(100 k). Hence, using Lemma 2.1 and the union bound, one can assume with probability 0.99 for all the comparisons we have:

If Wi is Γ-close to P, it will not lose; it remains as the current winner for the rest of the steps: Wi = Wi+1 = = Wk.

If Wi 1 is (3Γ + ϵ)-far from P, and Ri is Γ-close to P, then Wi is equal to Ri.

For the rest of this proof, we fix the response of comparisons for which the above conditions hold. The randomness used in the rest of this proof only depends on the randomness in the meta distribution over hypotheses and the internal coin tosses of the algorithm.

Our goal is to show that most of the hypotheses in the list, Q, are acceptable, so a randomly selected hypothesis in Q will be acceptable as well. Our first step is to show that the expected number of acceptable hypotheses in Q is high compared to the size of Q. For each i in [k], we define an indicator random variable Ii that is one if we add an acceptable hypothesis to the Q at step i in Line 8, and zero otherwise. Also, we define another indicator variable Ik+1 corresponding to the event that the final Wk, which we added to Q in Line 9, is an acceptable hypothesis. More formally, we have:

Ii := 1Wiis acceptable, and we add Wi to Q in Line 8. i [k] , Ik+1 := 1Wkis acceptable. .

Clearly, the sum of Ii s indicates the number of acceptable hypotheses in Q, so we focus on finding a lower bound for the expected value of this quantity. Let αi be the probability of Wi being acceptable. Without loss of generality, we set α0 equal to zero. It is not hard to see that: E[ Ii] = p0 αi for i [k]. Moreover for the last indicator variable, we have:

E[ Ik+1] = Pr[ Wk being acceptable] Pr[ Wk P TV Γ]

i=1 Pr[ Wi 1 loses to Ri , and Ri P TV Γ , and Ri does not lose to Ri+1, Ri+2, . . . , Rk] .

Assuming the pairwise comparisons are done perfectly, Ri being Γ-close to P, automatically implies that Ri does not lose to any of the hypotheses Ri+1, . . . , Rk. Thus, we have: E[ Ik+1]

i=1 Pr[ Wi 1 loses to Ri , and Ri P TV Γ]

i=1 Pr[ Wi 1 loses to Ri | Ri P TV Γ] Pr[ Ri P TV Γ] .

Note that the probability of Ri P TV Γ is at least p0. Now given that Ri is Γ-close to P, it will certainly win any hypotheses that is (3 Γ+ϵ)-far from P. Thus, the event that Wi is (3Γ+ϵ)-far must have a lower probability than the event that Wi 1 loses to Ri. Therefore, we obtain the following lower bound:

i=1 Pr[ Wi 1 P TV > (3 Γ + ϵ) | Ri P TV Γ] p0

i=1 Pr[ Wi 1 being unacceptable | Ri P TV Γ] p0 .

Recall that we pick Ri independently from all the previous hypothesis, R1, R2, . . . , Ri 1, so one can say Wi 1 is independent of Ri. This implies:

i=1 Pr[ Wi 1 being unacceptable] p0 =

i=1 (1 αi 1) p0 .

Putting it all together, the expected value of the sum of Ii s is:

i=1 p0 αi + (1 αi 1) p0 = k p0 + p0 (αk α0) k p0 .

The last inequality above is due to the fact that we set α0 to zero. Note that the above equation states that the expected number of acceptable hypotheses in Q is at least k p0. On the other hand, the expected number of hypotheses in Q is k p0 + 1. Thus, the expected number of unacceptable hypotheses in Q is at most one. Now, by Markov s inequality, the probability that we have more than 50 unacceptable hypotheses in Q is at most 0.02. And, by the Chernoff bound, we know that the probability of having less than 1000 many hypothesis in Q is at most:

Pr[ #hypotheses in Q < 1000] = Pr Bin(k, p0) < k p0

It is not hard to see that the ratio of unacceptable hypotheses to the total number of hypotheses in Q is at most 50/1000=0.05. Therefore, a random hypothesis in Q is acceptable with probability at least 0.95. Hence, by the union bound the total error probability is bounded by:

Pr[ Outputting an unacceptable hypothesis] Pr[ Outputting an unacceptable hypothesis | valid comparisons]

+ Pr[ at least one invalid comparisons] 0.05 + 0.01 < 0.1

Therefore, the proof is complete.

3.2 Memory-data tradeoffs of random ladder tournament

As we have discussed, one of the advantages of Algorithm 2 is its flexibility in the usage of memory and data. In the following, we describe a tradeoff that we can obtain for this algorithm using the sample summaries presented in Section 2.2.2. At a high-level, the following is how we achieve the tradeoff: suppose we have b bits of memory. We choose the largest integer t so that we can store the sample summary needed to compare t hypotheses. Then, we run the random ladder tournament while we draw new samples and refresh the sample summary at every t step. Our two described sample summaries Scheff e counts and the sorted list lead to the following memory-data tradeoffs. Although these tradeoffs are not as tight as our main result (by factors of log n and 1/ϵ), they have better accuracy guarantees (α = 3 instead of α = 9 in our main result).

Lemma 3.1. Suppose we have n hypotheses in H. For every p0 1/n, ϵ, and an integer t between two and k = Θ(1/p0), one can implement Algorithm 2 in such away that it uses:

t p0 log p 1 0 ϵ2

samples, and b = O t2 log log p 1 0 ϵ

+ t log n bits of memory.

Lemma 3.2. Suppose we have n hypotheses in H. For every p0 1/n, ϵ, and an integer t between two and k = Θ(1/p0), one can implement Algorithm 2 in such away that it uses:

t p0 log p 1 0 ϵ2

samples, and b = O t log(p 1 0 ) log n

ϵ2 + t log n bits of memory.

For the proofs of the above lemmas, see Section A.2.

Acknowledgement

MA was supported by NSF awards CNS-2120667, CNS-2120603, CCF-1934846, and BU s Hariri Institute for Computing. This work was predominantly done while MA was affiliated with Boston University and Northeastern University. MB was supported by NSF awards CCF-1947889 and CNS2046425, and a Sloan Research Fellowship. AS was supported in part by NSF awards CCF-1763786 and CNS-2120667 as well as Faculty Awards from Google and Apple.

[AAC+23] Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, and Sandeep Silwal. Data structures for density estimation. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023.

[ABH+20] Hassan Ashtiani, Shai Ben-David, Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. J. ACM, 67(6):32:1 32:42, 2020.

[ABM18] Hassan Ashtiani, Shai Ben-David, and Abbas Mehrabian. Sample-efficient learning of mixtures. In Sheila A. Mc Ilraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 2679 2686. AAAI Press, 2018.

[ADLS17] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1278 1289. SIAM, 2017.

[AF23] Tomer Adar and Eldar Fischer. Refining the adaptivity notion in the huge object model. Co RR, abs/2306.16129, 2023.

[AFJ+18] Jayadev Acharya, Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Maximum selection and sorting with adversarial comparators. The Journal of Machine Learning Research, 19:59:1 59:31, 2018.

[AJOS14] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Sorting with adversarial comparators and application to density estimation. In 2014 IEEE International Symposium on Information Theory, pages 1682 1686, 2014.

[AMNW22] Maryam Aliakbarpour, Andrew Mc Gregor, Jelani Nelson, and Erik Waingarten. Estimation of entropy in constant space with improved sample complexity. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 32474 32486. Curran Associates, Inc., 2022.

[BBK+21] Olivier Bousquet, Mark Braverman, Gillat Kol, Klim Efremenko, and Shay Moran. Statistically near-optimal hypothesis selection. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021, Denver, CO, USA, February 7-10, 2022, pages 909 919. IEEE, 2021.

[BBS22] Gavin Brown, Mark Bun, and Adam Smith. Strong memory lower bounds for learning natural models. In Proceedings of Thirty Fifth Conference on Learning Theory COLT, 2022.

[BKM19] Olivier Bousquet, Daniel Kane, and Shay Moran. The optimal approximation factor in density estimation. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 318 341. PMLR, 25 28 Jun 2019.

[BKSW19] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hypothesis selection. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing

Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 156 167, 2019.

[CDKL22] Clément L. Canonne, Ilias Diakonikolas, Daniel Kane, and Sihan Liu. Nearly-tight bounds for testing histogram distributions. In Advances in Neural Information Processing Systems 35 (Neur IPS), 2022. To appear.

[CDSS14] Siu-on Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Near-optimal density estimation in near-linear time using variable-width histograms. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1844 1852, 2014.

[CFG+23] Sourav Chakraborty, Eldar Fischer, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen. Testing of index-invariant properties in the huge object model. In 36th Annual Conference on Learning Theory - COLT, 2023.

[CKM+19] Clément L. Canonne, Gautam Kamath, Audra Mc Millan, Adam D. Smith, and Jonathan R. Ullman. The structure of optimal private tests for simple hypotheses. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 310 321. ACM, 2019.

[DDS12] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning poisson binomial distributions. In Howard J. Karloff and Toniann Pitassi, editors, Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pages 709 728. ACM, 2012.

[DGKR19] Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, and Sankeerth Rao. Communication and memory efficient testing of discrete distributions. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1070 1106. PMLR, 25 28 Jun 2019.

[Dia16] Ilias Diakonikolas. Learning structured distributions. In CRC Handbook of Big Data, pages 267 283. 2016.

[DK14] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pages 1183 1213. PMLR, 2014.

[DKS17] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 73 84. IEEE Computer Society, 2017.

[DL96] Luc Devroye and Gábor Lugosi. A universally acceptable smoothing factor for kernel density estimates. The Annals of Statistics, 24(6):2499 2512, 1996.

[DL97] Luc Devroye and Gábor Lugosi. Nonasymptotic universal smoothing factors, kernel complexity and Yatracos classes. The Annals of Statistics, 25(6):2626 2637, 1997.

[DL01] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer, 2001.

[DS18] Yuval Dagan and Ohad Shamir. Detecting correlations with little memory and communication. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1145 1198. PMLR, 06 09 Jul 2018.

[GKK+20] Sivakanth Gopi, Gautam Kamath, Janardhan Kulkarni, Aleksandar Nikolov, Zhiwei Steven Wu, and Huanyu Zhang. Locally private hypothesis selection. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 1785 1816. PMLR, 2020.

[GR22] Oded Goldreich and Dana Ron. Testing Distributions of Huge Objects. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), volume 215 of Leibniz International Proceedings in Informatics (LIPIcs), pages 78:1 78:19, Dagstuhl, Germany, 2022. Schloss Dagstuhl Leibniz-Zentrum für Informatik.

[KMV12] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Disentangling gaussians. Commun. ACM, 55(2):113 120, 2012.

[KSS18] Pravesh K. Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 1035 1046. ACM, 2018.

[MS08] Satyaki Mahalanabis and Daniel Stefankovic. Density estimation in linear time. In 21st Annual Conference on Learning Theory - COLT, pages 503 512, 2008.

[Pea95] Karl Pearson. Mathematical contributions to the theory of evolution. ii. skew variation in homogeneous material. [abstract]. Proceedings of the Royal Society of London, 57:257 260, 1895.

[Raz18] Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. Journal of the ACM (JACM), 66(1):1 18, 2018.

[SD15] Jacob Steinhardt and John Duchi. Minimax rates for memory-bounded sparse linear regression. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1564 1587. PMLR, 03 06 Jul 2015.

[Sha14] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 163 171, 2014.

[SOAJ14] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-optimal-sample estimators for spherical gaussian mixtures. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27, 2014.

[SSV19] Vatsal Sharan, Aaron Sidford, and Gregory Valiant. Memory-sample tradeoffs for linear regression with small error. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 890 901. ACM, 2019.

[Vad12] Salil P. Vadhan. Pseudorandomness. Foundations and Trends in Theoretical Computer Science, 7(1 3):1 336, 2012.

[Yat85] Yannis G. Yatracos. Rates of convergence of minimum distance estimators and kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985.

winner W 0 Compare W1 Compare ... W2 Compare Wk

Ri s are drawn from H according to the meta distribution.

Wi s are added to Q with some small probability.

ˆH is selected from members of Q uniformly at random.

Wk is always added to Q.

Figure 1: Random ladder tournament

A Random ladder tournament (cont.)

Figure 1 illustrates a visual representation of the algorithm.

A.1 Amplifying the confidence parameter

In this section, we describe how one can amplify the confidence parameter of the random ladder tournament. Note that to keep α = 3, we should avoid any two-step process that comes up with a collection of hypotheses to narrow down the possible choices and then pick an acceptable hypothesis among them. Such approaches generally lead to α = 9 or higher. Here, we argue that if we run the random ladder tournament for a longer time, a random hypothesis in Q will be acceptable with very high probability. In particular, if we set k, the number of steps equal to Θ(1/(δ p0)) (instead of setting k = Θ(1/p0)), we find an acceptable hypothesis with probability at least 1 δ.

Corollary A.1. Suppose that we can draw i.i.d. random hypotheses from an arbitrary meta distribution D over H. And, we have a hypothesis P that we aim to learn properly in H. Assume we are given parameters p0 and Γ such that the probability that a random hypothesis drawn from the meta distribution is Γ-close to P is at least p0. For any ϵ, δ (0, 1), Algorithm 2 can be modified to an (α = 3, ϵ, δ)-proper learner (with promise) for class H by setting k := 2 2/δ /p0 = Θ(1/(δ p0)).

Proof. Note that δ in Theorem 4 is 0.1. Hence, it suffices to focus on δ < 0.1. As we have shown in the proof of Theorem 4, the expected number of unacceptable hypotheses in Q is at most one. Let ˆH be a random hypothesis in Q. Let t := 2/δ , and set k to 2t/p0 . It is not hard to see that the probability of ˆH being unacceptable is bounded by δ:

Pr h ˆH is (3 Γ + ϵ)-far i = Pr h ˆH is unacceptable(3 Γ + ϵ)-far i

= Pr[ a random hypothesis in Q is unacceptable]

E[ # unacceptable hypotheses in Q | |Q| = ℓ]

ℓ Pr[ |Q| = ℓ] .

Now, we separate the first t terms in the summation, and bound them from above by the probability of |Q| being at most t. Moreover, we use the fact that the size of |Q| is essentially a binomial random variable drawn from Bin(k, p0) plus one. Hence, we obtain:

Pr h ˆH is (3 Γ + ϵ)-far i Pr[ |Q| t] +

E[ # unacceptable hypotheses in Q | |Q| = ℓ]

t Pr[ |Q| = ℓ]

Pr[ Bin(k, p0) < t] + E[ # unacceptable hypotheses in Q]

Pr Bin(k, p0) < k p0

(By the Hoeffding inequality)

where the last inequality holds due to the fact that e 1/(2 δ) < δ/4 for δ 0.1. Hence, the proof is complete.

A.2 Proofs of Memory-data tradeoffs of random ladder tournament

Lemma 3.1. Suppose we have n hypotheses in H. For every p0 1/n, ϵ, and an integer t between two and k = Θ(1/p0), one can implement Algorithm 2 in such away that it uses:

t p0 log p 1 0 ϵ2

samples, and b = O t2 log log p 1 0 ϵ

+ t log n bits of memory.

Proof. Recall that it suffices to draw m := O(log p 1 0 /ϵ2) samples to make sure all the comparisons we make in Line 7 are accurate with probability at least 0.99. Since we obtain this accuracy bound via the union bound, we may reuse the samples. The idea here is to process every t hypotheses together with the same set of m samples. To implement this algorithm in b bits of memory, we first draw t random hypotheses and store their indices using O(t log n) bits. Then, we draw m sample and store the Scheffé counts of every pair of hypotheses that are drawn. Counting the number of samples in each Scheffé sets takes O(log m) = O(log(log n/ϵ)) bits. And, there are O(t2) many of them. Now, we can make all the comparisons between the t hypotheses using O(t2 log m) = O(t2 log(log p 1/ϵ)) bits. Now, we run the first t steps of Algorithm 2 and compute the current winner up to round t, wt. Then, repeat this process for wt and the next t 1 random hypotheses with fresh samples until we reach the end of k hypotheses. Since at each round we process t 1 hypotheses (with the exception of the first round), the number of rounds is (k 1)/(t 1) = O(1/(t p0)), so we use O(m/(t p0)) samples in total. Hence, Algorithm 2 uses O(log n/(tϵ2p0)) samples and b = O(t2 log(log n/ϵ) + t log n) bits of memory overall.

Lemma 3.2. Suppose we have n hypotheses in H. For every p0 1/n, ϵ, and an integer t between two and k = Θ(1/p0), one can implement Algorithm 2 in such away that it uses:

t p0 log p 1 0 ϵ2

samples, and b = O t log(p 1 0 ) log n

ϵ2 + t log n bits of memory.

winner W 0 Compare W1 Compare

Filtering algorithm Filtering algorithm Filtering algorithm ...

Ri s are the output of the filtering algorithm.

Random Ladder Tournament

Figure 2: The sketch of our approach for the main result

Proof. The proof is very similar to the proof of Lemma 3.1. However, instead of using the Scheffé counts we use the sorted lists to store the samples. The only difference is that to process t hypotheses, we will need O(m t logn) = O(t log p 1 0 log n/ϵ2) bits instead of O(t2 log m) bits. This observation concludes the proof.

B Main result

In this section, we present an algorithm that achieves the tradeoff described in our main theorem. As we have described in Section 1.2, we obtain our main result by running the random ladder tournament on the hypotheses filtered by a filtering algorithm.

Recall that in the random ladder tournament, the algorithm can draw a random hypothesis such that with a probability of at least p0, it is a Γ-close hypothesis. The number of comparisons made in the random ladder tournament is approximately k := Θ(1/p0). To reduce k, we design a randomized filtering algorithm that produces an acceptable hypothesis ((3 Γ + ϵ )-close to P) with a certain descent probability. Roughly speaking, this probability is min(1/ n, b/n). Although this probability is not very high, it is still greater than 1/n, which allows us to execute the random ladder in fewer steps. The only caveat is that we must now run the random ladder tournament with a new parameter Γ := 3 Γ + ϵ . Hence, the accuracy guarantee of the resulting hypothesis will also be based on Γ , and we will obtain a (9 Γ + Θ(ϵ ))-close hypothesis to P. Figure 2 illustrates a visual representation of this approach.

The filtering algorithm operates on the basis of a simple observation: the excellent hypothesis H will not lose to any unacceptable hypotheses (those that are (3 Γ + ϵ )-far). Therefore, if there is no decent hypothesis, one can identify the excellent hypothesis H through an elimination tournament, where the algorithm pairs all hypotheses, compares them, and repeats this process while only the winners advance to the next round. Conversely, if numerous decent hypotheses exist, it suggests that a random hypothesis will have a reasonable chance of being decent (and hence acceptable). The

filtering algorithm combines these two strategies and does the following for each with probability a half:

1. It outputs a random hypothesis.

2. It runs an elimination tournament under the assumption that there are no decent hypotheses, then it outputs the result.

This approach leads to finding an excellent or decent hypothesis with an acceptable chance regardless of how scarce the decent hypotheses are.

Note that to run an elimination algorithm, we need to memorize the winners of each round. Hence, implementing the second part above with limited memory is not trivial. Therefore, we focus on a randomly sampled subset of elements and provide an algorithm that is extremely memory efficient. Roughly speaking, the algorithm processes n hypotheses using only O(n) bits of memory. This is one of the key components of our result that allows us to avoid extra log n factors and achieve the tradeoff s b O(n log n).

The algorithm operates in rounds, employing a tree-like structure. In each round of this algorithm, we group a carefully chosen number of hypotheses and run an all-against-all tournament among them. Next, the winner of each group proceeds to the next round. We continue this process until we find a single winner.

In Section B.1, we describe the group elimination tournament. In Section B.2, we detail how to conduct the group elimination tournament on a random set of n hypotheses while utilizing only n bits of memory. In Section B.3, we outline the filtering algorithm. Finally, we conclude with Section B.4, which combines all the pieces together and presents the proof of our main result.

B.1 Group elimination tournament

In this section, we focus on the case that there is no decent hypothesis in H. That is, every hypothesis that is OPT-far from P will lose to a hypothesis H with H P TV = OPT. The main advantage of this assumption is that if we have a knockout style tournament: only an OPT-close hypothesis will survive to the top. Here, we propose an algorithm that implements a knockout style tournament, which we call group elimination tournament. However, instead of comparing two hypotheses at every step, we run the all-go-against-all tournament among a small group of hypotheses and send the winner to the next round. For simplicity, in this section, we describe our approach in Algorithm 3 where we assume no memory restriction. We prove the performance of this algorithm in Theorem 5. For a discussion on how to implement this algorithm using roughly O(n) bits of memory, see Section B.2.

Algorithm 3 Hypothesis selection with no decent hypothesis

1: procedure HYPOTHESIS-SELECTION-NO-DECENT(H, n, ϵ, δ, sample access to P) 2: H1 H 3: for ℓ= 1, 2, . . . , L do 4: G1, G2, . . . , Gsℓ+1 Partition Hℓinto groups of size gℓ. 5: Hℓ+1 6: S := (i, j)| G G1, G2, . . . , Gsℓ+1 such that Hi G and Hj G

7: Draw mℓsamples. For each sample, update the Scheffé counts of S(Hi, Hj) if (i, j) S 8: for G = G1, G2, . . . , Gsℓ+1 do 9: Run a ALL-GO-AGAINST-ALL TOURNAMENT(P, G, ϵ, δ/(gℓ)) over the hypotheses in group G. 10: Add the winner to the set Hℓ+1. 11: Return the hypothesis in HL+1.

Lemma B.1. Suppose we have a set of n hypotheses H = {H1, H2, . . . , Hn} and an unknown hypothesis P which we wish to properly learn in H. Suppose there are no decent hypothesis in H, that is all hypotheses in H are either OPT-close, or they will lose to any OPT-close hypothesis. For every 2 r n, there exists an algorithm that uses O((log(logr n) log(1/δ) + log n)/ϵ2) samples from P, and it outputs ˆH that is OPT-close to P with probability 1 δ.

Round # # hypotheses group size # groups or # winners space

1 s1 = n g1 = r s1/g1 n/ r O(n r log(log(1/δ)/ϵ))

2 s2 = n/g1 g2 = l ( r)1.5m

s2/g2 n/r1.25 O(n r log(log(1/δ)/ϵ))

3 s3 = n/(g1g2) g3 = l ( r)2.25m

s3/g3 n/r2.375 O(n r log(log(1/δ)/ϵ))

& n QL 1 ℓ=1 gℓ

gℓ= l ( r)1.5ℓ 1m

sℓ/gℓ n r(1.5)ℓ 1

O sℓ gℓ g2 ℓ log mℓ =

O n r log log(1/δ)

Table 1: Branching factors in the tree structure

Proof. The algorithm runs in L rounds. At round ℓ [L], we start with sℓhypotheses. We partition them into groups of size gℓ. For each round, we draw mℓ:= Θ(log(g3 ℓ/δ)/ϵ2) fresh samples. We keep track of the number of samples in the Scheffé sets of every pair of hypotheses that are in the same group. Using the Scheffé counts, we run an all-go-against-all tournament in each group Gi with overall confidence probability 1 δ/(6 gi) and find the winner. Then, we repeat this process among the winners in the next round until we end up with only one hypothesis. See Algorithm 3 for the description.

Parameters: We set the group sizes as follows:

g1 = r , gℓ:= l ( r)1.5ℓ 1m g1.5 ℓ 1 ℓ {2, . . . , L} . (2)

We start with s1 := n. The number of hypotheses in the ℓ-th round is:

& n Qℓ 1 i=1 gi

l n r1.5ℓ 1 1

m ℓ {2, . . . , L} . (3)

The equation above is due to Fact D.2 and Fact D.3.

Observe that we stop our process when only one hypothesis is left: the ultimate winner. Thus, L is the smallest integer such that s L+1 is one. That is, QL ℓ=1 gℓ n. It is not hard to show that L is O (log (logr n)) due to the following argument. For any L log1.5 (logr n + 1), one can show QL ℓ=1 gℓis at least n via Fact D.3:

ℓ=1 gℓ r1.5L 1 n .

Therefore, L cannot be larger than log1.5 (logr n + 1) .

We describe how we set our parameters in Table 1.

Sample complexity: For round ℓ, we draw mℓfresh samples. Thus, the total number of samples is:

# samples =

ϵ2 log g3 ℓ δ

= O L log(1/δ)

= O L log(1/δ)

= O L log(1/δ) + log n

= O log(logr n) log(1/δ) + log n

Note that in the last line we use the fact that QL ℓ=1 gℓis poly(n) which is not difficult to prove using the following argument: We define L to be the smallest integer such that QL ℓ=1 gℓis at least n. Thus, QL 1 ℓ=1 gℓis at most n. Hence, we have:

ℓ=1 gℓ= g L

ℓ=1 gℓ O(g1.5 L 1)

Correctness: Suppose G is a group in round ℓthat contains an OPT-close hypothesis. For the all-go-against-all tournament in round ℓwe use mℓ= O(log(g3 ℓ/δ)/ϵ2) samples. That implies we estimate the probability of all the Scheffé sets in G with accuracy ϵ with probability at least 1 δ/(6 gℓ). Given that we do not have any decent hypothesis, using Fact D.1, the all-go-against-all tournament has to output an OPT-close hypothesis with probability 1 δ/(6 gℓ).

Now, let G(1) be the group in round one that H belongs to. Let G(2) be the group that the winner of G(1) belongs to in round two, and similarly we define: G(3), . . . , G(L). Clearly, the all-go-against-all tournament over G(i) will return an OPT-close hypothesis with probability 1 δ/(6 gℓ) if G(i) contains an OPT-close hypothesis. Thus, using the union bound, the probability that the final group G(L) does not output an OPT-close hypothesis is bounded by the following:

Pr[ Output hypothesis is not OPT close]

ℓ=1 ( r) 1.5ℓ 1 δ

ℓ=1 ( r) ℓ/2

1 r 1/4 δ .

B.2 Memory usage of Algorithm 3

In this section, we provide an approach to implement Algorithm 3 using roughly O(n) bits of memory and prove our main theorem.

Theorem 5. Suppose we have a set of n hypotheses H = {H1, H2, . . . , Hn} and an unknown hypothesis P which we wish to properly learn in H. Suppose there are no decent hypothesis in H, that is all hypotheses in H are either OPT-close, or they will lose to any OPT-close hypothesis. For every 2 r n, there exists an algorithm that uses O(n r log(log(1/δ)/ϵ)) bits of memory and O((log(logr n) log(1/δ) + log n)/ϵ2) samples from P, and it outputs ˆH that is OPT-close to P with probability 1 δ.

Proof. Note that there are two main sets of values that we need to store. First, the values of the Scheffé counts at every round that allow us to perform the all-go-against-all tournaments. Second, the

indices of the winner hypotheses that go to the next round. Below, we discuss bounding the memory usage for each part.

Memory usage of Scheffé counts: Here, we show to store the Scheffé counts, we need O(n r log(log(1/δ)/ϵ)) bits at every round. For a single group of size gℓ, there are O(g2 ℓ) pairs of hypotheses, therefore Scheffé sets, so we require O(g2 ℓlog mℓ) bits to count the number of samples in all the sets. Thus, in round ℓ, we need the following number of bits:

space = # groups O(g2 ℓ log mℓ) = O sℓ

gℓ g2 ℓ log mℓ

= O sℓ gℓ log gℓ+ log log(1/δ)

= n log log(1/δ)

n gℓ (log gℓ)

Next, we show the last term in the last line is O(r). For round 1, the memory bound holds since r log( r) r. Now for ℓ 2 using Equation (3), we have:

n gℓ (log gℓ) = O gℓ log gℓ

r(1.5)ℓ 1 1

= O r gℓ log gℓ

Hence, the bit complexity at every step is:

O n r log log(1/δ)

Storing the indices: Note that to run the comparisons, one needs the true indices of the hypotheses to query the PDF-comparator or draw a sample. Thus, in the trivial implementation of our algorithm, we potentially need O(n log n) bits to keep track of the indices of the winners at every round. In this section, we explain how one can implement our algorithm in a memory-efficient manner, so keeping track of the winners indices does not take more than O(n) at every round.

We start by the following assumption that there is natural ordering among the hypotheses: H1, H2, . . . , Hn. In this way, for the first round we do not need to remember the set of indices, we only need to remember n. At every level, we group adjacent" hypotheses together and preserve this natural ordering. More precisely, at round one with group size g1, the groups are:

G1 := {H1, H2, . . . , Hg1} , G2 := {Hg1+1, Hg1+2, . . . , H2g1} , ... Gl n g1

m := {Hj n 1

k g1+1, Hj n 1

k g1+2, . . . , Hn} .

For the next round, we keep a list of n/g1 winners. However, instead of fully writing down the index of the winners, we use O(log(g1)) bits per each, and save the index of the winner within the group. For the next round, we partition the hypotheses into groups of size g2 keeping the adjacent" hypotheses in the same group. We repeat the same process to store the winners. However now, each winner are representing" O(g1 g2) hypotheses, so we need O(log(g1 g2)) bits to store its index. We continue this approach. At round ℓ> 1, we have a list of sℓ= O(n/ Qℓ 1 i gi) hypotheses, and we require O(log Qℓ 1 i gi) many bits to do so. Thus, at every step we need O(n) bits of memory.

Note that at every point in the process, it is easy to compute the actual index from the stored index. If right before starting round i, the j-th index in the list is k, then the true index is:

NEWINDEX(i, j, k) = (j 1)

B.3 Filtering acceptable hypotheses

In this section, we describe a filtering algorithm that allow us to pick an acceptable hypothesis with a modest chance. Our algorithm is simple we draw roughly n n hypothesis at random. If one of these hypothesis is OPT-close, and there is no decent hypothesis among them, Algorithm 3 returns an excellent hypothesis. On the other hand, if there are a lot of decent hypotheses, then there is a modest chance that a randomly selected hypothesis is decent, hence an acceptable one.

In the description of Algorithm 3, we assume we have a class of hypotheses of size n, and roughly O(n) bits of memory. However, we aim to use this algorithm in a setting that we can process n random hypotheses in roughly O(n ) bits of memory. The primary challenge here is that we need roughly O(n log n) many bits to memorize which hypotheses are participating, and we cannot use the natural ordering" of the hypotheses that we use in the proof of Theorem 5. Thus, we are looking for a mapping that maps indices in [n ] to a random set of indices in [n] in a memory-efficient manner. To do so, we relax our requirement regarding that we need to draw n random hypotheses uniformly from H. Instead, we are looking for a set of random pairwise independent indices while finding the index of the selected elements requires small amount of memory. This relaxation gives us a process that filters acceptable hypotheses:

Theorem 6. Suppose we have a set of n hypotheses H = {H1, H2, . . . , Hn} and an unknown hypothesis P which we wish to properly learn in H. For every positive integer n < n, there exists an algorithm that uses O(n log(1/ϵ)) bits of memory and O((log n)/ϵ2) samples from P, and it outputs ˆH that is excellent or decent with the probability:

Pr h ˆH is excellent or decent. i min n

16 n , 1 4 n

Proof. First, we start off by describing how we randomly pick n hypotheses from the set of n hypotheses. We use a standard technique for generating pairwise independent random indices described in [Vad12, Chapter 3.5]. The following is the description of a randomized mapping that for any given index x [n ] it gives an index i in [n] that can be computed in a memory efficient manner.

Let q be a prime number between n and 2n (n < q < 2n). Such q always exists via the Bertrand Chebyshev theorem for any n 2. Without loss of generality, we assume we have q hypotheses by adding fake hypotheses to H that lose to any other hypothesis. Now, consider the finite field of Zq. Proposition 3.24 in [Vad12] implies that if we use two random numbers a and b in Zq, then the following mapping generates a set of pairwise independent indices:

fa,b(x) := a x + b (mod q) .

More precisely, by iterating x from 1 to n , this randomized mapping selects n hypotheses with the following indices: fa,b(1), fa,b(2), . . . , fa,b(n ).3 Note that to compute this mapping, we only need to memorize a and b, which requires O(log n) bits, significantly smaller compared to O(n log n) which we would need to memorize n random indices. Using the pairwise independence of the indices generated by this mapping, it is not too difficult to show that with some modest chance we pick exactly one excellent hypothesis and no decent hypothesis with this mapping.

Lemma B.2. For a prime q, assume we have a class of q hypotheses that contains te 1 excellent hypotheses and td decent hypotheses. Let a = 0 and b be two random numbers in Zq selected uniformly at random. Let H be the set of following hypotheses: {Hfa,b(x) : x [n ]} for which fa,b(x) := a x + b (mod q). If (te + td)/q 1/(2n ), then the probability that H contains exactly one excellent hypothesis and no decent hypothesis is at least n te/(2 q).

See Section D for the proof of this lemma.

3Without loss of generality, assume H0 is Hq.

Algorithm 4 Filtering Algorithm

1: procedure HYPOTHESES-FILTER(H, n, ϵ, sample access to P) 2: C toss up a coin. 3: if C is head then 4: Draw a random i in [n] 5: ˆH Hi 6: else 7: q the first prime number that is at least n 8: a a random number in [q 1] 9: b a random number in [q] 10: H {Ha x+b (mod q) | x [n ]} 11: ˆH HYPOTHESIS-SELECTION-NO-DECENT(H , n , ϵ, 0.5) 12: Output ˆH.

Rx s are Ha x+b ( mod q)

Group Elimination Algorithm

R1 R2 Rn . . .

Random Hypothesis in H

with probability 1/2

with probability 1/2

Figure 3: The sketch of the filtering algorithm

The algorithm: Assume H contains te 1 prefect hypotheses and td decent hypotheses. The algorithm is simple: with probability a half we set ˆH to be a random hypothesis from H. And, with probability a half we use the randomize mapping to pick n hypotheses and run Algorithm 3 with parameters δ = 0.5 and r = 2 on them. And, we output the output of that algorithm. Figure 3 shows a visual representation of the filtering algorithm.

The probability of getting an acceptable hypothesis by picking a random one is (te + td)/n. Note that if this quantity is at least 1/2 n , the statement of the lemma is true. Thus, assume (te + td)/n is less than 1/2 n which implies that (te + td)/q is less than 1/2 n . Given this condition, by the above lemma, our randomized mapping selects a set of hypotheses with only one excellent hypothesis and no other acceptable hypotheses with probability at least n te/(2 q). And, since we set δ to a half, with probability at least a half, Algorithm 3 will choose the excellent hypothesis as its output. Using that q < 2 n and Pr[ C = tail] = 1/2, with probability at least n te/(8 q) n /(16 n), ˆH is the excellent hypothesis as desired in the statement. Hence, the proof is complete.

B.4 Putting everything together

Theorem 1. There exists a constant c for which the following holds. Let H be an arbitrary class consisting of n distributions, and let ϵ > 0. For every natural number b where c (log n + log((log n)/ϵ)) b n, there exists an (α = 9, ϵ, δ = 0.1)-proper learner (with promise) for H using b bits of memory and the following number of samples:

when b n log log n ,

ϵ2 log log n

when n log log n b n .

Proof. Here, we combine the filtering algorithm and the random ladder tournament to prove the upper bound. We first use the filtering algorithm, Algorithm 4, to increase the probability of getting an acceptable hypothesis (3 Γ+ϵ -close P) with parameter ϵ = ϵ/4. Then, we use the random ladder tournament, Algorithm 2, to select an acceptable hypothesis among the hypotheses that have passed the filtering. Note that in the random ladder tournament, we need to use a new Γ parameter, that is, 3 Γ + ϵ .

Here, we use the memory bounds provided in Theorem 6 and Lemma 3.1 for these two algorithms. Fix two small constants c0, c1 1 that we determine later. Let t be the largest integer such that:

t2 log log n

t + t log n c0b .

Note that given our lower bound for b, one can find a t that is at least two. Set n as follows:

n := min c1 b log(1/ϵ) , 2 n .

We consider two cases in the following:

Case 1: n < 2 n n < 2 n n < 2 n. In this case, n is equal to c1 b/ log(1/ϵ). One can run Algorithm 4 and find an acceptable hypothesis with probability p0 := min(n /(16 n), 1/(4 n )) = n /(16 n). Now in this case, we invoke Algorithm 4 O(1/p0) times, and in each round we use O(log n/ϵ2) samples. Furthermore, the random ladder tournament uses O((log p 1 0 )/(t p0 ϵ2)) samples. Therefore overall, our sample complexity is:

= O n log n

Case 2: n = 2 n n = 2 n n = 2 n. Similar to the previous case, we would like to run Algorithm 4 and find an acceptable hypothesis with probability p0 := min(n /(16 n), 1/(4 n )) = 1/(4 n ). Now, since n c1 b/ log(1/ϵ), we may be able to run multiple instances of Algorithm 4 in parallel while using the same number of samples. Let r = c1 b/(n log(1/ϵ)). Now, the number of samples we use for this stage is:

= O log(1/ϵ) log n

Now, we focus on the number of samples we need for the random ladder tournament. Note that t should be:

b log ((log n)/ϵ)

Thus, we use the following amount of samples:

b min (log n) 1,

b log (log n)/ϵ) 1 log n

b log ((log n)/ϵ)

= O n log n

b ϵ2 log log n

Putting these two cases together, we obtain the desire sample-memory tradeoffs:

#samples #bits = O n log n

ϵ2 log log n

Regarding the accuracy guarantee, it is not hard to see that in the filtration algorithm, with a modest probability, we pick a distribution that is (3 Γ + ϵ )-close to P. Now, we run the random ladder tournament with a new Γ = 3 Γ + ϵ . Then, we get a hypothesis with accuracy:

3 Γ + ϵ = 3 (3 Γ + ϵ ) + ϵ = 9 Γ + 4ϵ = 9 Γ + ϵ . Hence, the proof is complete.

C Proper learner without promise

Throughout this paper, we mainly focus on proper learners with promise, for which the algorithm is given a parameter Γ, and we are promised that OPT is at most Γ. In this section, we focus on the case where no such information is available to the algorithm. First, we formally define a proper learner without promise. Definition C.1. Suppose A has sample access to an unknown distribution P. And, it can query the probabilities of the Scheffé sets according to each Hi H and a PDF-comparator for every pair of hypotheses in H. We say algorithm A is an (α, ϵ, δ)-proper learner for H if the following holds: for every distribution P, and for every class of n distributions H, with probability at least 1 δ over its input sample (drawn i.i.d. from P) and internal coin tosses, A outputs a distribution ˆH H such that:

P ˆH TV α OPT + ϵ .

Here, we provide a reduction from a proper learner without promise to a proper learner with promise. At a high level, we perform a binary search over possible values of Γ in (0, 1). At every step, we try a value of Γm and run A with Γ = Γm. The main challenge is that regardless of Γm being at least OPT or not, A may return a hypothesis that seems close to P, and it is hard to refute that the suggested value of Γ is at least OPT. For this reason, we look into the output of A, say H, and see if we find evidence that H is far from P. We check the distance between H and P on every Scheffé set of H. More precisely, we check if W( H) is larger than (α OPT + ϵ) where W(H) is defined as follows for every hypothesis Hi:

W(Hi) := max j [n]\{i} |Hi (S(Hi, Hj)) P (S(Hi, Hj))| .

Now, if W( H) is larger than (α OPT+ϵ), then one can imply that H is not (α OPT+ϵ)-close to P, therefore, Γm must have been less than OPT. On the other hand, if W( H) is at most (α OPT + ϵ), one can show H is not too far away from P (even when Γm is less than OPT). Putting these observations together and accounting for errors in estimation of W( H), we obtain a reduction that is described in Algorithm 5, and we prove its accuracy in Theorem 7.

Algorithm 5 Reduction from a proper learner without promise to a proper learner with promise

1: procedure B(H, α, ϵ, ϵ , δ, δ , sample access to P, and oracle access to A) 2: Γℓ 0 3: Γh 1 4: ˆH A (H, Γh, ϵ, δ)

5: while Γh Γℓ> ϵ do

6: Γm Γh + Γℓ

2 7: H A (H, Γm, ϵ, δ)

8: W( H) Estimate W( H) with error at most ϵ and with probability at least 1 δ . 9: if W( H) α Γm + ϵ + ϵ then

10: Γh Γm 11: ˆH H 12: else

14: Output ˆH .

Theorem 7. Fix five parameters ϵ, ϵ , δ, δ (0, 1), and α 1. Assume that A is an (α, ϵ, δ)- proper learner (with promise). Suppose that we can estimate the probabilities of all the Scheffé sets according to P with additive error at most ϵ and with probability at least 1 δ . Then, B, described in Algorithm 5, is an (α + 2, (α + 1) ϵ + 2 ϵ , t δ + δ ) proper learner (without promise) where t is defined as log2(1/ϵ) + 1.

Proof. Our goal here is to show that for the output hypothesis ˆH with probability at least 1 t δ +δ , we have: ˆH P TV (α + 2) OPT + (α + 1)ϵ + 2 ϵ . (4)

Observe that at every iteration of the while loop in the algorithm, Γh Γℓis divided by two. Thus, the while loop takes log2(1/ϵ) iterations. Thus, t is the number of times that we invoke A in Algorithm 5. Each time A works as expected with probability 1 δ. Hence, using the union bound, with probability 1 t δ, we can assume all invocations of A result in correct answers. In addition, with probability 1 δ , all the estimates we use for P S( ˆH, Hi) in Line 8 are accurate up to an

additive error ϵ . Hence, W( H) W( H) is at most ϵ with probability at least 1 δ . Hence, by the union bound, the following event happens with probability at lest 1 (t δ + δ ): In every invocation of A, if OPT is below the given parameter Γ, then A returns a hypothesis, H, that is (αΓ + ϵ)-close to P. And, W( H) is at most αΓ + ϵ + ϵ . Hence, for the rest of this proof, we assume this event holds, and we prove Equation 4.

Consider Γh and Γℓin the final iteration of while loop in Algorithm 5. Observe that the output ˆH is what A has found when it was invoked with input Γ = Γh. Now, we consider two cases as follows:

Case 1: OPT Γh OPT Γh OPT Γh: In this case, ˆH is (α Γh + ϵ)-close to P by our assumption that A works as expected. Thus, ˆH P TV is at most α Γh + ϵ. This upper bound is written based on Γh, while we aim to find an upper bound on the distance that only contains OPT. For this purpose, we need to use the fact that why the binary search algorithm stoped at this particular Γh. For Γℓ, we have two possibilities: either Γℓis zero; Or, we have run A with Γ = Γℓ, and we have not found a desired H.

In both cases, Γℓis a lower bound on OPT. Thus, we have:

OPT Γℓ Γh ϵ .

Now, using this relationship between Γh and OPT, we obtain:

ˆH P TV α Γh + ϵ α OPT + (α + 1) ϵ ,

implying Equation 4.

Case 2: OPT > Γh OPT > Γh OPT > Γh: In this case, clearly, Γh cannot be one (since OPT is at most one). Thus, Γh was set in Line 10 implying that W( ˆH) is at most α Γh + ϵ + ϵ . Since we assumed the estimation of W( ˆH) was accurate up to error ϵ , we have:

W( ˆH) W( ˆH) + ϵ α Γh + ϵ + 2 ϵ .

Given this bound on W( ˆH), we show that ˆH cannot be far from P. Let H be a hypothesis in H that is OPT-close to P. Then, we use the definition of the Scheffé sets and the triangle inequality to bound the distance between ˆH and P:

ˆH P TV ˆH H TV + H P TV (by triangle inequality)

= ˆH S ˆH, H H S ˆH, H + H P TV (by dfn. of Scheffé set)

ˆH S ˆH, H P S ˆH, H

+ P S ˆH, H H S ˆH, H + H P TV (by triangle inequality)

ˆH S ˆH, H P S ˆH, H + 2 H P TV .

Now, we use the definition of W( ˆH) and conclude:

ˆH P TV W( ˆH) + 2 H P TV = W( ˆH) + 2 OPT

α Γh + ϵ + 2 ϵ + 2 OPT (α + 2) OPT + ϵ + 2 ϵ .

Therefore, the proof is complete.

Corollary C.2. For every parameters ϵ and δ in (0, 1), there exists an (α = 5, ϵ =

ϵ , δ = δ )-proper learner (without promise) that uses O log(n/δ )/ (ϵ )2 samples and runs

O n/ δ (ϵ )2 time.

Proof. First, we start off with the parameters: We set ϵ := ϵ /6, ϵ := ϵ /6, δ = δ /2, and δ = δ /(2 t) where t := log2(1/ϵ) + 1 is the number of iterations in Algorithm 5. Using the Chernoff bound and reusing all samples, we can estimate all the Scheffé sets with additive error at most ϵ with probability at least 1 δ .

Note that based on Corollary A.1, the random ladder tournament (with an arbitrarily small confidence parameter) is a (α = 3, ϵ, δ)-proper learner (with promise) that uses O log(n/δ)/ϵ2 samples and

runs in O n log(n/δ )/ δ (ϵ )2 time. We use this learner as our A and run the reduction algorithm B with the parameters we specified earlier. Now, using Theorem 7, we obtain a proper learner without promise that satisfies the following guarantee with probability at least 1 δ = 1 (δ + t δ):

ˆH P TV (α + 2) OPT + (α + 1)ϵ + 2 ϵ = 5 OPT + ϵ .

Regarding the time complexity, the while loop in Algorithm 5 is repeated t times. In each iteration, we compute W( H) which takes O n log (n/δ ) / (ϵ )2 time, and we call the random ladder tournament (i.e, A). Hence, the overall time complexity is:

t n log (n/δ )

n log(n/δ ) log (1/ϵ )

Hence, the proof is complete.

Memory-data consumption of reduction algorithm: Besides the sample and memory usage of A, B needs to keep track of O(1) numbers and also computes W( H). Now, given b bits of memory, we can partition H into consecutive blocks of size r = b/(log(log(n)/ϵ)). At every round, we process one block of size r, we keep track of the Scheffé counts of H and the hypotheses in the block. For each round we use O(log(n)/ϵ2) samples, ane we have O(n/r) rounds. Thus, the total number of samples is s := O(n/r (log(n)/ϵ2)). Thus, for this part we will have:

s b = O n log n

ϵ2 log log n

Remark 8. Using the reduction algorithm with the memory-data tradeoff described above, one can translate our main theorem, Theorem 1, to a (α = 11, ϵ, 0.1)-proper learner (without promise) incurring additional factors of poly (log log(n), log(1/ϵ)).

D Auxiliary lemmas and facts

Fact D.1. [adapted from Theorem 4 in [MS08]] For every pair i, j [n], let qij denote an estimate of the probabilities of the Scheffé set of Hi and Hj according to P. Let ˆH be the hypothesis that minimizes of the following:

ˆH := arg max Hi H min j [n]\{i} |Hi (S(Hi, Hj)) qij| .

For every ϵ and δ, if we estimate qij using O(log(n/δ)/ϵ2) samples from P, then ˆH satisfies the following guarantee:

ˆH X TV 3 OPT + ϵ . Lemma 2.1. Upon receiving three parameters: Γ, ϵ, and δ, Algorithm 1 uses O(log(1/δ)/ϵ2) samples and satisfies the following guarantees with probability at least 1 δ:

If H1 is Γ-close to P, then the algorithm returns H1.

If H1 is (max (Γ, H2 P TV) + 2 H2 P TV + ϵ)-far from P, then the algorithm returns H2.

Proof. Let q, p1, and p2 denote the true probability of the Scheffé set of H1 and H2 according to P, H1, and H2:

q := P(S(H1, H2)) , p1 := H1(S(H1, H2)) , p2 := H2(S(H1, H2)) .

And, let ˆq be the estimate of q from the samples. the Using the Hoeffding bound, one can show that |q ˆq| is at most ϵ := ϵ/2 with probability 1 δ, since we use O(log(1/δ)/ϵ2) samples. Now, if H1 P TV is at most Γ, then |p1 q| must be at most Γ implying:

|p1 ˆq| |p1 q| + |q ˆq| Γ + ϵ .

Therefore, the algorithm outputs H1 as the winner.

Now, we show the second guarantee. If the total variation distance between H1 and P, H1 P TV, is greater than (max (Γ, H2 X TV) + 2 H2 X TV + ϵ), then by the triangle inequality and the definition of total variation distance, we will have:

|p1 ˆq| |p1 p2| |p2 ˆq| |p1 p2| |p2 q| |q ˆq|

H1 H2 TV H2 P TV ϵ (By the definition of the Scheffé set S(H1, H2))

H1 P TV 2 H2 P TV ϵ

> max (Γ, H2 X TV) + ϵ ϵ

max (Γ + ϵ , |p2 q| + ϵ )

max (Γ + ϵ , |p2 ˆq| + |q ˆq|)

max (Γ + ϵ , |p2 ˆq|)

Thus, the algorithm outputs H2 as the winner.

Lemma B.2. For a prime q, assume we have a class of q hypotheses that contains te 1 excellent hypotheses and td decent hypotheses. Let a = 0 and b be two random numbers in Zq selected uniformly at random. Let H be the set of following hypotheses: {Hfa,b(x) : x [n ]} for which fa,b(x) := a x + b (mod q). If (te + td)/q 1/(2n ), then the probability that H contains exactly one excellent hypothesis and no decent hypothesis is at least n te/(2 q).

Proof. Using Proposition 3.24 in [Vad12] the mapping is in fact a class of pairwise independent hash functions. That is, the following holds:

Fix two indices i Zp and x [n ]. The probability of mapping x to i is:

Pra,b[ fa,b(x) = i] = 1

Fix four indices i, j Zp and x, y [n ] where x = y. The probability of mapping x to i and y to j is:

Pra,b[ fa,b(x) = i and fa,b(y) = j] = 1

Suppose He and Hd indicate the set of excellent and decent hypotheses respectively. Fix an index i for which Hi is an excellent hypothesis. Suppose one of the indices in [n ], say x, is mapped to i. Now, the probability of another hypothesis in He Hd being chosen for a fixed index y = x in [n ] is:

Pra,b Hfa,b(y)(mod q) He Hd | fa,b(x) = i = X

j s.t. Hj ((He Hd)\{Hi}) Pra,b[ fa,b(y) = j | fa,b(x) = i]

j s.t. Hj ((He Hd)\{Hi})

Pra,b[ fa,b(y) = j and fa,b(x) = i]

Pra,b[ fa,b(x) = i]

= te + td 1

Hence, the expected number of hypotheses (other than Hi) that are selected is:

E #y [n ] \ {x} s.t. Hfa,b(y)(mod q) He Hd | fa,b(x) = i = (n 1) (te + td 1)

n (te + td)

The last inequality is due to the assumption we have: (te + td)/q 1/(2n ). Next, by Moarkov s inequality, the probability that the number of y s are at least one (that is two times the expected value) is at most 1/2. Hence, if x is mapped to Hi, with probability at least 1/2, there is no other excellent or decent hypothesis is selected.

Now, we compute the probability that an excellent hypothesis is selected and no other excellent or decent hypothesis is selected as follows:

Pra,b[ One excellent hypothesis and no decent one]

Hi He Pr[ fa,b(x) = i(mod q) and no other hypothesis in He Hdis selected]

Hi He Pr[ fa,b(x) = i(mod q)]

Pr no y [n ] \ {x} s.t. Hfa,b(y)(mod q) He Hd | fa,b(x) = i(mod q)

Hi He Pr[ fa,b(x) = i(mod q)] 1

Hence, the proof is complete.

Fact D.2. Let n, a, b be three positive integers. Then, we have: n/a

Proof. Let r denote the remainder of n divided by a. If r = 0, then the claim is trivial. Therefore, assume r is in {1, 2, . . . , a 1}. Let s be the remainder of n divided by a b. Clearly, s has to be of the form: t a + r for t {0, 1, . . . , b 1}. Hence, we have n/a

k + s + a r

k + (t + 1) a

k + 1 = l n

Fact D.3. Given the definition of the gℓ s in Equation (2), for any positive integer ℓ, we have:

i=1 gi r1.5ℓ 1 .

Proof. By the properties of the geometric series, we obtain:

r 1.5i 1 = r Pℓ i=1 1.5i 1 = r (1.5ℓ 1)/(1.5 1) = r1.5ℓ 1 .