# optimal_weak_to_strong_learning__b4bf4827.pdf

Optimal Weak to Strong Learning

Kasper Green Larsen Department of Computer Science Aarhus University Aarhus, Denmark larsen@cs.au.dk

Martin Ritzert Department of Computer Science Aarhus University Aarhus, Denmark ritzert@cs.au.dk

The classic algorithm Ada Boost allows to convert a weak learner, that is an algorithm that produces a hypothesis which is slightly better than chance, into a strong learner, achieving arbitrarily high accuracy when given enough training data. We present a new algorithm that constructs a strong learner from a weak learner but uses less training data than Ada Boost and all other weak to strong learners to achieve the same generalization bounds. A sample complexity lower bound shows that our new algorithm uses the minimum possible amount of training data and is thus optimal. Hence, this work settles the sample complexity of the classic problem of constructing a strong learner from a weak learner.

1 Introduction

The field of boosting has been started from a classic question in learning theory asking whether classifiers that are just slightly better than random guessing can be used to create a classifier with arbitrarily high accuracy when given enough training data. This question was initially asked by Kearns and Valiant [15, 16] and ignited the line of research that eventually lead to the development of Ada Boost [7], the prototype boosting algorithm to date. Ada Boost carefully combines the predictions of several inaccurate classifiers trained with a focus on different parts of the training data to come up with a voting classifier that performs well everywhere.

We quantify the performance of an inaccurate learner by its advantage 𝛾over random guessing. Said loosely, a 𝛾-weak learner will correctly classify new data points with probability at least 1/2 + 𝛾. In contrast, given 0 < 𝜀, 𝛿< 1 and enough training data a strong learner outputs with probability 1 𝛿 over the choice of the training data and possible random choices of the algorithm a hypothesis that correctly classifies new data points with probability at least 1 𝜀. The number of samples 𝑚(𝜀, 𝛿) such that the learning algorithm achieves the desired accuracy and confidence levels is called the sample complexity. The sample complexity is the key metric for the performance of a strong learner and depends on the weak learner s advantage 𝛾, the weak learner s flexibility measured in terms of the VC-dimension, as well as 𝜀and 𝛿. Essentially, a construction with low sample complexity makes the most out of the available training data.

Ada Boost [7] is the classic algorithm for constructing a strong learner from a 𝛾-weak learner. If the weak learner outputs a hypothesis from a base set of hypotheses H, then Ada Boost constructs a strong learner by taking a weighted majority vote among several hypotheses ℎ1, . . . , ℎ𝑡from H. Each of these hypotheses is obtained by invoking the 𝛾-weak learning algorithm on differently weighted versions of a set of training samples 𝑆. The number of samples required by Ada Boost for strong learning depends both on the advantage 𝛾of the weak learner and the complexity of the hypothesis set H. If we let 𝑑denote the VC-dimension of H, i.e. the cardinality of the largest set of data points 𝑥1, . . . , 𝑥𝑑such that every classification of 𝑥1, . . . , 𝑥𝑑can be realized by a hypothesis ℎ H, then it

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

is known that Ada Boost is a strong learner, which for error 𝜀and failure probability 𝛿, requires

𝑂 𝑑ln(1/(𝜀𝛾)) ln(𝑑/(𝜀𝛾))

𝜀𝛾2 + ln(1/𝛿)

samples. This sample complexity is state-of-the-art for producing a strong learner from a 𝛾-weak learner. However, is this the best possible sample complexity? This is the main question we ask and answer in this work.

First, we present a new algorithm for constructing a strong learner from a weak learner and prove that it requires only

𝜀𝛾2 + ln(1/𝛿)

samples. In addition to improving over Ada Boost by two logarithmic factors, we complement our new algorithm by a lower bound, showing that any algorithm for converting a 𝛾-weak learner to a strong learner requires

𝜀𝛾2 + ln(1/𝛿)

samples. Combining these two results, we have a tight bound on the sample complexity of weak to strong learning. In the remainder of the section, we give a more formal introduction to weak and strong learning as well as present our main results and survey previous work.

1.1 Weak and strong learning

Consider a binary classification task in which there is an unknown concept 𝑐: X { 1, 1} assigning labels to a ground set X. The goal is to learn or approximate 𝑐to high accuracy. Formally, we assume that there is an unknown but fixed data distribution D over X. A learning algorithm then receives a set 𝑆of i.i.d. samples 𝑥1, . . . , 𝑥𝑚from D together with their labels 𝑐(𝑥1), . . . , 𝑐(𝑥𝑚) and produces a hypothesis ℎwith ℎ 𝑐based on 𝑆and the labels. To measure how well ℎapproximates 𝑐, it is assumed that a new data point 𝑥is drawn from the same unknown distribution D, and the goal is to minimize the probability of mispredicting the label of 𝑥. We say that a learning algorithm is a weak learner if it satisfies the following:

Definition 1. Let C X { 1, 1} be a set of concepts and A a learning algorithm. We say that A is a 𝛾-weak learner for C, if there is a constant 𝛿0 < 1 and an integer 𝑚0 ℕ, such that for every distribution D over X and every concept 𝑐 C, when given 𝑚0 i.i.d. samples 𝑆= 𝑥1, . . . , 𝑥𝑚0 from D together with their labels 𝑐(𝑥1), . . . , 𝑐(𝑥𝑚0), it holds with probability at least 1 𝛿0 over the sample 𝑆and the randomness of A, that A outputs a hypothesis ℎ: X { 1, 1} such that

LD(ℎ) = Pr 𝑥 D ℎ(𝑥) 𝑐(𝑥) 1/2 𝛾.

A 𝛾-weak learner thus achieves an advantage of 𝛾over random guessing when given 𝑚0 samples. Note that A knows neither the distribution D, nor the concrete concept 𝑐 C but achieves the advantage 𝛾for all D and 𝑐. We remark that in several textbooks (e.g. Mohri et al. [18]) a weak learner needs to work for any arbitrary 𝛿> 0 while Definition 1 only requires the existence of some 𝛿0. Thus, every weak learner satisfying the definition of Mohri et al. also satisfies Definition 1, making our results more general.

In contrast to a weak learner, a strong learner can obtain arbitrarily high accuracy:

Definition 2. Let C X { 1, 1} be a set of concepts and A a learning algorithm. We say that A is a strong learner for C, if for all 0 < 𝜀, 𝛿< 1, there is some number 𝑚(𝜀, 𝛿) such that for every distribution D over X and every concept 𝑐 C, when given 𝑚= 𝑚(𝜀, 𝛿) i.i.d. samples 𝑆= 𝑥1, . . . , 𝑥𝑚from D together with their labels 𝑐(𝑥1), . . . , 𝑐(𝑥𝑚), it holds with probability at least 1 𝛿over the sample 𝑆and the randomness of A, that A outputs a hypothesis ℎ: X { 1, 1} such that

LD(ℎ) = Pr 𝑥 D ℎ(𝑥) 𝑐(𝑥) 𝜀.

The definition of a strong learner is essentially identical to the classic notion of (𝜀, 𝛿)-PAC learning in the realizable setting. Unlike the 𝛾-weak learner, we here require the learner to output a classifier with arbitrarily high accuracy (𝜀small) and confidence (𝛿small) when given enough samples 𝑆.

Kearns and Valiant [15, 16] asked whether one can always obtain a strong learner when given access only to a 𝛾-weak learner for a 𝛾> 0. This was answered affirmatively by Schapire [21] and is the motivation behind the design of Ada Boost [7]. If we let H denote the set of hypotheses that a 𝛾-weak learner may output from, then Ada Boost returns a voting classifier 𝑓(𝑥) = sign(Í𝑡 𝑖=1 𝛼𝑖ℎ𝑖(𝑥)) where each ℎ𝑖 H is the output of the 𝛾-weak learner when trained on some carefully weighted version of the training set 𝑆and each 𝛼𝑖is a real-valued weight. In terms of sample complexity 𝑚(𝜀, 𝛿), the number of samples stated in Eq. (1) is sufficient for Ada Boost. There are several ways to prove this. For instance, it can be argued that when given 𝑚samples, Ada Boost combines only 𝑡= 𝑂(𝛾 2 ln 𝑚) hypotheses ℎ1, . . . , ℎ𝑡from H in order to produce an 𝑓that perfectly classifies all the training data 𝑆, i.e. 𝑓(𝑥𝑖) = 𝑐(𝑥𝑖) for all 𝑥𝑖 𝑆. Using that the class H 𝑡can generate at most 𝑂( 𝑚 𝑑 𝑡) distinct classifications of 𝑚points (i.e. its growth function is bounded by this), one can intuitively invoke classic generalization bounds for PAC-learning in the realizable case to conclude that the hypothesis 𝑓satisfies

LD( 𝑓) 𝑂 𝑡𝑑ln(𝑚/𝑑) + ln(1/𝛿)

= 𝑂 𝑑ln(𝑚/𝑑) ln 𝑚

𝛾2𝑚 + ln(1/𝛿)

with probability at least 1 𝛿over 𝑆(and potentially the randomness of the weak learner). Using LD( 𝑓) = 𝜀and solving Eq. (2) for 𝑚gives the sample complexity stated in Eq. (1). This is the best sample complexity bound of any weak to strong learner prior to this work.

Our main upper bound result is a new algorithm with better sample complexity than Ada Boost and other weak to strong learners. It guarantees the following: Theorem 1. Assume we are given access to a 𝛾-weak learner for some 0 < 𝛾< 1/2, using a base hypothesis set H X { 1, 1} of VC-dimension 𝑑. Then there is a universal constant 𝛼> 0 and an algorithm A, such that A is a strong learner with sample complexity 𝑚(𝜀, 𝛿) satisfying

𝑚(𝜀, 𝛿) 𝛼 𝑑𝛾 2

𝜀 + ln(1/𝛿)

We remark that it is often required that a strong learner runs in polynomial time given a polynomialtime weak learner. This is indeed the case for our new algorithm.

Next, we complement our algorithm from Theorem 1 by the following lower bound: Theorem 2. There is a universal constant 𝛼> 0 such that for all integers 𝑑 ℕand every 2 𝑑< 𝛾< 1/80, there is a finite set X, a concept class C X { 1, 1} and a hypothesis set H X { 1, 1} of VC-dimension at most 𝑑, such that for every (𝜀, 𝛿) with 0 < 𝜀< 1 and 0 < 𝛿< 1/3, there is a distribution D over X such that the following holds:

1. For every 𝑐 C and every distribution D over X, there is an ℎ H with Pr𝑥 D [ℎ(𝑥) 𝑐(𝑥)] 1/2 𝛾.

2. For any algorithm A, there is a concept 𝑐 C such that A requires at least

𝜀 + ln(1/𝛿)

samples 𝑆and labels 𝑐(𝑆) to guarantee LD(ℎ𝑆) 𝜀with probability at least 1 𝛿over 𝑆, where ℎ𝑆is the hypothesis produced by A on 𝑆and 𝑐(𝑆).

The first statement of Theorem 2 says that the concept class C can be 𝛾-weakly learned. The second point then states that any learner requires as many samples as our new algorithm. Moreover, the lower bound does not require the algorithm to even use a 𝛾-weak learner, nor does it need to run in polynomial time for the lower bound to apply. Furthermore, the algorithm is even allowed to use the full knowledge of the set C and the distribution D. The only thing it does not know is which concept 𝑐 C provides the labels 𝑐(𝑆) to the training samples. The lower bound thus matches our upper bound except possibly for very small 𝛾< 2 𝑑. We comment further on this case in Section 5.

In the next section, we present the overall ideas in our new algorithm, as well as a new generalization bound for voting classifiers that is key to our algorithm. Finally, we sketch the main ideas in the lower bound. Due to the space requirements, several proofs have been moved to the supplementary material.

1.2 Main ideas and voting classifiers

One of the key building blocks in our new algorithm is voting classifiers. To formally introduce voting classifiers, define from a hypothesis set H X { 1, 1} the set of all convex combinations Δ(H) of hypotheses in H. That is, Δ(H) contains all functions 𝑓of the form 𝑓(𝑥) = Í𝑡 𝑖=1 𝛼𝑖ℎ𝑖(𝑥) with 𝛼𝑖> 0 and Í 𝑖𝛼𝑖= 1. Ada Boost can be thought of as producing a voting classifier 𝑔(𝑥) = sign( 𝑓(𝑥)) for an 𝑓 Δ(H) by appropriate normalization of the weights it uses.

Classic work on understanding the surprisingly high accuracy of Ada Boost introduced the notion of margins [2]. For a function 𝑓 Δ(H), and a sample 𝑥with label 𝑦, the margin of 𝑓on (𝑥, 𝑦) is 𝑦𝑓(𝑥). Notice that the margin is positive if and only if sign( 𝑓(𝑥)) correctly predicts the label 𝑦of 𝑥. It was empirically observed that Ada Boost produces voting classifiers 𝑔(𝑥) = sign( 𝑓(𝑥)) where 𝑓 has large margins. This inspired multiple generalization bounds based on the margins of a voting classifier, considering both the minimum and the 𝑘-th margin [12, 4, 3, 19, 20, 17]. The simplest bound when all margins are assumed to be at least 𝛾, is Breiman s min margin bound: Theorem 3 (Breiman [4]). Let 𝑐 X { 1, 1} be an unknown concept, H X { 1, 1} a hypothesis set of VC-dimension 𝑑and D an arbitrary distribution over X. With probability at least 1 𝛿over a set of 𝑚samples 𝑆 D𝑚, it holds for every voting classifier 𝑔(𝑥) = sign( 𝑓(𝑥)) with 𝑓 Δ(H) satisfying 𝑐(𝑥) 𝑓(𝑥) 𝛾on all 𝑥 𝑆, that:

LD(𝑔) = 𝑂 𝑑ln(𝑚/𝑑) ln 𝑚

The resemblance to the generalization performance of Ada Boost in Eq. (2) is no coincidence. Indeed, a small twist to Ada Boost, presented in the Ada Boost 𝜈algorithm [20], ensures that the voting classifier produced by Ada Boost 𝜈from a 𝛾-weak learner has all margins at least 𝛾/2. This gives an alternative way of obtaining the previous best sample complexity in Eq. (1). We remark that more refined generalization bounds based on margins exist, such as the 𝑘-th margin bound by Gao and Zhou [8] which is known to be near-tight [10]. These bounds take the whole sequence of margins 𝑐(𝑥𝑖) 𝑓(𝑥𝑖) of all samples 𝑥𝑖 𝑆into account, not only the smallest. However, none of these bounds leads to better generalization from a 𝛾-weak learner.

We note that the notion of margins has not only been considered in the context of boosting algorithms but also plays a key role in understanding the generalization performance of Support Vector Machines, see e.g. the recent works [14, 11] giving tight SVM generalization bounds in terms of margins.

In our new algorithm, we make use of a voting classifier with good margins as a subroutine. Concretely, we invoke Ada Boost 𝜈to obtain margins of at least 𝛾/2 on all training samples. At first sight, this seems to incur logarithmic losses, at least if the analysis by Breiman is tight. Moreover, Grønlund et al. [9] proved a generalization lower bound showing that there are voting classifiers with margins 𝛾on all training samples, but where at least one of the logarithmic factors in the generalization bound must occur. To circumvent this, we first notice that the lower bound only applies when 𝑚is sufficiently larger than 𝑑𝛾 2. We carefully exploit this loophole and prove a new generalization bound for voting classifiers: Theorem 4. Let 𝑐 X { 1, 1} be an unknown concept, H X { 1, 1} a hypothesis set of VC-dimension 𝑑and D an arbitrary distribution over X. There is a universal constant 𝛼> 0, such that with probability at least 1 𝛿over a set of 𝑚 𝛼(𝑑𝛾 2 + ln(1/𝛿)) samples 𝑆 D𝑚, every voting classifier 𝑔(𝑥) = sign( 𝑓(𝑥)) with 𝑓 Δ(H) satisfying 𝑐(𝑥) 𝑓(𝑥) 𝛾on all 𝑥 𝑆achieves

LD(𝑔) 1 200.

The value 1/200 is arbitrary and chosen to match the requirements in the proof of Theorem 1. Notice how our new generalization bound avoids the logarithmic factors when aiming merely at generalization error 1/200. Breiman s bound would only guarantee that 𝑑𝛾 2 ln(1/𝛾) ln(𝑑/𝛾) samples suffice for such a generalization error. While the focus of previous work on generalization bounds was not on the constant error case, we remark that any obvious approaches to modify the previous proofs could perhaps remove the ln 𝑚factor but not the ln(𝑚/𝑑) factor. The ln(𝑚/𝑑) factor turns into Θ(ln(1/𝛾)) when solving for 𝑚in 𝑑ln(𝑚/𝑑)/(𝛾2𝑚) = 1/200 and is insufficient for our purpose.

With the new generalization bound on hand, we can now construct our algorithm for producing a strong learner from a 𝛾-weak learner. Here we use as template the sample optimal algorithm by

Hanneke [13] for PAC learning in the realizable case (which improved over a previous near-tight result by Simon [22]). Given a training set 𝑆, his algorithm carefully constructs a number of sub-samples 𝑆1, 𝑆2, . . . , 𝑆𝑘of 𝑆and trains a hypothesis ℎ𝑖on each 𝑆𝑖using empirical risk minimization. As the final classifier, he returns the voter 𝑔(𝑥) = sign (1/𝑘) Í𝑘 𝑖=1 ℎ𝑖(𝑥) .

For our new algorithm, we use Hanneke s approach to construct sub-samples 𝑆1, . . . , 𝑆𝑘of a training set 𝑆. We then run Ada Boost 𝜈on each 𝑆𝑖to produce a voting classifier 𝑔𝑖(𝑥) = sign( 𝑓𝑖(𝑥)) for an 𝑓𝑖 Δ(H) with margins 𝛾/2 on all samples in 𝑆𝑖. We finally return the voter ℎ(𝑥) = sign((1/𝑘) Í𝑘 𝑖=1 𝑔𝑖(𝑥)). Our algorithm thus returns a majority of majorities.

To prove that our algorithm achieves the desired sample complexity 𝑚(𝜀, 𝛿) claimed in Theorem 1, we then revisit Hanneke s proof and show that it suffices for his argument that the base learning algorithm (in his case empirical risk minimization, in our case Ada Boost 𝜈) achieves an error of at most 1/200 when given 𝜏samples. If this is the case, then his proof can be modified to show that the final error of the output voter drops to 𝑂(𝜏/𝑚). Plugging in the 𝜏= 𝛼(𝑑𝛾 2 + ln(1/𝛿)) from our new generalization bound in Theorem 4 completes the proof.

Let us remark that a lower bound by Grønlund et al. [9] shows the existence of a voting classifier with simultaneously large margins and a generalization error with an additional log-factor. It is thus conceivable that a simple majority vote is not sufficient and a majority of majorities is indeed necessary, although the lower bound only guarantees the existence of a bad voter with good margins and not that all such voters are bad .

In the following, we start by proving our new generalization bound (Theorem 4) in Section 2. We then proceed in Section 3 to present our new algorithm and sketch the proof that it gives the guarantees in Theorem 1. Finally, in Section 4 we give the proof ideas of the lower bound in Theorem 2.

2 New margin-based generalization bounds for voting classifiers

In this section, we prove the new generalization bound stated in Theorem 4. For ease of notation, we write that D is a distribution over X { 1, 1} (and not just a distribution over X) and implicitly assume that the label of each 𝑥 X is 𝑐(𝑥) for the unknown concept 𝑐. Moreover, for a voting classifier 𝑔(𝑥) = sign( 𝑓(𝑥)) with 𝑓 Δ(H), we simply refer to 𝑓as the voting classifier and just remark that one needs to take the sign to make a prediction. Finally, we think of the sample 𝑆as a set of pairs (𝑥𝑖, 𝑦𝑖) with 𝑥𝑖 X and 𝑦𝑖= 𝑐(𝑥𝑖) { 1, 1}.

The key step in the proof of Theorem 4 is to analyze the generalization performance for a voting classifier obtained by combining randomly drawn hypotheses among the hypotheses ℎ1, . . . , ℎ𝑡 making up a voting classifier 𝑓= Í 𝑖𝛼𝑖ℎ𝑖from Δ(H). We then relate that to the generalization performance of 𝑓itself. Formally, we define a distribution D 𝑓,𝑡for every 𝑓and look at a random hypothesis from D 𝑓,𝑡. We start by defining this distribution.

Let 𝑓(𝑥) = Í ℎ𝛼ℎℎ(𝑥) Δ(H) be a voting classifier. Let D 𝑓be the distribution over H (the base hypotheses used in 𝑓) where ℎhas probability 𝛼ℎ. Consider drawing 𝑡i.i.d. hypotheses ℎ 1, . . . , ℎ 𝑡 from D 𝑓and then throwing away each ℎ 𝑖independently with probability 1/2. Let 𝑡 be the number of remaining hypotheses, denote them ℎ1, . . . , ℎ𝑡 , and let 𝑔= 1

𝑡 Í𝑡 𝑖=1 ℎ𝑖. One can think of 𝑔as a sub-sampled version of 𝑓with replacement. Denote by D 𝑓,𝑡the distribution over 𝑔.

Key properties of D𝒇, 𝒕. In the following, we analyze how a random 𝑔from D 𝑓,𝑡behaves and show that while it behaves similar to 𝑓it produces with good probability predictions that are big in absolute value (even if 𝑓(𝑥) 0). The proofs of the following three lemmas are given in the supplementary material. First, we note that predictions made by a random 𝑔are often close to those made by 𝑓and then observe that 𝑔rarely makes predictions 𝑔(𝑥) that are small in absolute value.

Lemma 1. For any 𝑥 X, any 𝑓 Δ(H), and any 𝜇> 0: Pr𝑔 D 𝑓,𝑡[| 𝑓(𝑥) 𝑔(𝑥)| 𝜇] < 5𝑒 𝜇2𝑡/32.

Lemma 2. For any 𝑥 X, any 𝑓 Δ(H), and any 𝜇 1/𝑡: Pr𝑔 D 𝑓,𝑡[|𝑔(𝑥)| 𝜇] 2𝜇 𝑡.

Lemma 2 states that even if 𝑓(𝑥) 0 for an unseen sample 𝑥, 𝑔(𝑥) will still be large with good probability. Thus we can think of 𝑔as having large margins (perhaps negative) also on unseen data. This is crucial for bounding the generalization error. The proof follows from an invocation of Erd os improved Littlewood-Offord lemma [6].

As the last property, we look at the out-of-sample and in-sample error of a random 𝑔and start with relating the generalization error of 𝑓to that of a random 𝑔. To formalize this, define for any distribution D, the loss L𝑡 D( 𝑓) := Pr(𝑥,𝑦) D, 𝑔 D 𝑓,𝑡 𝑦𝑔(𝑥) 0 and when writing L𝑡 𝑆( 𝑓) we implicitly assume 𝑆to also denote the uniform distribution over all (𝑥, 𝑦) 𝑆. We then have:

Lemma 3. For any distribution D over X { 1, 1}, any 𝑡 36 and any voting classifier 𝑓 Δ(H) for a hypothesis set H X { 1, 1}, we have LD( 𝑓) 3L𝑡 D( 𝑓).

Moreover, if 𝑓has margins 𝛾on all training samples (𝑥, 𝑦) 𝑆, then 𝑔is correct on most of 𝑆 provided that we set 𝑡big enough: Lemma 4. Let 𝑆be a set of 𝑚samples in X { 1, 1} and assume 𝑓is a voting classifier with 𝑦𝑓(𝑥) 𝛾for all (𝑥, 𝑦) 𝑆. For 𝑡 1024𝛾 2, we have L𝑡 𝑆( 𝑓) 1/1200.

Proof. By Lemma 1, it holds for all (𝑥, 𝑦) 𝑆, that | 𝑓(𝑥) 𝑔(𝑥)| 𝛾with probability at most 5 exp 𝛾2𝑡/32 5𝑒 32 1/1200. Since 𝑦𝑓(𝑥) 𝛾, this implies sign 𝑔(𝑥) = sign 𝑓(𝑥) = 𝑦.

The last ingredient for the proof of Theorem 4 is to relate L𝑡 𝑆( 𝑓) and L𝑡 D( 𝑓). For the proof we use Lemma 2 to infer that with good probability |𝑔(𝑥)| = Ω(𝛾), i.e. has large absolute value. We use this to argue that sign(𝑔) often belongs to a class with small VC-dimension and then apply a growth-function argument to relate L𝑡 𝑆( 𝑓) and L𝑡 D( 𝑓). Formally, we prove the following lemma. Lemma 5. Let D be an arbitrary distribution over X { 1, 1} and let H X { 1, 1} be a hypothesis set of VC-dimension 𝑑. There is a universal constant 𝛼> 0 such that for any 𝑡 ℕand any 𝑚 𝛼𝑡𝑑, it holds that:

h sup 𝑓 Δ(H) |L𝑡 𝑆( 𝑓) L𝑡 D( 𝑓)| > 1 1200 i 𝛼 exp( 𝑚/𝛼).

Before we prove Lemma 5, we show how to use it to prove Theorem 4. Since we are only aiming to prove the generalization of voting classifiers 𝑓with 𝑦𝑓(𝑥) 𝛾for all samples (𝑥, 𝑦) 𝑆, Lemma 4 tells us that such 𝑓have small L𝑡 𝑆( 𝑓) when 𝑡 1024𝛾 2. We thus fix 𝑡= 1024𝛾 2 and get that L𝑡 𝑆( 𝑓) 1/1200 from Lemma 4. By Lemma 5, with probability at least 1 𝛼exp( 𝑚/𝛼) over the sample 𝑆, we have for all 𝑓 Δ(H) that |L𝑡 𝑆( 𝑓) L𝑡 D( 𝑓)| 1/1200 and thus L𝑡 D( 𝑓) 1/600. Finally, Lemma 3 gives us that LD( 𝑓) 3L𝑡 D( 𝑓) for all 𝑓 Δ(H). Together we thus have L𝑡 D( 𝑓) 1/600 LD( 𝑓) 1/200 for any 𝑚 𝛼𝑡𝑑 𝛼 𝑑𝛾 2 where 𝛼 > 0 is a universal constant. By observing that 𝛼exp( 𝑚/𝛼) < 𝛿for 𝑚 𝛼ln(𝛼/𝛿), this completes the proof of Theorem 4. What remains is thus to prove Lemma 5 which we do in the remainder of this section.

2.1 Relating Lt S f and Lt D f

The last remaining step to show Theorem 4 is thus to relate L𝑡 𝑆( 𝑓) to L𝑡 D( 𝑓), i.e. to prove Lemma 5. In the proof, we rely on the classic approach for showing generalization for classes H of bounded VC-dimension and introduce a ghost set that only exists for the sake of analysis. In addition to the sample 𝑆, we thus consider a ghost set 𝑆 of another 𝑚i.i.d. samples from D. This allows us to prove:

Lemma 6. For 𝑚 24002 any 𝑡and any 𝑓, it holds that:

h sup 𝑓 Δ(H) |L𝑡 𝑆( 𝑓) L𝑡 D( 𝑓)| > 1 1200 i 2 Pr 𝑆,𝑆

h sup 𝑓 Δ(H) |L𝑡 𝑆( 𝑓) L𝑡 𝑆 ( 𝑓)| > 1 2400 i .

As the proof is standard, it can be found in the supplementary material.

We thus only need to bound Pr𝑆,𝑆 [sup 𝑓 Δ(H) |L𝑡 𝑆( 𝑓) L𝑡 𝑆 ( 𝑓)| > 1/2400]. To do this, consider drawing a data set 𝑃of 2𝑚i.i.d. samples from D and then drawing 𝑆as a set of 𝑚uniform samples from 𝑃without replacement and letting 𝑆 be the remaining samples. Then 𝑆and 𝑆 have the same distribution as if they were drawn as two independent sets of 𝑚i.i.d. samples each. From here on, we thus think of 𝑆and 𝑆 as being sampled via 𝑃.

Now consider a fixed set 𝑃in the support of D2𝑚and define Δ𝜇 𝛿(H, 𝑃) as the set of voting classifiers 𝑓 Δ(H) for which Pr(𝑥,𝑦) 𝑃[| 𝑓(𝑥)| 𝜇] 1 𝛿. These are the voting classifiers that make

predictions of large absolute value on most of both 𝑆and 𝑆 (if 𝛿 1/2). The crucial point, and the whole reason for introducing 𝑔, is that regardless of what 𝑓is, a random 𝑔 D 𝑓,𝑡often lies in the set Δ𝜇 𝛿(H, 𝑃):

Lemma 7. For any data set 𝑃, parameters 0 < 𝛿< 1 and 𝑡, and every 𝜇 𝛿/(9600 𝑡), we have Pr𝑔 D 𝑓,𝑡[𝑔 Δ𝜇 𝛿(H, 𝑃)] 1/4800.

Proof. Define an indicator 𝑋𝑖for each (𝑥𝑖, 𝑦𝑖) 𝑃taking the value 1 if |𝑔(𝑥𝑖)| 𝜇. By Lemma 2, we have 𝔼[Í 𝑖𝑋𝑖] |𝑃| 2𝜇 𝑡 |𝑃| 𝛿/4800. By Markov s inequality Pr[Í 𝑖𝑋𝑖 𝛿|𝑃|] 1/4800.

If we had just considered 𝑓, we had no way of arguing that 𝑓makes predictions of large absolute value on 𝑆 , since the only promise we are given is that it does so on 𝑆. That 𝑔makes predictions of large absolute value even outside of 𝑆is crucial for bounding the generalization error in the following.

Let us now define ˆΔ𝜇 𝛿(𝑃) = sign Δ𝜇 𝛿(H, 𝑃) which means that ˆΔ𝜇 𝛿(𝑃) contains all the hypotheses that are obtained by voting classifiers in Δ𝜇 𝛿(H, 𝑃) when taking the sign. Since 𝑔is in Δ𝜇 𝛿(H, 𝑃) except with probability 1/4800 by Lemma 7, we can prove:

Lemma 8. For any 0 < 𝛿< 1, every 𝑡, and every 𝜇 𝛿/(9600 𝑡), we have

h sup 𝑓 Δ(H)

L𝑡 𝑆( 𝑓) L𝑡 𝑆 ( 𝑓) > 1 2400 i sup 𝑃 2 ˆΔ𝜇 𝛿(𝑃) exp 2𝑚/96002 .

Again, the proof of the lemma can be found in the supplementary material. What Lemma 8 gives us, is that it relates the generalization error to the growth function | ˆΔ𝜇 𝛿(𝑃)|. The key point is that ˆΔ𝜇 𝛿(𝑃) was obtained from voting classifiers with predictions of large absolute value on all but a 𝛿fraction of points in 𝑃. This implies that we can bound the VC-dimension of ˆΔ𝜇 𝛿(𝑃) when restricted to the point set 𝑃using Rademacher complexity: Lemma 9. Let H be a hypothesis set of VC-dimension 𝑑. For any 𝛿, 𝜇> 0 and point set 𝑃, we have that the largest subset 𝑃 of 𝑃that ˆΔ𝜇 𝛿(𝑃) = sign Δ𝜇 𝛿(H, 𝑃) can shatter, has size at most |𝑃 | = 𝑑 < max{2𝛿|𝑃|, 4𝛼2𝜇 2𝑑}, where 𝛼> 0 is a universal constant.

Proof. Recall that the VC-dimension of H is 𝑑. Thus the Rademacher complexity of H for any point set 𝑃 is:

𝔼 𝜎 𝑃 { 1,1}

" 1 |𝑃 | sup ℎ H

𝑥 𝑃 ℎ(𝑥)𝜎(𝑥)

for a universal constant 𝛼> 0 (see e.g. [23]). Assume 𝑃 𝑃with |𝑃 | = 𝑑 can be shattered. Fix any labeling 𝜎 𝑃 { 1, 1}. Let ℎ𝜎 ˆΔ𝜇 𝛿(𝑃) be the hypothesis generating the dichotomy 𝜎 (which exists since 𝑃 is shattered). Since ℎ𝜎 ˆΔ𝜇 𝛿(𝑃), there must be some 𝑔 Δ𝜇 𝛿(H, 𝑃) such that ℎ𝜎= sign(𝑔) on the point set 𝑃 . If |𝑃 | 2𝛿|𝑃|, then by definition of Δ𝜇 𝛿(H, 𝑃), there are at least |𝑃 | 𝛿|𝑃| |𝑃 |/2 points 𝑥 𝑃 for which |𝑔(𝑥)| 𝜇. This means that (1/|𝑃 |) Í 𝑥 𝑃 𝑔(𝑥)𝜎(𝑥) (1/2)𝜇. But 𝑔(𝑥) is a convex combination of hypotheses from H, hence there is also a hypothesis ℎ H for which (1/|𝑃 |) Í 𝑥 𝑃 ℎ(𝑥)𝜎(𝑥) (1/2)𝜇. Since this holds for all 𝜎, by the bound on the Rademacher complexity, we conclude 𝛼

𝑑/|𝑃 | > (1/2)𝜇= |𝑃 | < 4𝛼2𝜇 2𝑑. We thus conclude that the largest set that ˆΔ𝜇 𝛿(𝑃) can shatter, has size less than max 2𝛿|𝑃|, 4𝛼2𝜇 2𝑑 .

We remark that it was crucial to introduce the random hypothesis 𝑔, since all we are promised about the original hypothesis 𝑓is that it has large margins on 𝑆, i.e. on only half the points in 𝑃. That case would correspond to 𝛿= 1/2 in Lemma 9 and would mean that we could potentially shatter all of 𝑃. In order for the bound to be useful, we thus need 𝛿 1/2 and thus large margins on much more than half of 𝑃(which we get by using 𝑔).

For a 0 < 𝛿< 1 to be determined, let us now fix 𝜇= 𝛿/(9600 𝑡) and assume that the number of samples 𝑚satisfies 𝑚 max{𝛼2𝜇 2𝑑/𝛿, 24002} where 𝛼is the constant from Lemma 9. By Lemma 8, we have

h sup 𝑓 Δ(H) |L𝑡 𝑆( 𝑓) L𝑡 𝑆 ( 𝑓)| > 1 2400 i sup 𝑃 2| ˆΔ𝜇 𝛿(𝑃)| exp( 2𝑚/96002).

Algorithm 1: Sub-Sample(𝐴, 𝐵) (Hanneke [13]) Input: Sets 𝐴and 𝐵

1 if |𝐴| 3 then // stop when 𝐴is too small to recurse

2 return 𝐴 𝐵

4 Let 𝐴0 denote the first |𝐴| 3 |𝐴|/4 elements of 𝐴, // split 𝐴evenly 𝐴1 the next |𝐴|/4 elements, 𝐴2 the next |𝐴|/4 elements, and 𝐴3 the remaining |𝐴|/4 elements.

5 return Sub-Sample(𝐴0, 𝐴2 𝐴3 𝐵) // recurse in leave-one-out fashion Sub-Sample(𝐴0, 𝐴1 𝐴3 𝐵) Sub-Sample(𝐴0, 𝐴1 𝐴2 𝐵)

Algorithm 2: Optimal weak-to-strong learner Input: Set 𝑆of 𝑚samples.

1 {𝐶1, . . . , 𝐶𝑘} = Sub-Sample(𝑆, ) // create highly overlapping subsamples of 𝑆

2 for 𝑖= 1, . . . , 𝑘do

3 ℎ𝑖= A 𝜈(𝐶𝑖) // run Ada Boost 𝜈on all those sub-samples

4 return ℎ(𝑥) = sign Í𝑘 𝑖=1 ℎ𝑖(𝑥) . // return unweighted majority vote

Lemma 9 gives us that the largest subset 𝑃 𝑃that ˆΔ𝜇 𝛿(𝑃) shatters has size at most 𝑑 < max{2𝛿|𝑃|, 4𝛼2𝜇 2𝑑}. By our assumption on 𝑚, the term 2𝛿|𝑃| = 4𝛿𝑚is at least 4𝑐2𝜇 2𝑑and thus 2𝛿|𝑃| = 4𝛿𝑚takes the maximum in the bound on 𝑑 . By the Sauer-Shelah lemma, we have that | ˆΔ𝜇 𝛿(𝑃)| Í4𝛿𝑚 1 𝑖=0 2𝑚 𝑖 . For 𝛿 1/4, this is at most 2𝑚 4𝛿𝑚 (𝑒𝛿 1/2)4𝛿𝑚= exp 4𝛿𝑚ln(𝑒𝛿 1/2) .

As conclusion we have:

h sup 𝑓 Δ(H) |L𝑡 𝑆( 𝑓) L𝑡 𝑆 ( 𝑓)| > 1 2400 i 2 exp(4𝛿𝑚ln(𝑒𝛿 1/2)) exp( 2𝑚/96002).

Let us now fix 𝛿= 10 10. We then have

2 exp 4𝛿𝑚ln(𝑒𝛿 1/2) exp( 2𝑚/96002)

= 2 exp 𝑚(4𝛿ln(𝑒𝛿 1/2) 2/96002)

2 exp 𝑚/108

where the last step is a numerical calculation. By Lemma 6, this in turn implies:

h sup 𝑓 Δ(H)

L𝑡 𝑆( 𝑓) L𝑡 D( 𝑓) > 1 1200 i 4 exp( 𝑚/108).

Since we only required 𝑚 max{𝛼2𝜇 2𝑑/𝛿, 24002} and we had 𝜇= 𝛿/(9600 𝑡), this is satisfied for 𝑚 𝛼 𝑡𝑑for a large enough constant 𝛼 > 0. This completes the proof of Lemma 5 and thus also finishes the proof of Theorem 4.

3 Weak to strong learning

In this section, we give our algorithm for obtaining a strong learner from a 𝛾-weak learner with optimal sample complexity and sketch the main proof idea.

The algorithm obtaining the guarantees of Theorem 1 is as follows: Let A 𝜈be an algorithm that on a sample 𝑆outputs a classifier 𝑔= sign( 𝑓), where 𝑓is a voting classifier with margins at least 𝛾/2 on all samples in 𝑆such as Ada Boost 𝜈[20]. Given a set 𝑆of 𝑚i.i.d. samples from an unknown distribution D, we run A 𝜈on a number of samples 𝐶1, 𝐶2, . . . , 𝐶𝑘 𝑆obtaining hypotheses ℎ1, ℎ2, . . . , ℎ𝑘. We then return the (unweighted) majority vote among ℎ1, . . . , ℎ𝑘as our final hypothesis ℎ . The subsets 𝐶𝑖are chosen by the algorithm Sub-Sample (shown in Algorithm 1) as in the optimal PAC learning algorithm by Hanneke [13]. The final algorithm (Algorithm 2) calls A 𝜈on all subsets

returned by Algorithm 1 and returns the majority vote. Note that the final hypothesis returned by Algorithm 2 is a majority of majorities since A 𝜈already returns a voting classifier. Let us also remark that in Theorem 4 we have a failure probability 𝛿0 > 0, while the analysis of Ada Boost 𝜈 assumes 𝛿0 = 0, i.e. that the weak learner always achieves an advantage of at least 𝛾. If one knows 𝛾 in advance, this is not an issue as Ada Boost 𝜈only calls the weak learner on distributions over the training data 𝑆and one can thus compute the advantage from the training data. After in expectation 1/(1 𝛿0) invocations of the weak learner, we thus get a hypothesis with advantage 𝛾.

In total there are 𝑘= 3 log4(𝑚) 𝑚0.79 calls to the weak learner, each with a sub-sample of linear size. Since Ada Boost 𝜈runs in polynomial time on its input, given that the weak learner is polynomial, Algorithm 2 is polynomial under the same condition.

The formal proof that Algorithm 2 has the guarantees of Theorem 1 is given in the supplementary material due to space constraints. It follows the proof of Hanneke [13] pretty much uneventfully, although carefully using that a generalization error of 1/200 suffices for each call of A 𝜈. The key observation is that each of the recursively generated sub-samples in Algorithm 1 leaves out a subset 𝐴𝑖of the training data, whereas the two other recursive calls always include all of 𝐴𝑖in their sub-samples. If one considers a hypothesis ℎtrained on the data leaving out 𝐴𝑖, then 𝐴𝑖serves as an independent sample from D. This implies that if ℎhas large error probability over D, then many of the samples in 𝐴𝑖will be classified incorrectly by ℎ. Now, since the two other recursive calls always include 𝐴𝑖, any hypothesis ℎ trained on a sub-sample from those calls will have margin at least 𝛾/2 on all points misclassified by ℎin 𝐴𝑖. But the generalization bound in Theorem 2 then implies that ℎ makes a mistake only with probability 1/200 on the conditional distribution D( | ℎerrs). Thus, the probability that they both err at the same time is at most the probability that ℎerrs, times 1/200. Applying this reasoning inductively gives the conclusion that it is very unlikely that the majority of all trained hypotheses err at the same time which then finishes the proof.

4 Lower bound

In this section, we sketch the proof of Theorem 2 which gives a lower bound that matches the sample complexity from Theorem 1. The full proof is given in the supplementary material. Informally, Theorem 2 says that there exists a weakly learnable concept class C such that the hypothesis A(𝑆) of any learning algorithm A satisfies

LD(A(𝑆)) 𝛼 𝑑

𝛾2𝑚+ ln(1/𝛿)

Here, the term ln(1/𝛿)/𝑚follows from previous work. In particular, we could let C = H and invoke the tight lower bounds for PAC-learning in the realizable setting [5].

Thus, we let 𝛿= 1/3 and only prove that the loss of A(𝑆) is at least 𝛼𝑑/(𝛾2𝑚) with probability 1/3 over 𝑆when |𝑆| = 𝑚for some weakly learnable concept class C. This proof uses a construction from Grønlund et al. [9] to obtain a hypothesis set H over a domain X = {𝑥1, . . . , 𝑥𝑢} of cardinality 𝑢= 𝛼𝑑𝛾 2 such that a constant fraction of all concepts in X { 1, 1} can be 𝛾-weakly learned from H. We then create a distribution D where the first point 𝑥1 is sampled with probability 1 𝑢/(4𝑚) and with the remaining probability, we receive a uniform sample among 𝑥2, . . . , 𝑥𝑢. The key point is that we only expect to see 1 + 𝑚 𝑢/(4𝑚) 𝑢/4 distinct points from X in a sample 𝑆of cardinality 𝑚. Thus, if we consider a random concept that can be 𝛾-weakly learned, the labels it assigns to points not in the sample are almost uniform random and independent. This in turn implies that the best any algorithm A can do is to guess the labels of points in X \ 𝑆. In that way, A fails with constant probability if we condition on receiving a sample other than 𝑥1. This happens with probability 𝑢/(4𝑚) = 4𝛼𝑑𝛾 2/𝑚and the lower bound follows.

To formally carry out the intuitive argument above, we first argue that for a random concept 𝑐 C, the Shannon entropy of 𝑐is high, even conditioned on 𝑆and the labels 𝑐(𝑆). Secondly, we argue that if A(𝑆) has a small error probability under D, then it must be the case that the hypothesis A(𝑆) reveals a lot of information about 𝑐, i.e. the entropy of 𝑐is small conditioned on A(𝑆). Since A(𝑆) is a function of 𝑆and 𝑐(𝑆), the same holds if we condition on 𝑆and 𝑐(𝑆). This contradicts that 𝑐has high entropy and thus we conclude that A(𝑆) cannot have a small error probability.

5 Conclusion

Overall, we presented a new weak to strong learner with a sample complexity that removes two logarithmic factors from the best-known bound. By accompanying the algorithm with a matching lower bound for all 𝑑and 2 𝑑< 𝛾< 1/80, we showed that the achieved sample complexity of our algorithm is indeed optimal. Our algorithm uses the same sub-sampling technique as Hanneke [13] and computes a voting classifier with large margins for each sample for example with Ada Boost 𝜈 [20]. The analysis of our algorithm uses a new generalization bound for voting classifiers with large margins.

Although we determined the exact sample complexity of weak to strong learning (up to multiplicative constants), there are a few connected open problems. Currently, our construction uses 3log4(𝑚) 𝑚0.79 many sub-samples of linear size as input to Ada Boost 𝜈. For very large datasets, it would be great to reduce the number and size of these calls. We conjecture that the most promising way to do so is to revisit Hanneke s optimal PAC learner and improve the sub-sampling strategy there. This could lead to an improvement for the realizable case as well as to faster weak-to-strong learners.

Next, the output of our algorithm is a majority vote over majority voters. It is unclear whether a simple voter could achieve the same bounds. We believe that a majority of majorities is actually necessary. This is supported by a lower bound showing that there are voters with large margin and poor generalization (paying a logarithmic factor) and thus the learning algorithm has to avoid this bad voter. We currently see no indication of how a variant of Ada Boost could do that.

For the regime of 𝛾< 2 𝑑which our lower bound does not capture, is it possible to use fewer samples? A recent result by Alon et al. [1] might suggest so. Concretely, they show that if a concept class C can be 𝛾-weak learned from a base hypothesis set H of VC-dimension 𝑑, then the VC-dimension of C is no more than 𝑂𝑑(𝛾 2+2/(𝑑+1)), where 𝑂𝑑( ) hides factors only depending on 𝑑. Interestingly, the part 𝛾2/(𝑑+1) becomes non-trivial precisely when our lower bound stops applying, i.e. when 𝛾< 2 𝑑. This could hint at a possibly better dependency on 𝛾for 𝛾< 2 𝑑.

We have a new generalization bound for large-margin classifiers, which is better than the 𝑘-th margin bound (Gao and Zhou [8]) for constant error. Can the 𝑘-th margin bound in general be improved, perhaps by one logarithmic factor? One of our key new ideas is the application of the Littlewood-Offord lemma which might also be helpful for the more general case of non-constant error.

Acknowledgments and Disclosure of Funding

This work was supported by the Digital Research Centre Denmark (DIREC) and by the Independent Research Fund Denmark (DFF) Sapere Aude Research Leader grant No 9064-00068B.

[1] Noga Alon, Alon Gonen, Elad Hazan, and Shay Moran. 2021. In STOC 21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021. ACM, 481 489.

[2] Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E Schapire. 1998. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics 26, 5 (1998), 1651 1686.

[3] Kristin P Bennett, Ayhan Demiriz, and John Shawe-Taylor. 2000. A column generation algorithm for boosting. In ICML. Citeseer, 65 72.

[4] Leo Breiman. 1999. Prediction games and arcing algorithms. Neural computation 11, 7 (1999), 1493 1517.

[5] Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. 1989. A general lower bound on the number of examples needed for learning. Information and Computation 82, 3 (1989), 247 261.

[6] Paul Erd os. 1945. On a lemma of Littlewood and Offord. Bull. Amer. Math. Soc. 51, 12 (1945), 898 902.

[7] Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119 139.

[8] Wei Gao and Zhi-Hua Zhou. 2013. On the doubt about margin explanation of boosting. Artificial Intelligence 203 (2013), 1 18.

[9] Allan Grønlund, Lior Kamma, Kasper Green Larsen, Alexander Mathiasen, and Jelani Nelson. 2019. Margin-based generalization lower bounds for boosted classifiers. Advances in Neural Information Processing Systems 32 (2019).

[10] Allan Grønlund, Lior Kamma, and Kasper Green Larsen. 2020. Margins are Insufficient for Explaining Gradient Boosting. In Advances in Neural Information Processing Systems 33 (Neur IPS 2020).

[11] Allan Grønlund, Lior Kamma, and Kasper Green Larsen. 2020. Near-Tight Margin-Based Generalization Bounds for Support Vector Machines. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3779 3788.

[12] Adam J Grove and Dale Schuurmans. 1998. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI. 692 699.

[13] Steve Hanneke. 2016. The optimal sample complexity of PAC learning. The Journal of Machine Learning Research 17, 1 (2016), 1319 1333.

[14] Steve Hanneke and Aryeh Kontorovich. 2021. Stable Sample Compression Schemes: New Applications and an Optimal SVM Margin Bound. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory (Proceedings of Machine Learning Research, Vol. 132). PMLR, 697 721.

[15] Michael Kearns. 1988. Learning Boolean formulae or finite automata is as hard as factoring. Technical Report TR-14-88 Harvard University Aikem Computation Laboratory (1988).

[16] Michael Kearns and Leslie Valiant. 1994. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM) 41, 1 (1994), 67 95.

[17] Alexander Mathiasen, Kasper Green Larsen, and Allan Grønlund. 2019. Optimal minimal margin maximization with boosting. In International Conference on Machine Learning. PMLR, 4392 4401.

[18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of machine learning. MIT press.

[19] Gunnar Rätsch and Manfred K Warmuth. 2002. Maximizing the margin with boosting. In International Conference on Computational Learning Theory. Springer, 334 350.

[20] Gunnar Rätsch, Manfred K Warmuth, and John Shawe-Taylor. 2005. Efficient Margin Maximizing with Boosting. Journal of Machine Learning Research 6, 12 (2005).

[21] Robert E Schapire. 1990. The strength of weak learnability. Machine learning 5, 2 (1990), 197 227.

[22] Hans U. Simon. 2015. An Almost Optimal PAC Algorithm. In Proceedings of The 28th Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 40). PMLR, Paris, France, 1552 1563.

[23] Jon Wellner et al. 2013. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] The main limitations are the points we stated in the open questions paragraph of the conclusion (c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work is purely theoretic (set within learning theory) and is thus not expected to impact society outside of the scientific community (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] Due to space constraints, a large part of the proofs had to be moved to the supplementary material. We still managed to present the largest part of the proof of our new generalization bound for classifiers with large margins within the main body of the paper. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]