# optimal_parallelization_of_boosting__db936b52.pdf

Optimal Parallelization of Boosting

Arthur da Cunha Department of Computer Science Aarhus University dac@cs.au.dk

Mikael Møller Høgsgaard Department of Computer Science Aarhus University hogsgaard@cs.au.dk

Kasper Green Larsen Department of Computer Science Aarhus University larsen@cs.au.dk

Recent works on the parallel complexity of Boosting have established strong lower bounds on the tradeoff between the number of training rounds p and the total parallel work per round t. These works have also presented highly non-trivial parallel algorithms that shed light on different regions of this tradeoff. Despite these advancements, a significant gap persists between the theoretical lower bounds and the performance of these algorithms across much of the tradeoff space. In this work, we essentially close this gap by providing both improved lower bounds on the parallel complexity of weak-to-strong learners, and a parallel Boosting algorithm whose performance matches these bounds across the entire p vs. t compromise spectrum, up to logarithmic factors. Ultimately, this work settles the parallel complexity of Boosting algorithms that are nearly sample-optimal.

1 Introduction

Boosting is an extremely powerful and elegant idea that allows one to combine multiple inaccurate classifiers into a highly accurate voting classifier. Algorithms such as Ada Boost [Freund and Schapire, 1997] work by iteratively running a base learning algorithm on reweighted versions of the training data to produce a sequence of classifiers h1, . . . , hp. After obtaining hi, the weighting of the training data is updated to put larger weights on samples misclassified by hi, and smaller weights on samples classified correctly. This effectively forces the next training iteration to focus on points with which the previous classifiers struggle. After sufficiently many rounds, the classifiers h1, . . . , hp are finally combined by taking a (weighted) majority vote among their predictions. Many Boosting algorithms have been developed over the years, for example Grove and Schuurmans [1998], Rätsch et al. [2005], Servedio [2003], Friedman [2001], with modern Gradient Boosting [Friedman, 2001] algorithms like XGBoost [Chen and Guestrin, 2016] and Light GBM [Ke et al., 2017] often achieving state-of-the-art performance on learning tasks while requiring little to no data cleaning. See e.g. the excellent survey by Natekin and Knoll [2013] for more background on Boosting.

While Boosting enjoys many advantages, it does have one severe drawback, also highlighted in Natekin and Knoll [2013]: Boosting is completely sequential as each of the consecutive training steps requires the output of previous steps to determine the reweighted learning problem. This property is shared by all Boosting algorithms and prohibits the use of computationally heavy training by the base learning algorithm in each iteration. For instance, Gradient Boosting algorithms often require hundreds to thousands of iterations to achieve the best accuracy. The crucial point is that even if you have access to thousands of machines for training, there is no way to parallelize the steps of Boosting and distribute the work among the machines (at least beyond the parallelization possible

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

for the base learner). In effect, the training time of the base learning algorithm is directly multiplied by the number of steps of Boosting.

Multiple recent works [Long and Servedio, 2013, Karbasi and Larsen, 2024, Lyu et al., 2024] have studied parallelization of Boosting from a theoretical point of view, aiming for an understanding of the inherent tradeoffs between the number of training rounds p and the total parallel work per round t. These works include both strong lower bounds on the cost of parallelization and highly non-trivial parallel Boosting algorithms with provable guarantees on accuracy. Previous studies however leave a significant gap between the performance of the parallel algorithms and the proven lower bounds.

The main contribution of this work is to close this gap by both developing a parallel algorithm with a better tradeoff between p and t, as well as proving a stronger lower bound on this tradeoff. To formally state our improved results and compare them to previous works, we first introduce the theoretical framework under which parallel Boosting is studied.

Weak-to-Strong Learning. Following the previous works Karbasi and Larsen [2024], Lyu et al. [2024], we study parallel Boosting in the theoretical setup of weak-to-strong learning. Weak-tostrong learning was introduced by Kearns [1988], Kearns and Valiant [1994] and has inspired the development of the first Boosting algorithms [Schapire, 1990]. In this framework, we consider binary classification over an input domain X with an unknown target concept c: X { 1, 1} assigning labels to samples. A γ-weak learner for c is then a learning algorithm W that for any distribution D over X, when given at least some constant m0 i.i.d. samples from D, produces with constant probability a hypothesis h with LD(h) 1/2 γ. Here LD(h) = Prx D[h(x) = c(x)]. The goal in weak-to-strong learning is then to boost the accuracy of W by invoking it multiple times. Concretely, the aim is to produce a strong learner: A learning algorithm that for any distribution D over X and any 0 < δ, ε < 1, when given m(ε, δ) i.i.d. samples from D, produces with probability at least 1 δ a hypothesis h: X { 1, 1} such that LD(h) ε. We refer to m(ε, δ) as the sample complexity of the weak-to-strong learner.

Weak-to-strong learning has been extensively studied over the years, with many proposed algorithms, among which Ada Boost [Freund and Schapire, 1997] is perhaps the most famous. If H denotes a hypothesis set such that W always produces hypotheses from H, and if d denotes the VC-dimension of H, then in terms of sample complexity, Ada Boost is known to produce a strong learner with sample complexity m Ada(ε, δ) satisfying

m Ada(ε, δ) = O

γ2ε + ln(1/δ)

This can be proved by observing that after t = O(γ 2 ln m) iterations, Ada Boost produces a voting classifier f(x) = sign(Pt i=1 αihi(x)) with all margins on the training data being Ω(γ). The sample complexity bound then follows by invoking the best known generalization bounds for large margin voting classifiers [Breiman, 1999, Gao and Zhou, 2013]. Here the margin of the voting classifier f on a training sample (x, c(x)) is defined as c(x) Pt i=1 αihi(x)/ Pt i=1|αi|. This sample complexity comes within logarithmic factors of the optimal sample complexity m OPT(ε, δ) = Θ(d/(γ2ε) + ln(1/δ)/ε) obtained e.g. in Larsen and Ritzert [2022].

Parallel Weak-to-Strong Learning. The recent work by Karbasi and Larsen [2024] formalized parallel Boosting in the above weak-to-strong learning setup. Observing that all training happens in the weak learner, they proposed the following definition of parallel Boosting: A weak-to-strong learning algorithm has parallel complexity (p, t) if for p consecutive rounds it queries the weak learner with t distributions. In each round i, if Di 1, . . . , Di t denotes the distributions queried, the weak learner returns t hypotheses hi 1, . . . , hi t H such that LDi j(hi j) 1/2 γ for all j. At the end of the p rounds, the weak-to-strong learner outputs a hypothesis f : X { 1, 1}. The queries made in each round and the final hypothesis f must be computable from the training data as well as all hypotheses hi j seen in previous rounds. The motivation for the above definition is that we could let one machine/thread handle each of the t parallel query distributions in a round.

Since parallel weak-to-strong learning is trivial if we make no requirements on LD(f) for the output f : X { 1, 1} (simply output f(x) = 1 for all x X), we from hereon focus on parallel weakto-strong learners that are near-optimal in terms of the sample complexity and accuracy tradeoff.

More formally, from the upper bound side, our goal is to obtain a sample complexity matching at least that of Ada Boost, stated in Eq. (1). That is, rewriting the loss ε as a function of the number of samples m, we aim for output classifiers f satisfying

LD(f) = O d ln(m) ln(m/d) + ln(1/δ)

When stating lower bounds in the following, we have simplified the expressions by requiring that the expected loss satisfies LD(f) = O(m 0.01). Note that this is far larger than the upper bounds, except for values of m very close to γ 2d. This only makes the lower bounds stronger. We remark that all the lower bounds are more general than this, but focusing on m 0.01 in this introduction yields the cleanest bounds.

With these definitions, classic Ada Boost and other weak-to-strong learners producing voting classifiers with margins Ω(γ) all have a parallel complexity of (Θ(γ 2 ln m), 1): They all need γ 2 ln m rounds to obtain Ω(γ) margins. Karbasi and Larsen [2024] presented the first alternative tradeoff by giving an algorithm with parallel complexity (1, exp(O(d ln(m)/γ2))). Subsequent work by Lyu et al. [2024] gave a general tradeoff between p and t. When requiring near-optimal accuracy, their tradeoff gives, for any 1 R 1/(2γ), a parallel complexity of (O(γ 2 ln(m)/R), exp(O(d R2)) ln(1/γ)). The accuracy of both of these algorithms was proved by arguing that they produce a voting classifier with all margins Ω(γ).

On the lower bound side, Karbasi and Larsen [2024] showed that one of three things must hold: Either p min{Ω(γ 1 ln m), exp(Ω(d))}, or t min{exp(Ω(dγ 2)), exp(exp(Ω(d)))} or p ln(tp) = Ω(d ln(m)γ 2).

Lyu et al. [2024] also presented a lower bound that for some parameters is stronger than that of Karbasi and Larsen [2024], and for some is weaker. Concretely, they show that one of the following two must hold: Either p min{Ω(γ 2d), Ω(γ 2 ln m), exp(Ω(d))}, or t exp(Ω(d)). Observe that the constraint on t is only single-exponential in d, whereas the previous lower bound is doubleexponential. On the other hand, the lower bound on p is essentially stronger by a γ 1 factor. Finally, they also give an alternative lower bound for p = O(γ 2), essentially yielding p ln t = Ω(γ 2d).

Even in light of the previous works, it is still unclear what the true complexity of parallel Boosting is. In fact, the upper and lower bounds only match in the single case where p = Ω(γ 2 ln m) and t = 1, i.e. when standard Ada Boost is optimal.

Our Contributions. In this work, we essentially close the gap between the upper and lower bounds for parallel Boosting. From the upper bound side, we show the following general result.

Theorem 1.1. Let c: X { 1, 1} be an unknown concept, W be a γ-weak learner for c using a hypothesis set of VC-dimension d, D be an arbitrary distribution, and S Dm be a training set of size m. For all R N, Algorithm 1 yields a weak-to-strong learner AR with parallel complexity (p, t) for

and t = e O(d R) ln ln m

such that, with probability at least 1 δ over S and the randomness of AR, it holds that

LD(AR(S)) = O d ln(m) ln(m/d) + ln(1/δ)

Observe that this is a factor R better than the bound by Lyu et al. [2024] in the exponent of t. Furthermore, if we ignore the ln(ln(m)/(δγ2)) factor, it gives the clean tradeoff

p ln t = O d ln m

for any p from 1 to O(γ 2 ln m).

We complement our new upper bound by an essentially matching lower bound. Here we show that

Theorem 1.2. There is a universal constant C 1 for which the following holds. For any 0 < γ < 1/C, any d C, any sample size m C, and any weak-to-strong learner A with parallel complexity (p, t), there exists an input domain X, a distribution D, a concept c: X { 1, 1}, and a γ-weak learner W for c using a hypothesis set H of VC-dimension d such that if the expected loss of A over the sample is no more than m 0.01, then either p min{exp(Ω(d)), Ω(γ 2 ln m)}, or t exp(exp(Ω(d))), or p ln t = Ω(γ 2d ln m).

Comparing Theorem 1.2 to known upper bounds, we first observe that p = Ω(γ 2 ln m) corresponds to standard Ada Boost and is thus tight. The term p = exp(Ω(d)) is also near-tight. In particular, given m samples, by Sauer-Shelah, there are only O((m/d)d) = exp(O(d ln(m/d))) distinct labellings by H on the training set. If we run Ada Boost, and in every iteration, we check whether a previously obtained hypothesis has advantage γ under the current weighing, then we make no more than exp(O(d ln(m/d))) queries to the weak learner (since every returned hypothesis must be distinct). The p ln t = Ω(γ 2d ln m) matches our new upper bound in Theorem 1.1. Thus, only the t exp(exp(Ω(d))) term does not match any known upper bound.

Other Related Work. Finally, we mention the work by Long and Servedio [2013], which initiated the study of the parallel complexity of Boosting. In their work, they proved that the parallel complexity (p, t) must satisfy p = Ω(γ 2 ln m), regardless of t (they state it as p = Ω(γ 2), but it is not hard to improve by a ln m factor for loss m 0.01). This seems to contradict the upper bounds above. The reason is that their lower bound has restrictions on which query distributions the weak-to-strong learner makes to the weak learner. The upper bounds above thus all circumvent these restrictions. As a second restriction, their lower bound instance has a VC-dimension that grows with m.

2 Upper Bound

In this section, we discuss our proposed method, Algorithm 1. Here, Cn refers a universal constant shared among results.

We provide a theoretical analysis of the algorithm, showing that it realizes the claims in Theorem 1.1. Our proof goes via the following intermediate theorem:

Theorem 2.1. There exists universal constant Cn 1 such that for all 0 < γ < 1/2, R N, concept c: X { 1, 1}, and hypothesis set H { 1, 1}X of VC-dimension d, Algorithm 1 given an input training set S X m, a γ-weak learner W,

γ2R , and t e16Cnd R R ln p R

produces a linear classifier g at Line 21 such that with probability at least 1 δ over the randomness of Algorithm 1, g(x)c(x) γ/8 for all x S.

In Theorem 2.1 and throughout the paper, we define a linear classifier g as linear combination of hypotheses g(x) = Pk i=1 αihi(x) with P

i|αi| = 1. A linear classifier thus corresponds to a voting classifier with coefficients normalized and no sign operation. Observe that the voting classifier f(x) = sign(g(x)) is correct if and only if c(x)g(x) > 0, where c(x) is the correct label of x. Furthermore, c(x)g(x) is the margin of the voting classifier f on input x.

Theorem 1.1 follows from Theorem 2.1 via generalization bounds for linear classifiers with large margins. Namely, we apply Breiman s min-margin bound:

Theorem 2.2 (Breiman [1999]). Let c: X { 1, 1} be an unknown concept, H { 1, 1}X a hypothesis set of VC-dimension d and D an arbitrary distribution over X. There is a universal constant C > 0 such that with probability at least 1 δ over a set of m samples S Dm, it holds for every linear classifier g satisfying c(x)g(x) γ for all (x, c(x)) S that

LD(sign(g)) C d ln(m) ln(m/d) + ln(1/δ)

Thus far, our general strategy mirrors that of previous works: We seek to show that given suitable parameters Algorithm 1 produces a linear classifier with margins of order γ with good probability.

Algorithm 1: Proposed parallel Boosting algorithm

Input : Training set S = {(x1, c(x1)), . . . , (xm, c(xm))}, γ-weak learner W, number of calls to weak learner per round t, number of rounds p Output: Voting classifier f

2 ln 1/2+γ/2

m, . . . , 1

4 for k 0 to p 1 do

5 parallel for r 1 to R do

6 parallel for j 1 to t/R do

7 Sample Tk R+r,j Dn k R+1 8 hk R+r,j W(Tk R+r,j, Uniform(Tk R+r,j))

9 Hk R+r {hk R+r,1, . . . , hk R+r,t/R} { hk R+r,1, . . . , hk R+r,t/R}

10 for r 1 to R do

11 if there exists h Hk R+r s.t. LDk R+r(h ) 1/2 γ/2 then

12 hk R+r h

13 αk R+r α

15 hk R+r arbitrary hypothesis from Hk R+r 16 αk R+r 0

17 for i 1 to m do

18 Dk R+r+1(i) Dk R+r(i) exp( αk R+rc(xi)hk R+r(xi))

19 Zk R+r Pm i=1 Dk R+r(i) exp( αk R+rc(xi)hk R+r(xi))

20 Dk R+r+1 Dk R+r+1/Zk R+r

21 g x 7 1 Pp R j=1 αj Pp R j=1 αjhj(x)

22 return f : x 7 sign(g(x))

Therefore, this section focuses on the lemmas that describe how, with suitable parameters, Algorithm 1 produces a classifier with large margins. With these results in hand, the proof of Theorem 2.1 becomes quite straightforward, so we defer it to Appendix B.3.

Algorithm 1 is a variant of Lyu et al. [2024, Algorithm 2]. The core idea is to use bagging to produce (in parallel) a set of hypotheses and use it to simulate a weak learner. To be more precise, we reason in terms of the following definition. Definition 1 (ε-approximation). Given a concept c: X { 1, 1}, a hypothesis set H { 1, 1}X , and a distribution D over X, a multiset T is an ε-approximation for D, c, and H if for all h H, it holds that

|LD(h) LT (h)| ε,

where LT (h) := LUniform(T )(h) is the empirical loss of h on T. Moreover, we omit the reference to c and H when no confusion seems possible.

Consider a reference distribution D0 over a training dataset S. The bagging part of the method leverages the fact that if a subsample T Dn 0 is a γ/2-approximation for D0, then inputting T (with the uniform distribution over it) to a γ-weak learner produces a hypothesis h that, besides having advantage γ on T, also has advantage γ/2 on the entire dataset S (relative to D0). Indeed, in this setting, we have that LD0(h) LT(h)+γ/2 1/2 γ+γ/2 = 1/2 γ/2. We can then take h as if produced by a γ/2-weak learner queried with (S, D0), and compute a new distribution D1 via a standard Boosting step1. That is, we can simulate a γ/2-weak learner as long as we can provide a γ/2-approximation for the target distribution. The strategy is to have a parallel bagging step in which we sample T1, T2, . . . , Tt iid Dn 0 and query the γ-weak learner on each Tj to obtain hypotheses h1, . . . , ht. Then, we search within these hypotheses to sequentially perform R Boosting steps, obtaining distributions D1, D2, . . . , DR. As argued, this approach will succeed whenever we can

1Notice that we employ a fixed learning rate that assumes a worst-case advantage of γ/2.

find at least one γ/2-approximation for each Dr among h1, h2, . . . , ht. A single parallel round of querying the weak learner is thus sufficient for performing R steps of Boosting, effectively reducing p by a factor R. Crucially, testing the performance of the returned hypotheses h1, . . . , ht uses only inference/predictions and no calls to the weak learner.

The challenge is that the distributions Dr diverge (exponentially fast) from D0 as we progress in the Boosting steps. For the first Boosting step, the following classic result ensures a good probability of obtaining an approximation for D0 when sampling from D0 itself.

Theorem 2.3 (Li et al. [2001], Talagrand [1994], Vapnik and Chervonenkis [1971]). There is a universal constant C > 0 such that for any 0 < ε, δ < 1, H { 1, 1}X of VC-dimension d, and distribution D over X, it holds with probability at least 1 δ over a set T Dn that T is an ε-approximation for D, c, and H provided that n C((d + ln(1/δ))/ε2).

However, we are interested in approximations for Dr when we only have access to samples from D0. Lyu et al. [2024] approaches this problem by tracking the distance between the distributions in terms of their max-divergence

D (Dr, D0) := ln sup x X Dr(x)/D0(x) . (2)

By bounding both D (Dr, D0) and D (D0, Dr), the authors can leverage the advanced composition theorem [Dwork et al., 2010]2 from the differential privacy literature to bound the probability of obtaining an approximation for Dr when sampling from D0. In turn, this allows them to relate the number of samples t and the (sufficiently small) number of Boosting steps R in a way that ensures a good probability of success at each step.

Besides setting up the application of advanced composition, the use of the max-divergence also simplifies the analysis since its locality allows one to bound the divergence between the two distributions via a worst-case study of a single entry. However, this approach sacrifices global information, limiting how much we can leverage our understanding of the distributions generated by Boosting algorithms. With that in mind, we instead track the distance between Dr and D0 in terms of the Kullback-Leibler divergence (KL divergence) [Kullback and Leibler, 1951] between them:

KL(Dr D0) := X

x X Dr(x) ln Dr(x)

Comparing this expression to Eq. (2) reveals that the max-divergence is indeed a worst-case estimation of the KL divergence.

The KL divergence also known as relative entropy between two distributions P and Q is always non-negative and equal to zero if and only if P = Q. Moreover, in our setting, it is always finite due to the following remark.3

Remark 1. In the execution Algorithm 1, every distribution Dℓ, for ℓ [p R], has the same support. This must be the case since Line 20 always preserves the support of D1.

On the other hand, the KL divergence is not a proper metric as it is not symmetric and it does not satisfy the triangle inequality, unlike the max-divergence. This introduces a number of difficulties in bounding the divergence between D0 and Dr. Overcoming these challenges requires a deeper and highly novel analysis. Our results reveal that the KL divergence captures particularly well the behavior of our Boosting algorithm. We remark that we are not the first to relate KL divergence and Boosting, see e.g. Schapire and Freund [2012, Chapter 8 and the references therein], yet we make several new contributions to this connection.

To study the probability of obtaining a γ/2-approximation for Dr when sampling from D0, rather than using advanced composition, we employ the duality formula for variational inference [Donsker and Varadhan, 1975] also known as Gibbs variational principle, or Donsker-Varadhan formula to estimate such a probability in terms of KL(Dr D0).

2Lemma 4.6 of Lyu et al. [2024]. 3We only need P to be absolutely continuous with respect to Q; i.e., that for any event A, we have P(A) = 0 whenever Q(A) = 0. We express our results in terms of identical supports for the sake of simplicity as they can be readily generalized to only require absolute continuity.

Lemma 2.4 (Duality formula4). Given finite probability spaces (Ω, F, P) and (Ω, F, Q), if P and Q have the same support, then for any real-valued random variable X on (Ω, F, P) we have that

ln EP e X EQ[X] KL(Q P). (3)

Lemma 2.4 allows us to prove that if KL(Dr D0) is sufficiently small, then the probability of obtaining a γ/2-approximation for Dr when sampling from D0 is sufficiently large. Namely, we prove the following.

Lemma 2.5. There exists universal constant Cn 1 for which the following holds. Given 0 < γ < 1/2, R, m N, concept c: X { 1, 1}, and hypothesis set H { 1, 1}X of VC-dimension d, let D and D be distributions over [m] and G [m] be the family of γ/2-approximations for D, c, and H. If D and D have the same support and

KL(D D) 4γ2R,

then for all n Cn d/γ2 it holds that

Pr T Dn[T G] exp( 16Cnd R).

Proof sketch. Our argument resembles a proof of the Chernoff bound: After taking exponentials on both sides of Eq. (3), we exploit the generality of Lemma 2.4 by defining the random variable X: T 7 λ1{T G} and later carefully choosing λ. We then note that Theorem 2.3 ensures that X has high expectation for T Dn. Setting λ to leverage this fact, we obtain a lower bound on the expectation of X relative to T Dn, yielding the thesis.

We defer the detailed proof to Appendix B.1.

With Lemma 2.5 in hand, recall that our general goal is to show that, with high probability, the linear classifier g produced by Algorithm 1 satisfies that c(x)g(x) = Ω(γ) for all x S. Standard techniques allow us to further reduce this goal to that of showing that the product of the normalization factors, Qp R ℓ=1 Zℓ, is sufficiently small. Accordingly, in our next lemma, we bound the number of samples needed in the bagging step to obtain a small product of the normalization factors produced by the Boosting steps.

Here, the analysis in terms of the KL divergence delivers a clear insight into the problem, revealing an interesting trichotomy: if KL(Dr D0) is small, Lemma 2.5 yields the result; on the other hand, if Dr has diverged too far from D0, then either the algorithm has already made enough progress for us to skip a step, or the negation of some hypothesis used in a previous step has sufficient advantage relative to the distribution at hand. Formally, we prove the following.

Lemma 2.6. There exists universal constant Cn 1 such that for all R N, 0 < δ < 1, 0 < γ < 1/2, and γ-weak learner W using a hypothesis set H { 1, 1}X with VC-dimension d, if t R exp(16Cnd R) ln(R/δ), then with probability at least 1 δ the hypotheses hk R+1, . . . , hk R+R obtained by Algorithm 1 induce normalization factors Zk R+1, . . . , Zk R+R such that

r=1 Zk R+r < exp( γ2R/2).

Proof sketch. We assume for simplicity that k = 0 and argue by induction on R [R]. After handling the somewhat intricate stochastic relationships of the problem, we leverage the simple

4Corollary of, e.g., Dembo and Zeitouni [1998, Lemma 6.2.13] or Lee [2022, Theorem 2.1]. Presented here in a weaker form for the sake of simplicity.

remark that KL(DR DR ) = 0 to reveal the following telescopic decomposition:

KL(DR D1) = KL(DR D1) KL(DR DR ) = KL(DR D1) KL(DR D2) + KL(DR D2) KL(DR D3) + + KL(DR DR 1) KL(DR DR )

r=1 KL(DR Dr) KL(DR Dr+1).

Moreover, given r {1, . . . , R 1},

KL(DR Dr) KL(DR Dr+1) =

i=1 DR (i) ln DR (i)

i=1 DR (i) ln DR (i)

i=1 DR (i) ln Dr+1(i)

i=1 DR (i)αrc(xi)hr(xi).

Altogether, we obtain that

KL(DR D1) = ln

i=1 DR (i)c(xi)( hr(xi)).

Now, if KL(DR D1) is small (at most 4γ2R), Lemma 2.5 ensures that with sufficient probability there exists a γ/2-approximation for DR within TR ,1, . . . , TR ,t/R, yielding the induction step

(by Claim 1). Otherwise, if KL(DR D1) is large, then either (i) the term ln QR 1 r=1 Zr is large enough for us to conclude that QR 1 r=1 Zr is already less than exp( γ2R /2) and we can skip the step; or (ii) the term PR 1 r=1 αr Pm i=1 DR (i)c(xi)( hr(xi)) is sufficiently large to imply the existence of h { h1, . . . , h R 1} satisfying that

i=1 DR (i)c(xi)h (xi) > γ,

which implies that such h has margin at least γ with respect to DR and we can conclude the induction step as before.

We defer the detailed proof to Appendix B.2.

3 Overview of the Lower Bound

In this section, we provide an overview of the main ideas behind our improved lower bound. The details are available in Appendix C.

Our lower bound proof is inspired by and builds upon the work of Lyu et al. [2024], which we now summarize. Similarly to Karbasi and Larsen [2024], they consider an input domain X = [2m], where m denotes the number of training samples available for a weak-to-strong learner A with parallel complexity (p, t). In their construction, they consider a uniform random concept c: X { 1, 1} and give a randomized construction of a weak learner. Proving a lower bound on the expected error of A under this random choice of concept and weak learner implies, by averaging, the existence of a deterministic choice of concept and weak learner for which A has at least the same error.

The weak learner is constructed by drawing a random hypothesis set H, using inspiration from the so-called coin problem. In the coin problem, we observe p independent outcomes of a biased coin

and the goal is to determine the direction of the bias. If a coin has a bias of β, then upon seeing n outcomes of the coin, any algorithm for guessing the bias of the coin is wrong with probability at least exp( O(β2n)). Now to connect this to parallel Boosting, Lyu et al. construct H by adding c as well as p random hypotheses h1, . . . , hp to H. Each hypothesis hi has each hi(x) chosen independently with hi(x) = c(x) holding with probability 1/2 + 2γ. The weak learner W now processes a query distribution D by returning the first hypothesis hi with advantage γ under D. If no such hypothesis exists, it instead returns c. The key observation is that if W is never forced to return c, then the only information A has about c(x) for each x not in the training data (which is at least half of all x, since |X| = 2m), is the outcomes of up to p coin tosses that are 2γ biased towards c(x). Thus, the expected error becomes exp( O(γ2p)). For this to be smaller than m 0.01 then requires p = Ω(γ 2 ln m) as claimed in their lower bound.

The last step of their proof, is then to argue that W rarely has to return c upon a query. The idea here is to show that in the ith parallel round, W can use hi to answer all queries, provided that t is small enough. This is done by observing that for any query distribution D that is independent of hi, the expected loss satisfies Ehi[LD(hi)] = 1/2 2γ due to the bias. Using inspiration from Karbasi and Larsen [2024], they then show that for sufficiently "well-spread" queries D, the loss of hi under D is highly concentrated around its expectation (over the random choice of hi) and thus hi may simultaneously answer all (up to) t well-spread queries in round i. To handle "concentrated" queries, i.e., query distribution with most of the weight on a few x, they also use ideas from Karbasi and Larsen [2024] to argue that if we add 2O(d) uniform random hypotheses to H, then these may be used to answer all concentrated queries.

Note that the proof crucially relies on hi being independent of the queries in the ith round. Here the key idea is that if W can answer all the queries in round i using hi, then hi+1, . . . , hp are independent of any queries the weak-to-strong learner makes in round i + 1.

In our improved lower bound, we observe that the expected error of exp( O(γ2p)) is much larger than m 0.01 for small p. That is, the previous proof is in some sense showing something much too strong when trying to understand the tradeoff between p and t. What this gives us, is that we can afford to make the coins/hypotheses hi much more biased towards c when p is small. Concretely, we can let the bias be as large as β = Θ( p

ln(m)/p), which may be much larger than 2γ. This, in turn, makes it significantly more likely that hi can answer an independently chosen query distribution D. In this way, the same hi may answer a much larger number of queries t, resulting in a tight tradeoff between the parameters. As a second contribution, we also find a better way of analyzing this lower bound instance, improving one term in the lower bound on t from exp(Ω(d)) to exp(exp(d)). We refer the reader to the full proof for details.

4 Conclusion

In this paper, we have addressed the parallelization of Boosting algorithms. By establishing both improved lower bounds and an essentially optimal algorithm, we have effectively closed the gap between theoretical lower bounds and performance guarantees across the entire tradeoff spectrum between the number of training rounds and the parallel work per round.

Given that, we believe future work may focus on better understanding the applicability of the theoretical tools developed here to other settings since some lemmas obtained seem quite general. They may aid, for example, in investigating to which extent the post-processing of hypotheses obtained in the bagging step can improve the complexity of parallel Boosting algorithms, which remains as an interesting research direction.

Acknowledgments and Disclosure of Funding

This research is co-funded by the European Union (ERC, TUCLA, 101125203) and Independent Research Fund Denmark (DFF) Sapere Aude Research Leader Grant No. 9064-00068B. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Adam J Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI, pages 692 699, 1998.

Gunnar Rätsch, Manfred K Warmuth, and John Shawe-Taylor. Efficient margin maximizing with boosting. Journal of Machine Learning Research, 6(12), 2005.

Rocco A. Servedio. Smooth boosting and learning with malicious noise. J. Mach. Learn. Res., 4 (null):633 648, dec 2003. ISSN 1532-4435. doi: 10.1162/153244304773936072. URL https: //doi.org/10.1162/153244304773936072.

Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 1232, 2001.

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In KDD, pages 785 794. ACM, 2016. ISBN 978-1-4503-4232-2.

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In NIPS, 2017.

Alexey Natekin and Alois Knoll. Gradient boosting machines, a tutorial. Frontiers in Neurorobotics, 7, 2013. ISSN 1662-5218.

Philip M. Long and Rocco A. Servedio. Algorithms and hardness results for parallel large margin learning. J. Mach. Learn. Res., 14(1):3105 3128, jan 2013. ISSN 1532-4435.

Amin Karbasi and Kasper Green Larsen. The impossibility of parallelizing boosting. In Claire Vernade and Daniel Hsu, editors, Proceedings of The 35th International Conference on Algorithmic Learning Theory, volume 237 of Proceedings of Machine Learning Research, pages 635 653. PMLR, 25 28 Feb 2024. URL https://proceedings.mlr.press/v237/karbasi24a.html.

Xin Lyu, Hongxun Wu, and Junzhao Yang. The cost of parallelizing boosting. In David P. Woodruff, editor, Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms, SODA 2024, Alexandria, VA, USA, January 7-10, 2024, pages 3140 3155. SIAM, 2024. doi: 10.1137/1. 9781611977912.112. URL https://doi.org/10.1137/1.9781611977912.112.

Michael Kearns. Learning boolean formulae or finite automata is as hard as factoring. Technical Report TR-14-88 Harvard University Aikem Computation Laboratory, 1988.

Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67 95, 1994.

Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197 227, 1990.

Leo Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493 1517, 1999. doi: 10.1162/089976699300016106. URL https://doi.org/10.1162/ 089976699300016106.

Wei Gao and Zhi-Hua Zhou. On the doubt about margin explanation of boosting. Artif. Intell., 203: 1 18, 2013.

Kasper Green Larsen and Martin Ritzert. Optimal weak to strong learning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ d38653cdaa8e992549e1e9e1621610d7-Abstract-Conference.html.

Yi Li, Philip M. Long, and Aravind Srinivasan. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62(3):516 527, 2001. doi: 10.1006/JCSS. 2000.1741. URL https://doi.org/10.1006/jcss.2000.1741.

Michel Talagrand. Sharper bounds for gaussian and empirical processes. The Annals of Probability, pages 28 76, 1994.

Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264 280, 1971.

Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 51 60. IEEE Computer Society, 2010. doi: 10.1109/FOCS. 2010.12. URL https://doi.org/10.1109/FOCS.2010.12.

Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79 86, 1951.

Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 05 2012. ISBN 9780262301183. doi: 10.7551/mitpress/8291.001.0001. URL https://doi.org/ 10.7551/mitpress/8291.001.0001.

Monroe D Donsker and SRS386024 Varadhan. Asymptotic evaluation of certain markov process expectations for large time, i and ii. Communications on Pure and Applied Mathematics, 28(1-2): 279 301, 1975.

Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Springer, 1998. ISBN 978-1-4612-5320-4. doi: 10.1007/978-1-4612-5320-4. URL https://doi.org/10. 1007/978-1-4612-5320-4.

Se Yoon Lee. Gibbs sampler and coordinate ascent variational inference: A set-theoretical review. Communications in Statistics-Theory and Methods, 51(6):1549 1568, 2022.

A Auxiliary Results

In this section we state and proof claims utilized in our argument. The arguments behind those are fairly standard, so they are not explicitly stated in the main text.

Claim 1. Let ℓ N and 0 < γ < 1/2. If a hypothesis hℓhas advantage γℓsatisfying LDℓ(hℓ) = 1/2 γℓ 1/2 γ/2 and αℓ= α, then

1 γ2 e γ2/2.

Proof. It holds that

i=1 Dℓ(i) exp( αℓc(xi)hℓ(xi))

i:hℓ(xi)=c(xi) Dℓ(i)e α + X

i:hℓ(xi) =c(xi) Dℓ(i)eα

1 + γ + 1/2 γℓ

(1 + γ)(1 γ)

Finally, since γℓ γ/2 and γ (0, 1/2), and, thus, 1 γ2 > 0, we have that

Claim 2. Algorithm 1 produces a linear classifier g whose exponential loss satisfies

i=1 exp c(xi)

j=1 αjhj(xi) = m

Proof. It suffices to consider the last distribution Dp R+1 produced by the algorithm. It holds that

i=1 Dp R+1(i) (as Dp R+1 is a distribution)

i=1 Dp R(i) exp( αp Rc(xi)hp R(xi))

Zp R (by Line 20)

exp( αjc(xi)hj(xi))

Zj (by further unrolling the Djs)

exp( c(xi) Pp R j=1 αjhj(xi)) Qp R j=1 Zj . (as D1 is uniform)

B Detailed Proofs

In this section, provide full proofs for the results from Section 2. For convenience, we provide copies of the statements before each proof.

B.1 Proof of Lemma 2.5

Lemma 2.5. There exists universal constant Cn 1 for which the following holds. Given 0 < γ < 1/2, R, m N, concept c: X { 1, 1}, and hypothesis set H { 1, 1}X of VC-dimension d, let D and D be distributions over [m] and G [m] be the family of γ/2-approximations for D, c, and H. If D and D have the same support and

KL(D D) 4γ2R,

then for all n Cn d/γ2 it holds that

Pr T Dn[T G] exp( 16Cnd R).

Proof. Let λ R>0 (to be chosen later) and X: [m]n {0, λ} be the random variable given by

X(T) = λ1{T G}.

Since D and D have the same support, so do Dn and Dn. Thus, taking the exponential of both sides of Eq. (3), Lemma 2.4 yields that

exp( KL(Dn Dn) + EDn[X]) E Dn e X . (4)

We have that

EDn[X] = λ Pr T Dn[T G]. (5)

E Dn e X = ET Dn eλ 1{T G} + 1{T G}

= ET Dn eλ 1{T G} + 1 1{T G}

= 1 + (eλ 1) ET Dn 1{T G}

= 1 + (eλ 1) Pr T Dn[T G]. (6)

Applying Eqs. (5) and (6) to Eq. (4), we obtain that

exp KL(Dn Dn) + λ Pr T Dn[T G] 1 + (eλ 1) Pr T Dn[T G]

Pr T Dn[T G] exp h KL(Dn Dn) + λ Pr T Dn[T G] i 1

for any λ > 0. Choosing

λ = KL(Dn Dn) + ln 2

Pr T Dn[T G] ,

we obtain that

Pr T Dn[T G] 1 eλ 1

Now, by Theorem 2.3 (using δ = 1/2), there exists a constant Cn 1 such that having

ensures that

Pr T Dn[T G] 1

Also, since, by hypothesis, KL(D D) 4γ2R, we have that

KL(Dn Dn) = n KL(D D) 4Cnd R.

Applying it to Eq. (7), we conclude that

Pr T Dn[T G] exp 4Cnd R + ln 2

exp( 16Cnd R).

B.2 Proof of Lemma 2.6

Lemma 2.6. There exists universal constant Cn 1 such that for all R N, 0 < δ < 1, 0 < γ < 1/2, and γ-weak learner W using a hypothesis set H { 1, 1}X with VC-dimension d, if t R exp(16Cnd R) ln(R/δ), then with probability at least 1 δ the hypotheses hk R+1, . . . , hk R+R obtained by Algorithm 1 induce normalization factors Zk R+1, . . . , Zk R+R such that

r=1 Zk R+r < exp( γ2R/2).

Proof. Assume, for simplicity, that k = 0.

r=1 Zr < exp( γ2R /2) ,

we will show that for all R [R] it holds that

Pr[ER | E1, . . . , ER 1] 1 δ/R. (8)

The thesis then follows by noting that

Pr[E1 ER] =

r=1 Pr[Er | E1, . . . , Er 1] (by the chain rule)

R (by Eq. (8))

R (by Bernoulli s inequality)

Let GDR [m]n be the family of γ/2-approximations for DR and recall that if T GDR , then any h = W(T, Uniform(T)) satisfies LDR (h) 1/2 γ/2. Therefore, the existence of TR ,j GDR , for some j [t/R], implies that h R ,j HR has margin at least γ/2 relative to DR . Hence, Algorithm 1 can select h R ,j at Line 11, setting αR = α so that, by Claim 1, we have that ZR exp( γ2/2).

Now notice that, by the law of total probability,

Pr h ER R 1 r=1 Er i

= Pr h KL(DR D1) 4γ2R R 1 r=1 Er i Pr h ER R 1 r=1 Er, KL(DR D1) 4γ2R i

+ Pr h KL(DR D1) > 4γ2R R 1 r=1 Er i Pr h ER R 1 r=1 Er, KL(DR D1) > 4γ2R i .

We will show that, conditioned on R 1 r=1 Er, if KL(DR D1) 4γ2R, we can leverage Lemma 2.5 to argue that with probability at least 1 δ/R there exists a γ/2-approximation for DR within TR ,1, . . . , TR ,t/R, and that ER follows. On the other hand, if KL(DR D1) > 4γ2R, we shall prove that ER necessarily holds. Under those two claims, Eq. (9) yields that

Pr h ER R 1 r=1 Er i Pr h KL(DR D1) 4γ2R R 1 r=1 Er i 1 d

+ Pr h KL(DR D1) > 4γ2R R 1 r=1 Er i 1

Pr h KL(DR D1) 4γ2R R 1 r=1 Er i 1 d

+ Pr h KL(DR D1) > 4γ2R R 1 r=1 Er i 1 d

which, as argued, concludes the proof.

To proceed, we ought to consider the relationships between the random variables involved. To do so, for r [R] let T r = {Tr,1, . . . , Tr,t/R}. Notice that Dn R is itself random and determined by D1, and T 1, . . . , T R 1.

For the first part, let D1 and T1, . . . , TR 1 be realizations of D1 and T 1, . . . , T R 1 such that R 1 r=1 Er holds and KL(DR D1) 4γ2R. Notice that if there exists a γ/2-approximation for DR within T R, then we can choose some h R HR with advantage at least γ/2 so that

r=1 Zr = ZR

< ZR exp( γ2(R 1)/2) (as we condition on R 1 r=1 Er)

exp( γ2R /2) (by Claim 1) and, thus, ER follows. That is,

Pr h ER R 1 r=1 Er, KL(DR D1) 4γ2R i Pr

TR ,1,...,TR ,t/R iid Dn 1

j [t/R], TR ,j GDR .

Finally, since by Remark 1 the distributions DR and D1 must have the same support, and we assume that KL(DR D1) 4γ2R, Lemma 2.5 ensures that Pr T Dn 1 [T GDR ] exp( 16Cnd R).

TR ,1,...,TR ,t/R iid Dn 1

j [t/R], TR ,j / GDR = Pr T Dn 1

T / GDR t/R (by IIDness)

(1 exp( 16Cnd R))t/R

R exp( 16Cnd R)

where the second inequality follows since 1 + x ex for all x R and the last from the hypothesis that t R exp(16Cnd R) ln(R/δ). Considering the complementary event and applying Eq. (10), we obtain that ER holds with probability at least 1 δ/R.

For the second part, consider instead D1 and T1, . . . , TR 1 realizations of D1 and T 1, . . . , T R 1 such that R 1 r=1 Er holds and

4γ2R < KL(DR D1), (11)

and argue that ER necessarily follows.

Observe that

KL(DR D1) = KL(DR D1) KL(DR DR ) = KL(DR D1) KL(DR D2) + KL(DR D2) KL(DR D3) + + KL(DR DR 1) KL(DR DR )

r=1 KL(DR Dr) KL(DR Dr+1). (12)

Moreover, given r {1, . . . , R 1},

KL(DR Dr) KL(DR Dr+1) =

i=1 DR (i) ln DR (i)

i=1 DR (i) ln DR (i)

i=1 DR (i) ln Dr+1(i)

i=1 DR (i) ln exp( αrc(xi)hr(xi))

i=1 DR (i)αrc(xi)hr(xi).

Applying it to Eqs. (11) and (12) yields that

4γ2R < KL(DR D1) = ln

i=1 DR (i)c(xi)hr(xi).

Thus, either

r=1 Zr > 4γ2R

i=1 DR (i)c(xi)hr(xi) > 4γ2R

We proceed to analyze each case.

If Eq. (13) holds, then

r=1 Zr < exp( 2γ2R)

exp( γ2R /2)

and ER follows by noting that ZR = 1 regardless of the outcome of Line 11 so QR

r=1 Zr QR 1 r=1 Zr.

On the other hand, if Eq. (14) holds, then, letting R = {r [R 1] | αr = 0},

i=1 DR (i)c(xi)hr(xi)

i=1 DR (i)c(xi)hr(xi).

Since |R| R, we obtain that

i=1 DR (i)c(xi)( hr(xi)) > 2γ2

so that there exists h { hr | r R} such that

i=1 DR (i)c(xi)h (xi) > 2γ2

Moreover, from the definition of α,

2 ln 1/2 + γ/2

2 ln 1 + 2γ 1 γ

γ 1 γ < 2γ, (16)

where the last inequality holds for any γ (0, 1/2). Applying it to Eq. (15) yields that

i=1 DR (i)c(xi)h (xi) > 2γ2

thus LDR (h ) < 1/2 γ/2 and, as before, ER follows by Claim 1 and the conditioning on R 1 r=1 Er.

B.3 Proof of Theorem 2.1

Theorem 2.1. There exists universal constant Cn 1 such that for all 0 < γ < 1/2, R N, concept c: X { 1, 1}, and hypothesis set H { 1, 1}X of VC-dimension d, Algorithm 1 given an input training set S X m, a γ-weak learner W,

γ2R , and t e16Cnd R R ln p R

produces a linear classifier g at Line 21 such that with probability at least 1 δ over the randomness of Algorithm 1, g(x)c(x) γ/8 for all x S.

Proof. Let k {0, 1, . . . , p 1}. Applying Lemma 2.6 with failure probability δ/p, we obtain that with probability at least 1 δ/p,

r=1 Zk R+r < exp( γ2R/2).

Thus, by the union bound, the probability that this holds for all k {0, 1, . . . , p 1} is at least 1 δ.

Under this event, we have that

i=1 exp c(xi)

j=1 αjhj(xi) = m

j=1 Zj (by Claim 2)

k=0 exp( γ2R/2)

= m exp( γ2p R/2). (17)

Now, let θ 0. If c(x)g(x) < θ, then, by the definition of g at Line 21, it must hold that c(x) Pp R j=1 αjhj(x) < Pp R j=1 αjθ, thus the difference Pp R j=1 αjθ c(x) Pp R j=1 αjhj(x) is strictly positive. Taking the exponential, we obtain that, for all x S,

1{c(x)g(x)<θ} 1

j=1 αjθ c(x)

j=1 αjhj(x)

exp(p Rαθ) exp

j=1 αjhj(x)

. (as αj α)

i=1 1{c(xi)g(xi)<θ} < exp(p Rαθ)

j=1 αjhj(xi)

Applying Eq. (17), we obtain that

i=1 1{c(xi)g(xi)<θ} < m exp(p Rαθ) exp( γ2p R/2)

= m exp p R(αθ γ2/2) .

Finally, since 0 α 2γ (see Eq. (16)), we have that, for 0 θ γ/8,

αθ γ2/2 2γ γ/8 γ2/2

and thus m X

i=1 1{c(xi)g(xi)<γ/8} < m exp p Rγ2/4

m m 1 (as p 4R 1γ 2 ln m) = 1,

and we can conclude that all points have a margin greater than γ/8.

C Lower Bound

In this section, we prove Theorem 1.2. Theorem 1.2 is a consequence of the following Theorem C.2. Before we state Theorem C.2 we will: state the assumptions that we make in the lower bound for a learning algorithm A with parallel complexity (p, t), the definition of a γ-weak learner in this section and describe the hard instance. For this let c: X { 1, 1} denote a labelling function. Furthermore, throughout Appendix C let Csize := Cs 1, Cbias := Cb 1 and Closs := Cl 1 denote the same universal constants. Assumption C.1. Let Qi with |Qi| t be the queries made by a learning algorithm A with parallel complexity (p, t) during the ith round. We assume that a query Qi j Qi for i = 1, . . . , p and j = 1, . . . , t is on the form (Si j, c(Si j), Di j), where the elements in Si j are contained in S, and that the distribution Di j has support supp(Di j) {(Si j)1, . . . , (Si j)m}. Furthermore, we assume Q1 only depends on the given sample S X m and the sample labels c(S) where c(S)i = c(Si), and that Qi for i = 2, . . . , p only depends on the label sample S, c(S) and the previous i 1 queries and the responses to these queries.

We now clarify what we mean by a weak learner in this section. Definition 2. A γ-weak learner W acting on a hypothesis set H, takes as input (S, c(S), D), where S X = i=1X i, c(S)i = c(Si) and supp(D) {S1, S2 . . .}. The output of h = W(H)(S, c(S), D) is such that P

i D(i)1{h(i) = c(i)} 1/2 γ.

We now define the hard instance which is the same construction as used in Lyu et al. [2024] (which was inspired by Karbasi and Larsen [2024]). For d N, samples size m, and 0 < γ < 1 4Cb we consider the following hard instance

1. The universe X we take to be [2m]. 2. The distribution D we will use on [2m] will be the uniform distribution U over [2m]. 3. The random concept c that we are going to use is the uniform random concept { 1, 1}2m, i.e. all the labels of c are i.i.d. and Prc[c(i) = 1] = 1/2 for i = 1, . . . , m. 4. The random hypothesis set will depend on the number of parallel rounds p, a scalar R N, and the random concept c, thus we will denote it Hp,c,R. We will see Hp,c,R as a matrix where the rows are the hypothesis so vectors of length 2m, where the ith entry specifies the prediction the hypothesis makes on element i [2m]. To define Hp,c,R we first define two random matrices Hu and Hc. Hu is a random matrix consisting of R exp (Csd) rows, where the rows in Hu are i.i.d. with distribution r { 1, 1}2m (r has i.i.d. entries Prr { 1,1}2m[r(1) = 1] = 1/2). Hc is a random matrix with R rows, where the rows in Hc are i.i.d. with distribution b { 1, 1}2m Cb , meaning the entries of b are independent and has distribution Prb { 1,1}2m Cb [b(i) = c(i)] =

1/2 Cbγ (so Cbγ biased towards the sign of c). We now let H1 u, H1 c, . . . , Hp u, Hp c denote i.i.d. copies of respectively Hu and Hc, and set Hp,c,R to be these i.i.d. copies stack on top of each other and Hp,c,R c to be the random matrix which first rows are Hp,c,R and its last row is c,

H1 u H1 c... Hp u Hp c

Hp,c,R c = Hp,c,R c

5. The algorithm W which given matrix/hypothesis set M Rℓ R2m (where Mi, denotes the ith row of M) is the following algorithm W(M).

Algorithm 2: W(M)

Input : Triple (S, c(S), D) where S [2m] , c(S)i = c(Si) and probability distribution D with supp(D) {S1, S2, . . . , }. Output: Hypothesis h = Mi, for some i = 1, . . . , ℓsuch that: P

i D(i)1{h(i) = c(i)} 1/2 γ.

1 for i [ℓ] do

j D(j)1{Mi,j = c(j)} 1/2 γ // Notice that W doesn t know c but can calculate this quantity using the information in (S, c(S), D) which is given as input.

4 return Mi, .

5 return M1, .

We notice that with this construction, we have that |Hp,c,R| R exp (Csd) +Rp and W(Hp,c,R c) a weak learner since it either finds a row in Hp,c,R with error less than 1/2 γ for a query or outputs c which has 0 error for any query - this follows by the Assumption C.1 that the learning algorithm given (S, c(S)) make queries which is consistent with c.

With these definitions and notation in place, we now state Theorem C.2, which Theorem 1.2 is a consequence of. Theorem C.2. For d N, m N, margin 0 < γ < 1 4Cb , R, p, t N, universe [2m], U the uniform distribution on [2m], and c the uniform concept on [2m] any learning algorithm A with parallel complexity (p, t), given labelled training set (S, c(S)), where S Um, and query access to W(Hp,c,R c) we have that

ES,c,H[Lc U(A(S, c(S), W(Hp,c,R c)))]

exp( Cl C2 b γ2Rp) 4Cl

1 exp m exp( Cl C2 b γ2Rp) 8Cl

pt exp ( Rd) ,

We now restate and give the proof of Theorem 1.2.

Theorem 1.2. There is a universal constant C 1 for which the following holds. For any 0 < γ < 1/C, any d C, any sample size m C, and any weak-to-strong learner A with parallel complexity (p, t), there exists an input domain X, a distribution D, a concept c: X { 1, 1}, and a γ-weak learner W for c using a hypothesis set H of VC-dimension d such that if the expected loss of A over the sample is no more than m 0.01, then either p min{exp(Ω(d)), Ω(γ 2 ln m)}, or t exp(exp(Ω(d))), or p ln t = Ω(γ 2d ln m).

Proof of Theorem 1.2. Fix d 1, sample size m (e80Cl)100, margin 0 < γ 1 4Cb , p such that

p min exp(d/8), ln(m0.01/80Cl)

2Cl C2 b γ2

, t exp(exp(d)/8) and p ln(t) d ln(m0.01/80Cl)

8Cl C2 b γ2 . We now

want to invoke Theorem C.2 with different values of R depending on the value of p. We consider 2 cases. Firstly, the case

ln m0.01/80Cl

2Cl C2 b γ2 exp(d) p ln m0.01/80Cl

2Cl C2 b γ2 .

In this case one can choose R N such that 1 < R exp(d) and

ln(m0.01/80Cl)

2Cl C2 b γ2R p ln(m0.01/80Cl)

2Cl C2 b γ2(R 1) .

Let now R be such. We now invoke Theorem C.2 with the above parameters and get

ES,c,H[Lc U(A(S, c(S), W(Hp,c,R c)))] (18)

exp( Cl C2 b γ2Rp) 4Cl

1 exp m exp( Cl C2 b γ2Rp) 8Cl

pt exp ( Rd) ,

We now bound the individual terms on the right-hand side of Eq. (18). Firstly, since

p ln(m0.01/80Cl)

2Cl C2 b γ2(R 1) ln(m0.01/80Cl)

Cl C2 b γ2R ,

we get that exp( Cl C2 b γ2Rp) 4Cl 20m 0.01 which further implies that

exp m exp( Cl C2 b γ2Rp) 8Cl

exp( 10m0.99) e 10.

We further notice that for R as above we have that p ln(exp(Rd/4)) d ln(m0.01/80Cl)

8Cl C2 b γ2 . This implies that t exp(Rd/4), since else we would have t > exp(Rd/4) and p ln(t) > p exp(Rd/4) d ln(m0.01/80Cl)

8Cl C2 b γ2 which is a contradiction with our assumption that p ln(t) d ln(m0.01/80Cl)

8Cl C2 b γ2 . Since we also assumed that p exp(d/8) we have that pt exp(d/8+Rd/4 ). Combining this with R > 1 and d 1 we have that pt exp (Rd) exp (Rd/2) e 1. Combining the above observations we get that the right-hand side of Eq. (18) is at least

ES,c,H[Lc U(A(S, c(S), W(Hp,c,R c)))] 20m 0.01 1 e 10 e 1 m 0.01.

Now in the case that

p < ln m0.01/80Cl

2Cl C2 b γ2 exp(d) , (19)

we choose R = exp(d) . Invoking Theorem C.2 again give use the expression in Eq. (18) (with the parameter R = exp(d) now) and we again proceed to lower bound the right-hand side of Eq. (18). First we observe that by the upper bound on p in Eq. (19), R = exp(d) and

exp( x/2) exp ( x) for x 1 we get that exp( Cl C2 b γ2Rp) 4Cl exp( ln(m0.01/80Cl)/2)

20m 0.01, which further implies that exp m exp( Cl C2 b γ2Rp) 8Cl

e 10. Now since x x/2

for x 1, R = exp(d) and we assumed that t exp(exp(d)/8) and p exp(d/8) we get that

pt exp( Rd) exp (exp(d)/8 + d/8 d exp(d)/2) exp ( d exp(d)/4) e e/4. Combining the above observations we get that the right-hand side of Eq. (18) is at least

ES,c,H[Lc U(A(S, c(S), W(Hp,c,R c)))] 20m 0.01 1 e 10 e e/4 m 0.01.

Thus, for any of the above parameters d, m, γ, p, t in the specified parameter ranges, we have that the expected loss of A over S, c, Hp,c,R is at least m 0.01, so there exists concept c and hypothesis H such that the expected loss of A over S is at least m 0.01. Furthermore, if A were a random algorithm Yao s minimax principle would give the same lower bound for the expected loss over A and S as the above bound holds for any deterministic A.

Now as remarked on before the proof the size of the hypothesis set Hp,c,R is at most |Hp,c,R| R exp (Csd) + Rp, see Item 4. Combining this with us in the above arguments having p exp(d/8), R exp(d) we conclude that |H c| exp( Cd/2) for C large enough. Thus, we get at bound of log2(|H c|) log2(exp( Cd/2)) Cd which is also an upper bound of the VC-dimension of H c. Now redoing the above arguments with d scaled by 1/ C we get that the VC-dimension of H c is upper bounded by d and the same expected loss of m 0.01. The constraints given in the start of the proof with this rescaling of d is now d C, m (e80Cl)100, 0 < γ 1 4Cb ,

p min n exp(d/(8 C)), ln (m0.01/(80Cl))

2Cl C2 b γ2 o , t exp (exp (d/ C)/8) and p ln(t) d ln(m0.01/80Cl)

8 CCl C2 b γ2 .

Thus, with the universal constant C = max n (e80Cl)100, 4Cb, C o and m, d C and γ 1/C

we have that the expected loss is at least m 0.01 when p min exp(O(d)), O(ln (m)/γ2) , t exp (exp (O(d))) and p ln(t) O(d ln (m)/γ2) which concludes the proof.

We now move on to prove Theorem C.2. For this, we now introduce what we will call the extension of A which still terminates if it receives a hypothesis with loss more than 1/2 γ. We further show two results about this extension one which says that with high probability we can replace A with its extension and another saying that with high probability the loss of the extension is large, which combined will give us Theorem C.2.

6. The output of the extension BA of A on input (S, c(S), W) is given through the outcome of recursive query sets Q1, . . ., where each of the sets contains t queries. The recursion is given in the following way: Make Q1 to W as A would have done on input (S, c(S), ) (this is possible by Assumption C.1 which say Q1 is a function of only (S, c(S))). For i = 1, . . . , p such that for all j = 1, . . . , t it is the case that W(Qi 1 j ) has loss less than 1/2 γ under Di 1 j let Qi

be the query set Q that A would have made after having made query sets Q1, . . . , Qi 1 and received hypothesis {W(Ql j)}(l,j) [i 1] [t]. If this loop ends output the hypothesis that A would have made with responses {W(Ql j)}(l,j) [i] [t] to its queries. If there is an l, j such that W(Ql j) return a hypothesis with loss larger than 1/2 γ return the all 1 hypothesis.

We now go to the two results we need in the proof of Theorem C.2. The first result Corollary 1 says that there exists an event E which happens with high probability over Hp,c,R such that A run with W(Hp,c,R, c) is the same as BA run with W(Hp,c,R). This corollary can be proved by following the proofs of Theorem 5 and 8 in Lyu et al. [2024] and is thus not included here. Corollary 1. For d N, m N, margin 0 < γ < 1 4Cb , labelling function c : [2m] { 1, 1}, R, p, t N, random matrix Hp,c,R, learning algorithm A, BA, training sample S [2m]m, we have that there exist and event E over outcomes of Hp,c,R such that

A(S, c(S), W(Hp,c,R c))1E = BA(S, c(S), W(Hp,c,R))1E and Pr Hp,c,R[E] 1 pt exp ( Rd) .

The second result that we are going to need is Lemma C.3 which relates parameters R, β, p to the success of any function of (S, c(S), Hp,c,R) which tries to guess the signs of c which is the number of failures in our hard instance. For a training sample S [2m] we will use |S| to denote the number of distinct elements in S from [2m], so for S [2m]m we have |S| m.

Lemma C.3. There exists universal constant Cs, Cl 1 such that: For m N, p N, Hp,β,c,R, function B that takes as input S [2m]m with labels c(S), and hypothesis set Hp,β,c,R, we have that

Pr c,Hp,c,R

i=1 1{B(S, c(S), Hp,c,R)(i) = c(i)} (2m |S|) exp( Cl C2 b γ2Rp) 2Cl

1 exp (2m |S|) exp( Cl C2 b γ2Rp) 8Cl

We postpone the proof of Lemma C.3 and now give the proof of Theorem C.2.

Proof of Theorem C.2. We want to lower bound ES,c,H[Lc U(A(S, W(H c)))]. To this end since S and c are independent and Hp,c,R depended on c the expected loss can be written as ES,c,H[Lc U(A(S, c(S), W(Hp,c,R c)))]

= ES Ec EHp,c,R[Lc U(A(S, c(S), W(Hp,c,R c)))] .

Now let S [2m]m, c be any outcome of S and c. Then for these S, c we have by Lemma C.3 that there exists some event E over Hp,c,R such that Lc U(A(S, c(S), W(Hp,c,R, c)))1E = Lc U(BA(S, c(S), W(Hp,c,R)))1E, and Pr Hp,c,R[E] 1 pt exp ( Rd) ,

furthermore, define E be the event that

i=1 1{BA(S, c(S), W(Hp,c,R))(i) = c(i)} (2m |S|) exp( Cl C2 b γ2Rp) 2Cl

Using the above and U being the uniform measure on [2m] so assigns 1/(2m) mass to every point and that |S| m we now get that EHp,c,R[Lc U(A(S, c(S), W(Hp,c,R, c)))] EHp,c,R[Lc U(A(S, c(S), W(Hp,c,R, c)))1E1E ]

= EHp,c,R[Lc U(BA(S, c(S), W(Hp,c,R)))1E1E ]

(2m |S|) exp( Cl C2 b γ2Rp) 4Clm EHp,c,R[1E1E ]

exp( Cl C2 b γ2Rp) 4Cl

1 Pr Hp,c,R[E ] pt e Rd .

We can do this for any pair c and S [2m]m, so we have that

Ec EHp,c,R[Lc U(A(S, c(S), W(Hp,c,R c)))]

exp( Cl C2 b γ2Rp) 4Cl

1 Pr c,Hp,c,R[E ] pt exp ( Rd) .

Now by Lemma C.3 and |S| m we have that Prc,Hp,c,R

E is at most

Pr c,Hp,c,R

E exp (2m |S|) exp( Cl C2 b γ2Rp) 8Cl

exp m exp( Cl C2 b γ2Rp) 8Cl

I.e. we have shown that Ec EHp,c,R[Lc U(A(S, c(S), W(Hp,c,R c)))]

exp( Cl C2 b γ2Rp) 4Cl

1 exp m exp( Cl C2 b γ2Rp) 8Cl

pt exp ( Rd) ,

for any S [2m]m. Now by taking expectation over S Um we get that ES,c,H[Lc U(A(S, c(S), W(Hp,c,R c)))]

exp( Cl C2 b γ2Rp) 4Cl

1 exp m exp( Cl C2 b γ2Rp) 8Cl

pt exp ( Rd) ,

which concludes the proof.

We now prove Lemma C.3 which is a consequence of maximum-likelihood, and the following Fact 1, where Fact 1 gives a lower bound on how well one from n trials of a biased { 1, 1} random variable, where the direction of the bias itself is random, can guess this random direction of the bias. Fact 1. For function f : { 1, 1}n { 1, 1} and 0 < γ 1 4Cb

Ec { 1,1}[Eb { 1,1}n Cb [1{f(b) = c}]] exp( Cl C2 b γ2n)/Cl.

Proof. This is the classic coin problem. The lower bound follows by first observing, by maximumlikelihood, that the function f minimizing the above error is the majority function. The result then follows by tightness of the Chernoff bound up to constant factors in the exponent.

With Fact 1 in place we are now ready to proof Lemma C.3, which we restate before the proof

Proof of Lemma C.3. Let HCb be the matrix consisting of the i.i.d Cbγ biased matrices in Hp,c,R, Hc 1, . . . , Hc p stack on top of each other,

H1 c... Hp c

Furthermore, for i = 1, . . . , 2m let HCb,i denote the ith column of HCb which is a vector of length p R. Now for i inside S, B has the sign of c(i), so the best function that B can be is to be equal to c(i). For i outside S, B does not know c(i) from the input but has information about it through HCb,i, we notice that the sign s of the hypotheses in H1 u, . . . , Hp u and HCb,j j = i and c(S) is independent of c(i) and does not hold information about c(i), thus the best possible answer any B can make is to choose the sign which is the majority of the sign s in HCb,i - the maximum likelihood estimator. We now assume that B is this above-described "best" function - as this function will be a lower bound for the probability of failures for any other B, so it suffices to show the lower bound for this B. Now with the above described B, we have that

i=1 1{B(S, c(S), Hp,c,R)(i) = c(i)} = X

Thus, we have that X is a sum of 2m |S| (where |S| is the number of distinct elements in S) independent {0, 1}-random variables and by Fact 1 we have that the expectation of each these random variables is at least Ec,Hp,c,R[X] (2m |S|) exp( Cl C2 b γ2Rp)/Cl. Thus, we now get by Chernoff that

Pr c,Hp,c,R

X (2m |S|) exp( Cl C2 b γ2Rp) 2Cl

Pr c,Hp,c,R[X E[X]/2]

1 exp( E[X]/8) 1 exp (2m |S|) exp( Cl C2 b γ2Rp) 8Cl

as claimed.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The theoretical claims in the introduction and abstract exactly match what we prove in the paper and claim as the scope and contributions.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [NA]

Justification: The paper is theoretical, so given the assumptions of the claims, there are no limitations.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: The main paper includes a proof sketch for both the upper and lower bound. Parts of the formal proofs of the upper bound and lower bound are in the main text and the remaining are in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper is theoretical, and we have no experiments, data or code in the paper. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We see no violations of the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The result of the paper is theoretical/foundational research, so we have no experiments, data or code in the paper. Guidelines:

The answer NA means that there is no societal impact of the work performed.

If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset.

For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: The paper is theoretical, and we have no experiments, data or code in the paper.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.