# sampleoptimal_agnostic_boosting_with_unlabeled_data__6926dad5.pdf

Sample-Optimal Agnostic Boosting with Unlabeled Data

Udaya Ghai 1 Karan Singh 2

Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sample-efficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity.

We highlight a previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples needed, unlabeled and labeled inclusive, is never more than that for the best known agnostic boosting algorithm so this result is never worse while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learningtheoretic applications of agnostic boosting, which often take place in the distribution-specific setting, where unlabeled samples can be availed for free. We also prove that the resultant guarantee is resilient against mismatch between the distributions governing the labeled and unlabeled samples. Finally, we detail an application of this result in reinforcement learning.

1. Introduction

The methodology of boosting, starting with the work of Schapire (1990), has deep roots both in the practice and

*Equal contribution 1Amazon, NYC 2Tepper School of Business, Carnegie Mellon University. Correspondence to: Karan Singh <karansingh@cmu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

theory of machine learning. In practice, combining classifiers trained on adaptively weighted data points to highlight past mistakes has proven to be a powerful idea. In theory, boosting provides a highly general and provably efficient framework to convert learning algorithms that are ever so slightly better than random into accurate learners. The historical origins of boosting began with the realization that although Empirical Risk Minimization (ERM), that is, the act of choosing the hypothesis that best fits the observed data, provides a general scheme for statistical learning, for many hypothesis classes it may not be approachable in terms of computational efficiency. Often instead, in theory and practice, it is possible to quickly construct weakly accurate rules-of-thumb (or weak learners). Can one use such weak learners to construct an accurate learning algorithm? Boosting provides an affirmative answer to this question by giving a principled way of combining many weak learners, each incrementally focusing on regions of the sample space where others have failed. Part of the original motivation here, as laid out by Kearns & Valiant (1994), was that such an aggregation of hypotheses would no longer be in-class (in modern parlance, improper) and hence could potentially circumvent the computational hardness of direct ERM. Interestingly, Kearns & Valiant (1994) were interested in the representation-independent hardness of learning, that is, the hardness that persists independently of algorithmic parameterization. However, the main impact of the boosting methodology has been as a positive theory, that is, in constructing powerful new learning algorithms, a view that the present work also takes. A natural follow-up question is: Does the computational reprieve that boosting provides come at a cost?

Here, we are primarily concerned with sample efficiency. In the realizable case, that is, when a perfect classifier exists, the general answer is no. This serves to reaffirm our faith in the realizable boosting framework. The celebrated Adaboost algorithm (Freund & Schapire, 1997), for instance, achieves the same (near-optimal) sample complexity as ERM for VC classes 1. In fact, Green Larsen & Ritzert (2022) recently showed that for a carefully designed variant the equivalence holds even up to logarithmic terms.

1But also see the note in Section 2.1 in Blanc et al. (2024) for potentially a gap given arbitrary weak learners.

Sample-Optimal Agnostic Boosting with Unlabeled Data

Let us now move to the agnostic case, where we make no assumptions about the conditional distribution of labels given features; in many ways, this has become the default stance in machine learning. Firstly, the classical notion of realizable weak learning requires that weak learners attain a 0/1 loss strictly better than half, or equivalently, a correlation strictly better than zero on signed labels. However, such notions are unattainable in the agnostic setting. Early works on agnostic boosting using weak learning definitions that mended this issue did not produce results that were additively competitive with the underlying hypothesis class, thus a direct comparison to ERM remained elusive. Following this, Kanade & Kalai (2009) gave a computationally efficient agnostic boosting that in fact achieves ε-excess population loss compared to the best-in-class hypothesis, mirroring the guarantee ERM provides when coupled with generalization bounds (for example, in Shalev-Shwartz & Ben-David (2014)). However, the sample complexity of their boosting algorithm scales as (log |H|)/ε4, where H is the (let us say for now, finite2.) hypothesis class. A recent result of Ghai & Singh (2024) provides a booster that requires (log |H|)/ε3 samples. However, both exhibit significantly higher sample complexity than the familiar (log |H|)/ε2

required for ERM, which is also optimal.

Although agnostic boosting has seen considerable recent progress both in methodology, for example, through extensions to online (Brukhim et al., 2020) and multiclass (Raman & Tewari, 2022) settings, and in finding new applications (Brukhim et al., 2022; Kothari & Livni, 2018), this limitation persists.

In this work, we highlight an unexplored avenue of improvement: unlabeled samples. As usual, to measure the statistical efficiency of the learner, we shall keep track of the number of labeled samples it needs. However, we also imbue the learner with the ability to additionally draw polynomially many unlabeled samples from the underlying feature distribution. Under this stipulation, our main result is a computationally efficient agnostic boosting algorithm with (labeled) sample complexity scaling as (log |H|)/ε2, and therefore for the first time matching that of ERM.

This is fortuitous and relevant both on practical and theoretical fronts. In practice, it is widely acknowledged that, in many domains, unlabeled samples are substantially cheaper to collect than labeled ones. Furthermore, as exemplified by previous works, many applications of agnostic boosting in learning theory take place in distribution-specific settings where unlabeled samples can be drawn for free. Here, we expressly utilize the fact that the distribution on features is

2Purely for ease of presentation, in the introduction, we stick to finite classes and drop dependencies in the weak learning edge and the failure probability, later denoted by γ and δ respectivey. In fact, our results hold for infinite VC classes. See Theorem 3.1

ϕ(z, 1)/2 ϕMADA(z, 1)

Figure 1. The potential ϕ(z, 1)/2 compared against the Madaboost potential ϕMADA(z, 1) (Domingo, 2000), along with the the Huber loss ψ(z).

parsimonious, for example, uniformly distributed or Gaussian, to guarantee that simple classifiers can act as weak learners for the hypothesis class of interest, utilizing tools from L1-approximation theory. Since in these settings, the feature distributions are explicitly noted, even closing the aforementioned gap will obtain results that are no better than what we get. We offer further refinements and applications of our main result, as we detail next.

1.1. Overview of contributions and techniques

Sample-optimal boosting with unlabeled data. Our first and main result is a computationally efficient agnostic boosting algorithm that produces a classifier with ε excess error, given (log |H|)/ε2 labeled and (log |H|)/ε4 unlabeled samples. Our labeled sample complexity is essentially optimal.

Using unlabeled data to accelerate boosting is a novel idea. Our work also differs qualitatively, in that it does not require additional assumptions, compared to semi-supervised learning methods (Chapelle et al., 2006), which often require a tight clustering of data or smoothness of labels.

Despite differing in algorithmic techniques, previous works (Kanade & Kalai, 2009; Ghai & Singh, 2024) on agnostic boosting almost invariably use the Madaboost potential (Domingo, 2000) or a closely related variant:

ϕMADA(z, y) =

( e zy if zy 0, 1 zy if zy < 0,

where z is a real-valued prediction of the discrete label y.

Our key innovation is to design a new bivariate potential

ϕ(z, y) = yz +

2 if |z| > 1, 1 2z2 if |z| 1. ,

whose derivative (with respect to z) is decomposable as one term exclusively dependent on the label, and the other exclusive involving the features. Thus, by linearity of expectation, the population analogues of these parts can be estimated separately with labeled and unlabeled data. Furthermore, it is essential that the labeled data part, or formally its derivative, does not depend on the ensemble whose value is being

Sample-Optimal Agnostic Boosting with Unlabeled Data

assessed; this allows us to reuse the same set of labeled data across all rounds of boosting unlike Kanade & Kalai (2009), without needing a uniform convergence argument as in Brukhim et al. (2020), or a martingale argument as in Ghai & Singh (2024).

This non-conforming potential function violates some key tenets of prior work. For example, the centerpiece of the Madaboost potential is that it, or formally, its functional derivative, does not downweight points that are misclassified by the ensemble. This property is best captured in the definition of a conservative relabeling in Kanade & Kalai (2009). Intuitively, it makes sense; one wants not to withdraw any focus from wrongly classified data points. Not only is this false for us, it is incompatible with the requirement of having a (smooth) separable potential function of the manner we have just described. This can be seen in Figure 1: the negative slope of the potential, always between 0 and 1, specifies the weight of each sample. The derivative of Madaboost for the negative domain is always 1, but for us this is not true, and our potential curves upward between 1 and 0.

Improved unlabeled sample efficiency. Since our main innovation, as described above, is orthogonal to previous approaches, it can be layered on top of existing algorithmic techniques.

Specifically, applying techniques from Ghai & Singh (2024) selectively to the part of potential that arises from unlabeled data, we further reduce the number of unlabeled samples needed to log(|H|)/ε3. With this result, our total number of samples (labeled and unlabeled) matches the number of labeled samples needed by the previous best result, yet for us, only a vanishing, concretely ε, fraction of these needs to be labeled.

Resilience to covariate shift. Since, in addition to the labeled dataset, our algorithm has access to a stream of unlabeled examples, a natural practical concern may be that the underlying law governing the latter s generation may not exactly match the distribution of features in labeled examples. Unlike realizable learning where every sample helps to narrow down to the correct classifier, agnostic learning is often brittle against changes in the covariate distribution, since no classifier is the best in all regions of the feature space. However, we show that if there is a mismatch between the labeled and unlabeled distributions available to learner, our learner still succeeds in learning an arbitrarily accurate classifier as long as these distributions have the same support on the feature space. Moreover, the labeled sample complexity is unaffected by such a mismatch.

We point out that for the covariate shift setting, one must suitably generalize the measure of the progress being made in each round of boosting. This requires a few changes, the

first among which is that the potential function is no longer the population (expectation) version of a scalar potential. We keep track of the progress on the labeled and unlabeled distributions separately.

Applications. We apply our results to boosting for reinforcement learning, where we reduce the required number of reward-annotated episodes, and learning halfspaces.

1.2. Related work

The early theory of boosting was developed in a sequence of papers (Freund, 1995; Schapire & Singer, 1998; Bartlett et al., 1998) starting with Schapire (1990) leading to a breakthrough in the form of Adaboost (Freund & Schapire, 1997). Even earlier the possibility of boosting was posed in Kearns (1988); Kearns & Valiant (1994). A comprehensive survey can be found in Schapire & Freund (2013). Early boosting algoritms were quite sensitive to noise (Long & Servedio, 2008; Domingo, 2000). To mitigate this, boosting was then studied in the random classification noise model (Diakonikolas et al., 2021; Kalai & Servedio, 2003). Agnostic boosting started with Ben-David et al. (2001); Mansour & Mc Allester (2002); Kalai et al. (2008); Kale (2007); Chen et al. (2016) where new types of weak learners suitable for the agnostic setting were defined. Kanade & Kalai (2009) (see also Feldman (2010)) gave weak learning definition that led to additive excess error guarantees. Boosting in online setting is also well studied (Beygelzimer et al., 2015; Chen et al., 2012; Jung et al., 2017; Brukhim et al., 2020; Raman & Tewari, 2022; Hazan & Singh, 2021). See Alon et al. (2021); Green Larsen & Ritzert (2022); Lyu et al. (2024) for more recent work.

An ostensible alternative to our approach, especially given that realizable boosting achieves a near-optimal sample complexity, is to reduce agnostic boosting to the realizable case a la Hopkins et al. (2022). The concurrent work of da Cunha et al. (2025) pursues this and, in fact, obtains much more refined fat-shattering bounds. Although statistically optimal, this reduction requires pruning the hypothesis class by enumeration while considering all possible labelings of samples, and hence takes exponential time, rivaling ERM. In contrast, the present work offers computationally efficient algorithms that run in polynomial time, a requirement that has been central to the theory of boosting since its origin in computational learning theory.

2. Problem setting

We consider the fundamental setting of binary classification where X represents the set of feature descriptions, and the possible labels are { 1}. There is an underlying joint distribution D over X { 1}, which, while unknown to the learner, is crucial to determining its performance. Let DX

Sample-Optimal Agnostic Boosting with Unlabeled Data

be the marginal distribution D induces on the feature space X. Given any binary classifier h : X { 1}, its success on D can be measured as

corr D(h) = E (x,y) D [yh(x)] .

Correlation can readily be translated to the more commonly used metric, the 0/1-loss l0/1 D (h) = E(x,y) D 1y =h(x) ,

as corr D(h) = 1 2l0/1 D (h).

Consider a hypothesis class H X { 1} against which the learner aims to be competitive. The objective of the learner is to produce a hypothesis h : X { 1} such that with probability at least 1 δ, corr D(h) maxh H corr D(h) ε, where ε, δ are pre-specified error tolerances. A crucial remark here is that this final hypothesis may not belong to the hypotheses class H. The learners who are given such flexibility are called improper, and all known boosting algorithms fall into this class.

Definition 2.1 (Agnostic Weak Learner). For any ε0, δ0 > 0, a γ-agnostic weak learner for a hypothesis class H and a base class B draws m(ε0, δ0) independently and identically distributed samples from any distribution D supported on X { 1} and outputs a base hypothesis W B such that with probability at least 1 δ0,

corr D (W) γ max h H corr D (h) ε0.

This definition of agnostic weak learning was introduced in Kanade & Kalai (2009), where it is noted that typically m(ε, δ) = O(log(|B|/δ)/ε2). Mirroring the presentation of results in Kanade & Kalai (2009); Brukhim et al. (2020); Ghai & Singh (2024), we present the formal statement of results for fixed ε0, δ0. In this way, the magnitudes of contribution to the final error for the boosting algorithm and the weak learner are made distinct.

Finally, a further refinement of the agnostic boosting framework is the distribution-specific setting (Kanade & Kalai, 2009; Feldman, 2010), in which the set of input distributions to the weak learner are constrained so that their marginals on the feature space match that of the true underlying distribution. Thus, any regularity present in the feature distribution DX , e.g., if it follows a uniform or a Gaussian distribution, is also made available to the weak learner, which makes the design of such weak learners easier. Our results will also apply under this restriction. Like in previous work, our main algorithm works by relabeling examples; there is no need to adaptively reweigh them. This ensures that, for any sample fed to the weak learner, the overall marginal distribution follows DX .

3. The algorithm and the main result

Our main result and its proof are completely contained in this section. We describe some notation and essential algorithmic ingredients in the following.

Notation. Let h = arg maxh H corr D(h) be the bestin-class hypothesis. For brevity of notation, for any finite set D X { 1}, we denote the empirical average over it by b ED[ ] = 1 |D| P

(x,y) D( ). We define sign(z) as 1 if z 0 and 1 otherwise. For a real-valued function f, we take sign f to mean its precomposition with sign.

Potential function. Consider the potential function

ϕ(z, y) = ψ(z) yz , (1)

where ψ(z) is the Huber loss (Huber, 1992):

2 if |z| > 1, 1 2z2 if |z| 1.

Since we never differentiate ϕ(z, y) with respect to y, let ϕ (z, y) = ϕ(z,y)

z and ϕ (z, y) = 2ϕ(z,y)

2z . To measure the progress of any real-valued H : X R, consider the population potential

ΦD(H) = E (x,y) D [ϕ(H(x), y)] .

Φ D(H, h) = E (x,y) D [ϕ (H(x), y)h(x)]

The quantity Φ D(H, h) is the directional derivative of ΦD(H) on h.

Description of the Algorithm. Algorithm 1 roughly follows the potential based boosting framework of Kanade & Kalai (2009). The algorithm simulates the process where the true label is kept with probability ϕ (Ht(x), y)/2 and chosen randomly otherwise. As can be seen in Figure 1, the probability of flipping increases monotonically in the magnitude of y Ht(x), so the more certain Ht is of the correct label, the closer to uniformly random the label will be in Dt for the weak learner. Since predicting on random labels is impossible, this randomized relabeling increases the relative importance of data that Ht does not predict accurately.

The main trick we employ in this work is that the potential ϕ(z, y) in (1) is split into two parts such that the first part ψ(z) has no dependence on the label, and hence can be estimated via unlabeled examples. The second part yz is linear in z, hence the derivative has no dependence on z. As a result estimating this does not depend on Ht, but just a weak hypothesis. Since this is a simple class, the samples required for this estimation are small and we can

Sample-Optimal Agnostic Boosting with Unlabeled Data

use uniform convergence to assure concentration across all boosting rounds. The formal concentration argument can be seen in Lemma 3.3. One interesting observation, is that in this construction, samples from the labeled part of the distribution are never relabeled. Line 6 of Algorithm 1 can be interpreted as providing a regularization, wherein predictions on the unlabeled data are pushed towards 0 because pt(x) 1

2 if and only if Ht(x) 0.

The relabeling is designed so correlation on the relabeled distribution corresponds to the derivative of a population potential (see Lemma 3.3). As such, a weak learner produces a hypothesis that has nonnegligeable correlation with the (negative) functional gradient of the population potential ΦD(Ht). With properly chosen η, this assures a descent in potential as long as the weak learner has sufficient edge on the current distribution (Case A of Theorem 3.1 and line 10 of Algorithm 1). However, this need not be the case because we have access to an agnostic weak learner, and it s possible that no h B performs well on Dt. If this is the case and if Φ D(Ht, sign(Ht)) ε we are in Case B of Theorem 3.1. In this case, this condition on the derivative assures us that ht = sign Ht is also a descent direction (line 12 of Algorithm 1). Now, because the potential is bounded from below, only a certain number of such descent steps can occur. Eventually, we must reach a point where neither Case A nor Case B holds. Here Φ D(Ht, sign(Ht)) is small and there is no h B that performs well on Dt. Stitching these together with Lemma 3.5 relating the correlations to the potential gradient can be used to provide the result.

3.1. Main result on sample-optimal agnostic boosting

Theorem 3.1 (Main theorem). For any ε, δ > 0, there is an instantiation of parameters such that η = O(γ2ε), T = O(1/γ2ε2), τ = O(γε), S = O(VC(B) /γ2ε2), U = O(VC(B) /γ2ε2), S0 = O(1/ε2), m = m(ε0, δ0) + O(1/γ2ε2) for which Algorithm 1 guarantees with probability 1 δ Tδ0 that

corr D(h) max h H corr D(h) 2ε0

During its execution, Algorithm 1 makes T = O(1/γ2ε2) calls to the weak learner, and samples S + S0 = O(VC(B) /γ2ε2) labeled samples and TU = O(VC(B) /γ4ε4) unlabeled samples.

3.2. Proof of the main result

To maintain the continuity of presentation, our organization and notation closely mirror Kanade & Kalai (2009). We will use the following properties of ϕ and ψ.

Proposition 3.2. |ψ (z)| 1 and zψ (z) 0 for all z R. For all y { 1}, ϕ( , y) is continuously differentiable, 1-smooth, and satisfies ϕ(0, y) minz ϕ(z, y) 1/2.

Algorithm 1 Agnostic Boosting with Unlabeled Data

1: Inputs: Samplers for labeled data from D and unlabeled data from DX , γ-agnostic weak learning oracle W, parameters η, T, τ, S, U, S0, m. 2: Initialize a zero hypothesis H1 = 0. 3: Sample S labeled examples to create dataset b D. 4: for t = 1 to T do 5: Sample U unlabeled examples to create dataset b Dt. 6: Construct a resampling distribution Dt that chooses between steps A and B with equal probability. A. Return (x, y) picked uniformly from b D.

B. Pick x uniformly from b Dt, and return (x, by), where the pseudo-label by is chosen as

( +1 with probability pt(x) = 1 ψ (Ht(x))

2 , 1 with remaining probability.

7: Sample m times from Dt to create dataset b D t. 8: Call the weak learner on b D t to get Wt = W( b D t). 9: if corr b D t(Wt) = P

(x,by) b D t by Wt(x) > τ then

10: Update Ht+1 = Ht + ηWt/γ. 11: else 12: Update Ht+1 = Ht η sign(Ht)t. 13: end if 14: end for 15: Sample S0 labeled examples to create dataset b D0. 16: Output h = arg max h {sign(Ht):t [T ]}

(x,y) b D0 yh(x).

Proof of Proposition 3.2. Let us begin by noting that ψ (z) = sign(z) min{1, |z|} from which the first two properties can be seen. From this, all but the last properties follow. The last part can be verified by elementary calculations.

We show that Φ D(H, h) can be estimated efficiently on the base class B.

Lemma 3.3. There exists a universal constant C > 0 such that with probability 1 δ, for all t [T], h B {h },

1 2Φ D(Ht, h) + corr Dt(h) εGen := C

VC(B) + log 1

δ min{S, U} .

Proof of Lemma 3.3. By the definition of Dt, we have that

corr Dt(h) = 1

2 b E b D[yh(x)] 1

2 b E b Dt[ψ (Ht(x))h(x)],

where we use the fact that Line 6.B in Algorithm 1 ensures E[by|x] = ψ (Ht(x)). Since b D and b Dt are composed of IID draws from D and DX respectively, we use the following uniform convergence result (e.g., see Anthony & Bartlett (2009)), originally due to Talagrand (1994).

Sample-Optimal Agnostic Boosting with Unlabeled Data

Theorem 3.4 ((Talagrand, 1994)). Fix a hypothesis class B X { 1}, and distribution D over X { 1}. There is a universal constant C 0 such that with probability 1 δ, for all h B, it holds E (x,y) D[yh(x)] 1

i=1 yih(xi)

VC(B) + log 1

where {(xi, yi)}i [m] are sampled IID from D.

Hence, for some constant C 0 we have with probability 1 δ that |b E b D[yh(x)] corr D(h)| and |b E b Dt[ψ (Ht(x))h(x)] Ex DX [ψ (Ht(x))h(x)]| are at most εGen. Since ϕ (z, y) = ϕ (z) y, and hence

Φ D(Ht, h) = E x DX[ψ (Ht(x))h(x)] E (x,y) D[yh(x)],

completes the proof. Including h changes the VC dimension by a constant (Eisenstat & Angluin, 2007).

Finally, we will use the following lemma that upper bounds the correlation gap, which is our ultimate concern, by the difference in the directional derivative of the potential. Lemma 3.5. For any real-valued classifier H : X R, we have corr D(h ) corr D(sign(H)) Φ D(H, sign(H)) Φ D(H, h ).

Proof. We start by noting that

Φ D(H, sign(H)) Φ D(H, h )

= E (x,y) D [(ψ (H(x)) y)(sign(H(x)) h (x))]

= E (x,y) D [(1 yψ (H(x)))y(h (x) sign(H(x)))] ,

where in the last line we use the fact that y2 = 1.

Consider any (x, y) such that y H(x) > 0: Here y(h (x) sign(H(x))) < 0. Furthermore, since y and H(x) have the same sign, so do y and ψ (H(x)) by Proposition 3.2, and hence (1 yψ (H(x))) 1. Similarly, whenever y H(x) < 0: Then y(h (x) sign(H(x))) > 0, and y and ψ (H(x)) have opposite signs that imply (1 yψ (H(x))) 1.

Now the claim follows as

Φ D(H, sign(H)) Φ D(H, h ))

1y H(x) 0 (1 yψ (H(x))) | {z } 1

y(h (x) sign(H(x))) | {z } 0

+ 1y H(x)<0 (1 yψ (H(x))) | {z } 1

y(h (x) sign(H(x))) | {z } 0

E (x,y) D[y(h (x) sign H(x))]

= corr D(h ) corr D(sign(H)).

This wraps up the lemma.

We are now ready to prove the main result.

Proof of Theorem 3.1. Let us dispense with the random events at once. The success of Lemma 3.3, the event that maxt [T ] | corr b D0(sign Ht) corr D(sign Ht)| ε/10, and maxt [T ] | corr b D t(Wt) corr Dt(Wt)| γε/10 can be ensured with probability 1 δ by a simple application of Hoeffing s inequality and union bound given the setting of m and S0 in the statement of the theorem. Similarly, εGen γε/10 holds in Lemma 3.3 for S, U = Ω((VC(B) + log δ 1)/γ2ε2).

Let ht = (Ht+1 Ht)/η. Since ϕ is 1-smooth by Proposition 3.2:

ΦD(Ht+1) ΦD(Ht) (2)

ηϕ (Ht(x), y)ht(x) + η2(ht(x))2

ηΦ D(Ht, ht) + η2

Case A: Consider any step t where corr b D t(Wt) > τ. Here ht = Wt/γ. It follows from Lemma 3.3 that

Φ D(Ht, ht) 2 corr Dt(ht)

2 corr b D t(ht)

where we set τ = γε and η = γ2ε. By Equation (2), the potential drops as ΦD(Ht+1) ΦD(Ht) γ2ε2/2.

Case B: Consider any step t where corr b D t(Wt) τ and crucially Φ D(Ht, sign Ht) ε. Here ht = sign Ht. Since Φ D(Ht, ht) = Φ D(Ht, sign Ht) ε, by Equation (2), we have ΦD(Ht+1) ΦD(Ht) γ2ε2/2.

By Proposition 3.2, at initialization, ΦD(0) is at most a half away from the minimum. Thus, setting T = 2/γ2ε2, there must arise an iterate such that neither Case A nor Case B hold. That is, there is some s [T] such that corr b D s(Ws) τ and Φ D(Hs, sign Hs) ε. Now using Lemma 3.3 and that the weak learner γ-approximately maximizes correlation (Definition 2.1), we have

Φ D(Hs, h ) 2 corr Ds(h ) + 2εGen

2 corr Ds(Ws)

2 corr b D s(Ws)

Sample-Optimal Agnostic Boosting with Unlabeled Data

where in the last line we recall τ = εγ and εGen = εγ/10.

By Lemma 3.5, we have

corr D(h ) corr D(sign(Hs))

Φ D(Hs, sign(Hs)) Φ D(Hs, h )

To complete the proof, we observe that

corr D(h) corr b D0(h) + ε/10

corr b D0(sign Hs) + ε/10

corr D(sign Hs) + ε/5

2ε0/γ + 18ε/5,

where we use the fact that h maximizes the empirical correlation on the dataset b D0. Substituting ε appropriately yields the claim.

4. Improving unlabeled sample efficiency

In this section, we reduce the number of unlabeled samples needed to 1/γ3ε3, using the data reuse scheme from Ghai & Singh (2024), which is crucially only applied to unlabeled data. The key idea behind the scheme is that since Ht changes by a small amount each time, the change it induces on any twice-continuously differentiable potential is also small. Therefore, the desired relabeling distributions are not too different across rounds, and one may be able to reuse the distribution of past rounds to some extent. This difference is reflected in Algorithm 2 in Line 6 which allows recursive use of unlabeled data from past rounds. To do this, we first construct a twice-continuously differentiable potential using a Pseudo-Huber loss.

ϕ(z, y) = ψ(z) yz, and ψ(z) = p

1 + x2 1. (4)

Theorem 4.1 (Main theorem with unlabeled data reuse). For any ε, δ > 0, there is an instantiation of parameters such that η = O(γ2ε/ log |B|), T = O(log |B|/γ2ε2), τ = O(γε), S = O(VC(B) /γ2ε2), U = O(1/γε), S0 = O(1/ε2), m = m(ε0, δ0) + O(1/γ2ε2) for which Algorithm 2 guarantees with probability 1 δ Tδ0 that

corr D(h) max h H corr D(h) 3ε0

During its execution, Algorithm 2 makes T = O(log |B|/γ2ε2) calls to the weak learner, and needs S + S0 = O(log |B|/γ2ε2) labeled samples and TU = O(log |B|/γ3ε3) unlabeled samples.

Although the above result is always better in terms of the demand for unlabeled samples, it comes at the cost of

Algorithm 2 Agnostic Boosting with Selcetive Reuse of Unlabeled Data

1: Inputs: Samplers for labeled data from D and unlabeled data from DX , γ-agnostic weak learning oracle W, parameters η, T, τ, S, U, S0, m, σ. 2: Initialize a zero hypothesis H1 = 0. 3: Sample S labeled examples to create dataset b D. 4: for t = 1 to T do 5: Sample U unlabeled examples to create dataset b Dt. 6: Construct a resampling distribution D t that picks x uniformly from b Dt, picks by { 1} uniformly and returns (x, by) if t = 1; for t > 1, do: A. With probability 1 σ, return a sample (x, by) from D t 1. B. Else return (x, by) where x is uniformly chosen from b Dt, η Unif[0, η], and by is created as

( +1 with probability pt(x, η ), 1 with remaining probability, where

pt(x, η ) = 1

2 σψ (Ht 1(x))

ηψ (Ht 1(x) + η ht 1(x))ht 1(x)

7: Construct a resampling distribution Dt that chooses between steps A and B with equal probability. A. Return (x, y) picked uniformly from b D. B. Return (x, by) sampled uniformly from D t.

8: Sample m times from Dt to create dataset b D t. 9: Call the weak learner on b D t to get Wt = W( b D t). 10: if corr b D t(Wt) = P

(x,by) b D t by Wt(x) > τ then

11: Update Ht+1 = Ht + ηWt/γ. 12: else 13: Update Ht+1 = Ht η sign(Ht)t. 14: end if 15: end for 16: Sample S0 labeled examples to create dataset b D0. 17: Output h = arg max h {sign(Ht):t [T ]}

(x,y) b D0 yh(x).

increased oracle complexity, that is, the number of calls to the weak learner, which now has a log |B| factor unlike Theorem 3.1. Again mirroring techniques in Ghai & Singh (2024) provides some mitigation. In particular, in Appendix A.2, we provide a different guarantee for the same algorithm that makes O(1/γ2ε2) calls to the weak learner, while needing O(log |B|/γ2ε2) labeled and O(log |B|/γ3ε3 + (log |B|)3/γ2ε2) unlabeled samples.

Sample-Optimal Agnostic Boosting with Unlabeled Data

5. Resiliency against covariate shift

We consider the setting when the learner has access to a distribution D supported over X { 1} and a different, possibly unrelated, distribution Q supported over features X. We show that under mild conditions the learner can still produce an arbitrarily accurate classifier. To measure how different Q and DX are, we define CX = d DX /d Q , which is a uniform upper bound on the Radon-Nikodym derivative of DX with respect to Q.

Theorem 5.1 (Main theorem for covariate shift). For any ε, δ > 0, there is an algorithm that makes O(CX /γ2ε2) calls to the weak learner, samples O(VC(B) /γ2ε2) labeled samples from D and TU = O((CX )3 VC(B) /γ4ε4) unlabeled samples from Q, and outputs h such that with probability 1 δ Tδ0 that

corr D(h) max h H corr D(h) (1 + CX )ε0

Recall that ε0 can be made arbitrarily small by feeding more samples to the weak learner from the empirical distribution. Although this comes at a (polynomial) computational cost, in particular, the weak learner now needs to be more accurate, the sample complexity remains unaffected.

The key step is in Lemma B.3; it proves an analogue of Lemma 3.5 implying that oversampling the part of the potential connected to the Huber loss does not hurt the correlation gap. From there, we set up a non-scalar potential measure to keep track of the progress of the learner.

6. Applications

6.1. Agnostic learning of halfspaces

We illustrate how our method, when used as a black box, agnostically learns halfspaces over the n-dimensional Boolean hypercube under uniform marginals on the features. Since unlabeled samples can be drawn from the uniform distribution at essentially no statistical cost, the added complexity of acquiring unlabeled data becomes solely a computational concern.

This procedure improves upon existing boosting-based approaches. In particular, building on Kanade & Kalai (2009), we rely on empirical risk minimization (ERM) over all parities of degree at most d 1/ε4. Such an ERM rule achieves a weak learner advantage γ = n d. As shown in Theorem 6.1 below (proved in Appendix D), this instantiation of our boosting framework reduces the labeled sample complexity from O(ε 7n60ε 4) to O(ε 6n40ε 4).

Theorem 6.1. Let D be any distribution over { 1}n { 1} with uniform feature marginals, and let

H = sign w x θ | (w, θ) Rn+1

denote the class of halfspaces. There exists a degree d = O(ε 4) such that running Algorithm 1 with ERM over parities of degree at most d produces a classifier h satisfying

h min h H l D(h) + ε,

while using only O ε 6 n40 ε 4 labeled samples in npoly(1/ε) time.

6.2. Boosting for reinforcement learning

The construction of near-optimal policy for reinforcement learning (RL) via boosting was first pursued in Brukhim et al. (2022). Ghai & Singh (2024) improve on these results. Modifying the RL setting to include the ability to sample trajectories without observing reward, we can apply our results to reduce the number of samples that require reward feedback. Such a feedback model could be useful where rollouts are cheap but reward feedback is not because it comes from human labeling or an expensive processes (Finn et al., 2016). In Appendix E, we provide a formal description of this modified setting.

Plugging Algorithm 1 into a modified meta-algorithm of Brukhim et al. (2022) in a manner that allows for trajectories without reward yields the following result for binary-action MDPs. Our result improves upon Ghai & Singh (2024) in that it requires fewer with-reward episodes. Here, V π

is the expected discounted reward of a policy π, V is its maximum. β is the discount factor of the underlying MDP, and C , D and E, Eν are distribution mismatch and policy completeness terms (related to the inherent Bellman error). In the episodic model, the learner interacts with the MDP in episodes. In the ν-reset model, the learner can seed the initial state with a fixed well dispersed distribution ν as a means to exploration. See Appendix E for a complete statement of results and details of the setting. Theorem 6.2 (Informal; stated formally in Theorem E.1). Let W be a γ-weak learner for the policy class Π operating with a base class B, with sample complexity m(ε0, δ0) = (log |B|/δ0)/ε2 0. Fix tolerance ε and failure probability δ. In the episodic access model, there is an algorithm using that uses the weak learner W to produce a policy π such that with probability 1 δ, we have V V π (C E)/(1 β) + ε, while sampling O((log |B|)/γ3ε4) episodes of length O((1 β) 1) without reward feedback and O((log |B|)/γ2ε3) episodes of length O((1 β) 1) with reward feedback. In the ν-reset access model, there is a setting of parameters such that Algorithm 4 when given access to W produces a policy

π such that with probability 1 δ, we have V V π (D Eν)/(1 β)2 + ε, while sampling O((log |B|)/γ3ε5) episodes of length O((1 β) 1) without reward feedback and O((log |B|)/γ2ε4) episodes of length O((1 β) 1) with reward feedback.

Sample-Optimal Agnostic Boosting with Unlabeled Data

Dataset No Added Noise 5% Noise 10% Noise 20% Noise PAB Ours PAB Ours PAB Ours PAB Ours Ionosphere 0.87 0.05 0.91 0.04 0.88 0.05 0.90 0.04 0.84 0.06 0.90 0.04 0.81 0.06 0.83 0.06 Diabetes 0.84 0.09 0.89 0.07 0.86 0.08 0.86 0.09 0.79 0.09 0.79 0.10 0.76 0.10 0.80 0.10 Spambase 0.91 0.02 0.94 0.02 0.90 0.03 0.92 0.03 0.89 0.03 0.90 0.02 0.83 0.04 0.87 0.03 German 0.79 0.07 0.86 0.07 0.84 0.08 0.84 0.08 0.76 0.08 0.87 0.07 0.75 0.08 0.77 0.08 Sonar 0.78 0.08 0.92 0.05 0.68 0.09 0.89 0.06 0.84 0.08 0.87 0.07 0.69 0.10 0.77 0.08 Waveform 0.89 0.02 0.89 0.02 0.88 0.03 0.87 0.03 0.86 0.03 0.86 0.03 0.83 0.03 0.83 0.03 Average 0.84 0.89 0.84 0.88 0.81 0.84 0.78 0.81

Table 1. 50-fold cross-validated accuracies of the Potential based Agnostic Booster (PAB) (Kanade & Kalai, 2009) and our proposed boosting algorithm on six datasets with 0%, 5%, 10%, and 20% added label noise (during training). Sonar and Ionosphere have 50% of labels dropped while the remaining datasets have 90% of labels dropped. A final row is included for the average accuracy (evenly weighted) over all 6 datasets.

7. Experiments

In this section, we demonstrate the empirical viability of our approach. Table 1 showcases the results from our initial experiments comparing Algorithm 1 with the agnostic boosting method introduced by Kanade & Kalai (2009), herein referred to as the Potential-based Agnostic Booster (PAB). These evaluations were performed on various UCI classification datasets (Sigillito et al., 1989; Hopkins et al., 1999; Smith et al., 1988; Hofmann, 1994; Sejnowski & Gorman, 1988; Breiman & Stone, 1984), employing decision stumps (Pedregosa et al., 2011) as the weak learners. Notably, Algorithm 1 extends PAB to handle unlabeled data by incorporating our newly defined potential function, as defined in Equation (1).

To evaluate the robustness of all algorithms against label noise, we introduced noise levels of 5%, 10%, and 20% during the training phase. We randomly remove a certain percentage of labels from each dataset to create scenarios with both labeled and unlabeled instances. Specifically, we omitted 50% of labels for smaller datasets (Sonar and Ionosphere) and 90% for the other datasets. Our findings indicate that incorporating unlabeled examples leads to improved performance. This enhancement is likely attributed to the limitation of PAB in reusing samples, which consequently restricts the number of boosting iterations when the sample size is constrained. For a comprehensive overview of the experimental setup, please refer to Appendix C.

8. Conclusion

This paper aims to leverage unlabeled data to reduce the sample complexity of agnostic boosting. The theoretical improvements are stark. When given as much unlabeled data as the amount of labeled data required for existing approaches, the resultant sample complexity reduces to that of ERM, becoming essentially optimal. This is accomplished by a novel decomposable potential function, whose derivative naturally splits into two parts that are estimable

independently by labeled and unlabeled data, respectively.

We end with a few concrete directions for future work. The possibility of achieving an optimal sample complexity for agnostic boosting in polynomial time without any concession remains open. In our view, an equally important and likely fruitful direction is to improve the oracle complexity of agnostic boosting where known results scale as 1/ε2, which is substantially worse than log 1/ε for the realizable case. Developing algorithms that adapt, on a per-round basis, to the weak learning edge γt is a key unresolved step to making agnostic boosting practical. Finally, it would be worthwhile to extend the sample complexity improvements in recent works to the filtering framework (cf. sub-sampling; see, for example, the discussion in Domingo (2000)), which essentially treats the weak learners as black-box learning algorithms and hence avoids appeals to uniform convergence arguments on the weak hypothesis class.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Alon, N., Gonen, A., Hazan, E., and Moran, S. Boosting simple learners. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 481 489, 2021.

Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009.

Bartlett, P., Freund, Y., Lee, W. S., and Schapire, R. E. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5): 1651 1686, 1998.

Sample-Optimal Agnostic Boosting with Unlabeled Data

Ben-David, S., Long, P. M., and Mansour, Y. Agnostic boosting. In Computational Learning Theory: 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, Euro COLT 2001 Amsterdam, The Netherlands, July 16 19, 2001 Proceedings 14, pp. 507 516. Springer, 2001.

Beygelzimer, A., Kale, S., and Luo, H. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pp. 2323 2331. PMLR, 2015.

Blanc, G., Hayderi, A., Koch, C., and Tan, L.-Y. The sample complexity of smooth boosting and the tightness of the hardcore theorem. In 2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS), pp. 1431 1450. IEEE, 2024.

Breiman, L. and Stone, C. Waveform Database Generator (Version 1). UCI Machine Learning Repository, 1984. DOI: https://doi.org/10.24432/C5CS3C.

Brukhim, N., Chen, X., Hazan, E., and Moran, S. Online agnostic boosting via regret minimization. Advances in Neural Information Processing Systems, 33:644 654, 2020.

Brukhim, N., Hazan, E., and Singh, K. A boosting approach to reinforcement learning. Advances in Neural Information Processing Systems, 35:33806 33817, 2022.

Chapelle, O., Scholkopf, B., and Zien, A. Semi-supervised learning. 2006. Cambridge, Massachusettes: The MIT Press View Article, 2:1, 2006.

Chen, S.-T., Lin, H.-T., and Lu, C.-J. An online boosting algorithm with theoretical justifications. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1873 1880, 2012.

Chen, S.-T., Balcan, M.-F., and Chau, D. H. Communication efficient distributed agnostic boosting. In Artificial Intelligence and Statistics, pp. 1299 1307. PMLR, 2016.

da Cunha, A., Høgsgaard, M. M., Paudice, A., and Sun, Y. Revisiting agnostic boosting. ar Xiv preprint ar Xiv:2503.09384, 2025.

Diakonikolas, I., Impagliazzo, R., Kane, D. M., Lei, R., Sorrell, J., and Tzamos, C. Boosting in the presence of massart noise. In Conference on Learning Theory, pp. 1585 1644. PMLR, 2021.

Domingo, C. Madaboost: a modification of adaboost. In Proc. of the 13th Conference on Computational Learning Theory, COLT 00, 2000.

Eisenstat, D. and Angluin, D. The vc dimension of k-fold union. Information Processing Letters, 101(5):181 184, 2007.

Feldman, V. Distribution-specific agnostic boosting. Innovations in Theoretical Computer Science (ITCS), pp. 241 250, 2010.

Finn, C., Yu, T., Fu, J., Abbeel, P., and Levine, S. Generalizing skills with semi-supervised reinforcement learning. ar Xiv preprint ar Xiv:1612.00429, 2016.

Freund, Y. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256 285, 1995.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Ghai, U. and Singh, K. Sample-efficient agnostic boosting. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=uf KBRv Yxtp.

Green Larsen, K. and Ritzert, M. Optimal weak to strong learning. Advances in Neural Information Processing Systems, 35:32830 32841, 2022.

Hazan, E. and Singh, K. Boosting for online convex optimization. In International Conference on Machine Learning, pp. 4140 4149. PMLR, 2021.

Hofmann, H. Statlog (German Credit Data). UCI Machine Learning Repository, 1994. DOI: https://doi.org/10.24432/C5NC77.

Hopkins, M., Reeber, E., Forman, G., and Suermondt, J. Spambase. UCI Machine Learning Repository, 1999. DOI: https://doi.org/10.24432/C53G6X.

Hopkins, M., Kane, D. M., Lovett, S., and Mahajan, G. Realizable learning is all you need. In Conference on Learning Theory, pp. 3015 3069. PMLR, 2022.

Huber, P. J. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pp. 492 518. Springer, 1992.

Jung, Y. H., Goetz, J., and Tewari, A. Online multiclass boosting. Advances in neural information processing systems, 30, 2017.

Kalai, A. and Servedio, R. A. Boosting in the presence of noise. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pp. 195 205, 2003.

Sample-Optimal Agnostic Boosting with Unlabeled Data

Kalai, A. T., Mansour, Y., and Verbin, E. On agnostic boosting and parity learning. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 629 638, 2008.

Kale, S. Boosting and hard-core set constructions: a simplified approach. In Electronic Colloquium on Computational Complexity (ECCC), volume 14. Citeseer, 2007.

Kanade, V. and Kalai, A. Potential-based agnostic boosting. Advances in neural information processing systems, 22, 2009.

Kearns, M. Learning boolean formulae or finite automata is as hard as factoring. Technical Report TR-14-88 Harvard University Aikem Computation Laboratory, 1988.

Kearns, M. and Valiant, L. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67 95, 1994.

Klivans, A. R., O Donnell, R., and Servedio, R. A. Learning intersections and thresholds of halfspaces. Journal of Computer and System Sciences, 68(4):808 840, 2004.

Kothari, P. K. and Livni, R. Improper learning by refuting. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

Long, P. M. and Servedio, R. A. Random classification noise defeats all convex potential boosters. In Proceedings of the 25th international conference on Machine learning, pp. 608 615, 2008.

Lyu, X., Wu, H., and Yang, J. The cost of parallelizing boosting. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 3140 3155. SIAM, 2024.

Mansour, Y. and Mc Allester, D. Boosting using branching programs. Journal of Computer and System Sciences, 64 (1):103 112, 2002.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Raman, V. and Tewari, A. Online agnostic multiclass boosting. Advances in Neural Information Processing Systems, 35:25908 25920, 2022.

Schapire, R. E. The strength of weak learnability. Machine learning, 5:197 227, 1990.

Schapire, R. E. and Freund, Y. Boosting: Foundations and algorithms. Kybernetes, 42(1):164 166, 2013.

Schapire, R. E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 80 91, 1998.

Sejnowski, T. and Gorman, R. Connectionist Bench (Sonar, Mines vs. Rocks). UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C5T01Q.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Sigillito, V., Wing, S., Hutton, L., and Baker, K. Ionosphere. UCI Machine Learning Repository, 1989. DOI: https://doi.org/10.24432/C5W01B.

Smith, J. W., Everhart, J. E., Dickson, W., Knowler, W. C., and Johannes, R. S. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, pp. 261. American Medical Informatics Association, 1988.

Talagrand, M. Sharper bounds for gaussian and empirical processes. The Annals of Probability, pp. 28 76, 1994.

Sample-Optimal Agnostic Boosting with Unlabeled Data

Map of the appendix. In Appendix A, we complete the proofs for results concerning improved unlabeled sample efficiency. Appendix B discusses the proofs for covariate shift. In Appendix C are provided experimental details not found in the main paper. Appendix D and Appendix E provide further details on applications of boosting to learning halfspace and reinforcement learning, respectively.

A. Improving unlabeled sample efficiency

A.1. Proofs for the result

Notice that in this section we use different choices of ϕ and ψ, those stated in Equation (4). However, Lemma 3.5 continues to hold. In fact, the latter only requires that zψ (z) 0 for all z. To maintain the continuity of presentation, our organization and notation closely mirror Ghai & Singh (2024). Throughout this section, we will always set σ = η/γ.

Define ΨD(H) = E(x,y) D[Ψ(H(x))] and Ψ D(H, h) = E(x,y) D[Ψ (H(x))h(x)].

Proof of Theorem 4.1. This proof is almost identical to that of Theorem 4 in Ghai & Singh (2024). We reproduce it for completeness. In fact, the only change stems from the new upper bound on εGen in Lemma A.1, which unlike the previous work makes a distinction between labeled and unlabeled samples.

Lemma A.1. There exists a universal constant C > 0 such that with probability 1 δ, for all t [T], h B {h } :

|Φ D(Ht, h) + 3 corr Dt(h)| C

log |B| + log 1

σU + log |B|T/δ

| {z } εGen

But before that let us dispense with the random events at once. The success of Lemma A.1, the event that maxt [T ] | corr b D0(sign Ht) corr D(sign Ht)| ε /10, and maxt [T ] | corr b D t(Wt) corr Dt(Wt)| ε /10 can be ensured with probability 1 δ by a simple application of Hoeffing s inequality and union bound given the setting of m and S0 appropriately. We will soon set precise values of ε and ε .

Recall that ht = η(Ht+1 Ht). Equation (2) can be rearranged to get

t=1 Φ D(Ht, ht) PT t=1(ΦD(Ht) ΦD(Ht+1))

ηT + η 2γ2 2

where we use the fact that ϕ(0, y) minz ϕ(z, y) 1.

Case A: If ht = Wt/γ, observe that corr Dt(Wt) corr D t(Wt) ε /10 τ ε /10. Now apply Lemma A.1 to get

Φ D (Ht, ht) 3

γ corr Dt(Wt) εGen

Case B: If ht = sign(Ht), then corr Dt(Wt) corr D t(Wt) + ε /10 τ + ε /10. Applying Lemma A.1, we get

3corr Dt(Wt) 3γcorr Dt(h ) 3ε0 γΦ D(Ht, h ) 3ε0 γεGen

Using Lemma 3.5, this translates to

Φ D(Ht, sign(Ht)) = Φ D(Ht, sign(Ht)) Φ D(Ht, h ) (corr D(h ) corr D(sign(Ht)))

(corr D(h ) corr D(sign(Ht))) + 3

Sample-Optimal Agnostic Boosting with Unlabeled Data

In either case, we have

Φ D(Ht, ht) min 3

γ , (corr D(h ) corr D(sign(Ht))) 3

Hereafter let s be the time step satisfying

corr D(h ) corr D(sign(Hs)) 3

γ (2τ + ε0) + 1 1

εGen + 3 ε0

Such a choice must exists, since otherwise we get for all t that

Φ D(Ht, ht) 3

γ = 4 (ηT) + η

which contradicts Equation (2). Combining this with the observation that h minimizes the correlation on b D0, we get

corr D(h ) corr D(h) 8

Setting ε = γε/100, ε = γε/100 and plugging in the proposed hyper-parameters with appropriate constants yields the claimed result.

Proof of Lemma A.1. By the definition of Dt, we have that

corr Dt(h) = 1

3 b E b D[yh(x)] + 2

3ED t[yh(x)] = 1

3 b E b D[yh(x)] + 2

3 corr D t(h),

Since b D is composed of IID draws from D, the standard uniform convergence result via union bound gets that for some constant C 0 we have with probability 1 δ that |b E b D[yh(x)] corr D(h)| is at most p

(log |B| + log δ 1)/S.

By inspecting Line 6 in Algorithm 2 here and Line 5 in Algorithm 1 in Ghai & Singh (2024) with the substitution that y = 1, we note that the two are identical. Therefore, we can apply the following result from Ghai & Singh (2024).

Lemma A.2 (Lemma 6 in Ghai & Singh (2024)). Setting σ = η/γ. There exists a universal constant C > 0 such that with probability 1 δ, for all t [T], h B {h } :

Ψ D(Ht, h) + 2 corr D t(h) Cη

σU + log |B|T/δ

Since ΦD(H) = ΨD(H) corr D(H), we get

1 3Φ D(Ht, h) + corr Dt(h) 1

3|b E b D[yh(x)] corr D(h)| + 1

Ψ D(Ht, h) + 2 corr D t(h)

Φ D(Ht, h) = E x DX[ψ (Ht(x))h(x)] E (x,y) D[yh(x)],

completing the proof.

Sample-Optimal Agnostic Boosting with Unlabeled Data

A.2. Trading off oracle Complexity and unlabeled sample complexity

Theorem A.3 (Main theorem with unlabeled data reuse). For any ε, δ > 0, there is an instantiation of parameters for which Algorithm 2 guarantees with probability 1 δ Tδ0 that

corr D(h) max h H corr D(h) 3ε0

During its execution, Algorithm 2 makes O(1/γ2ε2) calls to the weak learner, and samples S + S0 = O(log |B|/γ2ε2) labeled samples and TU = O(log |B|/γ3ε3 + (log |B|)3/γ3ε2) unlabeled samples.

Proof. The proof is identical to that of Theorem 4.1, except crucially to the substitution of the following bound on εGen.

There exists a universal constant C > 0 such that with probability 1 δ, for any t [T], h B {h } :

1 3Φ D(Ht, h) + corr Dt(h) C

log |B| + log 1

δ S + σ + η

σU + (log |B|T/δ)3/2

| {z } εGen

To prove this claim itself, we follow the same recipe as in the proof of Lemma A.1. Once again we observe that By Line 6 in Algorithm 2 here is identical to Line 5 in Algorithm 1 in Ghai & Singh (2024) with the substitution that y = 1. Therefore, we can apply the following result from Ghai & Singh (2024).

Lemma A.4 (Lemma 15 in Ghai & Singh (2024)). Setting σ = η/γ. There exists a universal constant C > 0 such that with probability 1 δ, for all t [T], h B {h } :

Ψ D(Ht, h) + 2 corr D t(h) Cη

σU + log |B|T/δ

B. Resiliency against covariate shift

Let ψ(z) be the Huber loss. Instead of definite a scalar potential, this time we will directly define the population potential measure that involves both D and Q. Recall that CX d DX /d Q has to be at least one.

Φ(H) = CX E x Q[ψ(H(x))] E (x,y) D[y H(x)]

Φ (H, h) = CX E x Q[ψ (H(x))h(x)] E (x,y) D[yh(x)]

We describe a variant of Algorithm 1 that can tolerate a mismatch between DX and Q. The key modification happens in Line 6 of Algorithm 3, where pseudo-labeled samples created from the unlabeled distribution Q are sampled a higher rate than labeled samples.

The uniform convergence still applies after some minor adjustment.

Lemma B.1. There exists a universal constant C > 0 such that with probability 1 δ, for all t [T], h B {h },

|Φ (Ht, h) + (1 + CX ) corr Dt(h)| εGen := C

VC(B) + log 1

VC(B) + log 1

Proof of Lemma B.1. By the definition of Dt, we have that

corr Dt(h) = 1 1 + CX

b E b D[yh(x)] CX b E b Dt[ψ (Ht(x))h(x)] ,

Sample-Optimal Agnostic Boosting with Unlabeled Data

Algorithm 3 Covariate-shift Resistant Agnostic Boosting with Unlabeled Data

1: Inputs: Samplers for labeled data from D and unlabeled data from Q, γ-agnostic weak learning oracle W, parameters η, T, τ, S, U, S0, m. 2: Initialize a zero hypothesis H1 = 0. 3: Sample S labeled examples to create dataset b D. 4: for t = 1 to T do 5: Sample U unlabeled examples to create dataset b Dt. 6: Construct a resampling distribution Dt that: A. With probability 1 CX , returns (x, y) picked uniformly from b D.

B. With remaining probability, picks x uniformly from b Dt, and returns (x, by), where by is chosen as

( +1 with probability pt(x) = 1 ψ (Ht(x))

2 , 1 with remaining probability.

7: Sample m times from Dt to create dataset b D t. 8: Call the weak learner on b D t to get Wt = W( b D t). 9: if corr b D t(Wt) = P

(x,by) b D t by Wt(x) > τ then

10: Update Ht+1 = Ht + ηWt/γ. 11: else 12: Update Ht+1 = Ht η sign(Ht)t. 13: end if 14: end for 15: Sample S0 labeled examples to create dataset b D0. 16: Output h = arg max h {sign(Ht):t [T ]}

(x,y) b D0 yh(x).

where we use the fact that Line 6.B in Algorithm 3 ensures E[by|x] = ψ (Ht(x)). Since b D and b Dt are composed of IID draws from D and Q respectively, we have that, for some constant C 0 we have with probability 1 δ that |b E b D[yh(x)] corr D(h)| p

(VC(B) + log δ 1)/S and |b E b Dt[ψ (Ht(x))h(x)] Ex Q[ψ (Ht(x))h(x)]| p

(VC(B) + log δ 1)/U.

|(1 + CX ) corr Dt(h) + Φ (Ht, h)| CX |b E b Dt[ψ (Ht(x))h(x)] E x Q[ψ (Ht(x))h(x)]| + |b E b D[yh(x)] corr D(h)|

The decomposition above completes the proof.

We will need the following well-known property of Random-Nikodym derivatives.

Lemma B.2. For any non-negative function f : X R 0, it is true that CX Ex Q[f(x)] Ex DX [f(x)].

Proof. Since CX d DX /d Q , we have CX Ex Q[f(x)] Ex Q f(x) d Dx

d Q (x) = Ex DX [f(x)].

The key idea in the analysis occurs in the following lemma. Essentially, it says oversampling the ψ part from unlabeled data does not hurt the correlation gap.

Lemma B.3. For any real-valued classifier H : X R, we have

Φ (H, sign(H)) Φ (H, h ) corr D(h ) corr D(sign(H)).

Proof. Using Lemma B.2 below, we arrive at

Φ (H, sign(H)) Φ (H, h ) = CX E x Q [ψ (H(x))(sign(H(x)) h (x))] E (x,y) D [y(sign(H(x)) h (x))]

E x DX [ψ (H(x))(sign(H(x)) h (x))] E (x,y) D [y(sign(H(x)) h (x))]

= E (x,y) D [(ψ (H(x)) y)(sign(H(x)) h (x))] .

Sample-Optimal Agnostic Boosting with Unlabeled Data

Recall that z and ψ (z) always have the same sign, and hence so do ψ (z) and sign(z). This ensures non-negativity as ψ (H(x))(sign(H(x)) h (x)) = |ψ (H(x))| ψ (H(x))h (x), since h (x) is restricted to { 1}.

From here onward, our original proof strategy work. In particular since y2 = 1, we get

Φ (H, sign(H)) Φ (H, h ) E (x,y) D [(1 yψ (H(x)))y(sign(H(x)) h (x))] .

As before, consider any (x, y) such that y H(x) > 0: Here y(h (x) sign(H(x))) < 0. Furthermore, since y and H(x) have the same sign, so do y and ψ (H(x)), and hence (1 yψ (H(x))) 1. Similarly, whenever y H(x) < 0: Then y(h (x) sign(H(x))) > 0, and y and ψ (H(x)) have opposite signs that imply (1 yψ (H(x))) 1.

Now the claim follows as

Φ D(H, sign(H)) Φ D(H, h ))

= E (x,y) DX

1y H(x) 0 (1 yψ (H(x))) | {z } 1

y(h (x) sign(H(x))) | {z } 0

+1y H(x)<0 (1 yψ (H(x))) | {z } 1

y(h (x) sign(H(x))) | {z } 0

E (x,y) D[y(h (x) sign H(x))]

= corr D(h ) corr D(sign(H)).

We are finally ready to prove the main result.

Proof of Theorem 3.1. Let us dispense with the random events at once. The success of Lemma B.1, the event that maxt [T ] | corr b D0(sign Ht) corr D(sign Ht)| ε/10, and maxt [T ] | corr b D t(Wt) corr Dt(Wt)| γε/20CX can be ensured with probability 1 δ by a simple application of Hoeffing s inequality and union bound given the setting of m = m(ε0, δ0) + O((CX )2/ε2γ2) and S0 = O(1/γ2ε2). Similarly, εGen γε/10 holds in Lemma 3.3 for S = Ω((VC(B) + log δ 1)/γ2ε2) and U = Ω((CX )2(VC(B) + log δ 1)/γ2ε2).

Let ht = (Ht+1 Ht)/η. Since ψ is 1-smooth, we have

ΦD(Ht+1) ΦD(Ht) ηΦ D(Ht, ht) + η2CX

Case A: Consider any step t where corr b D t(Wt) > τ. Here ht = Wt/γ. It follows from Lemma B.1 that

Φ (Ht, ht) (1 + CX ) corr Dt(ht)

(1 + CX ) corr b D t(ht)

where τ = 2γε/(1 + CX ) and η = γ2ε/CX . By Equation (5), the potential drops as Φ(Ht+1) Φ(Ht) γ2ε2/2CX .

Case B: Consider any step t where corr b D t(Wt) τ and crucially Φ(Ht, sign Ht) ε. Here ht = sign Ht. Since Φ (Ht, ht) = Φ (Ht, sign Ht) ε, by Equation (5), we have Φ(Ht+1) Φ(Ht) γ2ε2/2CX .

At initialization, Φ(0) = 0. Further for any H : X R, using non-negativity of ψ, we have

Φ(H) = CX E x Q[ψ(H(x))] E (x,y) D[y H(x)] E x DX[ψ(H(x))] E (x,y) D[y H(x)] 1

Thus, at initialization Φ is at most half away from its minimum. Thus, setting T = 2CX /γ2ε2, there must arise an iterate such that neither Case A nor Case B hold. That is, there is some s [T] such that corr b D s(Ws) τ and ΦD(Hs, sign Hs) ε.

Sample-Optimal Agnostic Boosting with Unlabeled Data

Now using Lemma B.1 and that the weak learner γ-approximately maximizes correlation (Definition 2.1), we have

Φ (Hs, h ) (1 + CX ) corr Ds(h ) + 2εGen

(1 + CX ) corr Ds(Ws)

γ + (1 + CX )ε0

(1 + CX ) corr b D s(Ws)

γ + (1 + CX )ε0

(1 + CX )ε0

where in the last line we recall τ = 2εγ/(1 + CX ) and εGen = εγ/10.

By Lemma 3.5, we have

corr D(h ) corr D(sign(Hs)) Φ D(Hs, sign(Hs)) Φ D(Hs, h )

(1 + CX )ε0

To complete the proof, we observe that

corr D(h) corr b D0(h) + ε/10

corr b D0(sign Hs) + ε/10

corr D(sign Hs) + ε/5

(1 + CX )ε0/γ + 18ε/5,

where we use the fact that h maximizes the empirical correlation on the dataset b D0.

C. Additional experimental details

For PAB, the number of samples that can be fed to a week learner in a round scales inversely with the number of boosting rounds, as the algorithm requires fresh samples each round.As such, we perform a grid search on the number of boosting rounds with T {25, 50, 100}, while we just use 100 for our implementation of Algorithm 1. In both algorithms we search over the parameter m, the number of samples we feed to the weak learner each round with a grid of {5, 20, 50, 100}, though if such a setting is invalid for PAB, we continue until all samples are used.

Our experiments were performed using the fractional relabeling scheme stated in (Kanade & Kalai, 2009), intended to reduce the stochasticity the algorithm is subject to. In particular, rather than sampling labels, we provide both (x, y) and (x, y) in our dataset with weights equal to their sampling probabilities. Experiments are all run on an M1 Macbook Pro and complete within an hour. Multiclass datasets are converted to binary.

D. Proof of Theorem 6.1

Proof. We observe that ERM on the Fourier basis χS(x) = Q

i S xi, namely parities on subsets S, can be used to produce a weak learner (Klivans et al., 2004). As such, an n-dimensional halfspace can be approximated with uniform weighting on the hypercube to ε2 ℓ2-error using degree-limited Bn,d = { χS : |S| d} as a basis, where d = 20ε 4. As a result, at least one h Bn,d must have high correlation.

Lemma D.1 (Lemma 5 in (Kalai et al., 2008)). Let D be any data distribution over { 1}n { 1, 1} with marginal distribution Unif({ 1}n) on the features. For any fixed ε and d = 20ε 4, there exists some h Bn,d such

corr D(h) maxc H corr D(c) ε

The result follows directly from the preceding lemma, which provides a weak learner for the task, and Theorem 3.1. We note that |Bn,d| < nd and γ = n d, so log |Bn,d|

γ2ε2 dn2d log(n)

Sample-Optimal Agnostic Boosting with Unlabeled Data

The unlabeled samples used in Algorithm 1 can be produced by sampling from the hypercube adding to the npoly(1/ε)

runtime, but not the sample complexity.

E. Boosting for reinforcement learning

In this section, we consider boosting in the reinforcement learning setting. We wish to separately consider the number of reward-annotated episodes against the number of reward-free episodes needed to learn a near-optimal policy.

Consider a Markov Decision Process M = (S, A, r, P, β, µ0), where S is a set of states, A = { 1} is a binary set of actions, r : S A [0, 1] determines the (expected) reward at any state-action pair (which is sometimes available), P : S A S captures the transition dynamics of the MDP, i.e., P(s |s, a) is the probability of moving to state s upon taking action a at state s, β [0, 1) is the discount factor, and µ0 is the initial state distribution. Let Qπ(s, a) and V π(s) be the state-action and state value functions. Let V π µ = Es µ[V (s)] be the expected total reward when starting from the start state distribution µ, and we will say V π µ0 = V π. Finally, the occupancy measure µπ µ induced by a policy π starting from an initial state distribution µ is stated below. We will take µπ = µπ µ0 as a matter of convention.

In the episodic model, the learner interacts with the MDP in a limited number of episodes of reasonable length (i.e., (1 β) 1), and the starting state of MDP is always drawn from µ0. In the second, termed rollouts with ν-resets, the learner s interaction is still limited to a small number of episodes, however, the MDP now samples its starting state from ν. It is important to stress that in both cases, the learner s objective is the same, to maximize V π starting from µ0. However, ν could be more spread out over the state space than µ0, and provide an implicit source of explanation, and the learner s guarantee as shown next benefits from its dependence on a milder notion of distribution mismatch in this case. In this setting, we do not always assume the reward is revealed. We consider a model where we can rollout a policy and observe rewards or alternatively can just observe the state trajectories.

Since we have binary actions, our weak learners are policies, which we denote π instead of h. This notion is equivalent to that used by Brukhim et al. (2022) and Ghai & Singh (2024), because for binary actions, a random policy induces an accuracy of half regardless of the distribution over features and labels.

Say π arg maxπ V π be a reward maximizing policy, and V be its value. Let Π be the convex hull of the boosted policy class, i.e., the outputs of the boosting algorithm. For any state distribution µ , define the policy completeness Eµ term as

Eµ = max π Π min π Π Es µπ µ [max a A Qπ(s, a) Ea π ( |s)Qπ(s, a)].

In words, this term captures how well the greedy policy improvement operator is approximated by Π in an state-averaged sense over the distribution induces by any policy in Π. Finally, we define distribution mismatch coefficients below.

C = max π Π µπ /µπ , D = µπ /ν .

Theorem E.1. Let W be a γ-weak learner for the policy class Π operating with a base class B, with sample complexity m(ε0, δ0) = (log |B|/δ0)/ε2 0. Fix tolerance ε and failure probability δ. In the episodic access model, there is a setting of parameters such that Algorithm 4 when given access to W produces a policy π such that with probability 1 δ, we have

while sampling

O C5 log |B| (1 β)9γ3ε4

episodes of length O((1 β) 1) without reward feedback (via Algorithm 6) and

O C4 log |B| (1 β)7γ2ε3

episodes of length O((1 β) 1) with reward feedback (via Algorithm 5).

Sample-Optimal Agnostic Boosting with Unlabeled Data

Algorithm 4 RL Boosting adapted from (Brukhim et al., 2022)

1: Input: iteration budget T, state distribution µ, step sizes ηt, post-selection sample size P 2: Initialize a policy π0 Π arbitrarily. 3: for t = 1 to T do 4: Run Algorithm 2 to get π t, using

Algorithm 5 to produce a distribution over state-actions (ignore b Q) by executing the current policy πt 1 starting from the initial state distribution µ as the labeled samples.

Algorithm 6 to produce a distribution over states by executing the current policy πt 1 starting from the initial state distribution µ as the unlabeled samples.

5: Update πt = (1 ηt)πt 1 + ηtπ t. 6: end for 7: Run each policy πt for P rollouts to compute an empirical estimate d V πt of the expected return. 8: return π = πt where t = arg maxt d V πt.

Algorithm 5 Trajectory Sampler adapted from (Brukhim et al., 2022)

1: Sample state s0 µ and action a Unif(A). 2: Sample s µπ as follows: at every step h, with probability β, execute π; else, accept sh. 3: Take action a at state sh, then continue to execute π, and use a termination probability of 1 β. Upon termination, set R(sh, a ) as the sum of rewards from time h onwards. 4: Define the vector b Q, such that for all a A, b Q(a) = 2R(sh, a ) Ia=a . 5: With probability C b Q(a ), set y = a else set y A {a }, where C = (1 β)/2. 6: return (sh, b Q, y).

In the ν-reset access model, there is a setting of parameters such that Algorithm 4 when given access to W produces a policy π such that with probability 1 δ, we have

V V π D Eν (1 β)2 + ε,

while sampling

O D5 log |B| (1 β)15γ3ε5

episodes of length O((1 β) 1) without reward feedback (via Algorithm 6) and

O D4 log |B| (1 β)12γ2ε4

episodes of length O((1 β) 1) with reward feedback (via Algorithm 5).

Proof. The proof follows by applying the result in Theorem 4.1 within the proof of Theorem 22 from Ghai & Singh (2024).

For the episodic model, applying the second part of Theorem 9 in (Brukhim et al., 2022), while noting the smoothness of V π, and combining the result with Lemma 18 and Lemma 11 in (Brukhim et al., 2022), we have with probability 1 Tδ

Following the logic of Theorem 22 from Ghai & Singh (2024), we need to ensure is that output of Algorithm 2 as instantiated in Algorithm 4 every round has an excess correlation gap over the best policy Π no more that (1 β)2ε/C , which

Algorithm 2 assures us can be accomplished with O C3 log |B| (1 β)6γ3ε3 unlabeled samples and O C2 log |B| (1 β)4γ2ε2 labeled samples.

The total number of samples is T = O C2 (1 β)3ε times greater.

Similarly, for the ν-reset model, we need to ensure is that output of Algorithm 2 as instantiated in Algorithm 4 every round has an excess correlation gap over the best policy Π no more that (1 β)3ε/D , which Algorithm 2 assures us can be

Sample-Optimal Agnostic Boosting with Unlabeled Data

Algorithm 6 Reward-free Trajectory Sampler

1: Sample state s0 µ and action a Unif(A). 2: Sample s µπ as follows: at every step h, with probability β, execute π; else, accept sh. 3: Take action a at state sh, then continue to execute π, and use a termination probability of 1 β. 4: return sh.

accomplished with O D3 log |B| (1 β)9γ3ε3 unlabeled samples and O D2 log |B| (1 β)6γ2ε2 labeled samples. The total number of samples

is T = O D2 (1 β)6ε2 times greater.