# provable_weaktostrong_generalization_via_benign_overfitting__a36c4a97.pdf

Published as a conference paper at ICLR 2025

PROVABLE WEAK-TO-STRONG GENERALIZATION VIA BENIGN OVERFITTING

David X. Wu Department of EECS UC Berkeley Berkeley, CA 94720 david_wu@berkeley.edu

Anant Sahai Department of EECS UC Berkeley Berkeley, CA 94720 sahai@eecs.berkeley.edu

The classic teacher-student model in machine learning posits that a strong teacher supervises a weak student to improve the student s capabilities. We instead consider the inverted situation, where a weak teacher supervises a strong student with imperfect pseudolabels. This paradigm was recently brought forth by Burns et al. (2023) and termed weak-to-strong generalization. We theoretically investigate weak-to-strong generalization for binary and multilabel classification in a stylized overparameterized spiked covariance model with Gaussian covariates where the weak teacher s pseudolabels are asymptotically like random guessing. Under these assumptions, we provably identify two asymptotic phases of the strong student s generalization after weak supervision: (1) successful generalization and (2) random guessing. Our techniques should eventually extend to weak-to-strong multiclass classification. Towards doing so, we prove a tight lower tail inequality for the maximum of correlated Gaussians, which may be of independent interest. Understanding the multilabel setting reinforces the value of using logits for weak supervision when they are available.

1 INTRODUCTION

Motivated by the problem of aligning increasingly capable models, Burns et al. (2023) introduced the framework of weak-to-strong generalization1, which draws an analogy between humans supervising superhuman AIs and weaker teacher models supervising stronger student models. This inverts the classic teacher-student model framework, which typically assumes that the teacher is stronger than the student. Burns et al. (2023) found that using GPT-2 to finetune GPT-4 can recover most of the performance of standard supervised finetuning with human-annotated data across standard NLP benchmarks, but struggles on more difficult tasks such as chess puzzles or reward modeling. One failure mode observed by Burns et al. (2023) is that the strong student sometimes learns to mimic the weak teacher, i.e. the strong model overfits to the limitations of the weaker one. Since Burns et al. (2023), this phenomenon has been studied in a variety of other empirical settings (see e.g. Ji et al. (2024); Guo et al. (2024); Liu & Alahi (2024); Yang et al. (2024); Tao & Li (2024)).

These empirical observations naturally lead to the question that we aim to answer in this paper: Can we identify a simple, concrete theoretical setting where we can provably exhibit different phases of weak-to-strong generalization? Below, we survey closely related areas of research.

Pseudolabeling and synthetic data. In semi-supervised learning, pseudolabeling refers to the method of using one model s outputs to generate pseudolabels for unlabeled data (Lee et al., 2013; Arazo et al., 2020; Rizve et al., 2021; Zhang et al., 2021; Cascante-Bonilla et al., 2021; He et al., 2024). These pseudolabels are then used to supervise the target model. We consider using the weak model to generate pseudolabels for the strong model in our concrete setting.

1This situation is related, but slightly different, from the earlier identified problem of easy-to-hard generalization (see e.g. Schwarzschild et al. (2021); Hase et al. (2024)) wherein a model is trained on easy cases but has to generalize to hard cases. The most important distinction is that in weak-to-strong generalization, the weak teacher actually gets things wrong.

Published as a conference paper at ICLR 2025

Synthetic data generation is another widely popular paradigm for scaling up models in the absence of human-annotated data; see, e.g. (Nikolenko, 2021; Liu et al., 2024; Chen et al., 2021; Figueira & Vaz, 2022) and references therein. In this approach, one uses generative models to generate synthetic data, which can then be labeled and used to supervise other models. The use of synthetic data to train models resembles the weak-to-strong setup that we study, although we assume the unlabeled datapoints are generated from the ground-truth distribution. Nevertheless, we believe that extending our techniques may yield interesting insights on the success and failure modes of synthetic data.

Supervised finetuning and scaling laws. In contemporary deep learning, the dominant paradigm is to pretrain large (e.g. billions of parameters) models on copious (e.g. trillions of tokens) unlabeled data in a self-supervised fashion and then finetune the model on relatively tiny (e.g. less than 100K) finetuning datasets. The prevailing wisdom suggests that (1) pretraining helps the model learn useful features that can be repurposed for specific tasks and (2) the optimal size of models should scale with the amount of data (see, e.g. (Kaplan et al., 2020; Wei et al., 2021; Wang et al., 2022b; Liu et al., 2022; Hoffmann et al., 2022)). Much effort has gone into improving the effectiveness and efficiency of finetuning (Houlsby et al., 2019; Hu et al., 2021; Lester et al., 2021; Dettmers et al., 2024).

According to NTK theory, if the model weights do not move far from initialization during finetuning, we can approximate finetuning as training a generalized linear model using (tangent) features. These features arise from the gradients of the network s outputs with respect to the parameters at finetuning initialization. This motivates our studying the behavior of interpolating linear models in the context of supervised finetuning. Our concrete theoretical setting sits within the Gaussian-features linearmodel style of Wei et al. (2022); Belkin et al. (2020); Mei & Montanari (2022); Bartlett et al. (2020); Muthukumar et al. (2020; 2021); Chatterji & Long (2021); Wang & Thrampoulidis (2021); Subramanian et al. (2022); Wang et al. (2022a); Wang et al. (2021); Cornacchia et al. (2023).

Knowledge distillation. The topic of weak-to-strong generalization is related to the extensive literature on knowledge distillation, which originated in the desire to compress large powerful models (teachers) into smaller ones (students) (Hinton et al., 2015; Buciluˇa et al., 2006). See Gou et al. (2021) for a general survey. Theoretical perspectives for the essentially underparameterized case were developed in Phuong & Lampert (2019); Ji & Zhu (2020), and the theoretically engaged literature in this area has continued to expand; see Yuan et al. (2024); Ojha et al. (2023); Safaryan et al. (2023); Alballa & Canini (2024); Hong et al. (2024); Sarnthein et al. (2023); Zhao & Zhu (2023); Das & Sanghavi (2023); Xu et al. (2024); Nagarajan et al. (2023); Borup & Andersen (2023); Harutyunyan (2023); Stanton et al. (2021); Mobahi et al. (2020) for a few representative more recent examples.

The question of how knowledge distillation can improve the student s generalization to surpass the teacher (especially when they have the same architecture) has led to the identification of several different underlying mechanisms (Yuan et al., 2024; Safaryan et al., 2023; Sarnthein et al., 2023; Das & Sanghavi, 2023; Nagarajan et al., 2023; Mobahi et al., 2020). Of these, the closest in spirit to our approach is the regularization viewpoint of Mobahi et al. (2020), which studies a kernel-regression model. They call out the crucial role of the spectrum of the Gram matrix more specifically, self-distillation accentuates the importance of the larger eigenvalues, which has a regularizing effect by making the corresponding basis functions more prominent in the learned pattern.

Concurrent theoretical work. Zhang et al. (2024) engages with generative models and observes that temperature can play an important role in allowing a trained model to surpass its training sources. Somerstep et al. (2024) takes a transfer-learning perspective and asserts that naive fine-tuning on weak pseudolabels tends not to work; our results show conditions where this does in fact succeed. Charikar et al. (2024) takes a representation-centric perspective in a regression setting and zooms in on the question of how much better the representation is for the stronger model. Lang et al. (2024) studies the classification setting and takes a neighborhood perspective that posits that the stronger model s neighborhood structure allows it to average over the weak labels to get generalization. At a high level, our work along with Mobahi et al. (2020); Charikar et al. (2024); Lang et al. (2024) all circle around the idea that weak-to-strong generalization works when the cascading learning process purifies representations in the true direction and contracts in false directions.

Published as a conference paper at ICLR 2025

1.1 CONTRIBUTIONS

In this paper, we explore weak-to-strong generalization in a stylized theoretical model that captures the dynamics of finetuning with weak supervision studied in Burns et al. (2023). Under a simple overparameterized spiked covariance model for the pretraining features, we prove that finetuning an overparameterized linear classifier using minimum ℓ2 norm interpolation on top of these features provably exhibits two distinct phases of weak-to-strong generalization. In particular, under certain scalings, as we increase the number of weakly labeled finetuning examples, the strong learner s asymptotic accuracy transitions from (1) random guessing to (2) perfect generalization; see Section 3 for a precise statement.

To be specific, we study the generalization of interpolating linear models with and without weak supervision. Although our results can be generalized to any weak teacher which produces logits using a linear head, we assume for the sake of concreteness that the weak teacher is also an interpolating linear model which can be fully expressed by the strong student. We discuss how our theoretical results connect to realistic supervised finetuning scenarios in Section 4.

Our results strongly hinge on the tight analysis for benign overfitting in multiclass classification in Wu & Sahai (2024), although the study of benign overfitting for regression and binary classification was already carried out by Bartlett et al. (2020); Muthukumar et al. (2021).

2 PRELIMINARIES AND SETUP

Below, we set up the weak-to-strong learning task in Section 2.1, along with the data assumptions in Section 2.2, and finally specify the concrete end-to-end learning algorithm we study in Section 2.3.

2.1 WEAK-TO-STRONG SETUP

To study weak-to-strong generalization, we will consider a simple setup which encapsulates the dynamics of standard supervised finetuning as well as training linear probes on intermediate activations. We will now give a high level description of the weak-to-strong setting, and in subsequent sections formally define the specific assumptions we make to theoretically study weak supervision.

Suppose we have n labeled datapoints and m = nu unlabeled datapoints, where u > 1. This matches the modern ML paradigm where labeled data is scarce and unlabeled data is abundant.

We assume that we have access to two sets of features extracted from the datapoints: the weak and strong features. Using these features, we will create a weak-to-strong setup with a weak model and a strong model, where the weak model is used to generate hard pseudolabels to train the strong model (see Procedure 1 for a formal definition). The learning task at hand is binary classification, where each model uses its respective features obtained from the datapoints see Section 4 for how our techniques apply to the multiclass setting.

Clearly, to get nontrivial learning guarantees, there needs to be some relationship between the weak and strong features. To specify this, we will assume that the weak and strong features come from an appropriate weak-to-strong ensemble of features, which we formally define in Section 2.2. One way to interpret the weak-to-strong ensemble is that the true hidden direction is highlighted more in the strong features than the weak features; see Figure 1.

In supervised finetuning and linear probing, the strong and weak features come from pretraining via the neural tangent features and intermediate activations, respectively. For example, in the GPT2 to GPT4 weak-to-strong setup of Burns et al. (2023), the weak features come from GPT2 pretraining, whereas the strong features come from GPT4 pretraining.

Broadly speaking, we study the case where the true labels are generated by a distinguished (but unknown) low-rank subspace hidden in very high dimensional space.

To be concrete, we will study two different classifiers in this setup:

(1) fweak: train/finetune on n datapoints using weak features in Rdweak and ground-truth labels. (2) fw2s: train/finetune on m n datapoints using strong features in Rd and hard pseudolabels generated from fweak.

Published as a conference paper at ICLR 2025

We study a scheme where fweak and fw2s are linear models trained by performing minimum ℓ2 interpolation (MNI) on their respective training sets; see Section 2.3 for a formal definition of MNI. Remark 2.1. Instead of training with hard (categorical) pseudolabels, one could use the real-valued scores from fweak. This would only affect constants which are not crucial to any of our results.

We will measure generalization via the test accuracy of these classifiers. In particular, let ℓ( ) be the 0-1 loss function, and let E[ℓ( )] be the expected test error over a fresh test sample. We introduce the following desiderata to define weak-to-strong generalization for binary classification. Desiderata 1. The main desiderata are the following:

(i) The strong model asymptotically generalizes2 when trained on the m weakly labeled datapoints: E[ℓ(fw2s)] = on(1).

(ii) The strong model can fully represent the weak model.

(iii) The weak model asymptotically does not generalize: E[ℓ(fweak)] = 1

In Section 3, we show that the above desiderata are achievable in a simple toy model; see Theorem 3.3 for a formal statement. This paints a rather striking picture: there are situations where the weak labels asymptotically look like random guessing (Desideratum 1.iii), the strong model can perfectly imitate the weak labels (Desideratum 1.ii), yet the strong model still asymptotically generalizes by extracting enough signal out of the plentiful weak labels (Desideratum 1.i).3

We also include some bonus desiderata, which paint a comparison to natural alternative training methods. We can also provably achieve these bonus desiderata in certain regimes; see Remark 3.4. Desiderata 2. The extra desiderata are the following:

(i) PCA cannot recover the low-rank structure from n + m observations of the strong features.

(ii) Let fstrong be the strong model when trained on n datapoints using strong features in Rd

and ground-truth labels. Then fstrong asymptotically fails: E[ℓ(fstrong)] = 1

2.2 DATA MODEL

Throughout, we will consider data with zero mean Gaussian covariates, which can be viewed as an idealized version of the pretraining features or representations. These covariates are generated from an ambient standard Gaussian vector g = (g1, . . . , g D) RD. We make Gaussianity assumptions for the sake of theoretical tractability; in Section 4 we discuss potential extensions to other settings.

Covariates. A learner observes iid features xi N(0, Σ), where we emphasize that the feature covariances Σ Rd d are unknown and different for each learner. The different Σ s capture how weak or strong the features are for each model; we make this precise below. Note that xi is a linear transformation of the underlying randomness gi N(0, ID). We will often refer to the eigendecomposition Σ = UΛU , where U Rd d is orthogonal and Λ Rd d is diagonal.

Labels. For binary classification, we consider hard labels generated by the signs of Gaussians, so that y = sgn( g, v ), where v is an unknown unit-norm direction. To analyze the generalization of various learners, we will allow ourselves to study labels generated by various directions v, not just the true v , all of which we assume are unknown.4 For multiclass classification, we assume the labels are generated via y = arg maxj [k] g, v(j) , where v(1) , . . . , v(k) are all unknown.

Following Muthukumar et al. (2021); Subramanian et al. (2022); Wu & Sahai (2024), we will make the following assumption on the true label directions to simplify the analysis. However, to study weak-to-strong generalization, we will eventually have to analyze a weak label direction which is not 1-sparse. One of our main contributions is showing how to analyze this case.

2The natural extension to multiclass settings with k different classes would require E[ℓ(fweak)] = 1 Θ( 1

k). 3In particular, Desideratum 1.ii allows the failure mode of the strong model imitating the weak model and also rules out more trivial sources of weak labels, such as independent label noise. If the strong model cannot represent the noise in the weak labels, then weak-to-strong generalization is intuitively much simpler. 4To study weak supervision, we take v = wweak, the direction that the weak model learns.

Published as a conference paper at ICLR 2025

Assumption 1 (1-sparse assumption). We say that the labels satisfy the 1-sparse assumption relative to covariance Σ if the following holds. The label defining direction v (directions v(1) , . . . , v(k) for multiclass) is aligned with a top eigenvector (top-k eigenvectors for multiclass) of Σ such that, in an eigenbasis where v = e1 and v(i) = ei, respectively, we have y = sgn(x1) (Binary) y = arg max j [k] xj (Multiclass)

For example, if the 1-sparse assumption holds for the strong covariance, then in a strong eigenbasis where the strong features are independent, the labels are generated by axis-aligned directions.

Bi-level ensemble and weak-to-strong ensemble. To simplify the analysis, we follow Muthukumar et al. (2021); Subramanian et al. (2022); Wu & Sahai (2024) and assume the eigenvalues Λ of the covariance are parameterized by the bi-level ensemble defined shortly. The bi-level ensemble is a simple overparameterized version of the well-known spiked covariance model for PCA. Definition 1 (Bi-level ensemble). Let x N(0, Σ), where the covariance Σ = UΛU . The bi-level ensemble parameterizes Λ = Λ(p, q, r), where p > 1, 0 r < 1, and 0 < q < (p r). The number of features (d), number of spiked directions (s), and degree of favoring (a) all scale with the number of training points (n) as follows: d = np , s = nr , a = n q . (1) Then Λ = diag(λi)i [d], where

s , 1 j s (1 a)d

d s , otherwise . (2)

If the above holds, we refer to x, Λ, or Σ as being drawn from the bi-level ensemble. We use the shorthands λF ad

s and λU (1 a)d

d s to denote the favored and unfavored eigenvalues, respectively.

In particular, the parameterization controls the total number of features, d = np n, as well as the dimension of the low-rank subspace s = nr n. For multiclass classification, we allow the number of classes to also scale with the number of datapoints n. Definition 2 (Scaling for multiclass). For multiclass classification with k classes, we have k = ck nt , where 0 t < r and ck is a positive integer.

We now pin down a concrete weak-to-strong ensemble, which boils down to specifying the joint distribution of the weak and strong features. We impose a subset relationship between the weak and strong features in the strong basis; see Figure 1 for a diagram. Assumption 2 (Weak-to-strong subset ensemble). Let Λ = Λ(p, q, r) Rd d denote the strong eigenvalues and Λweak = Λ(pweak, qweak, rweak) Rdweak dweak denote the weak eigenvalues, both drawn from the bi-level ensemble. Let λF,weak aweakdweak

sweak and λU,weak (1 aweak)dweak

dweak sweak denote the weak favored and unfavored eigenvalues, respectively. Let U be any distinguished eigenbasis of Σ where v = e1 (Assumption 1). The weak and strong features in the basis U are related as follows.

(1) xstrong N(0, Λ), where Λ = λF I[s] + λUI[d]\[s].

(2) There exists subsets of coordinates S [s], T [d] \ [s], with 1 S and |S| = sweak, such that

xweak = ( q

λU ΠT )xstrong d= N(0, λF,weak IS + λU,weak IT ) .

Here, ΠS denotes the projection onto the axis-aligned subspace indexed by S.

We will often abuse notation and restrict xweak Rd to the coordinates in S T, viewing xweak Rdweak. Hence, the weak favored features are a subset of the strong favored features, the weak unfavored features are a subset of the strong unfavored features, with different bi-level scalings, and furthermore the 1-sparse assumption holds for both Σ and Σweak.

Clearly, under Assumption 2, desideratum 1.ii is satisfied, and the subset ensemble is essentially the simplest relationship between the features one could impose to achieve the desideratum. In Section 4, we discuss how we expect the results to change if we relax these data modeling assumptions.

Published as a conference paper at ICLR 2025

Figure 1: Visualization of subset ensemble (Assumption 2) relating weak and strong features. Notice the decreased favoring for the weak features, and how the weak features are a subset of the strong features in their respective category. Hence, a linear model on the strong features can simulate one on the weak features (Desideratum 1.ii). Moreover, the label defining directions, represented by the green shaded box, are in the span of both the weak and strong features.

2.3 LEARNING ALGORITHM

Our models are all trained using minimum ℓ2-norm interpolation (MNI), which corresponds to the asymptotic behavior of gradient descent with zero initialization (Ji & Telgarsky, 2021). Define the data matrix X Rn d by X [x1 xn]. Let y(i) Rn be the (centered) one-hot vector for class i [k]. In MNI, we learn a linear score function f (i) via the optimization problem

f (i) = arg min f f 2 s.t. Xf = y(i). (MNI)

The coefficients of f (i) can be easily computed to be f (i) = X (XX ) 1y(i) X A 1y(i), where the matrix A XX Rn n is the unnormalized Gram matrix and invertible almost surely.

At test time, given a fresh sample xtest N(0, Σ), in the binary case we predict by = sgn( f, xtest ), and for multiclass we predict the class with the highest score: by = arg maxi [k] f (i), xtest . To be very explicit, let us formally define the end-to-end traning procedure for weak-to-strong binary classification (cf. the description before Remark 2.1), with an obvious extension to multiclass settings.

Procedure 1 (Weak-to-strong training). The weak learner observes an initial dataset of n datapoints (exi,weak, yi)i [n], where exi,weak are the weak features for the ith datapoint and yi = sgn( gi, v ) is the corresponding clean hard label. We train fweak Rdweak using MNI on these n clean datapoints.

Then, both learners observe m extra unlabeled datapoints, where the weak model sees weak features (xj,weak)j [m] and the strong model sees the corresponding strong features (xj,strong)j [m]. Generate m hard pseudolabels via byj,weak = sgn( fweak, xj,weak ), and use MNI to train fw2s Rd on (xj,strong, byj,weak)j [m].

3 MAIN RESULTS

In this section, we state our main results for binary classification (Theorem 3.3) and multilabel classification (Theorem 3.5). We focus on the regime q + r > u and qweak + rweak > 1, since Muthukumar et al. (2021, Theorem 13) implies that these are the only nontrivial regimes for binary classification under the bi-level ensemble.

In this regime, we also tighten the previous error rates for binary and multiclass classification in the bi-level ensemble from (Wu & Sahai, 2024, Theorem 3.2); the proof can be found in Appendix E. To that end, we establish a new concentration inequality for the lower tail of the maximum of correlated Gaussians, which may be of independent interest; we defer its statement to the end of this section.

Theorem 3.1 (Regimes with clean labels). Suppose the strong features have bi-level covariance Σ = Σ(p, q, r) (Definition 1), where q+r > 1, the true multiclass labels are 1-sparse (Assumption 1), and the number of classes k = nt (Definition 2). Then the test error for fstrong MNI-trained with n

Published as a conference paper at ICLR 2025

clean multiclass labels satisfies

E[ℓ(fstrong)] = on(1), if t < min {1 r, τstrong} 1 Θ 1

k , if t > min {1 r, τstrong} , (3)

where τstrong p + 1 2(q + r). Furthermore, for binary classification (i.e. k = 2), the explicit error rates are given by

E[ℓ(fstrong)] = 1

π arctan(Θ(nτstrong)). (4)

By generalizing other results from Wu & Sahai (2024) to handle imperfect labels, we formally establish weak-to-strong generalization for binary classification in the subset ensemble. We give a brief technical overview of the proof in Section 3.1 and prove it formally in Appendix C. Theorem 3.2 (Weak-to-strong generalization for subset ensemble). Consider the setup in Procedure 1 where the weak model fweak is MNI-trained on n correctly labeled examples and the strong model fw2s is MNI-trained on m = nu weakly pseudolabeled examples, where q + r > u and qweak + rweak > 1. In addition, we make the following data assumptions:

(1) The true binary labels are 1-sparse (Assumption 1) for the strong covariance.

(2) The weak and strong features follow the subset ensemble (Assumption 2) with bi-level eigenvalues Λ(p, q, r) and Λ(pweak, qweak, rweak), respectively, scaled relative to n.

(3) There are not too many weakly labeled examples: u < p+1+q+r (qweak+rweak)

Recall τstrong p + 1 2(q + r). Then, the resulting test error for fw2s satisfies

E[ℓ(fw2s)] = on(1), if u > qweak + rweak min {1 r, τstrong} 1 2 on(1), if u < qweak + rweak min {1 r, τstrong}. (5)

In other words, under the stated parameter regimes, weak-to-strong generalization occurs when the strong model is trained on sufficiently many weak binary labels.5 Figure 2 demonstrates that the conditions for weak-to-strong generalization are not vacuous. We make the third enumerated assumption for technical reasons, but we believe this condition is essentially tight.

By combining Theorems 3.1 and 3.2, we obtain the exact conditions where the main desiderata from Section 2 are satisfied in the bi-level ensemble. The regimes where our theorem applies are depicted in Figures 2a and 2b, and we validated our theory with numerical simulations of MNI with n = 50 in Figures 2c and 2d; see Appendix F for more details on the experiments. Theorem 3.3 (Main result). Under the same setup as Theorem 3.2, all the core weak-to-strong desiderata Desiderata 1.i to 1.iii are satisfied under the following conditions:

u > qweak + rweak min {1 r, τstrong} (Weak-to-strong generalization) 0 > τweak, (Weak model does not generalize)

where τweak pweak + 1 2(qweak + rweak) and τstrong p + 1 2(q + r). In particular, the conditions on the number of weak examples m = nu are non-vacuous as long as

τw2s p + 1 (q + r + qweak + rweak) > 0.

Note that under Assumption 2, Desideratum 1.ii is immediately satisfied since the weak features are spanned by the strong features. Also, as a sanity check, we confirm that whenever weak-to-strong generalization occurs with m weak labels, the strong model would generalize with m clean labels. By Theorem 3.1, if the strong model was instead trained on m = nu clean examples, it would generalize whenever u > 2(q + r) p. By expanding the definition of τstrong and using qweak + rweak > 1,

u > qweak + rweak τstrong = 2(q + r) p + (qweak + rweak 1) > 2(q + r) p,

which recovers the condition with m clean labels stated above.

5We expect weak imitation to occur as soon as m d if the models are trained using gradient descent, i.e. the classic underparameterized regime with more weakly labeled datapoints than features.

Published as a conference paper at ICLR 2025

Remark 3.4 (Bonus desiderata). If q+r > u, as in the setup of Theorem 3.3, PCA fails to recover the spiked subspace (Shen et al., 2013; Fan & Wang, 2015), so the bonus Desideratum 2.i automatically holds. Additionally, by using the condition for where the strong model fails with n clean examples from Theorem 3.1 (τstrong < 0), one can construct parameter regimes where the bonus Desideratum 2.ii

also holds. The theorem also implies the strong model cannot bootstrap its own performance. Indeed, some basic algebra shows τw2s = τstrong = τweak if (p, q, r) = (pweak, qweak, rweak).

We can extend the above result to the multilabel setting, which is a variant of multiclass classification where a datapoint can have multiple positive labels. The basic idea is that multilabel training can be approximated as several independent binary classification problems, and then we can apply the binary analysis. Since this requires additional setup, we defer the formal details to Appendix D.

Theorem 3.5 (Informal, see Theorem D.3). Under the analogous assumptions as Theorem 3.3 for multilabel classification, there is weak-to-strong generalization for fw2s trained on m weakly hard multilabeled examples from fweak in the exact same regimes.

The above result also suggests an interesting result for true multiclass problems with one-hot labels. In particular, one can use the multilabel scheme to generate weak multilabels for a multiclass weak-tostrong classifier fw2s. One can argue that whenever multilabel weak-to-strong generalization occurs, so too does multiclass weak-to-strong generalization (see Appendix D.1 for more justification). By comparing to the regimes for clean multiclass labels due to Theorem 3.1, there exists regimes where the strong model would fail if trained using the same number of clean multiclass labels (i.e. only one positive label per example). At a high level, this can happen because there are too many classes, which sparsifies the label vectors. In contrast, the weak multilabels do not suffer from the sparsity issue, even though the underlying signal is noisier. This appears to be related to the strategy of using soft labels or logits in the pseudolabeling and knowledge distillation literature.

1.0 1.5 2.0 2.5 3.0 3.5 4.0 p

Regime Diagram

fails generalizes

(a) (q, r) = (0.9, 0.8), (pweak, qweak, rweak) = (1.4, 0.9, 0.4).

1.0 1.5 2.0 2.5 3.0 3.5 4.0 p

Regime Diagram

fails generalizes

(b) (q, r) = (0.9, 0.5), (pweak, qweak, rweak) = (1.1, 0.9, 0.2)

w2s weak ground truth

(c) (p, q, r) = (2, 0.6, 0.6), (pweak, qweak, rweak) = (1.4, 0.9, 0.5).

w2s weak ground truth

(d) (p, q, r) = (1.5, 0.6, 0.8), (pweak, qweak, rweak) = (1.4, 0.9, 0.5).

Figure 2: (Top): Regime plots for weak-to-strong generalization based on Theorem 3.3. The blue region is the successful w2s regime, and the red region is where w2s training fails. The white region corresponds to regimes where either the hypotheses of the theorem fail to hold, or invalid settings of parameters for the bi-level ensemble. (Bottom): Comparison of simulations of MNI test accuracies for two different regimes where n = 50. Observe how the weak accuracy is close to random guessing and how the weak-to-strong accuracy increases as m increases. As corroborated by the plots, Theorem 3.3 predicts w2s success in Figure 2c and failure in Figure 2d.

Published as a conference paper at ICLR 2025

Finally, we state the aforementioned concentration inequality for the maximum of correlated Gaussians, which may be of independent interest. Our result complements the sharp bound of Lopes & Yao (2022) for more moderate deviations; the proof can be found in Appendix E. Theorem 3.6 (Lower tail for correlated Gaussians). Let ρ0 (0, 1) be a parameter bounded away from 0 and 1, and let (gi)i [N] be jointly Gaussian with zero mean and unit variance. Suppose E[gigj] ρ0 for all distinct i, j [N]. For any 0 t N = δ0 p

2(1 ρ0) log N where δ0 [0, 1) is bounded away from 1, there is a constant C > 0 depending only on ρ0 such that

Pr max i [N] gi t N

C N (1 δ0)2(1 1

ρ0 )(log N)

1 ρ0(2 δ0) δ0

In particular, one can take C = q

ρ0 1 ρ0 . If we further have E[gigj] = ρ0 for all distinct i, j and

t N = O log log N log N

, then Pr maxi [N] gi t N = Θ N 1 1

3.1 TECHNICAL OVERVIEW

In this section, we illustrate the main ideas of the proof of Theorem 3.2. Recall that the true label is sgn(x ), where x = gtest, v . Thus, studying weak-to-strong generalization amounts to analyzing

Pr[sgn( fw2s, xtest ) = sgn(x )], (6)

where the fresh test point xtest N(0, Σ). Observe that conditioned on the training data X, the true Gaussian x and fw2s, xtest are jointly Gaussian random variables, so there exist λ R and g N(0, 1) independent of x such that

fw2s, xtest = λx + g. (7)

The exact value of λ depends on exactly how much of x s signal survives through the weak-to-strong training process. Intuitively, the larger λ is, the higher the accuracy. Indeed, it is well-known from the classical study of noise stability that the probability in Equation (6) is precisely 1

π arctan(λ).

To compute λ, we use Gram-Schmidt to decompose fw2s to obtain the signal and noise with respect to x . We term these quantities the survival and contamination of x , respectively. Without loss of generality, we can rotate to the basis U where the strong features are drawn iid from N(0, Λ), where Λ = diag(λ1, . . . , λd). After applying the basis change, one easily computes fw2s, xtest d= N(0, Λfw2s 2 2). The survival and contamination are then defined as follows.

Definition 3 (Survival and contamination). Let v RD be a unit vector and let f Rd be a linear classifier for features drawn from N(0, P

i [d] λiviv i ), where v1, . . . , vd RD are orthonormal. The survival and contamination of v in f is defined as

λif[i] vi, v , CN(v) q

Λf 2 2 SU(v)2.

If f is trained by MNI on y = sgn( g, w ), we write SUn(v|w) for the survival of v given n labels generated from w, and similarly for CNn(v|w). If n or w is clear from context, we omit them.

To interpret these notions, consider the simple case where v = e1 and vi = ei, which is the 1-sparse setting studied by previous papers. There, the survival is simply just λ1f[1] and the contamination is p P

i>1 λif[i]2; this was used to analyze binary classification in (Muthukumar et al., 2021).

In the weak-to-strong setting, we have fw2s, xtest = SUm(v |vweak)x + CNm(v |vweak)g, where vweak = fweak/ fweak is the unit-norm direction learned by the weak model, and g is a standard Gaussian independent of x . Plugging in x = g, v into (7) yields

Pr[sgn(SUm(v |vweak)x + CNm(v |vweak)g) = sgn(x )], (8)

from which we can read off λ = SUm(v |vweak)

CNm(v |vweak). Thus, the explicit formula for the probability yields

that fw2s approaches perfect accuracy if and only if SUm(v |vweak)

CNm(v |vweak) = ω(1), whereas its accuracy

approaches random guessing if and only if SUm(v |vweak)

CNm(v |vweak) = o(1).

Published as a conference paper at ICLR 2025

We now turn to the strategy for controlling the survival and contamination. Using the explicit MNI formula for fweak and fw2s, this analysis reduces to controlling certain random bilinear forms. Applying standard eigenvalue concentration results and a version of the Hanson-Wright concentration inequality for bilinear forms developed by Wu & Sahai (2024), we can precisely control the coordinates of fweak, and hence vweak. Once we have control over vweak, we can then study fw2s.

One technical difficulty is that the weak labels are no longer 1-sparse, even if the true labels are. This marks a departure from previous analyses of benign overfitting in classification, which all assumed 1-sparse labels (Muthukumar et al., 2021; Wang et al., 2021; Wu & Sahai, 2024). To overcome this obstacle, we develop tools to analyze MNI beyond the 1-sparse regime, namely by using the Woodbury inversion formula and carefully bounding error terms using Hanson-Wright. Under the subset ensemble, the explicit parameterizations for the survival and contamination then yield the stated conditions for weak-to-strong generalization.

4 DISCUSSION

We now discuss further extensions which suggest interesting new weak-to-strong phenomena.

Modeling assumptions. In this paper, we assumed the features were Gaussian for the sake of theoretical tractability. One could relax this to vector-subgaussian, as this notion is preserved under rotation. On the other hand, the bi-level and subset ensembles were chosen as a minimal set of tractable theoretical conditions to study weak-to-strong generalization. In principle, our techniques could be extended to more complicated feature ensembles, but it would likely be quite involved.

As corroborated by our experiments in Figs. 2c and 2d, in practice one does not observe sharp transitions in test error from 1

2 to 0 using weak supervision such a sharp transition corresponds to a Performance Gap Recovered (PGR) equal to 1 in Burns et al. (2023) This phenomenon can nevertheless be justified by our theory. First, the transition we prove is asymptotic; the rate of convergence matters to predict the PGR for finite n. Another factor is the 1-sparse assumption for the true labels. We argue in Appendix G that this is essentially necessary to get this sharp transition in asymptotic test error. In practice, the true directions are rarely aligned with the eigenvectors of the covariance, so the transition would instead between different constant levels of test error.

Weak imitation. Our main result Theorem 3.2 uses the subset ensemble Assumption 2 to establish two distinct phases of weak-to-strong generalization. By adding a third intermediate eigenvalue level thus relaxing the bi-level ensemble to a tri-level ensemble and changing the relationship between the strong and weak basis, we expect a third distinct regime of weak imitation to occur in an overparameterized setting. At a high level, this can happen if the weak classifier puts most of its mass in its own unfavored direction v, but in the strong feature space v has an intermediate level of favoring. Given enough weakly labeled datapoints, the strong model will eventually learn to imitate the weak model. Nevertheless, it is still possible for weak-to-strong generalization to occur, with the regime for weak-to-strong generalization widening as the intermediate favoring level decreases.

Multiclass classification. By leveraging Theorem 3.1 and the techniques developed for binary and multilabel classification in the appendix, we expect our result to extend to the multiclass setting. The main technical challenge is controlling the expected weak-to-strong training behavior, which is significantly more complicated. However, multilabel weak supervision overcomes this challenge and is more tractable to analyze, as discussed after Theorem 3.5.

Relevance to realistic settings. One connection to more realistic settings comes through the NTK perspective, as touched upon in the introduction. To reiterate, if finetuning remains in the lazy training phase, then the supervised finetuning dynamics are captured by an appropriate generalized linear model defined by the NTK approximation. The strong and weak features are therefore learned from the pretraining phase. Another relevant setting is training linear probes on intermediate activations to interpret large models (Alain & Bengio, 2016; Marks & Tegmark, 2023; Nanda et al., 2023; Zou et al., 2023). Hence, under appropriate overparameterization of the intermediate activations, our results also apply there. To go beyond classification problems, one could hope to use some recent implicit bias results for next token prediction (Thrampoulidis, 2024) to study language models.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

We would like to acknowledge support from an Open AI Superalignment grant and NSF AST-2132700.

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. ar Xiv preprint ar Xiv:1610.01644, 2016.

Norah Alballa and Marco Canini. Practical insights into knowledge distillation for pre-trained models, 2024.

Eric Arazo, Diego Ortego, Paul Albert, Noel E O Connor, and Kevin Mc Guinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International joint conference on neural networks (IJCNN), pp. 1 8. IEEE, 2020.

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167 1180, 2020.

Kenneth Borup and Lars Nørvang Andersen. Self-distillation for gaussian process regression and classification, 2023.

Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535 541, 2006.

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. ar Xiv preprint ar Xiv:2312.09390, 2023.

Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, and Vicente Ordonez. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 6912 6920, 2021.

Moses Charikar, Chirag Pabbaraju, and Kirankumar Shiragur. Quantifying the gain in weak-to-strong generalization. ar Xiv preprint ar Xiv:2405.15116, 2024.

Niladri S Chatterji and Philip M Long. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. Journal of Machine Learning Research, 22(129):1 30, 2021.

Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6): 493 497, 2021.

Elisabetta Cornacchia, Francesca Mignacco, Rodrigo Veiga, Cédric Gerbelot, Bruno Loureiro, and Lenka Zdeborová. Learning curves for the multi-class teacher student perceptron. Machine Learning: Science and Technology, 4(1):015019, 2023.

Rudrajit Das and Sujay Sanghavi. Understanding self-distillation in the presence of label noise. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 7102 7140. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/das23d.html.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.

Jianqing Fan and Weichen Wang. Asymptotics of empirical eigen-structure for ultra-high dimensional spiked covariance model. ar Xiv preprint ar Xiv:1502.04733, 2015.

Published as a conference paper at ICLR 2025

Alvaro Figueira and Bruno Vaz. Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15):2733, 2022.

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789 1819, 2021.

Jianyuan Guo, Hanting Chen, Chengcheng Wang, Kai Han, Chang Xu, and Yunhe Wang. Vision superalignment: Weak-to-strong generalization for vision foundation models. ar Xiv preprint ar Xiv:2402.03749, 2024.

Hrayr Harutyunyan. On information captured by neural networks: connections with memorization and generalization, 2023.

Peter Hase, Mohit Bansal, Peter Clark, and Sarah Wiegreffe. The unreasonable effectiveness of easy training data for hard tasks. ar Xiv preprint ar Xiv:2401.06751, 2024.

Chunming He, Kai Li, Yachao Zhang, Guoxia Xu, Longxiang Tang, Yulun Zhang, Zhenhua Guo, and Xiu Li. Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. Advances in Neural Information Processing Systems, 36, 2024.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022.

Chi Hong, Robert Birke, Pin-Yu Chen, and Lydia Y. Chen. On dark knowledge for distilling generators. In De-Nian Yang, Xing Xie, Vincent S. Tseng, Jian Pei, Jen-Wei Huang, and Jerry Chun-Wei Lin (eds.), Advances in Knowledge Discovery and Data Mining, pp. 235 247, Singapore, 2024. Springer Nature Singapore. ISBN 978-981-97-2253-2.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp. 2790 2799. PMLR, 2019.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

Guangda Ji and Zhanxing Zhu. Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher. Advances in Neural Information Processing Systems, 33: 20823 20833, 2020.

Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, and Yaodong Yang. Aligner: Achieving efficient alignment through weak-to-strong correction. ar Xiv preprint ar Xiv:2402.02416, 2024.

Ziwei Ji and Matus Telgarsky. Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory, pp. 772 804. PMLR, 2021.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Hunter Lang, David Sontag, and Aravindan Vijayaraghavan. Theoretical analysis of weak-to-strong generalization. ar Xiv preprint ar Xiv:2405.16043, 2024.

Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, pp. 896. Atlanta, 2013.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. ar Xiv preprint ar Xiv:2104.08691, 2021.

Published as a conference paper at ICLR 2025

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950 1965, 2022.

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, et al. Best practices and lessons learned on synthetic data for language models. ar Xiv preprint ar Xiv:2404.07503, 2024.

Yuejiang Liu and Alexandre Alahi. Co-supervised learning: Improving weak-to-strong generalization with hierarchical mixture of experts. ar Xiv preprint ar Xiv:2402.15505, 2024.

Miles E Lopes and Junwen Yao. A sharp lower-tail bound for gaussian maxima with application to bootstrap methods in high dimensions. Electronic Journal of Statistics, 16(1):58 83, 2022.

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. ar Xiv preprint ar Xiv:2310.06824, 2023.

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75 (4):667 766, 2022.

Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351 3361, 2020.

Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1): 67 83, 2020.

Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter? Journal of Machine Learning Research, 22(222):1 69, 2021.

Vaishnavh Nagarajan, Aditya K Menon, Srinadh Bhojanapalli, Hossein Mobahi, and Sanjiv Kumar. On student-teacher deviations in distillation: does it pay to disobey? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 5961 6000. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/12d286282e1be5431ea05262a21f415c-Paper-Conference.pdf.

Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. ar Xiv preprint ar Xiv:2309.00941, 2023.

Sergey I Nikolenko. Synthetic data for deep learning, volume 174. Springer, 2021.

Utkarsh Ojha, Yuheng Li, Anirudh Sundara Rajan, Yingyu Liang, and Yong Jae Lee. What knowledge gets distilled in knowledge distillation? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 11037 11048. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/2433fec2144ccf5fea1c9c5ebdbc3924-Paper-Conference.pdf.

Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In International conference on machine learning, pp. 5142 5151. PMLR, 2019.

Damián Pinasco, Ezequiel Smucler, and Ignacio Zalduendo. Orthant probabilities and the attainment of maxima on a vertex of a simplex. Linear Algebra and its Applications, 610:785 803, 2021.

Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudolabeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. ar Xiv preprint ar Xiv:2101.06329, 2021.

Published as a conference paper at ICLR 2025

Mher Safaryan, Alexandra Peste, and Dan Alistarh. Knowledge distillation performs partial variance reduction. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 75229 75258. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/file/ee1f0da706829d7f198eac0edaacc338-Paper-Conference.pdf.

Felix Sarnthein, Gregor Bachmann, Sotiris Anagnostidis, and Thomas Hofmann. Random teachers are good teachers. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 30022 30041. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/sarnthein23a.html.

Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34:6695 6706, 2021.

Dan Shen, Haipeng Shen, Hongtu Zhu, and JS Marron. Surprising asymptotic conical structure in critical sample eigen-directions. ar Xiv preprint ar Xiv:1303.6171, 2013.

Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Ya acov Ritov, Mikhail Yurochkin, and Yuekai Sun. A statistical framework for weak-to-strong generalization. ar Xiv preprint ar Xiv:2405.16236, 2024.

Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34: 6906 6919, 2021.

Vignesh Subramanian, Rahul Arya, and Anant Sahai. Generalization for multiclass classification with overparameterized linear models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=ik Wv MRVQBWW.

Leitian Tao and Yixuan Li. Your weak llm is secretly a strong teacher for alignment. ar Xiv preprint ar Xiv:2409.08813, 2024.

Christos Thrampoulidis. Implicit bias of next-token prediction. ar Xiv preprint ar Xiv:2402.18551, 2024.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Guillaume Wang, Konstantin Donhauser, and Fanny Yang. Tight bounds for minimum ℓ1-norm interpolation of noisy data. In International Conference on Artificial Intelligence and Statistics, pp. 10572 10602. PMLR, 2022a.

Ke Wang and Christos Thrampoulidis. Benign overfitting in binary classification of Gaussian mixtures. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4030 4034. IEEE, 2021.

Ke Wang, Vidya Muthukumar, and Christos Thrampoulidis. Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation. ar Xiv e-prints, art. ar Xiv:2106.10865, June 2021.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. ar Xiv preprint ar Xiv:2212.10560, 2022b.

Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 23549 23588. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/ wei22a.html.

Published as a conference paper at ICLR 2025

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. ar Xiv preprint ar Xiv:2109.01652, 2021.

David Wu and Anant Sahai. Precise asymptotic generalization for multiclass classification with overparameterized linear models. Advances in Neural Information Processing Systems, 36, 2024.

Liuchi Xu, Jin Ren, Zhenhua Huang, Weishi Zheng, and Yunwen Chen. Improving knowledge distillation via head and tail categories. IEEE Transactions on Circuits and Systems for Video Technology, 34(5):3465 3480, 2024. doi: 10.1109/TCSVT.2023.3325814.

Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, and Yankai Lin. Super (ficial)-alignment: Strong models may deceive weak models in weak-to-strong generalization. ar Xiv preprint ar Xiv:2406.11431, 2024.

Hua Yuan, Yu Shi, Ning Xu, Xu Yang, Xin Geng, and Yong Rui. Learning from biased soft labels. Advances in Neural Information Processing Systems, 36, 2024.

Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408 18419, 2021.

Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L Edelman, Milind Tambe, Sham M Kakade, and Eran Malach. Transcendence: Generative models can outperform the experts that train them. ar Xiv preprint ar Xiv:2406.11741, 2024.

Qingyue Zhao and Banghua Zhu. Towards the fundamental limits of knowledge transfer over finite domains, 2023.

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. ar Xiv preprint ar Xiv:2310.01405, 2023.

A PRELIMINARIES

In this section, we lay out some preliminaries for the rigorous justification of our main results, and give a technical overview in Section 3.1 which motivates the study of the survival and contamination of the signal, defined formally in Definition 3. In Appendix B, we will introduce tools from high dimensional probability which can rigorously control the survival and contamination. In Appendix C, we will then use these tools to prove the main results of the paper for binary classification. Then, in Appendix D, we will discuss how the same exact proofs extend to the multilabel setting, and in Appendix E we will prove a tight lower tail inequality for correlated Gaussians which yields Theorem 3.1. In Appendix F we include some simulations validating in our theory, and finally in Appendix G we give a heuristic calculation which arrives at the same predictions for the weak-to-strong regimes which should lead to rigorous proofs even without the 1-sparse assumption.

Notation. For positive integers n, we use the shorthand [n] {1, . . . , n}. For a vector v Rn, v 2 always denotes the Euclidean norm. We index entries by using square brackets or subscripts, whichever is clearer, so v[j] and vj both denote the jth entry of v. For any matrix M Rm n, we denote its ijth entry by mij, M 2 denotes the spectral norm, and M F = Tr M M denotes the Frobenius norm. If M Rn n is symmetric, we write µ1(M) µ2(M) . . . µn(M) to denote the ordered eigenvalues of M.

We make extensive use of asymptotic notation. We say f(n) g(n) if f(n) = Θ(g(n)), f(n) g(n) if f(n) = O(g(n)), and f(n) = e O(g(n)) if f(n) = O(g(n) log(n)c) for some constant c 0. When we write f(n) g(n), we mean that f(n)

g(n) nε for some constant ε > 0. An event E is said to hold with high probability if it holds with probability 1 1/nc for some constant c > 0, and very high probability if it holds with probability 1 exp( nc) for some constant c > 0. For us, high probability statements will be taken with respect to n, the number of datapoints we scale the weak

Published as a conference paper at ICLR 2025

and strong features with respect to, but under our terminology it makes no difference whether we use n or m because they are polynomially related.

In the weak-to-strong ensemble analysis, many of the expressions will depend on q + r, but the number of datapoints involved will change. In an effort to reduce confusion, we introduce the following notation. Definition 4 (Bi-level prefactor). If the covariance Σ = Σ(p, q, r) is drawn from the bi-level ensemble scaled with respect to n, then we define

s = n1 q r. (9)

In particular, if q + r > 1 then µn 1. During weak supervision, the strong learner has bi-level features with covariance Σ(p, q, r) scaled with respect to n and receives m = nu weakly labeled examples. Hence, we define

s = nu q r. (10)

Remark A.1. Note that once a, s are fixed, µn is linear in n.

B TOOLS FOR ANALYZING SURVIVAL AND CONTAMINATION

In this section, we introduce the basic machinery which can justify the heuristic calculations carried out in Appendix A. We will heavily rely on the analysis of Wu & Sahai (2024), so in an attempt to not replicate effort we will comment on how the proofs change in the weak-to-strong setting.

Basis change. Since we have bi-level data and the true direction v is 1-sparse (Assumption 1), there are convenient basis changes which will greatly simplify the calculations. In particular, let S be the spiked subspace of Σ, and w be any unit vector. Then any vectors in S are eigenvectors of Σ, so we can pick an eigenbasis U so that xstrong has independent coordinates, v is 1-sparse, and w is 3-sparse. We state this more formally in the following definition. Definition 5 (Strong eigenbasis). Given Σ, v , and w as above, let U be the distinguished eigenbasis of Σ such that after rotating to U , we have xstrong N(0, Λ), v = e1, and w = w1e1 + w2e2 + ws+1es+1.

Hanson Wright inequality. Now, we recall a few versions of the Hanson-Wright inequality which handles bilinear forms between subgaussian vectors.

For the multiclass case, we also need to handle potentially sparse vectors; this version was proved in Wu & Sahai (2024). Theorem B.1 (Hanson-Wright for bilinear forms with soft sparsity). Let x = (X1, . . . , Xn) Rn and y (Y1, . . . , Yn) Rn be random vectors such that (Xi, Yi) are independent pairs of (possibly correlated) centered random variables such that Xi ψ2 K and Yi has soft sparsity at level σ, i.e. |Yi| 1 almost surely, and E[Y 2 i ] σ. Assume that conditioned on Yj, Xj ψ2 K. Then there exists an absolute constant c > 0 such that for all M Rn n and ϵ 0 we have

Pr |x My E[x My]| > ϵ 2 exp

K2σ M 2 F , ϵ K M 2

Because we are working with MNI, in a typical application we hope to set M = A 1. However, in order to apply Hanson-Wright, one needs to ensure that the matrix M is independent of x and y.

Decorrelated Gram matrix. As we are working with Gaussian covariates, one way to achieve this is to decorrelate A with x and y. Definition 6 (Decorrelated Gram matrix). Let x1, . . . , xn N(0, Λ) be iid samples from a Gaussian distribution in Rd. Let R = v(1), . . . , v(ℓ) be an arbitrary set of orthonormal vectors in Rd and for each j [ℓ], let h(j) = (h1, . . . , hn) Rn be the corresponding Gaussian observations, where hi Λ1/2gi, v(j) .

Published as a conference paper at ICLR 2025

Define the projection matrix onto R by ΠR = P

v R vv . The R-projection of the data matrix XR Rn d is defined by

and the R-decorrelated data matrix X R Rn d is defined by

X R X(I ΠR) = X XR.

The decorrelated Gram matrix is defined by

A R X RX R = X(I ΠR)X .

When R = {v}, we will simply abbreviate {v} with v.

Remark B.2. If R does not consist of orthonormal vectors, one can apply Gram-Schmidt first so that the above definition still makes sense.

It is not hard to show that as defined above, A R is mutually independent of h(j)

Fact B.3. In the same setting as Definition 6, A R is mutually independent of h(j)

Proof. The fact that X R is independent of the hi s is due to the fact that if g, h are jointly Gaussian with zero mean and unit variance, then g E[gh]h is independent of h. This immediately generalizes to each row of X, since the entries are all independent for any row. Finally, as the rows are independent, the statement is proved.

To relate a bilinear form against A 1 R to the original bilinear form against A 1, we will also need the Woodbury inversion formula for inverting the Gram matrix after decorrelating it with various features.

Define the hat matrix for R by

HR X R A 1 RXR. (12)

These hat matrices appear in the Woodbury inversion formula. For the sake of notational compactness, define

MR XR(I + HR) 1X R . (13)

Fact B.4 (Woodbury inversion formula). We have

A 1 = (XRX R + A R) 1

= A 1 R A 1 RXR(I + HR) 1X R A 1 R (14)

= A 1 R A 1 RMRA 1 R. (15)

Concentration of spectrum. Finally, in view of Hanson-Wright, we will need to control A 1 R 2 and Tr A 1 R . To this end, we control the spectrum of A R. For any PSD matrix Σ with eigenvalues

Λ = diag(λi)i [d] sorted in descending order and k [d], define the effective rank rk(Σ)

λk . We will use the following bounds adapted from Bartlett et al. (2020), which show that the effective rank of the covariance matrix controls the spectrum of the Gram matrix (and hence the sample covariance). Lemma B.5 (Eigenvalue bounds from Bartlett et al. (2020)). Suppose that x N(0, Σ), where Σ = UΛU . Write Λ = diag(λi), and let A = XX Rn n denote the Gram matrix for n iid observations of x. If x = U Λ1/2z, where z has independent, unit variance, O(1)-subgaussian coordinates, the following holds.

If r1(Σ) = ω(n), then with very high probability

Tr(Λ)(1 o(1)) µn(A) µ1(A) Tr(Λ)(1 + o(1))

As a corollary, we show that in the regime q + r > 1 in the bi-level ensemble, decorrelating the Gram matrix with any sublinear number of jointly Gaussian random variables will yield a flat matrix.

Published as a conference paper at ICLR 2025

Corollary B.6. Suppose that x N(0, Σ), where Σ Rd d satisfies µ1(Σ) = o( d

n) and for any ℓ= o(d), we have µℓ(Σ)rℓ(Σ) = d(1 o(1)) (in particular, this holds if Σ follows the bilevel ensemble and q + r > 1). Let R be an arbitrary set of orthonormal vectors in Rd such that |R| = o(d), and let A R be the corresponding decorrelated Gram matrix for n iid observations of x as in Definition 6. Then with very high probability,

d(1 on(1)) µn(A R) µ1(A R) d(1 + on(1))

Proof. First, note that the rows of X R are iid samples from N(0, Σ R), where Σ R = (I ΠR)Σ(I ΠR).

Let ℓ= |R|; by assumption ℓ= o(d). Since I ΠR is a projection matrix, we conclude by Cauchy Interlacing that µ1(Σ R) µ1(Σ) and for i {ℓ+ 1, . . . , d}, µi(Σ) µi(Σ R). It follows that

r1(Σ R) = Tr(Σ R)

µ1(Σ R) µℓ(Σ)rℓ(Σ)

ω d(1 o(1))

(Assumption on Σ)

Thus Lemma B.5 implies that µi(A R) = d(1 o(1)) with high probability.

This leads to the following bounds on the spectra of A 1 R and the hat matrix HR; this generalizes Wu & Sahai (2024, Corollary B.5, Proposition B.6).

Lemma B.7. In the bi-level model, for any set of orthonormal vectors R with |R| = o(d), if q+r > 1, then with very high probability we have

Tr A 1 R = n1 p(1 o(1)) A 1 R 2 (1 + o(1))n p XRX R 2 o(np)

µi(HR) = 1 o(1), i [d]

Proof. The first two statements are an immediate consequence of Corollary B.6. For the third, let S, T Rd be the spiked and unspiked subspaces of Σ. Since XR = (XS + XT )ΠR and using (A + B)M(A + B) 2AMA + 2BMB, we have

XRX R 2XSX S + 2XT ΠRX T .

Wu & Sahai (2024, Lemma I.1) implies that XSX S 2 n1 q r np = o(np) since q + r > 1. On the other hand, since the law of XT is rotationally invariant, one can view XT ΠRX T as a Wishart matrix Y Y where Y Rn |R|. Hence the concentration of singular values for a matrix with independent subgaussian rows (see, e.g. Vershynin (2010, Theorem 5.39)) implies that with probability 1 e n, XT ΠRX T 2 ( p

|R| + n + n1/4)2 = O(R + n) = o(np).

Combining this with the bound on XSX S 2, we conclude that XRX R 2 = o(np).

For the last statement, since HR = I+X R A 1 RXR, by applying the earlier bounds that µi(A 1 R) = np(1 o(1)) and 0 X R XR o(np), we conclude that µi(HR) = 1 o(1).

Hanson-Wright calculations. Combining the above results, we can establish a concentration inequality for the bilinear form z i A 1y.

Published as a conference paper at ICLR 2025

Proposition B.8. Let y = sgn( g, w ) for some unit vector w Rd and R be any set of orthonormal vectors such that span(R) {ei, w} and |R| = o(d). Then conditioned on A 1 R, with probability 1 1

n, z i A 1 Ry E[z i A 1 Ry] c1

where c1 is a positive constant. Furthermore, under the bi-level scaling, if q + r > 1, then with very high probability over A 1 R we have

E[z i A 1 Ry|A 1 R] = 2

3/2 (1 on(1)) arcsin(wi) n

A 1 R F = (1 on(1)) n

Proof. Since A 1 R is independent of (zi, y), we can apply Theorem B.1 to obtain the first statement.

For the other two expressions, we can apply Lemma B.7 since we assumed q + r > 1. We now compute the expectation using the noise stability formula:

E[z i A 1 Ry|A 1 R] = Tr A 1 RE[yz i ]

3/2 arcsin(wi) Tr A 1 R (Noise stability)

3/2 (1 on(1)) arcsin(wi) n

d , (Lemma B.7)

where the last line holds with very high probability over A 1 R.

Finally, for the last statement we have A 1 R F = s X

i [n] µi(A 1 R)2

i [n] d2(1 o(1)) (Lemma B.7)

= (1 o(1)) n

B.1 CONTROLLING THE ERROR FROM WOODBURY

In the previous subsection, we introduced some tools for obtaining concentration of the coefficients of the MNI classifier fstrong. The remaining survival and contamination reduce to understanding the typical behavior of bilinear forms z i A 1y. To do so, we work in the distinguished basis U introduced in Definition 5. We can thus pick R = {e1, e2, es+1} and apply Hanson-Wright to the bilinear form z i A 1 Ry. The next lemma bounds the error from replacing A 1 with A 1 R. Lemma B.9. Suppose we are in the bi-level model, y = sgn( g, w ), and q + r > 1. Suppose in the basis U , w = w1e1 + w2e2 + ws+1es+1. Let R = {e1, e2, es+1}. With probability 1 O(1/n), we have uniformly over i [d] that z i A 1y z i A 1 Ry z i A 1 RXR 2 X R A 1 Ry 2.

Furthermore, in the bi-level scaling we have with probability 1 O(1/n) that

z i A 1 RXR 2 =

e O( µnn 1 p

2 ) i [2] e O( µnn p

2 + n1 p) i = s + 1 e O( µnn p

2 ) otherwise and X R A 1 Ry 2 µn((|w1| + |w2|)n 1 p

2 + e O(n p

2 )) + |ws+1|n1 p + e O(n 1 2 p).

Published as a conference paper at ICLR 2025

Proof. By relating these two bilinear forms using Woodbury, we have z i A 1y z i A 1 Ry = z i A 1 RMRA 1 Ry (Fact B.4)

HR 2 z i A 1 RXR 2 X R A 1 Ry 2 (Cauchy-Schwarz)

(1 + o(1)) z i A 1 RXR 2 X R A 1 Ry 2 . (Lemma B.7)

It remains to control the above norms, which we achieve through another application of Hanson Wright. Indeed, we can compute z i A 1 RXR 2

j {1,2,s+1} λj(z i A 1 Rzj)2

Now, by Hanson-Wright, we have with probability 1 O( 1

λU(z i A 1 Rzs+1)2 =

( e O(n2 2p) i = s + 1 e O(n1 2p) otherwise

Similarly we have with probability 1 O( 1

j [2] (z i A 1 Rzj)2 =

( e O(np q r n2 2p) i [2] e O(np q r n1 2p) otherwise

Since np q r 1 as p > q + r and np q r n1 p = µn, it follows that with probability 1 O( 1

n) we have simultaneously for i [d] that

z i A 1 RXR 2 =

e O( µnn 1 p

2 ) i [2] e O( µnn p

2 + n1 p) i = s + 1 e O( µnn p

2 ) otherwise

A similar calculation, applying Proposition B.8, yields that

(z i A 1 Ry)2 =

π 3/2 arcsin(wi)n e O( n) i2 i {1, 2, s + 1}

d2 ) otherwise

X R A 1 Ry 2 X

i {1,2,s+1}

d (|wi|n + e O( n))

µn( w S 1n 1 p

2 + e O(n p

2 )) + |ws+1|n1 p + e O(n 1 2 p).

This concludes the proof.

We can simplify the above error bound greatly if w satisfies the 1-sparse assumption.

Corollary B.10 (Woodbury error under 1-sparse assumption). Suppose, in addition to the assumptions of Lemma B.9, we have w = v , where v satisfies the 1-sparse assumption. In other words, in the U basis we have v = e1. Then we pick R = {e1}, and with probability 1 O( 1

z i A 1y z i A 1 Ry

e O(µn n1 p) i = 1

e O(µn n 1 2 p) otherwise

B.2 SURVIVAL ANALYSIS

Let us first establish a general survival bound using Lemma B.9. Afterward, we present special cases under additional assumptions.

Published as a conference paper at ICLR 2025

Proposition B.11. Let Σ = Σ(p, q, r) be drawn from the bi-level ensemble scaled with respect to n with q + r > 1. Suppose in the distinguished basis U , we have w = w1e1 + w2e2 + ws+1es+1, and set R = {e1, e2, es+1}. Then, given n labels generated by w, we have z 1 A 1y 2

d (1 + o(1)) arcsin w1

+ z i A 1 RXR 2 X R A 1 Ry 2.

Proof. Recall our expressions for the survival:

SUn(v|w) = X

i [d] λiz i A 1y vi, v , (MNI)

where each label is generated by yi = sgn( gi, w ).

Setting v = v , in the basis U from Definition 5, the above expression simplifies to

SUn(v |w) = λ1z 1 A 1y

Applying Woodbury (Fact B.4) with R = {v, w}, we have

z 1 A 1y = z 1 A 1 Ry z 1 A 1 RMRA 1 Ry.

Now we can apply Proposition B.8 to conclude that z 1 A 1 Ry 2

d (1 + o(1)) arcsin(w1)

with probability 1 1

n. Now, triangle inequality and Lemma B.9 shows that z 1 A 1y 2

d (1 + o(1)) arcsin w1

+ z i A 1 RXR 2 X R A 1 Ry 2.

Thus the statement follows.

By way of Corollary B.10, we get a clean expression for the survival if w = v .

Corollary B.12. If q + r > 1, v satisfies the 1-sparse assumption, and y = sgn( g, v ), then with probability 1 O(1/n)

d z i A 1y e O n

SUn(v |v ) µn

Proof. If w = v , then by Corollary B.10 the error bound simplifies to

z i A 1y z i A 1 Ry

d ) i = 1 e O(µn n

d ) otherwise

Since q + r > 1, these error terms are lower order, and the results follow.

We can get a similarly clean bound under a few additional assumptions on w and q + r. Corollary B.13. If, in addition to the assumptions of Proposition B.11, we have

(1) |wi| ω( q

n ) for i {1, 2, s + 1}

(2) |w2| = o(|w1|)

(3) µnn 1 p

Published as a conference paper at ICLR 2025

then with yi = sgn( gi, w ), we have

d arcsin w1.

Furthermore, the survival of v given n labels generated by w is with high probability

SUn(v |w) µnw1 (16)

Proof. Under the additional assumptions, then from Lemma B.9

X R A 1 Ry 2 µn(|w1| + |w2|)n 1 p

2 + n1 p (|wi| = ω( p

µn|w1|n 1 p

2 + n1 p (|w2| = o(|w1|))

Since i = 1 here, we can apply Lemma B.9 to see that z 1 A 1 RXR 2 = e O( µnn 1 p

2 ). Thus, the Woodbury error term is upper bounded by

e O( µnn 1 p

2 )( µn|w1|n 1 p

2 + n1 p) e O(µn|w1|n1 p + µnn 3 3p

e O(µn|w1|n1 p) + o(n1 p|w1|)

o(n1 p|w1|)

where the second line follows because we assumed µnn 1 p

2 |w1|, and the last line follows because we assumed q + r > 1 so µn 1. Therefore, Proposition B.8 implies that z 1 A 1y 2

3/2 n1 p(1 + o(1)) arcsin w1

e O(n 1 2 p) + o(n1 p arcsin(|w1|)

o n1 p arcsin w1 , (|w1| ω( q

which recovers the stated result.

B.3 CONTAMINATION ANALYSIS

We now move onto the contamination.

Proposition B.14. Suppose v satisfies the 1-sparse assumption, and in the distinguished basis U we have v = e1 and w = w1e1 + w2e2 + ws+1es+1. Under the same assumptions as Corollary B.13, if there are n datapoints, and q + r > 1, then

CNn(v |w)2 o(µ2 n|w1|2) + µ2 nnr 1 log(n)2 + n1 p log(n)

CNn(v |w)2 µ2 nnr 1 + n1 p.

Furthermore, the lower bound holds even without the additional assumptions from Corollary B.13.

Proof. Since v is 1-sparse, we have

CN(v |w)2 = X

i {2,s+1} (λiz i A 1y)2 + X

i {1,2,s+1} (λiz i A 1y)2 . (17)

Since by definition wi = 0 for i {1, 2, s + 1}, the second term in Equation (17) can be upper bounded up to constant factors using Wu & Sahai (2024, Proposition A.2) by

µ2 n s n log(n)2 + n

d log(n) = µ2 nnr 1 log(n)2 + n1 p log(n).

Now, set R = {e1, e2, es+1}. For the first term, we know from Lemma B.9 that z i A 1y z i A 1 Ry z i A 1 RXR 2 X R A 1 Ry 2,

Published as a conference paper at ICLR 2025

z i A 1 RXR 2 =

( e O( µnn 1 p

2 ) i = 2 e O( µnn p

2 + n1 p) i = s + 1

and from our bounds in Proposition B.11, we have X R A 1 Ry 2 µn|w1|n 1 p

Hence we have z 2 A 1 RXR 2 X R A 1 Ry 2 o(n1 p|w1|).

On the other hand, we have z s+1A 1 RXR 2 X R A 1 Ry 2 µn|w1|n 1 2 p + µn|w1|n 3 3p

2 + µnn1 3p

o(n 1 2 p) + o(n1 p|w1|) + n2 2p,

where the last line follows from the assumptions and p > 1. Putting these together with Hanson Wright, we have z 2 A 1y e O(n 1 2 p) + o(n1 p|w1|) z s+1A 1y e O(n 1 2 p) + o(n1 p|w1|) + n2 2p

This yields an upper bound of

o(µ2 n|w1|2) + e O(n1 2p) + o(n2 2p|w1|2) + n4 4p o(µ2 n|w1|2) + n1 p,

as p > 1 and |w1| 1. This completes the proof of the upper bound.

For the lower bound, a careful inspection of the proof of Wu & Sahai (2024, Proposition A.2) reveals that, for q + r > 1, the lower bound holds for any multiclass problem with t 0. This recovers the desired bound by simply lower bounding the second term in the expansion Equation (17), completing the proof.

C ANALYZING THE SUBSET ENSEMBLE

In this section, we prove Theorem 3.2, which establishes weak-to-strong generalization in the simple weak-to-strong subset ensemble. We restate it below for convenience. Theorem 3.2 (Weak-to-strong generalization for subset ensemble). Consider the setup in Procedure 1 where the weak model fweak is MNI-trained on n correctly labeled examples and the strong model fw2s is MNI-trained on m = nu weakly pseudolabeled examples, where q + r > u and qweak + rweak > 1. In addition, we make the following data assumptions:

(1) The true binary labels are 1-sparse (Assumption 1) for the strong covariance.

(2) The weak and strong features follow the subset ensemble (Assumption 2) with bi-level eigenvalues Λ(p, q, r) and Λ(pweak, qweak, rweak), respectively, scaled relative to n.

(3) There are not too many weakly labeled examples: u < p+1+q+r (qweak+rweak)

Recall τstrong p + 1 2(q + r). Then, the resulting test error for fw2s satisfies

E[ℓ(fw2s)] = on(1), if u > qweak + rweak min {1 r, τstrong} 1 2 on(1), if u < qweak + rweak min {1 r, τstrong}. (5)

Proof. The key idea of the proof is that, despite the error rate of the weak classifier being 1

2 o(1), the weak classifier still has a noticeable amount of mass on the true label defining direction. So long as the error rate is not 1

2 O( 1 m), which random guessing can achieve, the strong learner can

Published as a conference paper at ICLR 2025

still pick up on the signal contained in the weak labels. To make this precise, we will control the coordinates of fweak in the distinguished basis (Definition 5).

Quantitatively, for the weak learner fweak, since we have clean labels and qweak + rweak > 1, we can apply Corollary B.12. Let µn,weak aweakn

sweak denote the bi-level prefactor for the weak model. Then the corollary shows that in the basis where v = e1, we have with high probability

fweak[1] = p

λF,weakz 1 A 1y

whereas for all other coordinates we have with high probability

λF,weak n dweak log n i {2, . . . , sweak} p

λU,weak n dweak log n i {sweak + 1, . . . , dweak}

( µn,weakn pweak

2 log n i {2, . . . , sweak}

n 1 2 pweak log n i {sweak + 1, . . . , dweak}

Hence, to obtain wweak in U , we first normalize fweak and then do the basis change. Some calculation yields

fweak 2 µn,weakn1 pweak + n1 pweak log n sweakλF,weak + (dweak sweak)λU,weak

dweak = (µn,weak + log n)n1 pweak

n1 pweak log n (qweak + rweak > 1)

where the second line follows because sweakλF,weak + (dweak sweak)λU,weak = dweak. Similarly, we can compute a simple lower bound:

i>s+1 (z i A 1y)2 = y A 1 X

i>s+1 ziz i A 1y

n1 pweak. (Lemma B.7)

We therefore obtain the sandwiching bounds

log n. (18)

By rotating to the distinguished basis, since qweak + rweak > 1, this yields

wweak = fweak fweak = w1e1 + w2e2 + ws+1es+1,

1 log n µn,weak w1 µn,weak (19)

1 log nn qweak

2 |w2| n qweak

1 |ws+1| . (21)

Let us now check the conditions to apply Corollary B.13, keeping in mind we need to scale with respect to m instead of n. In particular, we require

m ) for all i {1, 2, s + 1}.

Published as a conference paper at ICLR 2025

Since we are going to check the second condition, and clearly the first condition is satisfied for i = s + 1, it suffices to check the first condition for i = 1. By plugging in the scaling for w1 in

Equation (19), we have w1 = ω( q

1 qweak rweak > u qweak + rweak < u + 1. (22)

Let us now verify the second condition. It turns out that the second condition always holds under the bi-level parameterization. Indeed, from Equations (19) and (20), we always have

|w2| n qweak

log n 1 log nn

1 qweak rweak

as rweak < 1. Finally, for the third condition, we have µmn u p

2 = n u q r

1 qweak rweak

2 2u (p + q + r) < 1 (qweak + rweak) q + r > 2u (p + 1) + qweak + rweak (23)

With these preparations in hand, we now prove the positive side of the result.

Sufficient condition for weak-to-strong generalization. First we lower bound SUm(v |wweak).

Under Equations (22) and (23), we conclude that

SUm,strong(v |wweak) µmw1 1 log nµm n

1 qweak rweak

On the other hand, by Proposition B.14 we have

CNm,strong(v |w)2 o(µ2 m|w1|2) + µ2 mnr u log n + nu p log n.

We will now analyze the survival-to-contamination ratio for the strong learner. The first term is clearly at least ω(1). For the second term, we get 1 log nn

1 qweak rweak

which is ω(1) if

qweak + rweak < u + 1 r (24)

Finally, for the third term, we get 1 log nnu q r+ 1 qweak rweak

which is ω(1) if

qweak + rweak < u + p + 1 2(q + r). (25)

Collecting the conditions Equations (22), (24) and (25) yields

u > qweak + rweak min {1, 1 r, p + 1 2(q + r)} = qweak + rweak min {1 r, p + 1 2(q + r)},

which establishes the positive side of the result.

Sufficient condition for failure of weak-to-strong generalization. To get the negative result, we need to upper bound SUm,strong(v |wweak) and lower bound CNm,strong(v |wweak). Our earlier calculations and Proposition B.14 yield

|SUm,strong(v |wweak)| µm max n w1, n u 2 pp

log n o (26)

CNm,strong(v |wweak) µmn r u

The above bound only differ by log factors from the upper bounds. Hence, in this case, the sufficient condition merely flips the direction of the inequality, which recovers the stated result. Otherwise, if u < qweak + rweak 1, then certainly u < qweak + rweak 1 + r, and n u 2 p log n dominates. The latter is in turn dominated by n u p

2 , so the survival-to-contamination ratio is o(1). This completes the proof.

Published as a conference paper at ICLR 2025

D MULTILABEL CLASSIFICATION

In this section, we analyze multilabel classification with k classes, which is a variant of multiclass classification where a single datapoint can be positive examples of multiple classes. Given n datapoints, we encode this label enformation using k label vectors y(1), . . . , y(k), where for i [k], we encode the ith label vector as y(i) { 1}n, with 1 representing positive examples. We make the following 1-sparse assumption for multilabel data.

Assumption 3 (1-sparse assumption for multilabel). The label defining directions v(1) , . . . , v(k) are aligned with a top k eigenbasis of the strong covariance Σ. In a strong eigenbasis where v(i) = ei, we have

y(j) = sgn(xj), j [k]

Remark D.1. In fact, the analysis below will turn out to work even if the label defining directions are not orthogonal; they could all be the same! The analysis only relies on there existing some basis Ui such that v(i) = e1. However, the tight misclassification rate would be different depending on the relationship between the label defining directions.

The weak-to-strong setup is defined as follows.

(1) fweak Rdweak k: train on n datapoints using weak features and ground-truth labels. (2) fw2s Rd k: train on m n datapoints using strong features and hard pseudolabels generated from fweak.

As before, we will study weak-to-strong generalization using overparameterized linear classifiers. Hence, fweak will now consist of k different linear classifiers f (i) weak Rdweak for i [k], all trained by MNI on n clean multilabel datapoints. To train f (i) w2s, we generate hard pseudolabel vectors by(i) = sgn( f (i) weak, xweak ), and then perform MNI on these hard pseudolabel vectors. We deem that a multilabel classifier f generalizes if, for a fresh test sample xtest and for every class i [k], f correctly labels whether xtest is a positive example of class i. More formally, define for a collection of classifiers f = (f (1), . . . , f (k)), the loss function ℓ(f) = Pr[ i [k] : sgn(f (i)(xtest)) = y(i) test], where the probability is taken over a fresh test sample (xtest, ytest). Remark D.2. The weak model can also be trained on clean multiclass data rather than clean multilabel data. This will only affect the regimes for the weak model to satisfy Desideratum 1.iii, based on Theorem 3.1.

The subset ensemble definition is essentially the same as before (Assumption 2), with the main difference being that we require all k of the label defining directions to be favored and axis-alignable in both the weak and strong favored feature subspace. Assumption 4 (Subset ensemble for multilabel classification). Let Λ = Λ(p, q, r) Rd d denote the strong eigenvalues and Λweak = Λ(pweak, qweak, rweak) Rdweak dweak denote the weak eigenvalues, both drawn from the bi-level ensemble. Suppose the 1-sparse assumption (Assumption 1) holds for the strong covariance, with any distinguished eigenbasis U where v(i) = ei for i [k]. The following conditions relate the weak and strong features after rotating to U.

(1) xstrong N(0, Λ), where Λ = λF I[s] + λUI[d]\[s].

(2) There exists subsets of coordinates S [s], T [d] \ [s], with [k] S and |S| = sweak, such that

xweak = ( q

λU ΠT )xstrong d= N(0, λF,weak IS + λU,weak IT ) .

The crucial observation is that multilabel training boils down to k (nearly) independent binary classification problems. The main difference to establish successful generalization is that we now need to union bound over all k multilabel classifiers. The high probability bounds from the binary analysis come from two sources: (1) applying spectral bounds (Lemma B.5), which holds with very high probability (exp n1/2 ), and (2) Hanson-Wright calculations (Proposition B.8), where bounds

Published as a conference paper at ICLR 2025

that hold with probability δ have deviation poly log(1/δ). Hence, for k a constant, δ is only affected by a constant, and this will only change the bounds in our analysis by a constant, which does not shift the regimes. Furthermore, even if we allow k to scale with n as in Definition 2, since k is polynomial in n, the dependence for the high probability bounds on k is at most polylogarithmic in k, so again nothing changes with the analysis, which looks at polynomial regime shifts.

For the converse direction, one can use a crude bound of just analyzing the probability of one classifier failing, which gives a error rate bounded from below by 1

2 on(1). However, we expect that a more refined analysis would give the expected error rate of 1 O(2 k); we sketch an argument for how to get this improved error rate after the theorem statement. With these minor modifications, the formal details go through unchanged, and we arrive at our main theorem for weak-to-strong multilabel classification using hard pseudo-multilabels. Theorem D.3 (Weak-to-strong generalization for multilabel subset ensemble). Consider the setup where the weak model fweak is trained on n correctly labeled examples and the strong model fw2s is trained on m = nu weakly labeled examples using MNI, where u < p, q + r > u, and qweak + rweak > 1. Assume the following:

(1) The true multilabels satisfy the 1-sparse assumption (Assumption 3) for the strong covariance.

(2) The weak and strong features follow the subset ensemble (Assumption 2) with bi-level eigenvalues Λ(p, q, r) and Λ(pweak, qweak, rweak), respectively, scaled relative to n.

(3) There are not too many weakly labeled examples: u < p+1+q+r (qweak+rweak)

Let τstrong p + 1 2(q + r). Then, the resulting test error for fw2s satisfies

E[ℓ(fw2s)] = on(1), if u > qweak + rweak min {1 r, τstrong} Ω(1), if u < qweak + rweak min {1 r, τstrong}. (28)

We now sketch out an approach which should give the correct error rate for constant k (or even k growing slowly with n). Note that the complement of the failure event is

i [k] : sgn( f (i) w2s, xtest) = y(i) test.

From the Gram-Schmidt decomposition Equation (8) and the noise stability formula, we can decompose this event as

i [k] : sgn( SU(v(i) |w)

CN(v(i) |w)x(i) + g(i)) = sgn(x(i) ),

where x(i) = D gtest, v(i) E and g(i) are iid standard Gaussians. In the converse regime, the survival

to contamination ratio is polynomially decaying, so for typical g(i), with probability exp( nc) we

have sgn( SU(v(i) |w)

CN(v(i) |w)x(i) + g(i)) = sgn(g(i)). Furthermore, by unpacking on the analysis of f (i) w2s (specifically, see Equation (20)) for i = j, g(i) and g(j) only differ by a Gaussian with polynomially decaying variance, so up to a failure event with probability exp( nc), we can replace these with the same Gaussian eg which is independent of both x(i) and x(j) . We can do this for all pairs (i, j) and union bound over k (this is where we use the slow growing condition on k). Hence, up to these error terms, which are negligible, the complementary event occurs with probability 2 k, so the test error will indeed be 1 O(2 k).

D.1 MULTILABEL SUPERVISION FOR MULTICLASS CLASSIFICATION

In this section, we expand upon the arguments to use weak multilabel supervision to train a weakto-strong multiclass classifier fw2s. The difficulty of the multiclass analysis is studying the MNI behavior for weak supervision. However, since the multilabel MNI behavior is well under control by the arguments above, it is tractable to study this setup instead.

The key insight is that, for multiclass classification to succeed, it suffices to look at pairwise comparisons between the score functions for the k different classes (see Wu & Sahai (2024) for more

Published as a conference paper at ICLR 2025

justification). In particular, one studies the relative survival given different class label vectors:

SU(v|by(i) weak) SU(v|by(j) weak),

with the signal component for class i being the above quantity with v = v(i) . One can then define the relative contamination in analogous way. However, the path towards studying the survival comes from understanding the coefficients of fweak and fw2s, which is feasible.

Thus, to determine whether fw2s generalizes for multiclass classification, it reduces back down to the multilabel/binary behavior. As argued in Wu & Sahai (2024), because of the margin between the features and the survivals, it suffices to get a polynomially increasing SU/CN ratio. Hence, in the successful regime for weak-to-strong multilabel generalization, this relative SU/CN ratio is polynomially increasing, which implies that multiclass classification succeeds.

E IMPROVING THE BOUNDS FOR MISCLASSIFICATION RATE

In this section we tighten the bounds on misclassification rate for the multiclass setting. Theorem E.1 (Tightening of (Wu & Sahai, 2024, Proposition A.7)). Assume we are in the bi-level ensemble model (Definition 1), the true data generating process is 1-sparse (Assumption 1), and the number of classes follows the scaling defined in Definition 2. Then, in the negative regime where the model does not achieve vanishing error, we have

E[ℓ(fstrong)] = 1 Θ 1

where the expectation is taken over the randomness of the training data and the test point.

At a high level, the misclassification event Eerr is governed by the following inequality holding:

SUn CNn max j [k] |xj| max i [k] gi,

where (xj)j [k] are iid standard normals, and (gi)i [k] are jointly gaussian with some correlation structure.

Now, (Wu & Sahai, 2024, Proposition A.3) implies that SUn

CNn n u for some constant u > 0 with probability 1 O(1/n). Also, we can upper bound maxj [k] |xj| O( log k) with probability at least 1 O(1/k). Moreover, in the regime where classification fails, we know that from (Wu & Sahai, 2024, Proposition F.1), we have that E[gigj] 1

2 + n δ for some constant δ > 0 with probability at least 1 O(1/n). Since k = o(n), when we union bound over the above events, they get absorbed by the O( ) and Ω( ) terms.

To tighten the misclassification rate, we will improve upon (Lopes & Yao, 2022, Theorem 2.1). In particular, we shows the following lower tail inequality for the maximum of correlated gaussians. We will prove the following theorem, which captures the lower tail behavior of the maximum of correalted Gaussians very far from its expectation. This result complements the lower tail bound of Lopes & Yao (2022), which holds for more moderate deviations. Theorem 3.6 (Lower tail for correlated Gaussians). Let ρ0 (0, 1) be a parameter bounded away from 0 and 1, and let (gi)i [N] be jointly Gaussian with zero mean and unit variance. Suppose E[gigj] ρ0 for all distinct i, j [N]. For any 0 t N = δ0 p

2(1 ρ0) log N where δ0 [0, 1) is bounded away from 1, there is a constant C > 0 depending only on ρ0 such that

Pr max i [N] gi t N

C N (1 δ0)2(1 1

ρ0 )(log N)

1 ρ0(2 δ0) δ0

In particular, one can take C = q

ρ0 1 ρ0 . If we further have E[gigj] = ρ0 for all distinct i, j and

t N = O log log N log N

, then Pr maxi [N] gi t N = Θ N 1 1

Before we prove the theorem, let us see how it tightens the misclassiifcation rate.

Published as a conference paper at ICLR 2025

Proof of Theorem E.1. We will apply Theorem 3.6 with N = k 1, ρ0 = 1

2 + n δ, and δ0 = 1 log k.

Noting that δ0 p

2(1 ρ0) log k n u if k = cknt for any t [0, 1), we see that

Pr[max i [k] gi n u] Pr[max i [k] gi δ0 p

2(1 ρ0) log k] (δ0 p

2(1 ρ0) log k n u)

O k (1 δ0)2(1+n δ)(log k)n δ δ0

2 (Theorem 3.6)

O k (1 δ0)2(log k) δ0

2 (1 + on(1)) (kn δ = 1 + on(1))

O(k (1 δ0)2) ((log k) δ0/2 = 1 ok(1))

. (δ0 = 1/ log k)

Inverting this bound, we see that

Pr[max i [k] gi > n u] 1 O 1

For the upper bound on the above probability, we can use Slepian s lemma and then compare to Gaussians gi which have correlation 1

2 n δ. Writing it all out, we have

Pr[max i [k] gi n u] Pr[max i [k] gi n u] (Slepian s lemma)

Pr[max i [k] gi 0]

((Pinasco et al., 2021, Theorem 2.1))

. (kn δ = 1 + on(1))

from which we get a nearly matching upper bound on the probability of 1 Ω( 1

We return to the proof of Theorem 3.6.

Proof of Theorem 3.6. The lower bound directly follows from the proof of Lopes & Yao (2022), so we focus on proving the upper bound. We remark that the constant hidden by Θ can be pinpointed to (1 + o N(1)) q

ρ0 1 ρ0 , but we will not discuss this further.

To reduce confusion, we will attempt to follow the notation and treatment from Lopes & Yao (2022). To prove the upper bound, we can use Slepian s lemma to reduce to the case where E[gigj] = ρ0 for all i = j. Then, we can explicitly decompose

gi = ρ0x + p

where x, hi are iid standard Gaussians. Via this decomposition, we can write an integral representation for the desired probability:

Pr[max i [N] gi t N] = Z

ψ(s)ds, (30)

ψ(s) ϕ(s)ΦN t N ρ0s 1 ρ0

where ϕ( ) and Φ( ) are the standard Gaussian density and CDF, respectively. We will estimate this integral by splitting it into a couple pieces. To this end, we will bound the Gaussian CDF using Mills inequality. For any t > 0, we have

t 1 + t2 ϕ(t) 1 Φ(t) = Φ( t) 1

Published as a conference paper at ICLR 2025

We will need the following multiplicative estimate on ΦN( ), which holds for any δ [0, 1]:

2 log N(1 δ)) = (1 Φ( p

2 log N(1 δ)))N

2πN(1 δ)2( 2 log N(1 δ)+1)

N (Mill s inequality)

exp N 1 (1 δ)2

2π( 2 log N+1)

The above bound (32) yields nontrivial bounds only when δ > c log log N

log N for a constant c > 1

4, and decays superpolynomially if c > 3

These bounds motivate splitting (31) into a few different pieces, based on which term dominates the behavior of the integral ψ. To this end, let c N, d N > 0 be parameters we specify shortly. Then we split the integral into three pieces: Z c N

ψ(s) ds + Z d N

c N ψ(s) ds + Z

d N ψ(s) ds .

As in Lopes & Yao (2022), we define α0 ( 1

ρ0 1)(1 δ0)2 and β0 α0 1 δ0 . Then the ultimate

bound we want to prove is R ψ(s) ds O N α0(log N) β0 1

2 . We explain the choice of

c N, d N > 0 as follows. For succinctness, we introduce the following two functions on R:

2(1 ρ0) log N

ρ0 u = 2α0 log N

f N(s) t N ρ0s 1 ρ0 ,

and reparameterize the interval [ c N, d N] as another interval I on u below.

1. We want c N to satisfy Φ( c N) = O N α0(log N) β0 1

2 . Given this, the first term can

be bounded by Z c N

ψ(s) ds Z c N

ϕ(s) ds = Φ( c N) O 1 N α0 (log N) β0 1

By inverting Mills inequality, we see that it suffices to pick

2(1 ρ0) log N

1 δ0 log log N

1 δ0 log log N

Indeed, we have for sufficiently large N that

Φ( c N) ϕ(c N)

c N exp α0 log N(1 log log N 4(1 δ0) log N 2)

ρ0 1) log N(1 o N(1))

r ρ0 1 ρ0 N α0(log N)

2 (β0 = α0 1 δ0 )

2. We pick d N such that ΦN(f N( d N)) 1 Nα0 (log N) β0 1

2 . Given this, by monotonicity we evidently have Z

d N ψ(s) ds ΦN(f N( d N)) 1 N α0 (log N) β0 1

Published as a conference paper at ICLR 2025

Owing to (32), to achieve the desired superpolynomial decay it suffices to pick d N 0 such that t N+ ρ0d N 1 ρ0 2 log N(1 ( 3

4 + ε) log log N

log N ), where ε > 0 is a constant.

Recalling that t N = δ0 p

2(1 ρ0) log N, we see that f N( d N) = 2 log Nδ0 + ρ0 1 ρ0 d N. Hence the desired inequality holds by picking

2(1 ρ0) log N

4 + ε)log log N

4 + ε)log log N

Given the above choices of c N, d N in Eqs. (33) and (34), we see that the interval I that u belongs to is I [1 δ0 ( 3

4 + ε) log log N

log N , 1 δ0 1

4 log log N

log N ]. Hence, it is natural to reparameterize the integral in terms of η [0, 1

2 + ε] via the following change of variables

u N(η) 1 δ0 (η + 1

4) log log N

and an easy computation yields

ds = s N(u N(η))u N(η) dη = 2α0 1 δ0 log log N

It is not hard to see that u N([0, 1

2 + ε]) = [ c N, d N]. Let us introduce the following abbreviations for the compositions of our changes of variable:

es N(η) s N(u N(η))

ef N(η) f N(es N(η)) = p

2 log N 1 (η + 1

4) log log N

In this notation, we have Z d N

c N ψ(s) ds = 2α0 1 δ0 log log N

0 ψ(es N(η)) dη .

Now the argument as in (Lopes & Yao, 2022, Eq. 4.22) shows that this integral is O(N α0(log N) β0 1

2 ), which completes the proof of the first part of the theorem.6 However, let us give an alternative proof which is a bit simpler. We will estimate the integral by performing a Riemann sum with subintervals of width τ > 0, which we pick to satisfy β0τ < 1 and such that τ evenly divides 1

2 + ε. Then, we have

0 ψ(es N(η)) dη

kτ ψ(es N(η)) dη .

By monotonicity, for any k we have Z (k+1)τ

kτ ψ(es N(η)) dη τ ϕ(es N((k + 1)τ))ΦN( ef N(kτ))

C τ N α0(log N)β0 ( 1

2 +2(k+1)τ) exp (log N)2kτ ,

where C is a universal constant. Here, the last line used ϕ(es N(η)) = O(N α0(log N)2β0 ( 1

4 +η)) and ΦN( ef N(η)) = O(exp( log N)2η).

If 2β0(k + 1)τ < 1

2, then log log N

log N (log N)2β0(k+1)τ (log N) 1

2 , as desired. On the other hand, if 2β0(k + 1)τ 1

2, then as β0τ < 1

8, we have 2kτ 1 4β0 > 0. Since exp (log N)2kτ dominates

any polylog terms, so it is not hard to see that the total contribution here is N α0(log N) β0 1

2 . Hence, we conclude that R d N c N ψ(s) ds q

ρ0 1 ρ0 N α0(log N) β0 1

2 , as desired.

6To be explicit, their calculation requires us to verify that 2 log N(δ0 + u N(η)) = ω(1), which is certainly true.

Published as a conference paper at ICLR 2025

F EXPERIMENTS

In this section, we describe the simulations we conducted to validate the theory. We generated Gaussian data following the subset ensembles specified in the figures, and constructed two linear models from them: the MNI classifier and the simple averaging classifier. In the averaging classifier, we average over the positive examples of a label, which approximates the behavior of the first few iterations of gradient descent. In contrast, the MNI classifier governs the asymptotic behavior of gradient descent.

The weak-to-strong behavior for these two learning algorithms was compared to two other baselines: the weak accuracy for fweak and the accuracy for the strong model trained on m clean labels: fstrong. The test accuracies were evaluated on ntest = 100 fresh datapoints. We ran 8 independent trials to train fweak with n = 50 so that we could explore how the weak-to-strong behavior scales with p and u. For each fweak, we conducted 16 independent trials to train fw2s.

We swept out u using five equally spaced points in [1, 1.3]. In Figures 3 and 4, we show the results of the averaging and MNI experiments, respectively, for four different slices. In the top row, we show two slices where the theory predicts weak-to-strong generalization to occur for MNI, and in the bottom row we show two slices where the theory predicts failure of weak-to-strong generalization. The error bars show the estimated 95% CI over all sources of uncertainty in the inner and outer loop (fw2s and fweak). In both the averaging and MNI plots, the theory successfully predicts whether weak-to-strong generalization occurs. Furthermore, in every plot the ground truth trained strong model fstrong trained on m clean labels has better test accuracy than fweak and fw2s, as expected. Another interesting experimental observation is that the averaging classifier does significantly better than MNI in non-asymptotic settings. This corroborates the view of practitioners of the benefits of early-stopping for gradient descent.

G HEURISTIC CALCULATIONS

Recall from the definition of MNI that

fstrong = X A 1y,

where y { 1}n is a label vector generated by either the true feature x or a weak feature xweak and A = XX Rn n is the Gram matrix. The key to our analysis is studying the survival and contamination of various features when the labels are possibly generated by another feature. Recall that we performed a basis change X 7 XU, so that the strong features are drawn iid from N(0, Λ).

Writing the transformed data matrix now as

X = λ1z1 λ2z2 . . . λdzd ,

where each zi N(0, In), we obtain for any unit norm v RD that

i [d] λiz i A 1y vi, v (MNI)

CN(v) = s X

i [d] (λiz i A 1y)2(1 vi, v 2). (Orthonormality of vi)

Since A is close to d Id in the isotropic case, and we are working in the regime where PCA fails to extract the bi-level structure (q + r > 1), one could hope for the best and pretend that A 1 = 1

d Id. This step is not rigorous, but we will justify these approximations in Appendix B.

Based on the decompositions in Definition 3, we study the case where y = sgn( g, w ) for some w RD but we want to recover the planted direction v RD. For a subspace V RD and a vector u RD, let u V denote the projection of u onto V . For axis-aligned subspaces V [d], this just corresponds to restricting to the coordinates in V . To simplify the heuristic calculation, we will make the following assumptions. Assumption 5. Let S = [s] denote the spiked subspace after the basis change, and let α, ρ > 0 be parameters possibly depending on n. For any vector u Rd, let Tu n i S : |ui| = ω( 1 n) o

Published as a conference paper at ICLR 2025

w2s weak ground truth

(a) (p, q, r) = (2, 0.6, 0.6), (pweak, qweak, rweak) = (1.4, 0.9, 0.5).

w2s weak ground truth

(b) (p, q, r) = (2, 0.9, 0.4), (pweak, qweak, rweak) = (1.4, 0.9, 0.4)

w2s weak ground truth

(c) (p, q, r) = (1.5, 0.6, 0.8), (pweak, qweak, rweak) = (1.4, 0.9, 0.5).

w2s weak ground truth

(d) (p, q, r) = (2, 0.9, 0.4), (pweak, qweak, rweak) = (1.4, 0.9, 0.4)

Figure 3: Comparison of test accuracies for four different models using averaging training. The x-axis plots m, the number of additional labeled datapoints. The models are trained using class averaging, which approximates the behavior of the initial few gradient descent iterations. Note how the weak model has low accuracy, whereas the weak-to-strong model and ground truth have higher accuracies that increase as m increases. The top row Figures 3a and 3b are in a regime where we predict MNI weak-to-strong generalization to succeed, whereas the bottom row Figures 3c and 3d depict regimes where we expect MNI weak-to-strong generalization to fail.

denote the spiked coordinates where u is large, and Ru = S \ Tu denote the spiked coordinates where u is small.

We assume the following holds for T = Tw and R = Rw:

(1) We have v T , w T = α v T 2 w T 2. In other words, v and w have correlation α restricted to the heavy coordinates for w.

(2) We have P

i T v2 i w2 i = ρ2 w T 2 2. Note that by Lp norm inequalities we always have ρ2 1.

(3) We have |R| = Ω(s), i.e. a constant fraction of w s spiked coordinates are small.

Remark G.1. When v = w, Item (2) can be thought of as a relaxed notion of v being 1-sparse; for a given ρ one should roughly think of v as being 1

Published as a conference paper at ICLR 2025

w2s weak ground truth

(a) (p, q, r) = (2, 0.6, 0.6), (pweak, qweak, rweak) = (1.4, 0.9, 0.5).

w2s weak ground truth

(b) (p, q, r) = (2, 0.9, 0.5), (pweak, qweak, rweak) = (1.4, 0.9, 0.5)

w2s weak ground truth

(c) (p, q, r) = (1.5, 0.6, 0.8), (pweak, qweak, rweak) = (1.4, 0.9, 0.5).

w2s weak ground truth

(d) (p, q, r) = (2, 0.9, 0.4), (pweak, qweak, rweak) = (1.4, 0.9, 0.4)

Figure 4: Comparison of MNI test accuracies for four different models. Observe how the weak-tostrong accuracy increases as m increases. Again, the top row Figures 4a and 4b are in a regime where we predict MNI weak-to-strong generalization to succeed, whereas the bottom row Figures 4c and 4d depict regimes where we expect MNI weak-to-strong generalization to fail. The plots corroborate these theoretical predictions.

Survival bound. Recall that y = sgn( g, w ). By applying the noise stability formula again and the fact that zij N(0, 1), we deduce that

E[z i y] = n E[zijyj] = Pr[sgn(zij) = yj]E[|zij|] (1 Pr[sgn(zij) = yj])E[|zij|]

2 π(2Pr[sgn(zij) = yj] 1)

π)3/2 arcsin wi,

where in the second to last line we have used the fact that the expected magnitude of a standard Gaussian is p

2/π, and in the last line we have used the noise stability formula. By standard concentration inequalities, the deviations will be of order O( n). As the behavior of 2

π arcsin(x) 2 πx for small x, we deduce that the expectation will dominate whenever |wi| 1 n).

We will now plug in the bi-level scaling. Recall that λi = λF = ad

s for i [s] and λi = λU = (1 a)d

d s for i > s.

Published as a conference paper at ICLR 2025

Thus, ignoring constants, with high probability we should have

SU(v|w) = λF n

i Tw vi arcsin wi λF n d

i Rw vi λU n d

Observe that 2 π |x| 2

π |arcsin x| |x|,

for all x [ 1, 1].

Consequently, for i Tw, each summand contributes vi arcsin(wi) Θ(viwi).7 Thus, by Item (1) and the definition of λF , the first term is Θ(α an

s v Tw 2 w Tw 2).

For the second term, we will upper bound its magnitude by

n v Rw 2 (Rw S)

n . ( v 2 = 1)

Finally, for the third term, since P

d, after plugging in the definition of λU, we obtain the asymptotics

s α v Tw 2 w Tw 2 r s

Note that for the first term to dominate, we must have α = Ω( s

n). Also, if we want to improve our estimate on the second term, we can further split it by Tv, we can gain and get deviations of order

i Rw Tv vi + X

v Rw Tv 1 + |Rv| n (Definition of Rv)

v Rw Tv 1 + s n,

yielding an ultimate relative deviation of v Rw Tv 1 n + s

n. This is significantly better if, say |Tv| =

o(s), as it allows us to beat the contamination bounds which have relative deviations p s

Contamination bound. For succinctness, introduce the shorthand h2 i = (1 v2 i ) zi, y 2. Then the squared contamination is

CN(v|w)2 = X

i [d] λ2 i (1 v2 i ) y A 1ziz i A 1y

i [d] λ2 i h2 i (A 1 1

= λ2 F d2 X

i Tw h2 i + λ2 F d2 X

i Rw h2 i + λ2 U d2 X

i Tw (1 v2 i )( 2

π arcsin wi)2 + λ2 F n d2 X

i Rw (1 v2 i ) + λ2 Un d2 (d s),

where in the last line we have used the observation that the expectation of hi dominates if and only if i Tw.

7The reason this is not an equality is that there might be some heavy w is which disagree in sign with v is, but for the settings we consider this estimate will be true.

Published as a conference paper at ICLR 2025

The first term can be bounded (up to constants) as an

i Tw (1 v2 i )w2 i = an

2 w Tw 2 2(1 ρ2) . (Item (2))

For the second term, we can bound up to constants as

i Rw (1 v2 i ) = an

2 |Rw| v Rw 2 2 n

n . ( v 2 = 1)

Finally, the third term can be bounded by n

d . Putting these together, we conclude that

1 ρ2 w Tw 2 +

Let µn = an

s . In our regime, µn 1 because q + r > 1. Combining Equations (35) and (36) yields

SU(v|w) CN(v|w) µn α v Tw 2 w Tw 2 v Rw Tv 1 n s

1 ρ2 w Tw 2 + q

Hence, for the survival to contamination ratio to grow with n, we need

1 ρ2 (Weak supervision)

α v Tw 2 w Tw 2

n (Favored contamination)

µnα v Tw 2 w Tw 2 rn

Let us put these scalings together to predict the scaling regimes for weak-to-strong generalization. For strong generalization, we have v = v and y = sgn( g, v ). From the discussion in Section 3.1, we know that the strong learner generalizes if

SU(v |v ) CN(v |v ) = ωn(1),

and fails to generalize if the ratio is on(1). Under Assumption 1, we have v = e1, so Tv = Tw = {1}, α = 1, and ρ = 1, and the expression simplifies to

SU(v |v ) CN(v |v ) µn µn p s

which under the bi-level parameter scaling verifies the conditions for ground truth supervision.

This completes the proof sketch; it remains to justify the above estimates rigorously. In the subsequent subsections, we will assume that the above scalings of the survival and contamination are correct and use them to deduce that the 1-sparse assumption is necessary to get a sharp transition in the test error. These calculations can be upgraded to rigorous proofs using the tools are developed in Appendix B.

G.1 THE NECESSITY OF 1-SPARSE ASSUMPTION

Let s suppose we get clean labels from sgn( g, v ) and want to learn the unit vector v. We will

abbreviate T = Tv = Tw. In this case, we have α = 1 and ρ2 = v T 4 4 v T 2 2 in Assumption 5.

Published as a conference paper at ICLR 2025

Lemma G.2. Suppose we are given labels according to v and want to learn v. Then, the survival to contamination ratio is ωn(1) only if

v T 4 4 v T 2 2 = 1 o(1).

In particular, the above condition holds only if the following two conditions hold:

(1) v T 2 2 = 1 o(1).

(2) For each i T, either vi = o(1) or vi = 1 o(1).

The upshot is that having 1-sparse labels is necessary for obtaining asymptotically perfect generalization.

Proof. Hence, focusing only on the survival terms coming from T, which are the only relevant coordinates for learning, and lower bounding the contamination with just the T terms, we have

SU(v|v) CN(v|v)

v T 2 2 v T 4 4

1 v T 4 4 v T 2 2

1 v T 4 4 v T 2 2

where the last inequality used the fact that v is unit norm. This proves the first necessary condition.

We show that if v T 4 4/ v T 2 2 = 1 o(1), then the second set of necessary conditions hold.

Indeed, for the first claim, suppose v T 2 2 1 ε for some constant ε > 0. The Lp norm inequalities imply that v T 4 4 v T 4 2 (1 ε) v T 2 2, so the ratio is at most 1 ε, a contradiction. For the second claim, suppose instead there is a coordinate i with v2 i = 1 ε for some constant ε (0, 1). Then, we have v T \{i} 4 4 v T \{i} 4 2 ε, but then v T 4 4 (1 ε)2 + ε 1 Ω(ε), a contradiction.