# understanding_the_gains_from_repeated_selfdistillation__b3466af0.pdf Understanding the Gains from Repeated Self-Distillation Divyansh Pareek Simon S. Du Sewoong Oh Paul G. Allen School of Computer Science and Engineering University of Washington, Seattle, WA {dpareek,ssdu,sewoong}@cs.washington.edu Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as d, where d is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model s risk (MSE) by up to 47%. 1 Introduction Knowledge distillation [12] was initially proposed as a way to transfer the knowledge learnt by a larger teacher model to a smaller student model, which can then be deployed in limited resource settings. The process is as follows: Train a teacher (T) model using ground-truth labels, then use its predictions to supervise the training of a student (S) model via a combined per-sample loss, ξ ℓ ˆy T , y S(θ) + (1 ξ) ℓ y, y S(θ) , (1) where ℓdenotes the loss function, y is the ground-truth label, ˆy T denotes the teacher s prediction, and y S(θ) denotes the student s prediction, parameterized by the learnable θ. The extra hyperparameter ξ is called the imitation parameter [25], generally restricted to ξ [0, 1]. It gives additional freedom to the student to balance importance between labels and teacher s predictions. The student trained via this distillation objective (i.e., utilizing the teacher s predictions through ξ = 0) has been widely observed to generalize better than when trained only on the labels (i.e., ξ = 0). This gain has been attributed to dark knowledge that is (i) impossible to be directly extracted from the training data by the small model, but (ii) easily learnt by the large model and transferred to the small model. Challenging this interpretation, Li et al. [20] and Furlanello et al. [10] empirically observed performance gains through distillation even when the teacher and student are same-sized models. One can set T and S to have the same architecture, and S trained with the objective in Eq. (1) outperforms T. This is referred to as Born-Again Networks (BANs) or Self-Distillation (SD). Furthermore, repeatedly applying self-distillation on the same training data with a student model having the same architecture provides additional gains on benchmark datasets and architectures [10, 36, 44]. At each step, the student from the previous step acts as the teacher used to train a new student model under the self-distillation loss of Eq. (1). For such multi-step self-distillation, there is a fundamental question of interest: How much more gain can we get by repeatedly applying self-distillation? Recently, Das and Sanghavi [9] provided theoretical understanding of the original one-step selfdistillation. For the canonical task of fixed design linear regression, considering the standard ridge 38th Conference on Neural Information Processing Systems (Neur IPS 2024). estimator as both the teacher and student model, [9] showed that there is indeed a regime of problem instances in which the optimal student (i.e., with optimally tuned ridge parameter λ and imitation parameter ξ) can provably achieve a strictly lower test error than the optimal teacher (i.e. with optimally tuned λ). However, the amount of this gain has not been characterized in closed form, and can only be numerically evaluated for a given problem instance. Inspired by this work, we aim to study the performance gains from multi-step self-distillation under linear regression. Contributions. We summarize our contributions below. Under the fixed design linear regression defined in Section 3.1, we show that the optimal multistep self-distilled model (i.e., each ξ value at each step is optimized for the validation accuracy of the final multi-step self-distilled model) can achieve a test error that is a factor of d smaller than the optimal one-step self-distillation (Theorem 1), under certain assumptions on the problem parameters (Assumption 2). Here, d is the dimension of the input. Our analysis in Theorem 1 suggests that the sequence of ξ parameters provides additional freedom that can control the spectrum of eigenvalues of the linear estimator. Optimally choosing these ξ parameters can significantly reduce the variance of the estimator, leading to a factor of (up to) d difference in the overall test errors of multi-step SD compared to 1-step SD. We note that Das and Sanghavi [9] also observed a bias-variance tradeoff associated with the ξ parameter for 1-step SD compared to the ridge, which was the reason behind 1-step SD strictly outperforming the ridge. We demonstrate the necessity of the main assumption (Assumption 2) both theoretically (Theorems 2 and 3) and numerically (Figure 3). Further, we provide a lower bound for the test error that any repeated SD can achieve, and show that only r steps of SD (with optimally chosen ξ at each step) are sufficient to achieve this, when the input data matrix has rank r (Theorem 4). By capturing the functional form of the test error in ξ (Theorem 5), we also show a method to practically select the ξ parameters for real-world regression tasks. In Section 5, we empirically show that this theoretical insight leads to selecting effective ξ values, which can indeed achieve a lower test error on real-world regression tasks. 2 Related Work Knowledge distillation and self-distillation. Hinton et al. [12], Ba and Caruana [3] proposed knowledge distillation to transfer knowledge learnt by large teacher models into smaller student models without any substantial performance drop (e.g., [30, 31, 14, 8, 33, 24, 32] and surveys in [11, 13]). Distillation also provides interpretability [23], robustness to adversarial examples [28], and defense against backdoor attacks [39, 21, 35], although stronger backdoor attacks have been proposed that bypass distillation defense [17]. Perhaps surprisingly, empirical observations show that performance improves when a teacher model is distilled into a student model with the same architecture on the same training data (self-distillation). Performance gains with one-step selfdistillation of the form Eq. (1) were first demonstrated by Li et al. [20] for Alex Net on YFCC100M. Further gains can be achieved by repeating self-distillation, as shown for the Dense Net architecture family on CIFAR10 and CIFAR100 [10, Table 2]. To empirically explain such gains, Zhang and Sabuncu [44] measured prediction uncertainty on the same multi-step experiments and offered an interpretation that soft labels capture sample-level uncertainties. Yang et al. [36] also reproduced the same experiments and explained the gains as knowledge refinement on the class similarities. We will analytically study the gains that can be achieved with such repeated self-distillation. Many variations of self-distillation have also been proposed. Snapshot Distillation [37] tries to treat previous checkpoints (snapshots) of the same model as the teacher. Zhang et al. [43] employ a group of collaborative students with no teacher. Zhang et al. [42, 41] use it for model self-improvement, and DINO [7] adopts self-distillation for self-supervised learning. Knowledge distillation is also popular for transfer learning, where the student model is trained on a different dataset than the teacher model [38, 40, 1], which is not a setting we address. With the recent scaling of data, [45], [22] are relevant works using a teacher model for either label editing or data reweighing. Theory of distillation and self-distillation. Theoretical understanding of distillation started with Phuong and Lampert [29] studying linear student networks. Mobahi et al. [27] studied self-distillation theoretically in the restricted setting of ξ = 1 (i.e. only teacher supervision, no ground-truth labels), showing that in this setting, the SD process acts as a regularizer, with a few steps of SD helping, but further steps hurting model performance. We study the setting where ξ is not restricted to 1, and show a different conclusion. In particular, we observe that more steps of SD always provide an improvement, if the ξ parameters are chosen optimally. Allen-Zhu and Li [2] analyzed a stylized setting, where a different view of the data is learned by different models, and show how ensemble methods can combine the views, achieving improved test accuracy. This framework is used to show how self-distillation can also improve model accuracy by implicitly performing ensembling. Menon et al. [26] theoretically studied distillation in the classification setting, and also observed a bias-variance tradeoff underlying teacher supervision. Das and Sanghavi [9] theoretically studied one-step self-distillation for fixed design linear regression and binary classification, and showed that the student can provably achieve a lower test error than the teacher. We take inspiration from them and study the multi-step SD to characterize this performance gain, showing that the multi-step SD can outperform one-step SD by a large factor. Borup and Andersen [4] also studied multi-step SD and obtained an analytical form for the k-step SD similar to ours [4, Theorem 4.1]. The crucial difference is the freedom of the ξ parameters being different at each step of self-distillation. Whereas [4, Lemma 4.2 and Theorem 4.3] assume the ξ values at each step are equal (similar to [27]), concluding that after a point, more steps of SD will result in a poorer performing model (similar to [27]); our main result (Theorem 1) is different as it says that subsequent steps of SD strictly provide more freedom, and that the best multi-step SD can outperform the best 1-step SD by an Ω(r) factor. Similar to us, Jeong and Chung [16] also take inspiration from [9], [16] however aim to provide understanding of multi-step self-distillation in the multi-class classification setting. 3 Problem formulation and background on self-distillation Focusing on the simple but canonical task of linear regression, we investigate the performance gain from applying repeated self-distillation. 3.1 Linear regression For the observed response Y R and the covariate X Rd, the following assumption is standard in linear regression, e.g., [9]. Assumption 1. There exist θ Rd and γ > 0 such that (i) E[Y |X] = θ , X ; (ii) Var[Y |X] = γ2 for all X Rd; and (iii) (Y E[Y |X]) X, i.e. the label noise is independent of X. The training set of size n is denoted by X Rd n, the collection of covariates, and Y Rn, the responses; Y = X θ + η, with η satisfying E[η] = 0, E[ηη ] = γ2In. The problem instance is defined by its parameters (X, θ , γ2), treating X = [X1, X2, Xn] as fixed but Y as random. The training set (X, Y) is one occurrence of the random noise η Rn. In this fixed design setup, the excess risk of an estimator ˆθ is defined using the standard ˆΣn-norm, v ˆΣn = ˆΣ1/2 n v 2, as Excess Risk(ˆθ) := Eη h ˆθ θ 2 ˆΣn where ˆΣn := (1/n)XX is the covariance matrix, and the expectation is over the randomness in η. Measuring the error in the ˆΣn-norm ensures that the signal-to-noise ratio is uniform in all directions. The popular ridge estimator serves as a baseline, using a single hyperparameter λ > 0: ˆθ(λ) := arg min θ Rd Y X θ 2 + λ θ 2 = XX + λId 1 XY . (3) We use Ωλ := XX + λId throughout. We consider only λ > 0, but surprisingly, Kobak et al. [19] showed that the optimal penalty λ (one with the lowest risk) can indeed be negative. However we will largely work in the non-overparameterized case (n > d), where λ > 0 holds. 3.2 Self-distillation Applying the self-distillation loss in Eq. (1) to linear regression with hyperparameters λ and ξ, ˆθ(λ, ξ) := arg min θ Rd ξ X ˆθ(λ) X θ 2 + (1 ξ) Y X θ 2 + λ θ 2 (4) = XX + λId 1 X ξ X ˆθ(λ) + (1 ξ) Y | {z } New label = (1 ξ) Id + ξ Ω 1 λ XX | {z } Pre-conditioner: function of (λ,ξ) Ω 1 λ XY | {z } Ridge ˆθ(λ) where ξ R is not restricted to the conventional [0, 1] interval. This additional freedom is meaningful since it can result in a strictly better solution, as noted by Das and Sanghavi [9] both theoretically (Remark 3.6) and empirically (Table 1). It is worth noting that the optimization problem in eq. (4) remains convex for any ξ R, since its hessian evaluates to ξ 2XX +(1 ξ) 2XX = 2XX 0. On the other hand, the teacher and student use the same ridge penalty λ for simplicity. We call this estimator 1-step self-distillation. This can be interpreted as (i) assigning new labels that combine the ground-truth labels with the teacher s predictions, or (ii) pre-multiplying the usual ridge estimator with a pre-conditioner. Note that ξ = 0 recovers ridge. Das and Sanghavi [9, Theorem 3.8] show that under a certain condition, 1-step self-distillation strictly dominates ridge, i.e., min λ 0,ξ R Eη h ˆθ(λ, ξ) θ 2 2 i < min λ 0 Eη h ˆθ(λ) θ 2 2 i , (7) where the risk is measured in the non-standard Euclidean norm. The same strict inequality can be shown under the standard ˆΣn-norm under a slightly modified condition stated in Proposition B.1. This naturally leads to a fundamental question: How much more gain can we get by repeatedly applying self-distillation? T S 1-step self-distillation ξ S0 S1 S2 Sk k-step self-distillation ξ(k) 1 ξ(k) 2 ξ(k) k Figure 1: The standard 1-step self-distillation defined in Eq. (1) with parameter ξ and k-step selfdistillation that repeatedly applies Eq. (1) with parameter ξ(k) = [ξ(k) 1 , ξ(k) 2 , . . . , ξ(k) k ] Rk. 3.3 Repeated self-distillation The standard repeated application of self-distillation starts with the teacher model, T (which we also refer to as the zeroth model, S0), and applies self-distillation sequentially for k steps. At each step i, Eq. (1) is applied with the (i 1)th model, Si 1 as the teacher, and the ith model, Si as the student, with an imitation parameter ξ(k) i , i.e., ˆθ arg minθ ξ(k) i ℓ ˆy Si 1, y Si(θ) +(1 ξ(k) i )ℓ y, y Si(θ) for i [k]. The collection of parameters is denoted by ξ(k) = [ξ(k) 1 , ξ(k) 2 , . . . , ξ(k) k ] Rk. This repeated self-distillation has been studied, for example, theoretically in [27] and empirically in [10]. We aim to understand its gain under linear regression, where we prove that ˆθ(λ, ξ(k) |{z} Rk ) = i=1 ξ(k) i Ω 1 λ XX i ) | {z } Pre-conditioner: P(λ,ξ(k)) Ω 1 λ XY | {z } Ridge ˆθ(λ) with ξ(k) i := (1 ξ(k) k i) Qk l=k i+1 ξ(k) l for each i [k], and we let ξ(k) 0 = 0. The proof that repeated SD with ξ(k) Rk results in Eq. (8) is provided in Appendix C.2. Here ξ(k) Rk denote the imitation parameters, λ denotes the ridge coefficient for all the models, and ξ(k) Rk is a reparametrization of ξ(k) Rk (details in Appendix C.2). We call this k-step self-distillation. Note the increasing flexibility in the pre-conditioner matrix. The increasing powers of Ω 1 λ XX in the above expression are still numerically stable, since, for λ > 0, Ω 1 λ XX is PSD with all eigenvalues in [0, 1]. As an aside, one can also consider a version of SD where the ith model receives supervision from all S s2 > > sr > 0, where {sj}r j=1 denote the non-zero singular values of the input data matrix X whose rank is denoted by r. 2. For a β [0, 1), there exists an index j [r] such that θ , uj 2 (1 β) θ 2; where {uj}d j=1 denote the eigenvectors of XX , u1 being the leading one. Assumption 2 is needed to show that r-step SD achieves a small excess risk in Eq. (9). The quantity β captures the similarity of the (unknown) θ to the eigenbasis directions. The case of β = 0 entails perfect alignment of θ with one of uj, j [r]. As we will see in the result in Theorem 1, the gains provided by multi-step SD are large when β is small. In general, both these conditions are somewhat necessary for the separation, as we show in Theorems 2 and 3. We now state our main result. We show that under the above assumption, there exists a family of problem instances, (X, θ , γ2), such that the excess risk achieved by r-step SD is a factor of r := rank(X) smaller than that of the ridge estimator and the 1-step SD. Theorem 1. Under the fixed design linear regression in Assumption 1, there exists a family of problem instances satisfying Assumption 2 such that for any instance (X, θ , γ2) in the family, it holds that λ > 0, ξ(r) Rr, Excess Risk ˆθ(λ, ξ(r)) γ2 1 + β θ 2s2 1 γ2 λ > 0, ξ R, Excess Risk ˆθ(λ, ξ) 0.99 n , and (10) λ > 0, Excess Risk ˆθ(λ) 0.98 1 β 1 0.99β where r := rank(X), n is the number of samples, ˆθ(λ, ξ(r)) and ˆθ(λ, ξ) are the r-step and 1-step SD estimators defined in Eqs. (8) and (4) respectively, and ˆθ(λ) is the ridge estimator defined in Eq. (3). This theorem captures a general result for the gains of multi-step SD. In particular, the special case of β = 0 (i.e. θ being completely aligned with one of the eigenvectors uj, j [r]) presents an Ω(r) multiplicative separation between the excess risk of r-step SD and {1, 0}-step SD. We provide precise conditions and a proof in Appendix E. Since each k-step SD includes (k 1)-step SD as a special case with the proper choice of ξ(k), the hyperparameter-tuned excess risk of repeated SD is monotonically non-increasing. However, it is perhaps unexpected that the multiplicative separation between r-step SD and 1-step SD can be as large as Ω(r), demonstrating the gains of repeated SD. Figure 2 illustrates this Ω(r) multiplicative separation on a synthetic family of problems. Note that Ω(d) separation can be achieved by choosing the problem instance to have rank r = d, at the cost of requiring many more steps of SD. This Ω(d) factor is the largest multiplicative separation possible with self-distillation, as shown by the fundamental lower bound in Theorem 4 for any pre-conditioning based approach. In general, if β = O γ2 (akin to the inverse signal-to-noise ratio), then there exists an Ω(r) multiplicative separation between r-step SD and {1, 0}-step SD. In our particular construction of the problem instance for Theorem 1, the additional technical condition (i.e. Condition #2 in the detailed theorem statement in Appendix E) effectively translates this to β = O(1/r). Remark 4.1. SD significantly outperforms ridge by primarily reducing the variance. For the lower bound on ridge s excess risk, i.e., Eq. (11), we ignored the bias term and only used the variance term. The repeated SD (Eq. (9)) primarily reduces the variance to improve the excess risk over Eq. (11). Figure 2: On a synthetic problem family with dimension d = 100, noise variance γ = 0.1, and θ = u1 (agreement with Asmp. 2.2); we set the singular values of X with a power law from s1 = 1 to sr = {0.8, 0.5} (left and right panels) and vary r = rank(X). Both plots show a linear increase of the relative gain of r-step self-distillation in excess risk, i.e. the ratio A/B where A := minλ>0 Excess Risk ˆθ(λ) and B := minλ>0,ξ(r) Rr Excess Risk ˆθ(λ, ξ(r)) ; demonstrating that r-step SD outperforms ridge by a factor of Ω(r), with the constant inside the Ω(i.e. slope of the line) changing with the effective condition number, s1/sr. 4.2 Necessity of Assumption 2 In Figure 3, we empirically show on synthetic tasks how violating Assumption 2.1 or 2.2 leads to higher excess risks, even for the r-step SD (r = 4 in the example). This supports the necessity of both assumptions, which we analytically investigate in the following. Necessity of Assumption 2.1 on X. We assume that the non-zero singular values of X are unique. This allows us to tightly upper bound the excess risk achieved by r-step SD in Eq. (9) via Theorem 4. We show in the following that some version of Assumption 2.1 is also necessary. For a more detailed explanation of why we need Assumption 2.1, we refer the reader to Remark 4.2. Theorem 2. Under the hypotheses of Theorem 1 except for Assumption 2.1, if the singular values of X satisfy s1 = . . . = sr = 1, where r = rank(X), for all k 1, λ > 0, and ξ(k) Rk, we have Excess Risk ˆθ λ, ξ(k) rγ2 1 + rγ2 Pr j=1 θ , uj 2 Furthermore, there exists λ > 0 such that the ridge, ˆθ(λ), achieves this lower bound with equality. We provide a proof in Appendix F. This implies that when there is no gap in the singular values of the input X, there is no separation between ridge and SD estimators (repeated or not). Intuitively, if the sj are all equal, the pre-conditioner for ridge (i.e., Ω 1 λ ) and the pre-conditioner for the repeated SD, both are restricted to have all eigenvalues to be equal. (Repeated) SD has no degrees of freedom to deviate from this. However, sj s being unequal provides the freedom for the ξ(k) to control the SD s pre-conditioner matrix, and reduce the excess risk. This is also why in Remark 4.2, we hypothesize that numerically, the optimal ξ(k) depends inversely on the min-gap of the singular values. Figure 4 demonstrates this increasing relationship of the magnitude of the optimal ξ parameters w.r.to the decreasing singular gap. Necessity of Assumption 2.2 on θ . Theorem 1 shows that there is a large separation in the performance of repeated SD over ridge when Assumption 2.2 holds with a small β. In general, this translates to θ being highly aligned with any one of the eigenvectors of XX (not necessarily the leading eigenvector). We show next that if θ is equally (mis)aligned with all the eigenvectors {uj}r j=1 of XX , then again there is no separation between ridge and repeated SD. (a) s := [1, 1/2, 1/3, 1/4], θ = u1 (b) s := [1, 1, 1/2, 1/3] (c) θ := 0.5(u1 + u2 + u3 + u4) Figure 3: On a synthetic task (explained in Section 5.1), X has rank 4 with (a) θ = u1 and distinct sj s; (b) s = [1, 1, 1/2, 1/3]; (c) θ = 0.5(u1 + u2 + u3 + u4). Each additional step of SD with optimal choice of ξ(k) reduces Excess Risk(ˆθ(λ, (ξ(k)) )) for any choice of λ on the x-axis. Panel (a) satisfies Asmp. 2 and hence 4-step SD is necessary to achieve the optimal excess risk. This is no longer true when Asmp. 2.1 is violated (b) or Asmp. 2.2 is violated (c). Excess risk achieved by 4-step SD (i.e. the green lines) in panels (a) and (c) exactly match the numerical value given by RHS of eq. (14), i.e. the fundamental lower bound for any SD estimator. But this is not the case in panel (b) [which has the same lower bound from eq. (14) as panel (a)], because Asmp. 2.1 is violated. Figure 4: On the synthetic problem from Figure 3a, we fix λ = 0.125 and set the singular values of X as sj = {1 (j 1)ϵ}, i.e. consecutive values are separated by ϵ. For k-step SD with k = {1, 2, 3}, we plot (ξ(k)) (λ) (i.e. optimal values of the ξ parameters) by varying ϵ {0.2, 0.1, 0.05, 0.02, 0.01}. The magnitude of ξ(k) k values increases as the singular gap ϵ decreases, verifying Remark 4.2. Theorem 3. Under the hypotheses of Theorem 1 except for Assumption 2.2, if the true parameter θ satisfies θ , uj 2 = z for all j [r], it holds that for all z > 0, k 1, λ > 0, and ξ(k) Rk, Excess Risk ˆθ(λ, ξ(k)) γ2 Furthermore, there exists λ > 0 such that the ridge, ˆθ(λ), achieves this lower bound with equality. We provide a proof in Appendix G. Similar conditions are needed when analyzing 1-step SD in [9] as well; [9, Eq. (9) in Theorem 3.8] is required for the 1-step SD to strictly outperform ridge. Observe that Eq. (9) is violated when either (i) sj s are all equal or (ii) θ , uj 2 s are all equal. 4.3 r steps of self-distillation are sufficient For a given problem instance (X, θ , γ2), the excess risk achieved by the k-step SD with parameter ξ(k) can be exactly characterized (Theorem 5), but it is complicated and can only be evaluated numerically in general. On the other hand, we show that there exists a fundamental lower bound that holds for a linear family of estimators including all repeated SD, and this lower bound has a simple characterization (Lemma 4.1). Furthermore, we show that under a mild assumption on the eigenvalues of XX in Assumption 2.1, the r-step SD achieves the lower bound (Theorem 4). This allows a precise characterization of the performance of r-step SD. Theorem 4. Under the fixed design linear regression in Assumption 1, the excess risk of any k-step SD estimator on an instance (X, θ , γ2), is lower bounded for all k 1, λ > 0, and ξ(k) Rk by Excess Risk ˆθ(λ, ξ(k)) γ2 θ , uj 2 θ , uj 2 + γ2 where (sj, uj) is the jth eigenvalue and eigenvector of X and r := rank(X). Furthermore, if Assumption 2.1 holds then there exists λ > 0 and ξ(r) Rr such that the equality is achieved by the r-step SD estimator ˆθ(λ, ξ(r)). Proof sketch. (Proof in Appendix H). The lower bound in Eq. (14) is an instantiation of Lemma 4.1, since ˆθ(λ, ξ(k)) is a specific linear family estimator with P = P λ, ξ(k) Ω 1 λ defined in Eq. (8). To show achievability, we need to show that P λ, ξ(k) Ω 1 λ = P for some value of k, λ, and ξ(k). This holds when the below system of r linear equations admits a solution for the k parameters (i.e. ξ(k)), with an extra free parameter λ > 0. We show that with k = r and under Assumption 2.1, there exists λ > 0 that will ensure the existence of a solution for this system of equations. s2 j λ + s2 j s2 j λ + s2 j = θ , uj 2 θ , uj 2 + γ2 s2 j j [r] (15) Remark 4.2 (Necessity of Assumption 2.1). This assumption is required for (15). Otherwise, the LHS would be the same for indices j and j + 1 if sj = sj+1, but the RHS could still be different as θ , uj = θ , uj+1 generally. If Assumption 2.1 does not hold, there might not be any ξ(k) satisfying the set of equations for a general θ Rd. Further, the system of linear equations in Eq. (15) becomes more ill-conditioned as the singular values sj, j [r] get closer to each other. Capturing this dependence explicitly is outside the scope of this paper. Lower bound for a linear family. Consider a linear family of estimators of the form ˆθ(P) := P XY, for P := Ud SU d , whose eigenspace coincides with that of XX (i.e., Ud = [u1, . . . , ud]) and has d degrees of freedom represented by the eigenvalues S = diag[ s1, , sd]. This is a generic form of any linear estimator, albeit with the restriction of the eigenvectors matching the underlying Ud. In particular, k-step SD is an instantiation of this with P = P(λ, ξ(k)) Ω 1 λ (refer to Eq (8)). Lemma 4.1. The Excess Risk for ˆθ(P) = P XY where P := Ud SU d , satisfies Excess Risk ˆθ(P) γ2 θ , uj 2 θ , uj 2 + γ2 with equality achieved at P = P = Ud S U d , given by ( θ ,uj 2 ( θ ,uj 2s2 j+γ2) , j r (i.e., sj > 0) any real value , j r + 1 (i.e., sj = 0) . (17) Proof sketch. (Proof in Appendix H.1). One can expand the excess risk for ˆθ(P), which is a quadratic expression in sj, j [d]. Completing the squares gives the result immediately. 4.4 The excess risk for the k-step SD estimator is quadratic in ξ(k) Rk We give an explicit formula for the excess risk achieved by for the k-step SD estimator from Eq. (8). Since ˆθ(λ, ξ(k)) is linear in each of ξ(k) i , i [k] (recall that ξ(k) is a reparametrization of ξ(k)), the overall excess risk is quadratic in ξ(k) as shown below. Appendix I provides a proof and the expressions for M (k), m(k), and c. This quadratic form will be especially useful in experiments. Theorem 5 (Informal version of Theorem 7 in Appendix I). Under the fixed design linear regression in Assumption 1, the excess risk achieved by the k-step SD is quadratic in ξ(k) Rk: Excess Risk ˆθ(λ, ξ(k)) = ξ(k) M (k) | {z } Rk k ξ(k) + 2 ξ(k) m(k) |{z} Rk + c . (18) From the detailed expressions given in Appendix I, we note that M (k) is a sum of r symmetric rank-1 matrices, which means it can have a maximum rank of r. This implies that M (k) Rk k for k > r is rank-deficient (causing no additional decrease in the excess risk if the ξ(r) Rr were chosen optimally to minimize the excess risk). This indicates that r steps of SD might be sufficient to achieve the optimal excess risk, which is indeed what we observe in Theorem 4. 5 Experiments In this section, we empirically show that multi-step SD can outperform the ridge and single-step SD. We first present a synthetic setting (section 5.1) to validate our theory. In section 5.2, we discuss a strategy to select ξ parameters based on the theoretical insight from section 4.4. In section 5.3, we implement that strategy on real-world regression tasks and show that it can indeed select performant ξ values that provide multi-step SD estimators that achieve a smaller test risk. 5.1 Synthetic Experiments We validate our theoretical results with a fixed design synthetic experiment. We consider a problem with d = r = 4, and set problem parameters (X, θ , γ2). Namely, X s singular values are set as sj := 1/j for j [4], and θ := u1 as in Theorem 1. Figure 3 shows the result for γ = 0.125, along with two more settings that validate the necessity of our assumptions (validating Theorems 2 and 3). Figure 3a confirms that repeated steps of SD do provide a reduction in the excess risk, since the lowest point of the curve for each k reduces as k increases. Also note that the optimal λ for each k (one that produces lowest excess risk estimator) is different. Appendix J presents some more settings, including θ := 1/ 2(u1 + u2) for comparison with [9]. An interesting phenomenon in Figure 3 is that local maxima in k-step SD s curve coincide with local minima in (k 1)-step SD s curve, which was proven for k = 1 in [9], and we observe empirically for all values of k. Explanation of Figures 3, 5. For the fixed design synthetic experiment in Figure 3 and the random design real-world experiment in Figure 5 (section 5.3), the curves plotted are with the optimal (ξ(k)) for each λ. Hence, the curve of k-step SD will point-wise be lower/equal to the curve of (k 1)-step SD, since more steps of SD only provide more freedom. We say k-step SD strictly dominates (k 1)-step SD when the minimum value of k-step SD s excess risk is strictly lower than that of (k 1)-step SD. For Figure 3, the optimal (ξ(k)) is found analytically from the problem parameters. For real-world datasets in Figure 5, we use the strategy in section 5.2 to find (ξ(k)) . 5.2 Choosing the hyperparameters ξ for real-world datasets We have shown that at the cost of introducing additional hyperparameters and setting them to their optimal values, one can extract a large (upto Ω(d)) performance gain. However, how does one select these ξ s for real-world datasets? The standard practice is to use a validation set, and perform a grid search. But this becomes infeasible for k-step SD for larger values of k, since performing a search over parameters in Rk+1 (i.e k values of ξ(k) and 1 value of λ) quickly becomes impractical. However, our theoretical analysis provides an insight that can be used to directly compute the optimal ξ(k) Rk (for a chosen λ) given a few evaluations on the validation set with certain chosen ξ(k) values. Namely, Theorem 5 tells us that the Excess Risk is quadratic in ξ(k) (the reparameterized version). Now the coefficients of the quadratic depend on unknown quantities (like θ , γ2), however just knowing the quadratic nature of the functional, we can use the validation set to estimate these coefficients. To estimate the coefficients for k-step SD, we need k(k+3)/2 + 1 evaluations on the validation set. Appendix K provides more discussion, and a detailed illustration of the above process for k = 1, 2. Note that this is feasible when the cost/time needed for a single training run of a k-step SD is small (since we need to perform it O(k2) times), which holds true for linear regression. 5.3 Real-world regression experiments We implement multi-step SD for real-world regression tasks from the UCI repository [18], and demonstrate that 2-step SD can outperform ridge and 1-step SD. Note that for this section, the test set will contain fresh samples of X Rd, i.e. random design linear regression instead of fixed design. Our metric for an estimator s performance will now be mean squared error (MSE) on a test set of unseen examples, which is the empirical version of total risk (i.e. excess risk translated by the unknown noise variance γ2). Using the training and validation splits, we compute (i) Optimal ridge: ˆθ(λ 0), (ii) Optimal 1-step SD: ˆθ(λ 1, ξ ), and (iii) Optimal 2-step SD: ˆθ λ 2, (ξ 1, ξ 2) . The procedure is to plot the MSE on the validation set for a grid of λ values for all three estimators, and choose the λ that achieves the lowest error for each one (for any given λ, the optimal ξ (λ) is chosen by the strategy described in Section 5.2). Finally, we evaluate the MSE of all three selected estimators (i.e. with the chosen optimal hyperparameters) on the test set, which serves as our performance metric (refer to Table 1). Appendix L explains the overall methodology in greater detail. We apply this methodology on three datasets (dataset descriptions in Appendix L.1). Table 1 describes the results we observe. For two of the three datasets, 2-step SD can outperform both ridge and 1-step SD. For the Air Quality dataset, 2-step SD significantly outperforms both ridge and 1-step SD, reducing the MSE by 47.2% compared to the optimal ridge. In contrast, for the AEP dataset, we observe that the SD process cannot improve upon the ridge at all. The MSE curves in Figure 5 also shed light on these observations. Notice how Figures 5a, 5b show a gap in the ridge and 2-step SD (similar to Figure 3a), whereas Figure 5c shows no such gap (similar to Figure 3c). In Appendix L.2, we explain the lack of gains from self-distillation on the AEP dataset from the lens of Assumption 2.2. Recall that Theorem 1 says that the gains of repeated SD are most prominent with a small β in Assumption 2.2. Figure 10 shows that the θ for the AEP dataset is at most 35% aligned with any of the eigenbasis directions, corresponding to a β 0.65. The same alignment for the other two datasets is 80%, corresponding to a much smaller β 0.2. In Appendix L.3, we also verify that the strategy described in section 5.2 indeed selects performant ξ values. Table 1: Chosen hyperparameter values and the achieved test set MSE for ridge and 1, 2-step SD. Dataset Optimal ridge Optimal 1-step SD Optimal 2-step SD Air Quality Optimality hyperparameters λ 0 = 102 λ 1, ξ = 103, 4.1 λ 2, (ξ 1, ξ 2) = 103, ( 0.9, 16.2) Test set MSE 2.01 1.99 1.06 Airfoil Optimality hyperparameters λ 0 = 102 λ 1, ξ = 100, 66.5 λ 2, (ξ 1, ξ 2) = 103, ( 1.8, 7.8) Test set MSE 1.34 1.22 1.19 AEP Optimality hyperparameters λ 0 = 102.5 λ 1, ξ = 102.5, 0.1 λ 2, (ξ 1, ξ 2) = 102.5, ( 2.4, 2.3) Test set MSE 0.62 0.62 0.63 (a) Air Quality dataset (b) Airfoil dataset (c) AEP dataset Figure 5: Validation set MSE vs λ for three estimators: Ridge, 1-step SD and 2-step SD. 6 Conclusion and Broader Impacts In this paper, we theoretically studied the multi-step self-distillation for fixed design linear regression, with the goal of characterizing its performance compared to the single-step SD. Perhaps surprisingly, we demonstrated that the optimal multi-step SD can outperform the optimal single-step SD by a factor as large as d in the estimator s excess risk, where d is the input dimension of the regression. Our analysis is limited by the fixed design assumption, and it would be useful to study the case of random design linear regression as well. We empirically demonstrated the gains from using 2-step SD on simple linear regression tasks. Larger scale empirical studies of multi-step SD, especially leveraging the insights of Section 5.2 on hyperparameter search, remain as a direction of future work. Our contributions are largely on the theoretical understanding of multi-step self-distillation, and its potential performance gains. At a high-level, self-distillation can use data more effectively, since it allows us to extract more knowledge from the same training dataset. In today s age with data being one of the most important resources, this has positive potential impacts through more judicious use of data. On the other hand, we propose to use multiple steps of self-distillation, requiring more compute and potentially contributing to higher environmental costs. Acknowledgements We thank Raghav Somani for numerous insightful discussions at various stages of this project. This work is supported by NSF grants no. 2019844, 2112471, 2229876, 2134106, 2143493, 2134012, and 2229881; and Microsoft Grant for Customer Experience Innovation. [1] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9163 9171, 2019. [2] Z. Allen-Zhu and Y. Li. Towards understanding ensemble, knowledge distillation and selfdistillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023. [3] J. Ba and R. Caruana. Do deep nets really need to be deep? Advances in neural information processing systems, 27, 2014. [4] K. Borup and L. N. Andersen. Even your teacher needs guidance: Ground-truth targets dampen regularization imposed by self-distillation. Advances in Neural Information Processing Systems, 34:5316 5327, 2021. [5] T. Brooks, D. Pope, and M. Marcolini. Airfoil Self-Noise. UCI Machine Learning Repository, 2014. DOI: https://doi.org/10.24432/C5VW2C. [6] L. Candanedo. Appliances Energy Prediction. UCI Machine Learning Repository, 2017. DOI: https://doi.org/10.24432/C5VC8G. [7] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650 9660, 2021. [8] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker. Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017. [9] R. Das and S. Sanghavi. Understanding self-distillation in the presence of label noise. In Proceedings of the 40th International Conference on Machine Learning, pages 7102 7140. PMLR, 2023. [10] T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born again neural networks. In International Conference on Machine Learning, pages 1607 1616. PMLR, 2018. [11] J. Gou, B. Yu, S. J. Maybank, and D. Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789 1819, 2021. [12] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL http://arxiv.org/ abs/1503.02531. [13] C. Hu, X. Li, D. Liu, X. Chen, J. Wang, and X. Liu. Teacher-student architecture for knowledge learning: A survey. ar Xiv preprint ar Xiv:2210.17332, 2022. [14] Z. Huang and N. Wang. Like what you like: Knowledge distill via neuron selectivity transfer. ar Xiv preprint ar Xiv:1707.01219, 2017. [15] V. Jankovic. Quadratic functions in several variables. http://elib.mi.sanu.ac.rs/files/ journals/tm/15/tm821.pdf. [16] H. Jeong and H. W. Chung. Understanding self-distillation and partial label learning in multiclass classification with label noise. ar Xiv preprint ar Xiv:2402.10482, 2024. [17] R. Jha, J. Hayase, and S. Oh. Label poisoning is all you need. Advances in Neural Information Processing Systems, 36:71029 71052, 2023. [18] M. Kelly, R. Longjohn, and K. Nottingham. The uci machine learning repository. https: //archive.ics.uci.edu. [19] D. Kobak, J. Lomond, and B. Sanchez. The optimal ridge penalty for real-world highdimensional data can be zero or negative due to the implicit ridge regularization. Journal of Machine Learning Research, 21(169):1 16, 2020. [20] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li. Learning from noisy labels with distillation. In Proceedings of the IEEE international conference on computer vision, pages 1910 1918, 2017. [21] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma. Neural attention distillation: Erasing backdoor triggers from deep neural networks, 2021. [22] Z. Lin, Z. Gou, Y. Gong, X. Liu, Y. Shen, R. Xu, C. Lin, Y. Yang, J. Jiao, N. Duan, et al. Rho-1: Not all tokens are what you need. ar Xiv preprint ar Xiv:2404.07965, 2024. [23] X. Liu, X. Wang, and S. Matwin. Improving the interpretability of deep neural networks with knowledge distillation. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pages 905 912. IEEE, 2018. [24] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2604 2613, 2019. [25] D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. ar Xiv preprint ar Xiv:1511.03643, 2015. [26] A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar. A statistical perspective on distillation. In International Conference on Machine Learning, pages 7632 7642. PMLR, 2021. [27] H. Mobahi, M. Farajtabar, and P. Bartlett. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, pages 3351 3361, 2020. [28] N. Papernot, P. Mc Daniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pages 582 597. IEEE, 2016. [29] M. Phuong and C. Lampert. Towards understanding knowledge distillation. In International conference on machine learning, pages 5142 5151. PMLR, 2019. [30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets; 2014. ar Xiv preprint ar Xiv:1412.6550, 3, 2014. [31] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. In International Conference on Learning Representations, 2016. [32] S. Sun, Y. Cheng, Z. Gan, and J. Liu. Patient knowledge distillation for bert model compression. ar Xiv preprint ar Xiv:1908.09355, 2019. [33] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, A. Mohamed, M. Philipose, M. Richardson, and R. Caruana. Do deep convolutional nets really need to be deep and convolutional? In International Conference on Learning Representations, 2017. [34] S. Vito. Air Quality. UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C59K5F. [35] J. Xia, T. Wang, J. Ding, X. Wei, and M. Chen. Eliminating backdoor triggers for deep neural networks using attention relation graph distillation, 2022. [36] C. Yang, L. Xie, S. Qiao, and A. L. Yuille. Training deep neural networks in generations: A more tolerant teacher educates better students. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5628 5635, 2019. [37] C. Yang, L. Xie, C. Su, and A. L. Yuille. Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2859 2868, 2019. [38] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4133 4141, 2017. [39] K. Yoshida and T. Fujino. Countermeasure against backdoor attack on neural networks utilizing knowledge distillation. Journal of Signal Processing, 2020. [40] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, 2017. [41] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3713 3722, 2019. [42] L. Zhang, C. Bao, and K. Ma. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4388 4403, 2021. [43] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320 4328, 2018. [44] Z. Zhang and M. Sabuncu. Self-distillation as instance-specific label smoothing. Advances in Neural Information Processing Systems, 33:2184 2195, 2020. [45] B. Zhu, M. I. Jordan, and J. Jiao. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf. ar Xiv preprint ar Xiv:2401.16335, 2024. In this short section, we collect some notation used throughout the proofs. Decomposition of X. Let rank(X) = r and the SVD of X be X = Pr j=1 sjujv T j where s1 s2 sr > 0, and each uj Rd and vj Rn. Further, let {u1, u2, , ud} be the full set of left singular vectors of X (even those corresponding to zero singular values), forming an orthonormal basis of Rd. Let Ud Rd d and Ur Rd r denote the left singular matrix of X for the full and truncated set of left singular vectors respectively. Similarly, let Vd Rn d and Vr Rn r denote the full and truncated right singular matrix. Let Sd Rd d and Sr Rr r denote the collection of the singular values (with and without zeros respectively). Then, it holds that X = Ur Sr V r = Ud Sd V d . (19) Indices. Throughout the text, the indices i lie in [k], i.e. they denote subsequent steps of selfdistillation. The indices j lie in [r] or [d], ie they denote dimensions of the d-dimensional space (aligned with vectors of Ur or Ud). vj, Vi,j will denote indexing into a vector v, matrix V . There is one exception to vj denoting a vector s jth element, which is the below. Components of θ on Ud. We will denote θ j := θ , uj 2, j [d] as the components of θ onto X s left singular space. Note that Pd j=1 θ j = θ 2 2. B Discussion on norm used in excess risk metric We point out that Das and Sanghavi [9] used the . 2 norm instead of the more natural . ˆΣn norm to measure their fixed design excess risk (in eq (2)). Almost all our results have an equivalent version in the . 2 norm setting also, since the only difference is in the relative weighing of the underlying dimensions of variation, i.e. with the . ˆΣn norm, j [d], direction uj is weighed by s2 j/n instead of a constant 1 weight (independent of j). In particular, we also present the version of [9, Eq. (9) from Theorem 3.8] that will result in a strict dominance like Eq. (7) under the ˆΣn norm, i.e. min λ 0,ξ R Eη h ˆθ(λ, ξ) θ 2 ˆΣn =Excess Risk(ˆθ(λ,ξ)) < min λ 0 Eη h ˆθ(λ) θ 2 ˆΣn =Excess Risk(ˆθ(λ)) Proposition B.1. Let λ := argminλ>0Excess Risk(ˆθ(λ)). Then Eq. (20) holds on a problem instance (X, θ , γ2) when s4 js4 k s2 j s2 k θ , uk 2 θ , uj 2 λ + s2 j 4 (λ + s2 k)4 < 0 . (21) Note that this differs from [9, Eq. (9)] in just one respect: it has s4 js4 k instead of s2 js2 k. C Details on ξ parameters for general k-step SD C.1 Full k-step SD is representationally no larger than Repeated k-step SD We first illustrate the Full k-step SD in Figure 6. The repeated version introduces k extra hyperparameters in the form of ξ(k) Rk parameters, whereas the full version introduces k(k+1)/2. S0 S1 S2 Sk Repeated k-step self-distillation S0 S1 S2 Sk Full k-step self-distillation Figure 6: Illustrating two possible generalizations of 1-step SD to a k-step process. Consider the case of k = 2, since that is the lowest value of k for which the full version and the repeated version differ. Figure 7 illustrates this difference explicitly. We will show that the freedom of ξ R3 is no more than the freedom of ξ R2. This shows the equivalence of the full 2-step and the repeated 2-step versions, when ξ R3, ξ R2 are free parameters optimized over the entire respective spaces. Such equivalence for the general k-step case is then easy to see. Nodes S0, S0 are both solving the ridge regression problem (3). Let θ0 = θ0 = Ω 1 λ XY denote the estimator for both these nodes. Similarly S1 and S1 are solving the same problem, although with different parameters. We have θ1(ξ1) = Ω 1 λ X ξ1 X θ0 + (1 ξ1) Y (and similarly, one can write θ1 with ξ1 instead of ξ1). Now the node S2 is also solving a 1 parameter supervised SD problem, so θ2(ξ1, ξ2) = Ω 1 λ X ξ2 X θ1(ξ1) + (1 ξ2) Y . Expanding this gives θ2(ξ1, ξ2) = (1 ξ2) Id + (ξ2 ξ1ξ2) Ω 1 λ XX + ξ1ξ2 (Ω 1 λ XX )2 Ω 1 λ XY . (22) But the optimization problem for S2 is a 2 parameter supervised SD. It evaluates to 2 X θ0 X θ 2 + ξ2b 2 X θ1 X θ 2 + (1 ξ2a ξ2b) 2 Y X θ 2 + λ Following through a similar calculation, we observe that θ2 for node S2 is given by θ2( ξ1, ξ2a, ξ2b) = n (1 ξ2a ξ2b) Id + ( ξ2a + ξ2b ξ1 ξ2b) Ω 1 λ XX + ξ1 ξ2b (Ω 1 λ XX )2o Ω 1 λ XY . (23) S0 S1 S2 Repeated 2-step self-distillation ξ1 ξ2 S0 S1 S2 Full 2-step self-distillation ξ1 ξ2b Figure 7: Repeated vs Full 2-step SD. We show that the extra freedom of the parameter ξ2a does not provide any additional freedom, when the other two ξ1, ξ2b are optimized as free parameters. Equations (22) vs (23) show that the full 2-step offers the same freedom as the repeated 2-step. However the repeated version has one shortcoming. To generate the optimal 2-step SD estimator, ξ1 needs to be different than the value needed for generating the optimal 1-step SD estimator. That is, the repeated k-step version, with its k free parameters, allows only to generate the final kth-step estimator as the optimal one (i.e. if we choose the (ξ1, ξ2) values so that the 2nd estimator is the optimal 2-step SD estimator, then the 1st estimator with the chosen ξ1 won t be the optimal 1-step SD estimator). Whereas the full k-step version, with all its k(k + 1)/2 free parameters, allows us to generate a sequence of all k optimal estimators. C.2 Proof that k-step SD estimator with ξ(k) Rk will have the form given in eq (8) Consider Figure 1. ξ(k) Rk is the set of actual imitation parameters used for running k-step (repeated) SD. Let ˆθ(λ, ξ(k)) denote the k-step estimator generated by using ξ(k) parameters. In what follows, we will prove that ˆθ(λ, ξ(k)) will have the form given in eq (8), with ξ(k) Rk as the described reparametrization of ξ(k) Rk. Since we re in the repeated version, the kth step objective will be a combination of losses w.r.to ground-truth labels and predictions from the (k 1)th step. So, the objective for the kth step is ˆθ(λ, ξ(k)) := argminθ Rd 2 X ˆθ(λ, ξ(k 1)) X θ 2 + (1 ξ(k) k ) 2 Y X θ 2 + λ (24) Note that ˆθ(λ, ξ(k)) recursively depends on predictions from the previous ˆθ(λ, ξ(k 1)). This objective is of the form in eq (4), so similar to eq (5), we have the following expression ˆθ(λ, ξ(k)) = XX + λId 1 X n ξ(k) k X ˆθ(λ, ξ(k 1)) + (1 ξ(k) k ) Y o (25) = n (1 ξ(k) k ) Ω 1 λ XY + ξ(k) k Ω 1 λ XX ˆθ(λ, ξ(k 1)) o . (26) Using this, we can inductively prove that the general form of this estimator is captured by Eq. (8). Claim. With the described reparametrization, i.e., ξ(k) i := (1 ξ(k) k i) Qk l=k i+1 ξ(k) l for each i [k] (where we let ξ(k) 0 = 0), Eq. (8) is the solution to the recursive form Eq. (26). Proof. We will prove this by induction. Base Case. From eqs (8) and (26), the case for k = 1 is true (with ξ(1) = ξ(1)). Inductive Step. Assuming Eq. (8) captures the solution of Eq. (26) for k 1, we get ˆθ(λ, ξ(k)) = (n 1 ξ(k) k o Id + ξ(k) k Ω 1 λ XX ( i=1 ξ(k 1) i i=1 ξ(k 1) i Ω 1 λ XX i !) This again satisfies the form in equation (8), when the following coefficients match i=1 ξ(k) i , i=1 ξ(k 1) i ) = ξ(k) 1 , ξ(k) k ξ(k 1) i 1 = ξ(k) i i {2, 3, , k} . One can then see that the described reparametrization makes the above hold true. Remark. Since ˆθ(λ, ξ(1)) is simply the 1-step SD estimator with the form ˆθ(λ, ξ(1)) = P Ω 1 λ XY for some preconditioner P, plugging this in the equation eq (26), we realize that we can factor out Ω 1 λ XY (i.e. the ridge solution) from the expression on the right side. That is, ˆθ(λ, ξ(2)) = P Ω 1 λ XY for some different pre-conditioner P . This is why inductively we get that ˆθ(λ, ξ(k)), k 1 all produce a pre-conditioning on the ridge solution, as shown in eq (8). Further, eq (26) also dictates why we get increasing powers of the term Ω 1 λ XX in the final expression. C.3 Explicit reparametrization for k = 2, 3 We explicitly demonstrate the reparametrization for k = 2, 3. As noted in Section 5.2 (and Appendix K), owing to the quadratic form of excess risk in ξ(k), one can find the optimal ξ(k) analytically. It can then be translated back to the original ξ(k) as follows. For k = 2, the form is (dropping the .(2) for ease) ξ1 = 1 ξ1 ξ1 + ξ2 , ξ2 = ξ1 + ξ2 . (27) For k = 3, the form is (dropping the .(3) for ease) ξ1 = 1 ξ2 ξ2 + ξ3 , ξ2 = 1 ξ1 ξ1 + ξ2 + ξ3 , ξ3 = ξ1 + ξ2 + ξ3 . (28) D Algebraic expansion of ˆθ(λ, ξ(k)) from eq (8) Lemma D.1. The estimator ˆθ(λ, ξ(k)) given in eq (8) can be expanded as ˆθ(λ, ξ(k)) = s2 j λ + s2 j s2 j λ + s2 j θ , uj uj s2 j λ + s2 j sj λ + s2 j η, vj uj Proof. The expansion relies on XX = Ud S2 d U d and XX + λId = Ud(S2 d + λId)U d . Ud is orthonormal (Ud U d = Id = U d Ud), which neatly cancels all occurences of Ud in the middle, and allows us to directly combine the matrices in the eigenvalue space. E Detailed version and a proof of Theorem 1 We first write a detailed version of the theorem that exactly characterizes the family of problem instances that admits the separation. We then present a proof. We point the reader to Appendix A for notations used throughout the proof. Theorem 6 (Detailed version). Consider the following conditions on θ , X that characterize a family of problem instances {(X, θ , γ2)}: 1. θ satisfies j [r], θ , uj 2 (1 β) θ 2 [Assumption 2.2] 2. θ 2 198 rγ2/s2 r (s1/sr)4 3. Assumption 2.1 holds on the singular values of X [Assumption 2.1] 4. (s2 1/s2 r) 2 holds on the singular values of X Under Condition #1 + Condition #3, it holds that λ > 0, ξ(r) Rr, Excess Risk ˆθ(λ, ξ(r)) γ2 1 + β θ 2s2 1 γ2 (Rewriting eq (9)) Under Condition #1 + Condition #2 + Condition #4, it holds that λ > 0, ξ R, Excess Risk ˆθ(λ, ξ) 0.99 211 (1 β) rγ2 n (Rewriting eq (10)) Under Condition #1 + Condition #2, it holds that λ > 0, Excess Risk ˆθ(λ) 0.98 1 β 1 0.99β n (Rewriting eq (11)) Proof. We will analyze all three: k-step SD, ridge, and 1-step SD. Let j [r] be the index that corresponds to the maximum (unsigned) alignment between θ and uj, among all j [r]. If it is not unique, pick any of the corresponding indices. That is, j satisfies θ , uj 2 θ , uj 2 for all j [r]. Then, Condition #1 translates to the below two inequalities, θ j θ , uj 2 (1 β) θ 2 , (29) j=1,j =j θ j j=1,j =j θ , uj 2 j=1,j =j θ , uj 2 β θ 2 . (30) k-step SD: Since Assumption 2.1 holds, from Theorem 4 we directly have λ > 0, ξ(r) Rr, Excess Risk ˆθ λ, ξ(r) = γ2 θ j θ j + γ2 θ j θ j + γ2 θ j θ j + γ2 θ j θ j + γ2 θ j γ2 s2 j 1 + s2 1 γ2 j=1,j =j θ j Using eq (30) in the above, we have 1 + β θ 2s2 1 γ2 This completes the proof for eq (9). Ridge: We will now show (11), by characterizing the Excess Risk for the ridge estimator in this regime, showing that it can be upto r times worse. From Lemma I.1, we realize that the Excess Risk expression for the simple ridge estimator ˆθ(λ) is given by λ > 0, Excess Risk ˆθ(λ) = 1 λ2θ j + γ2s2 j s2 j (λ + s2 j)2 (37) s4 j (λ + s2 j)2 | {z } Just the Variance term Inequality (38) above comes from ignoring the bias term (since it is non-negative). Note that the variance term is a decreasing function of λ. And again from Lemma I.1, we get the following expression for λ > 0 that minimizes the Excess Risk. Pr j=1 s4 j (λ +s2 j) 3 Pr j=1 θ j s4 j (λ +s2 j) 3 (39) Pr j=1 s4 j (λ +s2 j) 3 θ j s4 j λ +s2 j 3 (Ignoring positive terms in the denominator above) (Rearranging the above) (Using eq (29)) λ + s2 j 3 + s4 j s4 j + (Since sj sj for j j ) s4 j s4 j + s4 j s4 j s6 j s6 j (Since λ +s2 j λ +s2 j s2 j s2 j for j > j , as λ > 0) s4 j s4 j | {z } r ( s1 sr ) 4 s2 j s2 j | {z } r ( s1 sr ) 2 Now since the variance term in (38) is a decreasing function of λ, we can use the upper bound of λ from (40) to lower bound the Excess Risk of optimal ridge as Excess Risk ˆθ(λ ) γ2 2 (Rewriting eq (38)) 2 (Since 0 < sr sj for all j) n 1 1 + 1 1 β 2rγ2 θ 2s2r s1 sr 4 2 (Using eq (40)) n 1 1 + 1 99(1 β) 2 (Using Condition #2) n (99(1 β))2 (100 99β)2 (41) | {z } 0.98 (1 0.99β)2 . (42) This completes the proof of eq (11). 1-step SD: To evaluate 1-step SD s Excess Risk, we make use of Theorem 5. Observe that since k = 1, the Excess Risk is a simple quadratic in one variable. We will call ξ(1) R as just ξ (similar to eq (6)). And similarly, we will call M (1), m(1) as just M, m, both real-valued. We then have Excess Risk ˆθ(λ, ξ) = Mξ2 + 2mξ + c M ξ R , (By simple quadratic min) Note that M, m, c are all functions of λ. Now we evaluate their expressions. c is simply Excess Risk of ridge, which we will fetch from Lemma I.1. Evaluate M from Theorem 5: λθ j ρj + γ2 (1 + ρj)2 Cj(1) Cj(1) λθ j ρj + γ2 (1 + ρj)2 ρ2 j (1 + ρj)2 (Using Cj(1) = 1 1 1+ρj ) λθ j ρj + γ2ρ2 j λ2θ j s6 j + γ2λ2s4 j (λ + s2 j)4 . (Using ρj = λ Evaluate m from Theorem 5: (1 + ρj)2 Cj(1) (1 + ρj)2 ρj 1 + ρj (Using Cj(1) = 1 1 1+ρj ) λ2θ j s4 j γ2λs4 j (λ + s2 j)3 . (Using ρj = λ We now write the expressions together for comparison, before we use them. That is, n M = λ2 r X θ j s6 j (λ + s2 j)4 + λ2γ2 r X s4 j (λ + s2 j)4 , n m = λ2 r X θ j s4 j (λ + s2 j)3 λγ2 r X s4 j (λ + s2 j)3 , n c = λ2 r X θ j s2 j (λ + s2 j)2 + γ2 r X s4 j (λ + s2 j)2 . (Directly from Lemma I.1) With all these pieces, for the numerator in eq (43), we have n2 Mc m2 = T1 + T2 + T3 , where we use T1, T2, T3 to capture terms of different forms. T1 will capture the θ j product terms, T2 will capture the γ2 product terms, and T3 will capture the cross terms. Namely, T1 = λ4 r X θ j θ l s2 js2 l (λ + s2 j)2(λ + s2 l )2 s4 j (λ + s2 j)2 + s4 l (λ + s2 l )2 2s2 js2 l (λ + s2 j)(λ + s2 l ) θ j θ l s2 js2 l (λ + s2 j)2(λ + s2 l )2 s2 j (λ + s2 j) s2 l (λ + s2 l ) T2 = λ2γ4 r X s4 js4 l (λ + s2 j)2(λ + s2 l )2 1 (λ + s2 j)2 + 1 (λ + s2 l )2 2 (λ + s2 j)(λ + s2 l ) s4 js4 l (λ + s2 j)2(λ + s2 l )2 1 (λ + s2 j) 1 (λ + s2 l )2 θ j s6 j (λ + s2 j)4 s4 j (λ + s2 j)2 θ j s2 j (λ + s2 j)2 s4 j (λ + s2 j)4 θ j s4 j (λ + s2 j)3 s4 j (λ + s2 j)3 θ j s6 j (λ + s2 j)4 s4 j (λ + s2 j)2 θ j s2 j (λ + s2 j)2 s4 j (λ + s2 j)4 (Ignoring the 3rd term) We lower bound the terms Z1, Y1 and Z2, Y2 in the above lower bound on T3 using simple functional analysis. Consider the fact that the function fp(z) := z2p (λ+z2)p over z > 0 is strictly monotonically increasing for any p N, since f p(z) = 2pλz2p 1 (λ+z2)p+1 > 0 for all z > 0. We will use the instantiation f4 for Z1 and Y2, and f2 for Z2 and Y1. For Z1, we get s2 j s8 j (λ + s2 j)4 s8 r (λ + s2r)4 s2 j (Using monotone increasingness of f4) s8 r (λ + s2r)4 1 (Using monotone decreasingness of f(z) := 1/z2) s8 r (λ + s2r)4 (1 β) θ 2 s2 1 (Using eq (29)) (1 β) θ 2 1 s2 1 s8 r (λ + s2r)4 . (44) Following a similar analysis scheme, we get Z2 (1 β) θ 2 1 s2 1 s4 r (λ + s2r)2 , (45) Y1 r s4 r (λ + s2r)2 , (46) s4 1 s8 r (λ + s2r)4 . (47) Using eqs (44), (45), (46), (47), we get an overall lower bound on T3. For the overall lower bound on the numerator of eq (43), we combine this with eqs ( ), ( ) for the terms T1 and T2. We get n2 Mc m2 rγ2 (1 β) θ 2 s2 1 s12 r (λ + s2r)6 And for the denominator in eq (43), we have n M = λ2 r X θ j s6 j (λ + s2 j)4 + λ2γ2 r X s4 j (λ + s2 j)4 θ j s6 j (λ + s2 j)4 + γ2 r X s4 j (λ + s2 j)4 Upper Bound λ2 from eq (49): θ j s6 j s8 j + γ2 r X θ j s2 j + γ2 r X j=1 θ j + γ2 j=1 θ j + γ2 r X θ 2s2 r + rγ2 λ2 θ 2s2 1 + rγ2 . (50) Upper Bound 1/λ2 from eq (49): θ j s6 j λ4 + γ2 r X j=1 θ j s6 j + γ2 r X j=1 θ j + γ2s4 1 s4 1 λ2 θ 2s2 1 + rγ2 . (51) Note that we used Pr j=1 θ j Pd j=1 θ j = θ 2 in eqs ( , ). Now we put together the numerator lower bound from eq (48), with the denominator upper bound from eq (50) for the first term, and from eq (51) for the second term. We then get M rγ2 (1 β) θ 2 s2 1 s12 r (λ + s2r)6 s4 r ( θ 2s2 1 + rγ2) + λ6 s8 1 ( θ 2s2 1 + rγ2) = (1 β) rγ2 θ 2s4 r s2 1 ( θ 2s2 1 + rγ2) s12 r (λ + s2r)6 + s8 r s8 1 λ6 = (1 β) rγ2 s4 r s4 1 1 1 + rγ2 θ 2s2 1 s12 r (λ + s2r)6 + s8 r s8 1 λ6 Under Condition #2, it holds that rγ2 θ 2s2 1 1 198 s6 r s6 1 1 198 (since sr s1). This gives 199 0.99 . (53) And now we analyze Q2 using simple calculus. Note that Q2 = t1 + t2λ6 (λ + s2r)6 , where t1 = s12 r , t2 = s8 r s8 1 . Simple calculus shows that this function is minimized at λ = (t1/(t2s2 r))0.2 = s8 1s2 r 0.2. And the min value of the function (evaluated at λ) can be used as a lower bound for this, giving Q2 t1 s2r λ + s2r 5 = 1 1 + s1 sr Combining eqs (43), (52), (53), (54), we get λ > 0, ξ R Excess Risk ˆθ(λ, ξ) (1 β) rγ2 Using Condition #4 in the above gives the desired eq (10). F Proof of Theorem 2 We point the reader to Appendix A for notations used throughout the proof. Proof. Under the condition of j [r], sj = 1, we need to analyze the Excess Risk for both k-step SD and ridge. Denote Q := Pr j=1 θ j for simplicity. k-step SD: Since assumption 2.1 is violated, we cannot use Theorem 4 for the k-step SD. Instead, we will work with the quadratic expansion of Excess Risk from Theorem 5. We will rewrite the expressions from Appendix I for the quadratic coefs. Note that ρj = λ/s2 j = λ for all j [r] becomes independent of j in this case. And so Cj(i) = 1 1 (1+λ)i for all j [r], i [k] also becomes independent of j. Let C(i) := 1 1 (1+λ)i denote the coefs Cj(i), since they re now independent of j. Let ω := [C(1), C(2), , C(k)] Rk. Then, we get (i1, i2) [k] [k] M (k) i1,i2 = 1 (1 + λ)2 C(i1) C(i2) (1 + λ)2 C(i1) C(i2) = M (k) = 1 (1 + λ)2 ωω Rk k becomes a rank-1 matrix Similarly, for m(k) we get i1 [k] m(k) i1 = 1 (1 + λ)2 C(i1) (1 + λ)2 ω Rk And similarly, for c we get n λ2Q + rγ2 Using the above expressions, we rewrite the overall Excess Risk for any k-step SD (k 1) as Excess Risk ˆθ(λ, ξ(k)) = 1 (1 + λ)2 ω, ξ(k) 2 + 2 λQ rγ2 (1 + λ)2 ω, ξ(k) + λ2Q + rγ2 We re aiming for a lower bound on the above quantity, so we can first minimize with respect to ξ(k), and then analyze the remaining as a function of λ. Note that this is a quadratic in the scalar ω, ξ(k) . Since q(x) := ax2 + 2bx + c = c b2 a + a x + b a , we have k 1, ξ(k) Rk Excess Risk ˆθ(λ, ξ(k)) 1 n 1 (1 + λ)2 n 1 (1 + λ)2 λ2Qrγ2 + Qrγ2 + 2λQrγ2 Note this expression is independent of λ. Hence the above lower bound holds k 1, ξ(k) Rk, λ > 0. This concludes the proof of eq (12). Ridge: For ridge, we can simply borrow the expression of c from above for its Excess Risk. That is Excess Risk ˆθ(λ) = 1 n λ2Q + rγ2 Let λ denote its minimizer over λ > 0. Simple calculus gives λ = rγ2 Q . Plugging this in, we get the same expression Excess Risk ˆθ(λ ) = rγ2 Q This completes the proof of ridge achieving the lower bound in eq (12). G Proof of Theorem 3 We point the reader to Appendix A for notations used throughout the proof. Proof. Under the condition of j [r], θ j = z > 0, we need to analyze the Excess Risk for both k-step SD and ridge. k-step SD: Again from Theorem 4, any k-step SD estimator s Excess Risk is k 1, λ > 0, ξ(k) Rk, Excess Risk ˆθ(λ, ξ(k)) γ2 θ j θ j + γ2 (Since j [r], θ j = z) This completes the proof of eq (13). Ridge: Now for the ridge estimator, we will use Lemma I.1. From eq (61), we get an exact expression for λ > 0 that minimizes the Excess Risk. Namely z Substituting this in eq (60), we get Excess Risk ˆθ(λ ) = 1 (λ )2θ j + γ2s2 j s2 j (λ + s2 j)2 λ + s2 j s2 j (λ + s2 j)2 (Since λ θ j = γ2 s2 j ) (Substituting λ = γ2 This completes the proof of ridge achieving the lower bound in eq (13). H Proof of Theorem 4 We point the reader to Appendix A for notations used throughout the proof. Proof. The proof of (14) is a simple instantiation of Lemma 4.1. Since the SD estimator is a particular instance of the general ˆθ(P), i.e. ˆθ λ, ξ(k) = ˆθ P P λ, ξ(k) Ω 1 λ Eq (14) follows from the lower bound in Lemma 4.1. For proving the equality being achieved, we will work with the general k-step SD estimator, and show that: Under assumption 2.1, k = r steps are sufficient to provide enough freedom to ξ(k) so that the k-step SD estimator achieves the lowest possible Excess Risk. Using lemma 4.1, note that the condition P λ, ξ(k) Ω 1 λ = P is sufficient to ensure that ˆθ λ, ξ(k) = ˆθ(P ), which would mean that the k-step SD estimator admits the lowest possible Excess Risk. Since the eigenspaces for both sides of the equation are the same (Ud), we only need to ensure that the eigenvalues match on both sides. That is, we need the following condition (for indices j r + 1, there s no condition since lemma 4.1 tells us that any real value of sj suffices). s2 j λ + s2 j 1 λ + s2 j = s j (55) i=1 ξ(k) i + s2 j λ + s2 j s2 j λ + s2 j = θ j θ j + γ2 Let aj(λ) := s2 j λ+s2 j . Since λ > 0, aj (0, 1), j [r]. The above condition can then be written succinctly in matrix form as A(k)(λ) | {z } Rr k ξ(k) |{z} Rk = α(λ) |{z} Rr (57) With the following describing the elements of A(k)(λ), α(λ) as [α(λ)]j := 1 λ + s2 j s j j [r] h A(k)(λ) i j,i := 1 (aj(λ))i j [r], i [k] The notation A(k)(λ), α(λ) explicitly denotes the dependence on λ. Now, A(k)(λ) being invertible would ensure that ξ(k) Rk that makes equation (57) hold true (i.e. the system of equations admits a solution). For that, we need (1) k = r and (2) A(r)(λ) being full-rank. The first condition is stated in the lemma. In what follows, we will prove that λ > 0 such that the second condition is satisfied, i.e. A(r)(λ) being full-rank. Decompose A(r)(λ) as 1 1 1 1 1 1 ... ... ... ... 1 1 1 r r | {z } W a1(λ) a1(λ)2 a1(λ)r a2(λ) a2(λ)2 a2(λ)r ... ... ... ... ar(λ) ar(λ)2 ar(λ)r r r | {z } V (λ) Where W = 11 is the matrix of all ones, and V (λ) is akin to the square Vandermonde matrix (only difference being that the standard definition of Vandermonde also has a column of ones). Using the Matrix Determinant Lemma, we can write det(A(r)(λ)) = 1 1 V (λ) 11 det( V (λ)) First note that, with the determinant expansion rule based on the first row det V (λ) = det 1 0 1 V (λ) (r+1) (r+1) The matrix on the right is exactly the Vandermonde matrix with a0 = 0, a1(λ), a2(λ), ar(λ). Using the formula for the det of a standard (with a row of ones) Vandermonde matrix, we get det V (λ) = Y 0 i 0. With the SVD of V (λ) = CDE 1 where C, E are orthonormal and D is diagonal with positive entries, one can expand this term as: 1 V (λ) 11 = (E 11) D 1(C 11) = |1 V 11| 1 dmax | C 11, E 11 | Now observe that as λ , aj(λ) 0 for all j [r], which means that V (λ) 0r r, which means that (1) dmax 0+, and (2) C E (since V (λ) becomes closer to a symmetric matrix, allowing its left and right singular matrices to approach equality to each other). That is, 1 dmax and C 11, E 11 1 2 = r. This would ensure that |1 V (λ) 11| > 1, meaning that the scalar (1 1 V (λ) 11) would be non-zero. Hence λ > 0, such that A(r)(λ) is full-rank. H.1 Proof of Lemma 4.1 Proof. The estimator ˆθ(P) expands as ˆθ(P) = Ud SS2 d U d θ + Ud SSd V d η . Expanding the Excess Risk shows Excess Risk ˆθ(P) = sj s2 j 1 2 θ j wj | {z } Bias s2 j s2 j wj | {z } V ariance where wj = s2 j n for all j [d]. This is a simple quadratic expression in s. Completing the squares gives the desired optimal values of s j, j [d]. I Quadratic Excess Risk: detailed version of Theorem 5 and a proof We point the reader to Appendix A for notations used throughout the proof. Theorem 7 (Formal version of Theorem 5). The Excess Risk is quadratic in ξ(k) Rk. Namely Excess Risk ˆθ(λ, ξ(k)) = ξ(k) M (k) | {z } Rk k ξ(k) + 2 ξ(k) m(k) |{z} Rk + c (58) where the below holds, for indices (i1, i2) in [k] [k], M (k) i1,i2 = λ θ j ρj(1 + ρj)2 Cj(i1) Cj(i2) 1 (1 + ρj)2 Cj(i1) Cj(i2) V (k) i1,i2 λθ j ρj + γ2 (1 + ρj)2 Cj(i1) Cj(i2) , m(k) i1 = λ θ j (1 + ρj)2 Cj(i1) ( 1) (1 + ρj)2 Cj(i1) (1 + ρj)2 Cj(i1) , θ j ρj (1 + ρj)2 | {z } From Bias 1 (1 + ρj)2 | {z } From Variance λθ j ρj + γ2 (1 + ρj)2 . The B, b and V, v notation is used to indicate the (squared) Bias and Variance terms respectively. Here ρj, j [r] is used to simplify the notation, and is defined as ρj := λ s2 j , j [r]. And the coefs Cj, j [r] with i [k] have the form Cj(i) := 1 1 (1 + ρj)i j [r], i [k] . (59) Proof. We first use the expansion from Lemma D.1. Secondly, expand θ = Pd j=1 θ , uj uj. Using these, we can expand Excess Risk as Excess Risk ˆθ(λ, ξ(k)) = Eη ˆθ(λ, ξ(k)) θ 2 s2 j n |{z} ˆΣn weighing s2 j λ + s2 j s2 j λ + s2 j s2 j n |{z} ˆΣn weighing γ2 s2 j (λ + s2 j)2 s2 j λ + s2 j Writing down the above expansion and collecting the corresponding quadratic, linear and constant terms coefficients in the ξ(k), give the desired expressions. I.1 Excess Risk for ridge Since we will compare the ridge estimator s Excess Risk to the k-step SD estimator, we state the Excess Risk expression for the ridge estimator (eq (3)) formally here. Lemma I.1. The ridge estimator ˆθ(λ) satisfies λ > 0, Excess Risk ˆθ(λ) = 1 λ2θ j + γ2s2 j s2 j (λ + s2 j)2 . (60) And the optimal penalty λ > 0 that minimizes this Excess Risk satisfies Pr j=1 s4 j (λ +s2 j) 3 Pr j=1 θ j s4 j (λ +s2 j) 3 . (61) Proof. Eq (60) is a simple instantiation of Theorem 5 in the vacuous case of k = 0. Specifically, the quantity c captures exactly what we need. Borrowing its expression from the detailed theorem statement (Appendix I) gives us eq (60). Now, taking the derivative of the expression in eq (60) and setting it to zero, we get the stated expression for λ > 0. J Discussion on synthetic experiments This section follows up Section 5.1 with more details and examples about the synthetic problem. Figure 8 shows four more settings, with γ {0.125, 0.25} and θ {u1, 1/ 2(u1 + u2)}. Figure 8a validates that repeated steps of SD do provide a reduction in the excess risk, since the lowest point of the curve for each k is reducing as k increases. Observe that at k = r = 4 steps of SD, the curves in Figure 8 become flat. This is because we stated Theorem 4 with a " λ > 0" such that r-step SD can achieve the lower bound, but in practice it can happen " λ > 0". (a) γ = 0.125, θ = u1 (b) γ = 0.25, θ = u1 (c) γ = 0.125, θ = 1/ (d) γ = 0.25, θ = 1/ Figure 8: Plot of excess risk of k-step SD with optimal ξ(k) for each λ, for k {0, 1, 2, 3, 4} on a synthetic problem (Section 5.1). K Discussion on choosing ξ parameters For the k-step SD estimator (eq (8)), for any chosen λ, Theorem 5 tells us that the Excess Risk is quadratic in ξ(k) Rk. To estimate the coefficients of this quadratic from certain chosen evaluations of ξ(k), we need a total of k(k+3)/2 + 1 evaluations. This is because the number of unknown coefficients are (i) k for square terms, (ii) k(k 1)/2 for cross-square terms, (iii) k for linear terms, and (iv) 1 for the constant term. After estimating the coefficients of the quadratic, we only need to perform a grid search over one parameter, which is λ. We now illustrate this in detail for k = 1, 2. K.1 Illustration for k = 1 For 1-step SD, we know that the Excess Risk(ˆθ(λ, ξ)) = Aξ2 + 2Bξ + C for unknown A, B, C (that depend on λ). Training 3 specific 1-step SD estimators, with ξ = { 1, 0, 1}, and measuring each of those 3 estimators Risk on the validation set, lets us estimate A, B, C. We then know that ξ = B/A, for the chosen value of λ. K.2 Illustration for k = 2 S0 S1 S2 ξ1 ξ2 Figure 9: Illustrating 2-step SD. For 2-step SD, the Excess Risk for a chosen λ has the following form Excess Risk ˆθ(λ, ξ |{z} R2 ) = A ξ2 1 + B ξ2 2 + 2C ξ1 ξ2 + 2D ξ1 + 2E ξ2 + F where the reparametrization is ξ1 ξ2(1 ξ1), and ξ2 ξ2ξ1 (refer to Appendix C.3 for details on this reparametrization). Let EVAL(ξ2, ξ1) denote the result of measuring this estimator s Risk on the validation dataset. We get the below system of equations. F = EVAL(ξ2 = 0, ξ1 = 0) A + 2D + F = EVAL(ξ2 = 1, ξ1 = 0) A 2D + F = EVAL(ξ2 = 1, ξ1 = 0) B + 2E + F = EVAL(ξ2 = 1, ξ1 = 1) B 2E + F = EVAL(ξ2 = 1, ξ1 = 1) A + B 2 + D + E + F = EVAL(ξ2 = 1, ξ1 = 0.5) The above 6 EVAL operations help us identify all the 6 unknown coefficients. Given AB C2 > 0 ([15]), we will have the below ξ 1, ξ 2 minimizing the test loss (i.e. giving the optimal 2-step SD coefs for the chosen λ). And using them, we calculate the actual ξ 1, ξ 2 by doing the inverse mapping of the reparametrization. ξ 1 = CE DB AB C2 , ξ 2 = CD AE ξ 1 = ξ 2 ( ξ 1 + ξ 2), ξ 2 = ξ 1 + ξ 2 L Details on regression experiments We note that all experiments run on a single CPU within 60 seconds (wall-clock time). We utilize sklearn s implementation of the RIDGE. Methodology. We now explain our methodology in detail, which is followed for all datasets: 1. First, split the original dataset into three parts for a Train-Validation-Test split. We divide all datasets in a 30 30 40 split. Note that 30% of the data suffices for training since we work with small datasets which have d on the order of ten (and total number of samples n on the order of thousands). For datasets that have a temporal notion (e.g. date/timestamps), we do the three-way split sequentially. 2. Perform two standard transformations on all three splits of the data: (i) Remove records with missing values, and (ii) coordinate-wise whitening for all X features and the Y target, i.e., subtracting the mean (computed on the train set) and dividing by the standard deviation (also computed on the train set). 3. Select a grid of λ values (and ensure that it is large enough so that the optimal λ lies in it). The grid has a factor of 10 difference between consecutive values (e.g., {1, 10, 10, , 104}). Then, for all λ in the grid, compute: The ridge estimator ˆθ(λ) using the train set. The 1-step SD estimator ˆθ(λ, ξ ) using the train set, where ξ for each λ is found using the validation set by the strategy described in Section 5.2. The 2-step SD estimator ˆθ(λ, (ξ(2)) ) using the train set, where (ξ(2)) for each λ is found using the validation set by the strategy described in Appendix K.2. Let λ 0 denote the optimal penalty for ridge that minimizes the MSE on the validation set. Similarly, let (λ 1, ξ ) and λ 2, (ξ 1, ξ 2) denote the optimal parameters for 1-step, 2-step SD, again chosen via the validation set. 4. Evaluate the MSE on the test set (unseen as yet) for all three computed estimators: ˆθ(λ 0), ˆθ(λ 1, ξ ), ˆθ λ 2, (ξ 1, ξ 2) . L.1 Description of the datasets used UCI Air Quality. The UCI Air Quality dataset [34] contains records of hourly averaged readings from 5 metal oxide chemical sensors (which serve as covariates), along with ground-truth hourly averaged concentrations of the pollutants from a co-located reference certified analyzer. The recordings are from an Italian city over a one year period in 2004-2005. After removing records with missing entries, this dataset has n = 6941 rows and we consider d = 8 relevant covariates (5 values of the metal oxide chemical sensor readings + 3 values of Temperature, Relative Humidity, and Absolute Humidity). We explicitly mention the field names (all real-numbered non-negative). X covariates names in the dataset: [PT08.S1(CO), PT08.S2(NMHC), PT08.S3(NOx), PT08.S4(NO2), PT08.S5(O3), T, RH, AH] Y target s name in the dataset: NO2(GT) UCI Airfoil Self-Noise. The UCI Airfoil Self-Noise dataset [5] contains data obtained from NASA s aerodynamic and acoustic tests of airfoil blade sections. It contains 5 real-valued covariates, which are relevant physical quantities. The target is also real-valued, which is the sound pressure created (in d B). There are no missing entries. This dataset has n = 1503 rows and d = 5 covariates. We explicitly mention the field names: X covariates names in the dataset: [frequency, attack-angle, chord-length, free-stream-velocity, suction-side-displacement-thickness] Y target s name in the dataset: scaled-sound-pressure UCI Appliances Energy Prediction. The UCI AEP dataset [6] contains energy appliances data collected at 10 min frequency for about 4.5 months. This dataset has no missing entries, and a total of 19735 instances with 28 covariates. We downsample the dataset to hourly frequency, giving n = 3290 rows with d = 24 relevant covariates (removing degenerate covariates: date , lights , windspeed , visiblity ). The full list of covariates is 24 long, and so we do not list it out here (we already mentioned the 4 we removed from the total of 28). L.2 Measuring the alignment between θ and the eigenbasis directions for the datasets The goal here is to compute the alignment between the θ and the eigenbasis directions {uj}d j=1 for the three real-world datasets. The eigenspace Ud is known from the design matrix X, but θ is unknown for real-world tasks. As a proxy, we use the ridge solution ˆθλ (with a small λ). Methodology of choosing λ: We considered using the OLS solution ˆθOLS := (XX ) 1XY as the proxy for θ , but (XX ) 1 was numerically unstable for these datasets, so we instead used the ridge solution ˆθλ with a small λ. We calculated this λ methodically for all datasets as a constant fraction of the sum of squared singular values. Explicitly, we (i) computed the SVD of the design matrix X, and (ii) set λ := C Pd j=1 s2 j using the obtained singular values. The value C := 10 5 was chosen arbitrarily (and the above trend is stable across other reasonably small values of C). Figure 10 shows the result. The sum of all bars in a single plot is one, since {uj}d j=1 are unit-norm vectors that form an orthogonal basis of Rd. We infer two things. Firstly, for multi-step SD to outperform ridge (as is the case for the Air Quality and Airfoil datasets), θ can be well-aligned with any of the uj, j [d]; not necessarily u1. Secondly, this gives insight into why multi-step SD could not outperform ridge on the AEP task (Table 1). Unlike the other two datasets, the θ for the AEP dataset is not strongly aligned with any of the uj, j [d]. The top component in AEP only explains 35% of the total θ norm, whereas that number is close to 80% for the Air Quality and Airfoil datasets. This indicates a large β value for the AEP dataset in Assumption 2.2. L.3 Observed quadratic nature of MSE w.r.t. ξ parameters In this section, we show that the optimal ξ values given in Table 1, selected via the strategy in section 5.2, are indeed the ones that minimize the validation MSE. By inspecting the minima of each of the curves in Figure 11, we observe that the values of ξ for 1-step SD in Table 1 (found through the strategy described in section 5.2) indeed coincide with the minima producing values in the below curves. Further, Figure 11 empirically demonstrates the quadratic nature of risk (MSE) vs ξ described in Theorem 5. (a) Air Quality dataset (λ = 0.167) (b) Airfoil dataset (λ = 0.023) (c) AEP dataset (λ = 0.237) Figure 10: The alignment of θ to the eigenbasis directions {uj}d j=1 for the three datasets used in the experiments. The low-alignment of θ to any of the eigenbasis directions for the AEP dataset explains why SD provides no gain over ridge in the test set MSE values observed in Table 1. Details of the methodology used to compute the alignment are provided in Appendix L.2. (a) Air Quality dataset (b) Airfoil dataset (c) AEP dataset Figure 11: Observed quadratic nature of MSE (on validation set) of 1-step SD vs ξ for λ = λ 1. This agrees with Theorem 5 and validates the hyperparameter tuning strategy outlined in section 5.2. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We present a transparent top-level summary of all our results in the abstract and introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Wherever applicable, limitations have been discussed (e.g. Remark 4.2). Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theorems and lemmas have been stated with the precise assumptions used, and the Appendix provides all the proofs. Further, proof sketches and main ideas have been presented in the main paper. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Details of the experiments (dataset, methodology used) are provided in the main paper and in the appendix. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Our main contributions are theoretical results, however we plan to release relevant code on github. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We specify all relevant details in the paper (especially refer to Section 5.2 for how we exactly select our hyperparameters). Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: Since our experiments don t have randomized components (estimators have a closed-form solution), we do not report error bars. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Our experiments are smaller scale and all experiments run on a single CPU within a minute. We mention this in the Appendix. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We uphold the code of ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Although our work is largely theoretical, it does contribute to the understanding of self-distillation, and we mention potential societal impacts in the Conclusion section. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This work poses no such risk. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We write our own code. All data used is indeed permissible to use for research purposes. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We don t release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Not relevant. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Not relevant. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.