# anisotropic_random_feature_regression_in_high_dimensions__df85d023.pdf

Published as a conference paper at ICLR 2022

ANISOTROPIC RANDOM FEATURE REGRESSION IN HIGH DIMENSIONS

Gabriel C. Mel Neurosciences Graduate Program Stanford University meldefon@gmail.com

Jeffrey Pennington Brain Team, Google Research jpennin@google.com

In contrast to standard statistical wisdom, modern learning algorithms typically ﬁnd their best performance in the overparameterized regime in which the model has many more parameters than needed to ﬁt the training data. A growing number of recent works have shown that random feature models can offer a detailed theoretical explanation for this unexpected behavior, but typically these analyses have utilized isotropic distributional assumptions on the underlying data generation process, thereby failing to provide a realistic characterization of real-world models that are designed to identify and harness the structure in natural data. In this work, we examine the high-dimensional asymptotics of random feature regression in the presence of structured data, allowing for arbitrary input correlations and arbitrary alignment between the data and the weights of the target function. We deﬁne a partial order on the space of weight-data alignments and prove that generalization performance improves in response to stronger alignment. We also clarify several previous observations in the literature by distinguishing the behavior of the samplewise and parameter-wise learning curves, ﬁnding that sample-wise multiple descent can occur at scales dictated by the eigenstructure of the data covariance, but that parameter-wise multiple descent is limited to double descent, although strong anisotropy can induce additional signatures such as wide plateaus and steep cliffs. Finally, these signatures are related to phase transitions in the spectrum of the feature kernel matrix, and unlike the double descent peak, persist even under optimal regularization.

1 INTRODUCTION

While deep learning races ahead with its impressive list of practical successes, it has often done so by building larger models and training on ever larger datasets. For example, recent large language models leverage billions (Brown et al., 2020; Raffel et al., 2019) or even trillions (Fedus et al., 2021) of parameters. As the state-of-the-art pushes the computational frontier, model development, experimentation, and exploration become increasingly costly, both in terms of the computational resources and person-hours.

To facilitate the development of improved models, it is becoming ever more necessary to develop theoretical models that can help guide model development in these high-dimensional scenarios. Indeed, our theoretical understanding is so limited that even linear regression, the most basic machine learning method of all, is still yielding surprises in high dimensions (Hastie et al., 2019).

While linear regression can capture many salient features of high-dimensional data, it is not a reasonable model for high-dimensional models. In particular, there is no natural data-independent notion of model size: the number of parameters in the model is intimately tied (usually equal) to the dimensionality of the datapoints. This limitation means that linear regression may not accurately

Published as a conference paper at ICLR 2022

capture the effect of overparameterization, which appears to be a crucial ingredient in modern deep learning models (Zhang et al., 2016).

Beyond linear regression, the next simplest model is arguably the random feature model of Rahimi et al. (2007), which turns out to be tractable in high dimensions while also allowing for a natural data-independent lever to vary model size the number of random features. Random feature models have also gained independent interest through their relationship to neural kernels and Gaussian processes (Neal, 1996; Lee et al., 2017; Novak et al., 2018; Lee et al., 2019; Jacot et al., 2018). While recent works have examined the asymptotic performance of random feature models, most have done so in the restricted setting of isotropic covariates and target functions (Mei & Montanari, 2019; d Ascoli et al., 2020; Adlam & Pennington, 2020a;b). Based on the conjectured Gaussian equivalence principle of Goldt et al. (2020b), a few recent works have begun relaxing these simplifying assumptions in an effort to describe more natural data distributions (Loureiro et al., 2021; d Ascoli et al., 2021). In this work, we extend the high-dimensional random feature linearization technique of Tripuraneni et al. (2021a;b) to allow for anisotropic target functions, revealing a host of novel phenomena related to the alignment of the data and the target weights.

1.1 CONTRIBUTIONS

Our primary contributions are to:

1. Derive theoretical predictions for the test error, bias, and variance of random feature regression for anisotropic Gaussian covariates under general linear target functions in the high-dimensional asymptotic limit (Section 3.1);

2. Prove that overparameterization reduces the total test error, bias, and variance, even in the presence of anisotropy (Section 3.2);

3. Deﬁne a partial order on the space of weight-data alignments and prove that generalization performance improves in response to stronger alignment (Deﬁnition 2.1, Section 3.3);

4. Demonstrate that anisotropy can induce sample-wise multiple descent, but prove that parameter-wise multiple descent is limited to double descent (Section 3.4);

5. Provide a theoretical explanation for steep cliffs observed in parameter-wise learning curves that persist under optimal regularization, and relate them to phase transitions in the spectrum of the feature kernel matrix (Section 3.4).

1.2 RELATED WORK

Our analysis builds on a series of works that have studied the exact high-dimensional asymptotics of the test error for a growing class of model families and data distributions. Early work in this direction examined minimum-norm interpolated least squares and ridge regression in the high-dimensional random design setting (Belkin et al., 2019; Dobriban et al., 2018; Hastie et al., 2019), ﬁnding surprising phenomena such as double descent and a negative optimal ridge constant (Kobak et al., 2020). These analyses were generalized in (Richards et al., 2021; Wu & Xu, 2020; Hastie et al., 2019) to allow for anisotropic covariance matrices and weights that generate the targets. This anisotropic conﬁguration was studied in further detail in (Mel & Ganguli, 2021), uncovering nuanced behavior such as sample-wise multiple descent and a sequence of phase transitions in the performance and optimal regularization.

While linear regression can provide insight into generalization performance in high dimensions, it cannot faithfully reproduce the features of non-linear models, such as the effects of overparameterization. Random feature models, on the other hand, can capture many such features while still maintaining analytical tractability. Recent results have developed a detailed understanding of isotropic random feature regression in the high-dimensional asymptotic limit, providing a precise characterization of peaks in the test error and the beneﬁts of overparameterization (Mei & Montanari, 2019; d Ascoli et al., 2020; Adlam & Pennington, 2020a). The origin of these peaks was explained by means of a ﬁne-grained decomposition of the bias and variance in (Adlam & Pennington, 2020b; Lin & Dobriban, 2020). While most prior work examining the high-dimensional asymptotics of random feature models have leveraged isotropy of the covariate distribution, this assumption was recently relaxed in (Liao et al., 2020) for random Fourier feature models, in (Tripuraneni et al., 2021a;b) in the

Published as a conference paper at ICLR 2022

study of covariate shift, and in several works based on the Gaussian equivalence principle of Goldt et al. (2020a;b) (Gerace et al., 2020; Loureiro et al., 2021).

Though most of these works did not pursue a detailed investigation of the anisotropy itself, while ﬁnalizing this manuscript we became aware of concurrent work (d Ascoli et al., 2021) which does focus on the interplay of the anisotropic input features and target functions, as we do here. Whereas d Ascoli et al. (2021) leverage the Gaussian equivalence principle and the replica method from statistical physics, we use a completely different set of techniques stemming from the literature on random matrix theory and operator-valued free probability (Pennington & Worah, 2018; 2019; Adlam et al., 2019; Adlam & Pennington, 2020a; Louart et al., 2018; Péché et al., 2019; Far et al., 2006; Mingo & Speicher, 2017), which provides a complementary perspective that could be made completely rigorous. d Ascoli et al. (2021) investigates the role of anisotropy not just under squared loss, as we do here, but also under logistic loss, and obtains good numerical support for the Gaussian equivalence conjecture in that setting. In contrast, our narrower and deeper focus on squared loss allows us not only to uncover a host of novel phenomena, including multiple descent, steep error cliffs, and spectral phase changes, but also to rigorously prove several propositions about their existence and properties. We provide a more in-depth discussion of these and other related works in App. A.

Our work relies heavily on the concept of alignment (see Def. 2.1) between the data covariance and the target vector coefﬁcients. An analogous deﬁnition appears in (Tripuraneni et al., 2021a;b) in the context of covariate shift, leading to a superﬁcial similarity of results. However, the implications and interpretations are otherwise quite different, so we refrain from repeatedly commenting on these recurring connections.

2 PRELIMINARIES

2.1 PROBLEM SETUP AND NOTATION

We study random feature regression (Rahimi et al., 2007) as a model for learning an unknown linear function of the data, y(xi) = β xi/ n0 + ϵi , (1) from m independent samples (xi, yi) Rn0 R, i = 1, . . . , m, where the covariates are Gaussian, xi N(0, Σ), and where ϵi N(0, σ2 ϵ ) is additive noise (present on the training samples only).

We focus on the high-dimensional proportional asymptotics in which the input feature dimension n0, the hidden layer size n1, and the number of samples m all tend to inﬁnity at the same rate, with φ := n0/m and ψ := n0/n1 held constant. The overparameterization ratio φ/ψ = n1/m is a measure of the normalized complexity of the random feature model.

Interestingly, under this model in our high-dimensional setup, any nonlinear component of the signal to be learned behaves like additive noise, so the linear function in Eq. (1) does not sacriﬁce generality (Mei & Montanari, 2019; Adlam & Pennington, 2020a).

Following many works on high-dimensional regression (see e.g. Dobriban et al. (2018)), we assume the coefﬁcient vector β is random, β N(0, Σβ), though allowing for deterministic β would be a straightforward extension and would simply involve replacing Σβ ββ throughout.

Denoting the training dataset as X = [x1, . . . , xm] and the test point as x, the random features are, F := σ(WX/ n0) and f := σ(Wx/ n0) , (2)

where W Rn1 n0 is a random weight matrix with i.i.d. standard Gaussian entries, and σ( ) : R R is an activation function that is applied elementwise. These random features deﬁne the kernel

K(x1, x2) := 1

n1 σ(Wx1/ n0) σ(Wx2/ n0) , (3)

which can be used to compute the model s predictions on a point x as ˆy(x) = Y K 1Kx.

Here we have deﬁned the training labels as Y := [y(x1), . . . , y(xm)], and the regularized kernel matrix as K := K(X, X) + γIm, as well as Kx := K(X, x) for γ 0.

The training error is given by,

EΣ train = EβE[(y(X) ˆy(X))2] = E[(β X/ n0 Y K 1K(X, X))2] , (4)

Published as a conference paper at ICLR 2022

and the test error (without label noise on the test point) can be written as

EΣ = EβEx E[(y(x) ˆy(x))2] σ2 ϵ = Ex E[(β x/ n0 Y K 1Kx)2] , (5)

where the outermost expectation over β has been suppressed since the quantity concentrates sharply around its mean and the inner expectation is computed over all the randomness from training, i.e. W, X, and ϵ. The test error can be decomposed into its bias and variance components as

EΣ = Ex E[(E[ˆy(x)] y(x))2] | {z } BΣβ

+ Ex[V[ˆy(x)]] | {z } VΣβ

The bias and variance are computed with respect to all the randomness from training, i.e. W, X, ϵ, which differs from common practice in the statistics literature, but which is necessary to obtain a decomposition that is intuitive and unambiguous when the predictive function relies on randomness from multiple sources (Adlam & Pennington, 2020b; Lin & Dobriban, 2020).

As noted by (Hastie et al., 2019; Wu & Xu, 2020; Mel & Ganguli, 2021), the test error of linear regression depends on the geometry of (Σ, Σβ) (or (Σ, β) in the case of nonrandom β), and we will ﬁnd the same to be true for random feature regression. As such, we decompose the training and β

covariance matrices into eigenbases as Σ = Pn0 i=1 λivivi and Σβ = Pn0 i=1 λβ i vβ i vβ i , where the eigenvalues are in nondecreasing magnitude, i.e. λ1 λ2 . . . λn0 and λβ 1 λβ 2 . . . λβ n0. Following (Tripuraneni et al., 2021a;b), we then deﬁne the overlap coefﬁcients as

qi := v i Σβvi =

j=1 (vβ j vi)2λβ j , (7)

to measure the alignment of Σβ with the ith eigendirection of Σ. Note that for deterministic β this deﬁnition simply gives the square of the component of β in the ith eigendirection, i.e. qi = (v i β)2.

2.2 ASSUMPTIONS

In order to deﬁne the limiting behavior of the test error in Eq. (5), it is necessary to impose some regularity conditions on Σ and Σβ.1 As in (Wu & Xu, 2020; Tripuraneni et al., 2021a;b), the limiting spectra of these matrices cannot be considered independently because their eigenspaces may align. This alignment is most conveniently described in an eigenbasis of Σ.

Assumption 1. We deﬁne the joint spectral distribution (JSD) as

i=1 δ(λi,qi) (8)

and assume it converges in distribution to some µ, a distribution on R2 + as n0 . We refer to µ as the limiting joint spectral distribution (LJSD), and emphasize that this deﬁnes the relevant limiting properties of the covariate and weight distributions2.

Often we use (λ, q) for random variables sampled jointly from µ and denote the marginal of λ under µ with µdata. Since the conditional expectation E[q|λ] is an important object in our study, we assume the following for simplicity.

Assumption 2. µ is either absolutely continuous or a ﬁnite sum of delta masses and the expectations of λ and q are ﬁnite. Moreover, Eµ[λq] = 1.

The normalization condition Eµ[λq] = 1 is not necessary for our analysis, but, as in (Mel & Ganguli, 2021), it enforces consistent signal-to-noise ratios across different target functions and thereby facilitates meaningful comparisons of the test error.

1In fact, the necessary assumptions are identical to those of Tripuraneni et al. (2021a;b) since Σβ plays an analogous role to the test covariance Σ of that work. 2The JSD depends not only on Σ and Σβ but also on a choice of eigendecomposition for Σ when it has repeated eigenvalues; however, as in (Tripuraneni et al., 2021a;b), all possible choices lead to the same formulas and conclusions.

Published as a conference paper at ICLR 2022

As we will eventually consider the high-dimensional limit, it is convenient to deﬁne the asymptotic covariance scale as s = limn0 1 n0 tr(Σ) = Eµ[λ] under the limiting behavior speciﬁed in Assumption 1.

One important case is when Σβ = In0, in which case the LJSD degenerates to µ deﬁned by

µ (λ, q) := µdata(λ)δ1(q) . (9)

Following Tripuraneni et al. (2021a;b), we also enforce the following standard regularity assumptions on the activation functions to ensure the existence of the moments and derivatives we compute.

Assumption 3. The activation function σ : R R is assumed to be differentiable almost everywhere. We assume there exists a universal constant C such that, |σ(x)|, |σ (x)| = O(exp(Cx)).

2.3 A SIMPLE FAMILY OF ANISOTROPIC DISTRIBUTIONS

The assumptions above allow for a wide class of possible asymptotic covariance structures and spectra. To allow for concrete examples that have intuitive interpretations, we introduce the following simple family of distributions that capture multi-scale structure in the input, which we refer to as d-scale LJSDs. In particular, for d N and real 0 < α 1 and θ, we deﬁne

µd-scale α,θ :=

1 dδ(Cαj,Dαθj), (10)

where C := 1 d 1 αd

1 α 1 and D := 1 C 1 d 1 α(θ+1)d

1 α(θ+1) 1 enforce the normalization conditions

s = Eµd-scale α,θ [λ] = 1 and Eµd-scale α,θ [qλ] = 1. These simple covariance models capture the hierarchy of scales that often characterize the structure of natural datasets. Note that by setting α = 1 and s = 1, the distribution reduces to the isotropic case with identity covariance. For the nontrivial setting in which α < 1, the data exhibits a sequence of d distinct scales where each scale is a factor α smaller than the previous one. The exponent θ parameterizes the strength of the weight-data alignment in an intuitive way. When θ = 0, there is no alignment, and Σβ is proportional to the identity. When θ > 0, there is strong alignment, as the large eigendirections of the training distribution correspond to large eigendirections of Σβ. When θ < 0, there is anti-alignment, as the large eigendirections of the training distribution correspond to small eigendirections of Σβ.

2.4 DEFINITION OF THE STRENGTH OF WEIGHT-DATA ALIGNMENT

As the preceding discussion has suggested and as we will see explicitly in the following section, the LJSD µ captures all of the information about the pair of covariance matrices (Σ, Σβ) that is relevant for describing the asymptotic test error, bias, and variance. As the alignment between Σ and Σβ can vary in strength among the different eigendirections, the concept of alignment is inherently multi-dimensional. Nevertheless, by requiring strong alignment along the larger eigendirections, it is possible to deﬁne the following natural partial order on the space of possible weight-data alignments3

Deﬁnition 2.1. Let µ1 and µ2 be LJSDs with the same marginal distribution of λ. If the asymptotic overlap coefﬁcients are such that Eµ1 [λq|λ] /Eµ2 [λq|λ] = Eµ1 [q|λ] /Eµ2 [q|λ] is nondecreasing in λ, we say that µ1 is more strongly aligned than µ2 and write µ1 µ2. Comparing against the case of isotropic weight distribution, µ , we say µ1 is aligned when µ1 µ and anti-aligned when µ1 µ .

The parameter θ of the d-scale model in Eq. (10) provides a quantitative measure of alignment under Deﬁnition 2.1, formalizing the intuition given in Section 2.3. As we will see in Section 3.3, the asymptotic generalization performance of random feature regression is strictly ordered under this deﬁnition of alignment strength.

3Note that this deﬁnition is nearly identical to that of Tripuraneni et al. (2021a;b), with the concept of hardness replaced by alignment.

Published as a conference paper at ICLR 2022

Figure 1: Test error, bias, and variance as a function of the overparameterization ratio (φ/ψ = n1/m) and the alignment θ for the 2-scale LJSD (Eq. (10)) with φ = n0/m = 10/9, σ = tanh, γ = 10 8 and σ2 ε = 1/100. (a) The total test error exhibits the characteristic double descent behavior for all shift powers. (b) The bias is a nonincreasing function of φ/ψ for all shift powers, as in Proposition 3.1. (c) The variance is the source of the peak, and is a nonincreasing function of φ/ψ for all shift powers in the overparameterized regime, as in Proposition 3.2. Blue, orange, and green lines in (a,b,c) indicate locations of vertical and horizontal slices shown in (d,e,f). (d) 1D horizontal slices of (a,b,c) more clearly demonstrate the monotonicity in φ/ψ predicted by Propositions 3.1 and 3.2. (e) 1D vertical slices demonstrate the monotonicity in θ of the test error and the bias predicted by Proposition 3.3, and illustrate that the variance may not decrease in response to stronger alignment (solid green trace). (f) Test error, bias, and variance normalized by the training error (ie. Eµ+σ2 ε Etrain/Etrain(θ=0); similarly for Bµ, Vµ). Same legend as (e). Constancy of the blue traces illustrates how the train and test error respond identically to alignment. Simulations for m = 4000 (d,e; crosses) agree well with formulas.

3 MAIN RESULTS

Our main result characterizes the high-dimensional asymptotic limits of the test error, bias, and variance of the nonlinear random feature model of Section 2. Before stating the result, we introduce some additional notation that captures the effect of the nonlinearity σ,

η := Vz N(0,s)[σ(z)] , ρ:= ( 1

s Ez N(0,s)[zσ(z)])2 , ζ := sρ , ω:= s(η/ζ 1) . (11)

where, as above, s = limn0 1 n0 tr(Σ) = Eµ[λ]. The constant ω 0 is a measure of the degree of nonlinearity, with ω = 0 corresponding to linear activation functions (see Lemma B.1). Analogously to Tripuraneni et al. (2021a;b), we also introduce two functionals of µ, which capture all the relevant spectral information needed for describing the test loss,

Ia,b := φ Eµ λa (φ + xλ) b and Iβ a,b := φ Eµ qλa (φ + xλ) b . (12)

3.1 FORMULAS FOR ASYMPTOTIC BIAS, VARIANCE, AND TEST ERROR

Theorem 3.1. Under Assumptions 1-3, as n0, n1, m with φ = n0/m and ψ = n0/n1 held constant, the training error EΣ train tends toward the value of Eµ train where

Eµ train = γ2 γ(τ1Iβ 1,1) + σ2 ε γτ1 , (13)

Published as a conference paper at ICLR 2022

Figure 2: Sample-wise multiple descent for anisotropic data as a function of nonlinearity strength ω. All panels refer to a 3-scale covariance model with α = 103, and ψ = 1

2, s = ρ = 1, γ = 10 13

and σ2 ε = 10. (a) Error exhibits multiple peaks as a function of the the sampling density φ 1. As ω is increased, peaks are attenuated and eventually disappear in sequence starting from those due to the weakest data scales (at m = n0) to those due to the strongest data scales (at m = n0/3). (b) Limiting spectral density of the kernel matrix K(X, X) = F T F/n1. Vertical slices correspond to the spectral density at the corresponding value of φ 1 for ω = 10 10. Peaks in the error curves in (a) occur at critical values m = 1

3n0, n0 where a new spectral component ﬁrst appears. (c) Consistent with this, sample-wise learning curves exhibiting multiple descent at low ω pass through several transitions in the number of spectral components, while those at higher ω with fewer peaks pass through fewer component transitions.

and the bias, variance, and test error BΣβ, VΣβ, EΣβ tend toward Bµ, Vµ, Eµ with

Eµ = Eµ train γ2τ 2 1 σ2 ε , Bµ = φIβ 1,2 , and Vµ = Eµ Bµ , (14)

where ρ, ω are deﬁned in (11), Ia,b, Iβ a,b are deﬁned in (12), and x is the unique nonnegative real

root of x = 1 γτ1

ω+I1,1 with τ1 =

(ψ φ)2+4xψφγ/ρ+ψ φ

Remark 3.1. As noted in (Adlam & Pennington, 2020a; Hastie et al., 2019) for isotropic covariates, the test error is simply related to the training error through the generalized cross-validation (GCV) formula (Golub et al., 1979). Curiously, because τ1 is independent of β, it follows that the test error depends on β entirely through its effect on the training error; in contrast, the bias-variance decomposition retains explicit β dependence, even conditional on the training error.

The results in Theorem 3.1 depend on a single scalar self-consistent equation for x, x = 1 γτ1 ω+I1,1 , which is in fact the same as the one appearing in Tripuraneni et al. (2021a;b), and which signiﬁcantly simpliﬁes the expressions relative to those recently obtained using the replica method (d Ascoli et al., 2021). Owing to its simplicity, this equation admits straightforward analysis that we pursue in the Appendix, yielding numerous inequalities and bounds that ultimately prove the propositions presented throughout this section. We will occasionally refer to the ridgeless limit of Theorem 3.1, which is given in Corollary G.1. Finally, as a consistency check, taking σ(x) = x and ψ 0 in Theorem 3.1 yields an expression for the test error, bias, and variance for linear regression that agrees with the results of (Wu & Xu, 2020; Mel & Ganguli, 2021) (see Section E).

3.2 THE BENEFITS OF OVERPARAMETERIZATION

While there is abundant empirical evidence that overparameterization can improve the generalization performance of practical models (see e.g. (Zhang et al., 2016)), rigorous explanations for this behavior have been offered solely in the setting of isotropic covariates and target functions. It is therefore natural to wonder whether the beneﬁts of overparmeterization persist in the presence of anisotropy, and indeed recent theoretical work has provided numerical evidence hinting that overparameterization is beneﬁcial (d Ascoli et al., 2021). The following two results prove this to be the case.

First, we show that the bias decreases (or stays constant) in response to an increase in the number of random features, regardless of any anisotropy in the covariates or target function weights.

Published as a conference paper at ICLR 2022

Proposition 3.1. In the setting of Theorem 3.1, the bias Bµ is a nonincreasing function of the overparameterization ratio φ/ψ.

The same is true for the variance in the overparameterized regime. Note that our proof requires the ridgeless setting, but numerical simulations suggest that the result may hold generally (see Fig. 1). Proposition 3.2. In the ridgeless limit and in the overparameterized regime (ψ < φ), the variance Vµ is a nonincreasing function of overparameterization ratio φ/ψ.

In Fig. 1, Propositions 3.1 and 3.2 are illustrated. Moving left to right in panels (a)-(d) leads to models with more parameters. The monotonicity of the bias across the whole range of parameterization is evident, as is the monotonicity of the variance in the overparmaeterized regime.

3.3 WEIGHT-DATA ALIGNMENT REDUCES THE BIAS AND TEST ERROR

A number of recent works have conﬁrmed the intuition that generalization performance should improve if the weights of the target function align with the large eigendirections of the training covariance, with formal theoretical arguments in the case of linear regression (Hastie et al., 2019; Mel & Ganguli, 2021), and with informal numerical simulations for random feature models (d Ascoli et al., 2021). The following result proves that this informal observation holds in full generality, not only for the total test error, but for the bias and the bias-to-variance ratio as well. Proposition 3.3. Let µ1, µ2 be two LJSDs such that µ1 µ2 (see Deﬁnition 2.1). Then Bµ1 Bµ2, Eµ1 Eµ2, and Bµ1/Vµ1 Bµ2/Vµ2.

We illustrate Proposition 3.3 in Fig. 1 for the 2-scale LJSD of Eq. (10). Following vertical lines upward in panels (a-c) or the x-axis left-to-right in (e,f) yields stronger alignment and a corresponding decrease in the bias and test error. Panel (f) shows the alignment-dependence of the bias, variance, and test error when normalized by the training error. As suggested by Remark 3.1, the normalized test error is independent of the alignment strength, and as intuition would suggest and as predicted by Proposition 3.3, the normalized bias decreases in response to stronger alignment.

3.4 ANISOTROPY INDUCES STRUCTURED LEARNING CURVES

Strong anisotropy has been observed to induce structured learning curves in the context of linear regression (Mel & Ganguli (2021)). It is natural to wonder whether these effects generalize to the case of nonlinear random features. Figure 2 shows that this is indeed the case for sample-wise learning curves, where φ 1 = m/n0 is varied while ψ = n0/n1 is held ﬁxed. Horizontal slices through Fig. 2 (a) exhibit sample-wise multiple descent. Fig. 2 (b) and (c) show that these peaks are associated with phase transitions at critical values of φ 1 where the spectrum of the kernel matrix 1 n1 F T F acquires a new component. Increased nonlinearity ω attenuates this effect.

In contrast, the following proposition shows that parameter-wise multiple descent does not occur even in the presence of strong anisotropy, as long as µ is aligned. Proposition 3.4. If µ is aligned (see Deﬁnition 2.1), then, in the ridgeless limit, the test error has at most two interior critical points as a function of the overparameterization ratio φ/ψ. Remark 3.2. Since k-fold descent requires at least k critical points, we conclude that multiple (k > 2) descent does not occur, even in the presence of covariate anisotropy, so long as µ is aligned.

Fig. 3(a,b) shows parameter-wise learning curves for various values of θ in the 2and 3-scale models. As predicted by Proposition 3.4, the curves for aligned LJSDs (θ 0) have at most two critical points (not shown). In fact, even for anti-aligned LJSDs (θ < 0), most curves also have at most two critical points; however, for small enough θ, an additional critical point emerges, and indeed the upper-most traces in Fig. 3(b) showcase a second peak in the learning curves.

Even though parameter-wise multiple descent is not possible for aligned LJSDs, strong anisotropy can nevertheless induce other signatures, including wide plateaus and steep cliffs. To understand the origin of these effects, consider the ridgeless limit, for which the self-consistent equation for x is,

φ + Eµ λ λ + φ

Published as a conference paper at ICLR 2022

Figure 3: Strong anisotropy causes steep cliffs in the test error as a function of the overparameterization ratio φ/ψ = n1/m. φ = 5/6, γ = 10 22, s = ρ = 1, ω = 10 16 and σ2 ε = 10 4. Top row: d = 2-scale model with α = 10 5; bottom row: 3-scale model with α = 10 6. (a,b) Test error exhibits the characteristic peak at the interpolation threshold, but additionally displays steep cliffs at critical values of n1 corresponding to multiples of n0/d (b, dashed lines; m also shown). The θ value associated to each trace is indicated by the colored ticks in (a). (c) Unlike the peak at the interpolation threshold, cliffs in the error traces in (b) persist under optimal regularization (error values from analytical formulas were numerically optimized over γ). (d) The spectrum of the kernel matrix K(X, X) = 1 n1 F F undergoes a phase transition at each of the critical values of n1 where a cliff occurs (dashed lines). The phase transition associated to the interpolation threshold at n1 = m is also visible in the lower panel, but is out of range (at 10 17) in the upper panel.

For small ω, the left side becomes relatively insensitive to changes in x whenever λ φ/x λ+ for some wide spectral gap with boundary (λ , λ+). It follows that, in the underparameterized regime (φ < ψ), the derivative (φ/ψ)/ x is small, or, equivalently, x/ (φ/ψ) is large. This strong sensitivity of x to changes in the overparameterization ratio φ/ψ induces similarly strong changes to the error because Eµ/ x is bounded. As a result, the parameter-wise learning curves exhibit sharp downward cliffs in the underparameterized regime for sufﬁciently small ω and sufﬁciently large spectral gaps λ+/λ . We make this argument more precise in Appendix F.

Fig. 3(a) illustrates these cliffs in the context of the d = 2and 3-scale LJSD. Along horizontal slices, the error turns sharply downward as a function of n1 φ/ψ at integer multiples of n0/d. As can be seen in Fig. 3(b), when the alignment is increased and β overlaps more with large λs, the ﬁrst error cliff, corresponding to the largest scales in the data, strengthens relative to others (cliff at φ/ψ = 1/2 in the 2-scale model and φ/ψ = 1/3 in the 3-scale model). Fig. 3(c) shows that, unlike the peak at the interpolation threshold, cliffs persist under optimal regularization. Finally, Fig. 3(d) illustrates how steep decreases in the error are associated with the appearance of a new component in the spectrum of the kernel matrix at critical values of φ/ψ.

4 CONCLUSIONS

We presented an exact calculation of the limiting test error, bias, and variance for random feature kernel regression with anisotropic covariates and target function weights. We deﬁned a partial order over weight-data alignments and proved that stronger alignment decreases the bias and test error. We also proved that the beneﬁts of overparameterization persist in the anisotropic setting, and that weight-data alignment limits parameter-wise multiple descent to double descent. In contrast, we demonstrated that anisotropy can induce sample-wise multiple descent, and parameter-wise cliffs, and that their structure is dictated by the eigenstructure of the data covariance. Future directions include extending our results to the non-asymptotic regime, accommodating feature learning and more general neural network models, and investigating the impact of anisotropy for other loss functions.

Published as a conference paper at ICLR 2022

Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp. 74 84. PMLR, 2020a.

Ben Adlam and Jeffrey Pennington. Understanding double descent requires a ﬁne-grained biasvariance decomposition. ar Xiv preprint ar Xiv:2011.03321, 2020b.

Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities for deep learning. ar Xiv preprint ar Xiv:1912.00827, 2019.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

Stéphane d Ascoli, Levent Sagun, and Giulio Biroli. Triple descent and the two kinds of overﬁtting: Where & why do they appear? ar Xiv preprint ar Xiv:2006.03509, 2020.

Stéphane d Ascoli, Marylou Gabrié, Levent Sagun, and Giulio Biroli. On the interplay between data structure and loss function in classiﬁcation problems, 2021.

Edgar Dobriban, Stefan Wager, et al. High-dimensional asymptotics of prediction: Ridge regression and classiﬁcation. The Annals of Statistics, 46(1):247 279, 2018.

Reza Rashidi Far, Tamer Oraby, Wlodzimierz Bryc, and Roland Speicher. Spectra of large block matrices. ar Xiv preprint cs/0610045, 2006.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity. ar Xiv preprint ar Xiv:2101.03961, 2021.

Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model, 2020.

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limitations of lazy training of two-layers neural networks, 2019.

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods?, 2021.

Sebastian Goldt, Marc Mézard, Florent Krzakala, and Lenka Zdeborová. Modeling the inﬂuence of data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10 (4):041044, 2020a.

Sebastian Goldt, Galen Reeves, Marc Mézard, Florent Krzakala, and Lenka Zdeborová. The gaussian equivalence of generative models for learning with two-layer neural networks. ar Xiv e-prints, pp. ar Xiv 2006, 2020b.

Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215 223, 1979.

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. ar Xiv preprint ar Xiv:1903.08560, 2019.

J. William Helton, Scott A. Mc Cullough, and Victor Vinnikov. Noncommutative convexity arises from linear matrix inequalities. Journal of Functional Analysis, 240:105 191, 2006.

Hong Hu and Yue M. Lu. Universality laws for high-dimensional learning with random features, 2021.

Published as a conference paper at ICLR 2022

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv preprint ar Xiv:1806.07572, 2018.

Noureddine El Karoui. The spectrum of kernel random matrices. Annals of Statistics, 38:1 50, 2010.

Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. J. Mach. Learn. Res., 21:169 1, 2020.

Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017.

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32:8572 8583, 2019.

Zhenyu Liao, Romain Couillet, and Michael W Mahoney. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. ar Xiv preprint ar Xiv:2006.05013, 2020.

Licong Lin and Edgar Dobriban. What causes the test error? going beyond bias-variance via anova. ar Xiv preprint ar Xiv:2010.05170, 2020.

Cosme Louart, Zhenyu Liao, Romain Couillet, et al. A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190 1248, 2018.

Bruno Loureiro, Cédric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model, 2021.

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. ar Xiv preprint ar Xiv:1908.05355, 2019.

Gabriel Mel and Surya Ganguli. A theory of high dimensional regression with arbitrary correlations between input features and target functions: sample complexity, multiple descent curves and a hierarchy of phase transitions. In International Conference on Machine Learning, pp. 7578 7587. PMLR, 2021.

James A Mingo and Roland Speicher. Free probability and random matrices, volume 35. Springer, 2017.

Radford M Neal. Priors for inﬁnite networks. In Bayesian Learning for Neural Networks, pp. 29 53. Springer, 1996.

Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A Abolaﬁa, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. ar Xiv preprint ar Xiv:1810.05148, 2018.

S Péché et al. A note on the pennington-worah distribution. Electronic Communications in Probability, 24, 2019.

Jeffrey Pennington and Pratik Worah. The spectrum of the ﬁsher information matrix of a singlehidden-layer neural network. In Neur IPS, pp. 5415 5424, 2018.

Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124005, 2019.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. ar Xiv preprint ar Xiv:1910.10683, 2019.

Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NIPS, volume 3, pp. 5. Citeseer, 2007.

Published as a conference paper at ICLR 2022

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. In Arindam Banerjee and Kenji Fukumizu (eds.), Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 3889 3897. PMLR, 13 15 Apr 2021. URL http://proceedings.mlr.press/v130/richards21b.html.

Nilesh Tripuraneni, Ben Adlam, and Jeffrey Pennington. Covariate shift in high-dimensional random feature regression. ar Xiv preprint ar Xiv:2111.08234, 2021a.

Nilesh Tripuraneni, Ben Adlam, and Jeffrey Pennington. Overparameterization improves robustness to covariate shift in high dimensions. Advances in Neural Information Processing Systems, 34, 2021b.

Denny Wu and Ji Xu. On the optimal weighted ℓ2 regularization in overparameterized linear regression. ar Xiv preprint ar Xiv:2006.05800, 2020.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.

Published as a conference paper at ICLR 2022

Supplementary Material: Anisotropic Random Feature Regression in High Dimensions

TABLE OF CONTENTS: SUPPLEMENTARY MATERIAL

A Additional discussion of related work 2

A.1 Asymptotic error formulas and Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . 2

A.2 Gaussian equivalents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

A.3 Weight-data alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

B Useful inequalities 3

B.1 Basic properties of the self-consistent equation for x . . . . . . . . . . . . . . . . 3

B.2 I and Iβ inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

C Weight-data alignment is a partial order 6

D Proofs of propositions 6

D.1 Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

D.2 Proposition 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

D.3 Proposition 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

D.4 Proposition 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

E Linear regression limit 10

E.1 Comparison to Mel & Ganguli (2021) . . . . . . . . . . . . . . . . . . . . . . . . 10

E.2 Comparison to Wu & Xu (2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

F Structured learning curves 12

F.1 Effect of spectral gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

F.2 Analysis of the D-scale model in the separated limit . . . . . . . . . . . . . . . . . 14

G Proof of Theorem 3.1 15

G.1 Decomposition of the test loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

G.2 Decomposition of the bias and total variance . . . . . . . . . . . . . . . . . . . . . 16

G.3 Summary of linearized trace terms . . . . . . . . . . . . . . . . . . . . . . . . . . 17

G.4 Calculation of error terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

G.5 Final result for bias, variance, and test error . . . . . . . . . . . . . . . . . . . . . 31

Published as a conference paper at ICLR 2022

A ADDITIONAL DISCUSSION OF RELATED WORK

Since preparing the initial version of this manuscript, we became aware of several related recent and concurrent works. Our main contribution relative to these works is the detailed phenomenology explored throughout the main text, but here we provide a more detailed discussion of some of the connections between the current paper and other work, highlighting the main similarities and differences.

A.1 ASYMPTOTIC ERROR FORMULAS AND THEOREM 3.1

Alternative asymptotic formulas for the test error of anisotropic random feature regression have been presented in a set of concurrent works, yielding overlapping results with Theorem 3.1. The result that is most closely related is that of d Ascoli et al. (2021), who also study the random feature model with anisotropic input data and target function. They derive formulas for arbitrary convex loss function (generalizing our setup) and then focus on two specialized learning scenarios: (1) linear target function with additive noise and quadratic loss, corresponding to the case studied here; and (2) discrete class labels sign(βT x/ n0) with label-ﬂipping noise and logistic cost function. Their main results rely on a Gaussian equivalence theorem for anisotropic data, which they derive assuming the Gaussian equivalence principle of Goldt et al. (2020a;b), and then proceed via a standard replica calculation. Another salient paper is that of Loureiro et al. (2021), who study generic feature maps for student-teacher models. A Gaussian covariate model is proposed and rigorous asymptotic solutions are derived for it using Gaussian comparison inequalities, which are shown to agree with calculations from the replica method. The model is general enough to facilitate comparisons to realistic datasets, and numerical evidence and universality arguments support the utility of the model for exactly describing the random feature model we examine here, among many other applications.

Our technical approach proceeds in a substantially different manner, using tools from random matrix theory and operator-valued free probability, rather than statistical physics techniques or the replica method. The results could be made entirely rigorous, though here we simply present the pertinent calculations and defer justiﬁcation of the underlying linearization techniques to future work and to (Tripuraneni et al., 2021a;b). Our analysis ultimately yields ﬁnal expressions with a relatively simple form, involving only a single scalar self-consistent equation, which lends itself to more straightforward downstream calculations and analysis (e.g. Propositions 3.1-3.4 and Corollary G.1). Finally, beyond the total error, we also derive formulas for the bias and variance, which aid signiﬁcantly in the interpretation of the phenomenology, and are novel results. Interestingly, the order parameter Q from d Ascoli et al. (2021) (and others) is interpreted as the variance of the student s outputs, but actually differs from the variance deﬁned in Eq. (6). The reason is that the bias-variance decomposition is deﬁned conditionally on x. Because the conditional mean is nonzero, i.e. E[ˆy|x] = 0, Q actually corresponds to an uncentered second moment, and corresponds to our term E3 (deﬁned in Eq. (S147)) which differs from the total variance by the non-trivial additive term E4 = Ex[E[ˆy|x]2] (deﬁned in Eq. (S162)). A more thorough discussion of these and related concepts is given by Adlam & Pennington (2020b).

A.2 GAUSSIAN EQUIVALENTS

Our calculations utilize the concept of Gaussian equivalents, in particular a linear-signal-plus-noise surrogate F lin for the random feature matrix F. This approach originates from Karoui (2010) in the context of kernel random matrices of the form Kij = σ(X i Xj/n0) or Kij = σ( Xi Xj 2/n0) and from Pennington & Worah (2019); Adlam et al. (2019); Péché et al. (2019) for the covariance matrices F F/n1 studied here. This linearization technique was further developed by Adlam et al. (2019) for anisotropic covariates (and in the presence of bias), where it was shown to be sufﬁcient for predicting the training error for random feature ridge regression with isotropic linear target functions. In the setting of spherical data and weights, Mei & Montanari (2019) extended these results to cover the test error as well. In this case, many of the main technical results could build directly from Karoui (2010) owing to a decomposition of kernel inner product matrices (Mei & Montanari, 2019, Section C.4) that ultimately relies on the orthogonality of Gegenbauer polynomials and cannot (immediately) be extended to the Gaussian settings studied here and elsewhere. For random feature regression with the neural tangent kernel (which subsumes the standard random feature setting), a proof of the linearization was outlined for the test error in the setting of isototropic Gaussian covariates by Adlam

Published as a conference paper at ICLR 2022

& Pennington (2020a). As mentioned previously, an extension to anisotropic Gaussian covariates was later developed by Tripuraneni et al. (2021a;b), which is the basis for our analysis in this work.

In a parallel and largely independent line of work stemming from Goldt et al. (2020a), a nearly identical approach is developed under the name of the Gaussian equivalence property, under which a possibly nonlinear target function and/or prediction are replaced with simple linear Gaussian equivalents. This is a crucial step in the analysis in Goldt et al. (2020b); Gerace et al. (2020); d Ascoli et al. (2021) and related work. For example, Gerace et al. (2020) use this principle (in fact, a stronger version they refer to as replicated Gaussian equivalence") in order to perform their replica analysis of the isotropic random feature model. Subsequently, in the isotropic setting, this principle was rigorously justiﬁed using the Lindeburg exchange method under a variety of technical assumptions on the data distribution, weight distributions, nonlinear activation function, and target function (Hu & Lu, 2021). Goldt et al. (2020b) relaxes some of these conditions and provides extensive tests of the resulting formulas on real-world datasets. To perform the analysis for anisotropic input data and target function weights, as is pursued in d Ascoli et al. (2021), an anisotropic extension of the Gaussian equivalence theorem is required. Substantial numerical evidence and theoretical arguments are presented by d Ascoli et al. (2021); Loureiro et al. (2021) for the validity of this extension, but to the best of our knowledge a rigorous proof in this context has not been established.

A.3 WEIGHT-DATA ALIGNMENT

One of the basic conclusions of our study of anisotropy is that weight-data alignment generally improves performance. Similar observations appear in several recent works, albeit in slightly different contexts. For example, Ghorbani et al. (2019) study the random feature model with isotropic inputs, but anisotropic weights, in the case of a ﬁxed quadratic target function and derives an asymptotic formula for the test error in the population limit (ie. m n0, n1). For wide networks, n1 n0, the error simpliﬁes and is exactly proportional to a simple measure of alignment between the random feature weights and the target, that is loosely related to the measure we propose in Deﬁnition 2.1.

Ghorbani et al. (2021) also study the random feature model in the population limit, and makes the assumption that the target function is sensitive to a much lower dimensional subspace of the input by positing sub-linear scaling of the dimensionality of the relevant subspace. They show that increasing the power of the input data in this subspace generally decreases test error and the number of random features required to learn a function of ﬁxed complexity. Although the learning contexts and the ﬁnal scaling limits for m, n0, n1 are distinct, these phenomena parallel our main result on alignment (see e.g. Fig. 3b for illustration in the context of the d-scale model).

A main contribution of the current paper is the partial order on the space of weight-data alignments, which allows us to prove that the total error and the bias decrease in response to stronger alignment (Proposition 3.3). Our results in this vein are most directly related to those of d Ascoli et al. (2021), who informally observe a basic relationship between weight-data alignment and performance, though the impact of alignment is also investigated elsewhere, e.g. Loureiro et al. (2021, Fig. 2). While these works informally examine concept of alignment, the conclusions about it derive from numerical evaluation of the formulas, and as such the generality of some of the results remains unclear and some of the underlying phenomena are partially obfuscated. For example, it is not clear why the isotropic and misaligned curves cross each other in of d Ascoli et al. (2021, Fig. 2c): naively, one might expect the misaligned model to always perform worse. Our results provide a nice perspective on this behavior: owing to the differing covariate distributions, the two forms of alignment are incomparable under the partial order.

B USEFUL INEQUALITIES

Here we include the statements and proofs of several auxiliary inequalities that we use throughout the Supplementary Material.

B.1 BASIC PROPERTIES OF THE SELF-CONSISTENT EQUATION FOR x

We begin by reviewing the basic inequalities, ﬁrst given in (Tripuraneni et al., 2021a;b). The deﬁnitions of the following quantities can be found in Theorem 3.1.

Published as a conference paper at ICLR 2022

Lemma B.1 (Adapted from (Tripuraneni et al., 2021a;b)). We have the following bounds: ω, τ1, τ1, x, Ia,b, Iβ a,b 0 and x

Proof. As shown in (Pennington & Worah, 2018) for the unit-variance case, a simple Hermite expansion argument establishes the relation η ζ, which implies ω = s(η/ζ 1) 0. From Appendix G.4.1, τ1 and τ1 are traces of positive semi-deﬁnite matrices and are therefore nonnegative. From the same equations, it follows that x = γρτ1 τ1 0. Nonnegativity of x implies Ia,b 0 and Iβ a,b 0 from their deﬁnitions in (12). Finally, using the nonnegativity of ω, τ1, τ1, x, and Ia,b, the expression for x

γ in Theorem 3.1 immediately gives,

x γ = x γ + ργ( ψ

φ τ1 + τ1)(ω + φI1,2) 0. (S1)

Next we show that the self-consistent equation x = 1 γτ1

ω+I1,1 appearing in Theorem 3.1 and deﬁned in (S237) admits a unique positive real solution for x.

Lemma B.2 (Adapted from (Tripuraneni et al., 2021a;b)). There is a unique real x 0 satisfying x = 1 γτ1

Proof. Let t = 1/x 0 and deﬁne,

h(t) = t ρ(ψ φ) + p

ρ2(ψ φ)2 + 4γρφψ/t

2ρψ 1 + ω + I1,1(1/t) , (S2)

which is a rewriting of eqn. (S237), so it sufﬁces to show that h admits a unique real positive root. To that end, ﬁrst observe that limt 0 I1,1(1/t) = 0 and limt I1,1(1/t) = s so that

h(0) = ω > 0 and lim t h(t)/t = min{1, φ/ψ} < 0 , (S3)

which together imply that h has an odd number of positive real roots. Next, we show that h is concave for t 0:

γ2ρφψ (ρ2(ψ φ)2 + 4γρφψ/t)3/2 + I2,3(1/t) (S4)

which implies that h has at most two positive real roots. Therefore, we conclude that h has exactly one positive real root. To provide a bounding interval for this root, we ﬁrst observe that,

lim t h(t) min{1, φ/ψ}t + ω + s + γφ ρ|ψ φ| = 0 , (S6)

so that h(t) can be upperand lower-bounded by linear functions ,

ω min{1, φ/ψ}t h(t) ω + s + γφ ρ|ψ φ| min{1, φ/ψ}t . (S7)

The roots of these linear functions bound the root of h, so we have

min{1, φ/ψ}

γφ ρ|ψ φ| + ω + s x min{1, φ/ψ}

Published as a conference paper at ICLR 2022

B.2 I AND Iβ INEQUALITIES

We now establish some useful properties of the I and Iβ functionals deﬁned in (12). To begin, we note that simple algebraic manipulations establish the following raising and lowering identities:

Ia 1,b 1 = φIa 1,b + x Ia,b and Iβ a 1,b 1 = φIβ a 1,b + x Iβ a,b . (S9)

Next, we consider how the partial order of LJSDs given in Deﬁnition 2.1 leads to inequalities on the Iβ functionals. Letting (Iβ a,b)1 and (Iβ a,b)2 to denote the corresponding functionals with the LJSDs µ1 and µ2 respectively, we can establish the following useful lemma.

Lemma B.3 (Adapted from (Tripuraneni et al., 2021a;b)). Let µ1 µ2, so µ1 is more strongly aligned than µ2 (recall Deﬁnition 2.1). Suppose the functions f, g, h : R R are such that f(λ) = g(λ)h(λ) and h(λ) is nonincreasing for all λ > 0, then

Eµ1[qf(λ)] Eµ2[qf(λ)] Eµ1[qg(λ)]

Eµ2[qg(λ)]. (S10)

Proof. By the law of iterated expectation, we have

Eµ1[qf(λ)] = Eµ2[qg(λ)]Eλ

Eµ2[qg(λ)|λ]

Eµ2[qg(λ)] Eµ1[q|λ] Eµ2[q|λ]h(λ) . (S11)

Note that the expectation Eλ in (S11) over λ is the same under µ1 and µ2 by assumption. Moreover, the function h(λ) is nonincreasing in λ by assumption. Finally, observe that the factor Eµ2[rg(λ)|λ]/Eµ2[rg(λ)] deﬁnes a change in distribution for the random variable λ, since taking its expectation over λ yields 1. Denote a new random with this distribution by λ. Then, we may apply the Harris inequality to see

Eµ1[qf(λ)] = Eµ2[qg(λ)]E λ

" Eµ1[q| λ] Eµ2[q| λ] h( λ)

Eµ2[qg(λ)]E λ

" Eµ1[q| λ] Eµ2[q| λ]

E λ h h( λ) i (S13)

Eµ2[qg(λ)]Eλ

Eµ2[qg(λ)|λ]

Eµ2[qg(λ)] Eµ1[q|λ] Eµ2[q|λ]

Eµ2[qg(λ)|λ]

Eµ2[qg(λ)] h(λ) (S14)

= Eµ1[qg(λ)]

Eµ2[qg(λ)]Eµ2 [qf(λ)] . (S15)

Corollary B.1. Let µ1 µ2 and (Iβ a,b)i := φ Eµi qλa (φ + xλ) b . Then, for a 1 and b 0,

2 1 Eµ1 [λq]

Proof. Note that h : λ 7 φλa 1(φ + xλ) b is a nonincreasing function of λ 0. Then, setting g = λ and f = gh in Lemma B.3 gives the desired result.

Lemma B.4. Suppose the functions f, g, h : R R are such that f(λ) = λg(λ)h(λ) and h(λ) is nonincreasing for all λ > 0. Then, if the LJSD µ is aligned (see Deﬁnition 2.1), then Eλ[g]Eµ[qf] Eµ[qg]Eλ[f].

Published as a conference paper at ICLR 2022

Eµ[qf] = Eµ[qλgh] (S17) = Eλ[Eµ[qλ|λ]g(λ)h(λ)] (S18)

Eµ[qλ|λ]h(λ) g(λ)

Eµ[qλ|λ] g(λ)

= 1 Eλ[g]Eµ[qλg]Eλ[f] , (S21)

where Eµ[qλ|λ] is nondecreasing in λ because µ is aligned, so the inequality again follows from the Harris inequality.

Corollary B.2. If µ is aligned, Ia,b Iβ a,b Ia 1,b Iβ a+1,b.

Proof. Take g : λ φλa(φ + xλ) b and h : λ 1/λ in Lemma B.4.

C WEIGHT-DATA ALIGNMENT IS A PARTIAL ORDER

We restate Deﬁnition 2.1 for reference, and prove that it deﬁnes a partial order. The deﬁnition and proof are identical to those of Tripuraneni et al. (2021a;b), but differ in notation so we repeat them here for clarity. Deﬁnition C.1 (Restatement of Deﬁnition 2.1). Let µ1 and µ2 be LJSDs with the same marginal distribution of λ. If the asymptotic overlap coefﬁcients are such that Eµ1 [λq|λ] /Eµ2 [λq|λ] = Eµ1 [q|λ] /Eµ2 [q|λ] is nondecreasing in λ, we say that µ1 is more strongly aligned than µ2 and write µ1 µ2. Comparing against the case of isotropic weight distribution, µ , we say µ1 is aligned when µ1 µ and anti-aligned when µ1 µ . Proposition C.1. Deﬁnition 2.1 is a partial order over over weight-data alignments µ.

Proof. Reﬂexivity is satisﬁed as Eµ[q|λ]/Eµ[q|λ] = 1 is nondecreasing for all µ.

For antisymmetry, we see µ1 µ2 and µ2 µ1 imply Eµ1[q|λ]/Eµ2[q|λ] is constant in λ as it is nonincreasing and nondecreasing. However, setting Eµ1[q|λ] = c Eµ2[q|λ] and taking expectation over λ and rearranging yields 1 = Eµ1[q]/Eµ2[q] = c, so in fact Eµ1[q|λ] = Eµ2[q|λ]. Assuming that µ1 and µ2 are absolutely continuous (the case where they are a sum of point masses is similar), we can write their densities as pi(λ, q) = pi(λ)pi(q|λ). By assumption p1(λ) = p2(λ), so it sufﬁces to show p1(q|λ) = p2(q|λ) almost everywhere. Next note

0 = Eµ1[q|λ] Eµ2[q|λ] = Z

R+ q (p1(q|λ) p2(q|λ)) dq, (S22)

we have that p1(q|λ) p2(q|λ) = 0 almost everywhere.

Finally, for transitivity assume µ1 µ2 and µ2 µ3, then

Eµ1[q|λ] Eµ3[q|λ] = Eµ1[q|λ]

Eµ2[q|λ] Eµ2[q|λ]

Eµ3[q|λ], (S23)

so Eµ1[q|λ]/Eµ3[q|λ] is the product of two nondecreasing, positive functions and is thus also nondecreasing.

D PROOFS OF PROPOSITIONS

D.1 PROPOSITION 3.1

Proposition D.1 (Restatement of Proposition 3.1). In the setting of Theorem 3.1, the bias Bµ is a nonincreasing function of overparameterization ratio φ/ψ.

Published as a conference paper at ICLR 2022

Proof. Recall from Theorem 3.1 that the bias is given by

Bµ = φIβ 1,2 , (S24)

where x is the unique positive real root of the self-consistent equation,

ω + I1,1 . (S25)

Differentiating (S24) with respect to φ/ψ gives,

Bµ (φ/ψ) = ψ2

ψ Iβ 1,3 . (S26)

Since Lemma B.1 gives Iβ a,b 0, it sufﬁces to show x

ψ 0, which immediately follows by implicitly differentiating (S25)) and simplifying the expression,

x ψ = ρxτ1(ω + I1,1) φ 1 + ρ( τ1 + ψ

φ τ1)(ω + φI1,2) 0 , (S27)

where the inequality also follows from Lemma B.1. Therefore we conclude that Bµ (φ/ψ) 0.

D.2 PROPOSITION 3.2

Proposition D.2 (Restatement of Proposition 3.2). In the setting of Corollary G.1 and in the overparameterized regime (ψ < φ), the variance Vµ is a nonincreasing function of overparameterization ratio φ/ψ.

Proof. In the overparameterized regime, Corollary G.1 gives the expression for the variance as,

Vµ = ψ φ ψ (σ2 ε + Iβ 1,1) + x I2,2 ω + φI1,2 (σ2 ε + Iβ 1,2) , (S28)

and, since the self-consistent equation x = 1 ω+I1,1 is independent of ψ, we have x

ψ = φ (φ ψ)2 (σ2 ε + Iβ 1,1) 0 , (S29)

which implies that the variance is nonincreasing in the overparameterized regime.

D.3 PROPOSITION 3.3

Proposition D.3 (Restatement of Proposition 3.3). Let µ1, µ2 be two LJSDs such that µ1 µ2 (see Deﬁnition 2.1). Then Bµ1 Bµ2, Eµ1 Eµ2, and Bµ1/Vµ1 Bµ2/Vµ2.

Proof. For the bias, Corollary B.1 implies (Iβ 1,2)1 (Iβ 1,2)2 and therefore Bµ1 Bµ2.

For the test error, we use the explicit expression for the variance from Eq. (S378) and the identity Iβ 2,2 = 1

x Iβ 1,2 which follows from Eq. (S9) to write,

Eµ = C0 + C1Iβ 1,1 + C2Iβ 1,2 , (S30)

Published as a conference paper at ICLR 2022

where the Ci 0 and depend on µ only through the marginal λ (i.e. they are independent of the weight distribution):

φ x γ σ2 ε (ω + φI1,2)(ω + I1,1) + φ

ψ γ τ1I2,2 0 (S31)

(ω + φI1,2)(ω + I1,1) + γτ1

x (ω + φI1,2) 0 (S32)

ψ γ τ1I2,2 φγτ1

x (ω + φI1,2) (S33)

ψ γ τ1I2,2 φγτ1

x (ω + φI1,2) φ2

φ τ1I2,2 ψτ1

x (ω + φI1,2) + φ

ρx(1 + ρ(τ1ψ/φ + τ1)(ω + φI1,2)) (S35)

φ τ1I2,2 + φ

ρx(1 + ρ τ1(ω + φI1,2)) (S36)

It is now straightforward to write,

Eµ2 Eµ1 = C1(Iβ 1,1)2 + C2(Iβ 1,2)2 C1(Iβ 1,1)1 + C2(Iβ 1,2)1 (S38)

= C1 (Iβ 1,1)2 (Iβ 1,1)1 + C2 (Iβ 1,2)2 (Iβ 1,2)1 (S39)

where the inequality follows from Corollary B.1. Similarly, we can write,

Bµ1 Bµ2 Eµ2 Eµ1 = C0 (Iβ 1,2)1 (Iβ 1,2)2 + C1 (Iβ 1,2)1 (Iβ 1,2)2 (Iβ 1,1)2 + C2(Iβ 1,2)1 C0 C1(Iβ 1,1)1 C2(Iβ 1,2)1

= C0 (Iβ 1,2)1 (Iβ 1,2)2 1 + C1 (Iβ 1,2)1 (Iβ 1,2)2 (Iβ 1,1)2 (Iβ 1,1)1 (S42)

where the inequality follows from Corollary B.1 and from Lemma B.3 with g : λ φλ(φ + λx) 1 and h : λ (φ + λx) 1. Finally, using Eµi = Bµi + Vµi, the above implies Bµ1/Vµ1 Bµ2/Vµ2.

D.4 PROPOSITION 3.4

Proposition D.4 (Restatement of Proposition 3.4). If the LJSD is aligned (see Deﬁnition 2.1), then, in the setting of Corollary G.1, the test error has at most two interior critical points as a function of the overparameterization ratio φ/ψ.

Proof. From Corollary G.1, there is a critical point at the interpolation threshold φ/ψ = 1. Therefore it sufﬁces to show that there is at most one additional interior critical point. Focusing ﬁrst on the overparameterized regime φ > ψ, the test error reads,

Eµ = φIβ 1,2 + ψ φ ψ (σ2 ε + Iβ 1,1) + x I2,2 ω + φI1,2 (σ2 ε + Iβ 1,2) , (S44)

and, since x

ψ = φ (φ ψ)2 (σ2 ε + Iβ 1,1) > 0 , (S45)

which implies that the test error is monotone decreasing in the overparameterized regime.

Published as a conference paper at ICLR 2022

Next, let us consider the case φ < ψ. In this case,

Eµ = φIβ 1,2 + φ ψ φ(σ2 ε + Iβ 1,1) + x Iβ 2,2 , (S46)

so that, Eµ

ψ x Iβ 1,2 φ (ψ φ)2 (σ2 ε + Iβ 1,1) + φ ψ φ x ψ x Iβ 1,1 + x

ψ x(x Iβ 2,2) (S47)

= φ (ψ φ)2 (σ2 ε + Iβ 1,1) + x

x Iβ 1,2 + φ ψ φ x Iβ 1,1 +

x(x Iβ 2,2) (S48)

= φ (ψ φ)2 (σ2 ε + Iβ 1,1) + x

2φIβ 2,3 φ ψ φIβ 2,2 + Iβ 2,2 2x Iβ 3,3

= φ (ψ φ)2 (σ2 ε + Iβ 1,1) x

ψ ψ ψ φIβ 2,2 (S50)

= φ (ψ φ)2 (σ2 ε + Iβ 1,1) + φ ψ(ψ φ) Iβ 2,2 ω + φI1,2 . (S51)

Therefore we see that E

ψ = 0 implies

φ ψ = x(ω + I1,1) = 1 (ω + φI1,2)σ2 ε + Iβ 1,1 Iβ 2,2 , (S52)

or, equivalently, g(x) = 0 for

g(x) = 1 (ω + φI1,2)σ2 ε + Iβ 1,1 Iβ 2,2 x(ω + I1,1) . (S53)

First we note that g has at most one real root since its derivative is never positive,

g (x) = 2φI2,3 σ2 ε + Iβ 1,1 Iβ 2,2 + (ω + φI1,2)

1 2σ2 ε + Iβ 1,1 Iβ 3,3

(ω + φI1,1) + x I2,2 (S54)

= 2φI2,3 σ2 ε + Iβ 1,1 Iβ 2,2 2(ω + φI1,2)σ2 ε + Iβ 1,1 Iβ 3,3 (S55)

= 2σ2 ε + Iβ 1,1 (Iβ 2,2)2

φI2,3Iβ 2,2 (ω + φI1,2)Iβ 3,3 (S56)

= 2σ2 ε + Iβ 1,1 (Iβ 2,2)2

φ2I2,3Iβ 2,3 (ω + φI1,2 xφI2,3)Iβ 3,3 (S57)

= 2σ2 ε + Iβ 1,1 (Iβ 2,2)2

φ2I2,3Iβ 2,3 (ω + φ2I1,3)Iβ 3,3 (S58)

2φ2 σ2 ε + Iβ 1,1 (Iβ 2,2)2

I2,3Iβ 2,3 I1,3Iβ 3,3 (S59)

where the last line follows from Corollary B.2 since we are assuming µ is aligned.

Next, regarding x as a function of φ/ψ, we consider the interval (x , x+) for x = x(φ/ψ = 0) and x+ = x(φ/ψ = 1). From the self-consistent equation for x, we immediately see that x+(ω + I1,1(x+)) = 1 and x = 0 so that

g(x+) = (ω + φI1,2)σ2 ε + Iβ 1,1 Iβ 2,2 (S61)

< 0 . (S62)

Published as a conference paper at ICLR 2022

g(x ) = 1 (ω + φ2Eµ[λ])σ2 ε + Eµ[qλ])

Eµ[qλ2]) . (S63)

Observe that,

g(x ) > 0 σ2 ε < σ2 c Eµ[qλ2] ω + φEµ[λ] Eµ[qλ] . (S64)

Therefore, from the intermediate value theorem, we conclude that g has no real roots in (x , x+) for σ2 ε > σ2 c, and exactly one real root if σ2 ε < σ2 c.

E LINEAR REGRESSION LIMIT

To reduce to the linear case, we need to take ψ 0 and σ(x) x, in which case we have that η = ζ = ρ 1 and ω 0, so that

τ1 x and τ1 1

x I1,1 (S66)

x φEs2 µdata s2

φ + xs2 . (S67)

E.1 COMPARISON TO MEL & GANGULI (2021)

To compare with (Mel & Ganguli, 2021), note that φ = 1/α, γ = 1/φλ, x = τ1 = φ/ λ, so we have

φ φEs2 µdata s2 λ/φ λ + s2

= λ φEs2 µdata s2 λ λ + s2 , (S69)

which is the expression appearing in Eq. (8) in (Mel & Ganguli, 2021). To compare expressions for the test error, note that

x γ x γ + φI1,2 , (S70)

x2 x γ (S73)

= 1 x(γ + φI1,2) , (S74)

Published as a conference paper at ICLR 2022

E = φIβ 1,2 + 1

φIβ 1,2 + σ2 ε x2I2,2 (S75)

= φIβ 1,2 + 1

ρf φIβ 1,2(x I1,1 xφI1,2) + σ2 ε ρf x2I2,2 (S76)

= φIβ 1,2 + 1

ρf φIβ 1,2(1 x(γ + φI1,2)) + σ2 ε ρf x2I2,2 (S77)

= φIβ 1,2 + 1

ρf φIβ 1,2(1 ρf) + σ2 ε ρf x2I2,2 (S78)

φIβ 1,2 + σ2 εx2I2,2 . (S79)

In contrast to our conventions, the error F in (Mel & Ganguli, 2021) does include an additive constant induced by the label noise, and also normalizes by the total output variance i.e. F = E+σ2 ε Var[y] . Taking this relation into account and using the deﬁnitions of I and Iβ, and ﬁnally translating the notation via the substitutions φ 1/α, λq v = (SUT w)2, σ2 ε Var[y] fn, |v|2

Var[y] fs, we ﬁnd

F = E + σ2 ε Var [y] (S80)

= σ2 ε Var [y] + 1

1 Var [y]Eµ

+ φ σ2 ε Var [y]Eµ

ˆv2 λ λ + λ

which is Eq. (6) of (Mel & Ganguli, 2021).

E.2 COMPARISON TO WU & XU (2020)

Wu & Xu (2020) study the case of anisotropic regularizer: ˆβλ = X X + λΣw 1 X y (S83)

with n samples, p features, X Rn p and p/n γ. After simplifying the error expression they arrive at eq. 3.1:

E ˆy x ˆβλ 2 = σ2 1 + 1

ntr Σx/w X /w X/w + λI 1 λΣx/w X /w X/w + λI 2

n tr Σx/w X /w X/w + λI 1 Σwβ X /w X/w + λI 1 (S85)

Setting Σw I must give the expression for isotropic regularization, thus the effect of the weighting matrix Σw can be accounted for by just changing the parameters of the isotropic model. The effective feature covariance is Σ Σx/w and the effective weight covariance is Σβ Σwβ.

The error expression given in eqs. 4.1-4.3 is

E y x ˆβλ 2 m ( λ)

(h m ( λ) + 1)2 + σ2 !

λ = 1 m ( λ) γE h 1 + h m ( λ) (S87)

1 m2 ( λ) γE h2

(h m ( λ) + 1)2

m ( λ) (S88)

Published as a conference paper at ICLR 2022

In our notation, the predicted output on a new input x is

ˆy = 1 n0 β X + ϵtr

n1 F F + γIm

n1 F f (x) (S89)

1 n0 β X + ϵtr

n0 X X + γIm

n0 X x (S90)

= 1 n0 β X + ϵtr

n0 XX + γIn0

= ˆy X X X + φγIn0 1 x (S92)

where X has 1 m = 1

samples normalization. Thus translating our notation involves setting φ γ,

γ λ/γ, Σ Σx/w, λ h, Σβ γΣwβ, and q γg. In this new notation, our equation for x reads λ = γ

1 + h x γ (S93)

which shows x γm ( λ), and therefore x

λ/γ = γ2m ( λ). Next, note that

γ I2,2 = x γ + φI1,2 I2,2 (S94)

= I1,1 φI1,2

γ + φI1,2 (S95)

γ + φI1,2 1 (S96)

= 1/x γ + φI1,2 1 (S97)

x2 x γ 1 (S98)

So the full error is

E = φIβ 1,2 x

φIβ 1,2I2,2 + σ2 e I2,2

Iβ 1,2 σ2 e x γ I2,2 (S100)

φIβ 1,2 + σ2 e

x2 x γ 1 (S101)

φIβ 1,2 + σ2 e σ2 e (S102)

(1 + m( λ)h)2 + σ2 !

which, after removing the additive shift, matches the expressions given in (Wu & Xu, 2020) eq. 4.1.

F STRUCTURED LEARNING CURVES

F.1 EFFECT OF SPECTRAL GAP

Here we demonstrate that a large gap in the spectrum of Σ can induce steep cliffs in the learning curves as a function of the overparameterization φ/ψ:

Suppose there is a gap in the spectrum of Σ of size g. That is, there are λ < λ+ such that there is no eigenvalue λ (λ , λ+) and λ+

λ = g. Assuming φ < 1 and µ is aligned, and working in the

Published as a conference paper at ICLR 2022

noiseless ridgeless limit, we will show the slope of the learning curve log Eµ

(φ/ψ) becomes arbitrarily negative for small ω.

From Theorem (3.1), x, τ1 satisfy

x = 1 γτ1 ω + φE λ φ+xλ (S104)

(ψ φ)2 + 4xψφγ/ρ + ψ φ

Since x min{1,φ/ψ}

ω (Eq. (S8)), for ω > 0, x stays ﬁnite in the ridgeless limit γ 0, so

γτ1 |ψ φ| + ψ φ

2ψ . (S106)

We have the numerator 1 γτ1 min(1, φ/ψ), and

x ω + φE λ φ + xλ

Since x = 0 is not a solution for 0 < ψ, φ < , we can change variables to γ = φ

γ + E λ γ + λ = 1

which implies γ is a continuous decreasing function of φ/ψ (keeping φ ﬁxed). Taking the limit of (S108) directly shows that γmax := limφ/ψ 0 γ = , while γmin := limφ/ψ γ satisﬁes

ω 1 γmin + E λ γmin + λ = 1

By the intermediate value theorem, γ takes all values in the interval ( γmin, ). For φ < 1, using E λ γmin+λ 1 we obtain γmin ω φ 1 φ.

We assume that ω φ 1 φ λ , so the previous bound gives γmin λ and thus γ attains all values in (λ , λ+). In particular, there is some 0 < φ/ψ < 1 such that γ (φ/ψ) = p

λ λ+. At this point, differentiating (S108) gives

γ 1 ψ = ω 1

( γ + λ)2 (S110)

λ+ ( γ + λ+)2 p (λ λ+) + λ ( γ + λ )2 p (λ λ )

γ + g g + 1 2 (S112)

φ log x (φ/ψ) 1 , we get

γ + g ( g+1) 2 log x

(φ/ψ) (S113)

For large spectral gap g this tends toward

(φ/ψ) (S114)

Published as a conference paper at ICLR 2022

If the nonlinearity ω is small compared to the middle of the spectral gap p

λ+λ , x undergoes large fractional change as a function of the overparameterization ratio φ/ψ.

To see how this affects the test error, we can use the lowering identity Iβ a 1,b 1 = φIβ a 1,b + x Iβ a,b to write the ridgeless error expression from Eq. (S46) as

Eµ = φIβ 1,2 + φ ψ φ

σ2 ϵ + Iβ 1,1 + x Iβ 2,2 (S115)

= φ ψ φσ2 ϵ + ψ ψ φIβ 1,1. (S116)

So we can write (φ/ψ) log Eµ φ ψ φσ2 ϵ

= (φ/ψ) log ψ ψ φIβ 1,1 (S117)

= ψ ψ φ + (φ/ψ) log Iβ 1,1 (S118)

For general a, b, we have

(φ/ψ) log Iβ a,b = b log x

( γ+λ)b+1 q i

E h λa ( γ+λ)b q i (S119)

Specializing to a = b = 1, and using the fact that λ γ+λEq [q|λ] is a nondecreasing function of λ (guaranteed since q is aligned), we may apply the Harris inequality to obtain

ψ log Iβ 1,1 = log x

Eλ h λ γ+λ λ γ+λEq [q|λ] i

Eλ h λ ( γ+λ)Eq [q|λ] i (S120)

p (λ > λ+) (S122)

ω p (λ > λ+) , (S123)

which implies

(φ/ψ) log Eµ φ ψ φσ2 ϵ

= ψ ψ φ (φ/ψ) log Iβ 1,1 (S124)

ω p (λ > λ+) (S125)

In particular, if σ2 ε = 0, then

(φ/ψ) ψ ψ φ + 1

ω p (λ > λ+) (S126)

Thus as ω 0 the learning curve becomes arbitrarily steep at the critical value x = φ/ p

F.2 ANALYSIS OF THE D-SCALE MODEL IN THE SEPARATED LIMIT

We will consider the d-scale covariance model:

λn = Cαn, pn = 1

d, n = 0, 1, , d 1 (S127)

where C is chosen so that

1 = s = tr [Σ] = 1

n=0 Cαn = C 1

Published as a conference paper at ICLR 2022

We will obtain expressions for x in the limit of small λ. Consider the ridgeless limit of γ := φ/x: 1 γ ω + X

n pn λn γ + λn = 1 max (φ, ψ) (S129)

Suppose, ω sits between the scales Cαj, Cαj+1. To enforce this constraint, we will take ω = ˆωαj+ 1

2 where ˆω is a constant independent of α.

The α scaling of γ will depend on the value of max(φ, ψ). Discarding the second term in (S129) we obtain max (φ, ψ) ω γ, and thus the lowest possible scaling for γ is γ = Cj+ 1

2 . Substituting this ansatz into (S129) and taking the limit α 0, we obtain 1 max (φ, ψ) = 1

2 + Cαn (S130)

γ ω + j + 1

Solving for γ gives γ = max(φ,ψ) 1 max(φ,ψ) j+1

d ω. For other values of max(φ, ψ), γ may have higher scaling,

ie. γ = Ckαk with k j. Substituting and solving for γ we obtain γ = max(φ,ψ) k+1

d 1 1 max(φ,ψ) k

d λk. Thus we obtain the following self-consistent solutions for γ:

max(φ,ψ) 1 max(φ,ψ) j+1

d ω max (φ, ψ) < d j+1 max(φ,ψ) k+1

d 1 1 max(φ,ψ) k

d λk d k+1 < max (φ, ψ) < d

Thus γ takes on the scale of a single eigenvalue λk for a range of overparameterization ratios corresponding to d k+1 < ψ max φ ψ, 1 < d

k. To understand what happens at the transitions between these regimes, we can apply the results from the previous subsection F.1 for generic Σ with large spectral gap. In the notation of F.1, the D-scale model has a spectral gap between each pair of consecutive scales of size g = λj/λj+1 = Cαj/Cαj+1 = 1/α and as a consequence, γ will exhibit near inﬁnite slop as it passes through the middle of a gap p

λj+1λj = Cαj+ 1

2 . Comparing to the self-consistent solutions (S132) these transitions must happen at the critical values max(φ, ψ) = d k+1 for k j. At these transition points, the error exhibits steep cliffs in the parameter regime descried in F.1.

G PROOF OF THEOREM 3.1

The proof closely follows the methods described in (Adlam et al., 2019; Adlam & Pennington, 2020a;b; Tripuraneni et al., 2021a;b). Indeed, precisely the same techniques from operator-valued free probability used in those works apply here. The main and only difference is the anisotropic weight covariance Σβ, which changes the details of the computations but not the arguments justifying the linearized Gaussian equivalents and the application of operator-valued free probability. We therefore refer the reader to those previous works for an in-depth discussion of methods and merely focus here on the details of the requisite calculations. Throughout this section, we use tr to denote the dimension-normalized trace, i.e. tr(A) = 1

ntr(A) for a matrix A Rn n.

G.1 DECOMPOSITION OF THE TEST LOSS

The test loss can be written as, EΣ = E(x,y)(y ˆy(x))2 = E1 + E2 + E3 (S133) with E1 = E(x,ε)tr(y(x)y(x) ) (S134)

E2 = 2E(x,ε)tr(K x K 1Y y(x)) (S135)

E3 = E(x,ε)tr(K x K 1Y Y K 1Kx) . (S136)

Published as a conference paper at ICLR 2022

Recall the kernels K = K(X, X) and Kx = K(X, x) are given by,

n1 + γIm and Kx = 1

n1 F f . (S137)

Using the cyclicity and linearity of the trace, the expectation over x requires the computation of

Ex Kx K x , Exy(x)K x , Exy(x)y(x) . (S138)

As described in detail in (Tripuraneni et al., 2021a;b; Adlam et al., 2019; Adlam & Pennington, 2020a; Mei & Montanari, 2019), asympotically the trace terms E1, E2, and E3 are invariant to a linearization of the random feature vector f,

f f lin = ρ n0 Wx + p

η ζθ , (S139)

where θ Rn1 is a vector of iid standard normal variates. Similarly, we will take the linearization of the training features to be ρ n0 WX + η ζΘ where Θ Rn1 m has standard normal components. The expectations over x are now trivial and we readily ﬁnd,

Ex Kx K x = 1

n0 WΣW + (η ζ)In1 F (S140)

Exy(x)K x = ρ n0n1 β ΣW F (S141)

Exy(x)y(x) = 1

n0 βΣβ (S142)

Next, we recall the deﬁnition, Y = β X/ n0 + ϵ, and, using the above substitution, we ﬁnd

n0 X ΣβX + σ2 εIm (S143)

Eϵ Y Exy(x)K x = ρ

n3/2 0 n1 X ΣβΣW F . (S144)

Putting these pieces together, we have

E1 = tr(ΣβΣ)

E2 = E21 (S146) E3 = E31 + E32 , (S147)

n3/2 0 n1 Etr X ΣβΣW FK 1 (S148)

E31 = σ2 εEtr K 1Σ3K 1 (S149)

n0 Etr K 1Σ3K 1X ΣβX (S150)

Σ3 = ρ n0n2 1 F WΣW F + η ζ

n2 1 F F . (S151)

G.2 DECOMPOSITION OF THE BIAS AND TOTAL VARIANCE

Note that it is sufﬁcient to calculate the bias term given the total test loss, since the total variance can be obtained as VΣ = EΣ BΣ. Following the total multivariate bias-variance decomposition

Published as a conference paper at ICLR 2022

of (Adlam & Pennington, 2020b), for each random variable in question we introduce an iid copy of it denoted by either the subscript 1 or 2. We can then write, BΣ = E(x,y)(y E(W,X,ε)ˆy(x; W, X, ε))2 (S152)

= E(x,y)E(W1,X1,ε1)E(W2,X2,ϵ2)(y ˆy(x; W1, X1, ε1))(y ˆy(x; W2, X2, ϵ2)) (S153)

n0 + E21 + H000 , (S154)

where an expression for E21 was given previously and H000 satisﬁes H000 = Eˆy(x; W1, X1, ε1)ˆy(x; W2, X2, ϵ2) , (S155) where the expectations are over x, W1, X1, ε1, W2, X2, and ϵ2. Recalling the deﬁnition of ˆy, ˆy(x; W, X, ε) := Y (X, ϵ)K(X, X; W) 1K(X, x; W) (S156) and the techniques described in the previous section, it is straightforward to analyze the above term. First note we can write,

Ex K(X1, x; W1)K(x, X2; W2) = ρ n0n2 1 F 11W1ΣW 2 F22 . (S157)

Here we have deﬁned F11 F(W1, X1) and F22 F(W2, X2). Now we proceed to calculate H000 as H000 = Eˆy(x; W1, X1, ε)ˆy(x; W2, X2, ϵ2) (S158)

= EK(x, X2; W2)K(X2, X2; W2) 1Y (X2, ϵ2) Y (X1, ε1)K(X1, X1; W1) 1K(X1, x; W) (S159)

= Etr K(X2, X2; W2) 1X 2 X1K(X1, X1; W1) 1K(X1, x; W)K(x, X2; W2) (S160)

= ρ n2 0n2 1 Etr K 1 22 X 2 ΣβX1K 1 11 F 11W1ΣW 2 F22 (S161)

E4 , (S162) where in the second-to-last line we have deﬁned K11 K(X1, X1; W1) and K22 K(X2, X2; W2).

G.3 SUMMARY OF LINEARIZED TRACE TERMS

We now summarize the requisite terms needed to compute the total test error, bias, and variance after using cyclicity of the trace to rearrange several of them. In the following, we slightly change notation in order to make explicit the dependence on the covariance matrix Σ. To be speciﬁc, whereas above we assumed that the columns of X1 and X2 were drawn from multivariate Gaussians with covariance Σ, below we assume that they are drawn from multivariate Gausssians with identity covariance. This change is equivalent to replacing X1 Σ1/2X1 and X2 Σ1/2X2 in the above expressions. We utilize this deﬁnition so that X1, X2, W1, W2, and Θ all have iid standard Gaussian entries. From the previous computations, we can now write the requisite terms as,

Σ3 = ρ n0n2 1 F 11W1ΣW 1 F11 + η ζ

n2 1 F 11F11 (S163)

n3/2 0 n1 tr X 1 Σ1/2ΣβΣW 1 F11K 1 11

E31 = σ2 ϵ tr K 1 11 Σ3K 1 11

E32 = 1 n0 tr K 1 11 Σ3K 1 11 X 1 Σ1/2ΣβΣ1/2X1

E4 = ρ n2 0n2 1 tr F22K 1 22 X 2 Σ1/2ΣβΣ1/2X1K 1 11 F 11W1ΣW 2

EΣ = 1 n0 tr (ΣΣβ) + E21 + E31 + E32 (S168)

BΣ = 1 n0 tr (ΣΣβ) + E21 + E4 (S169)

VΣ = EΣ BΣ (S170)

Published as a conference paper at ICLR 2022

G.4 CALCULATION OF ERROR TERMS

To compute the test error, bias, and total variance, we need to evaluate the asymptotic trace objects appearing in the expressions for E21, E31, E32, and E4, deﬁned in the previous section. As these expressions are essentially rational functions of the random matrices X, W, Θ, Σ, and Σβ, these computations can be accomplished by representing the rational functions as single blocks of a suitablydeﬁned block matrix inverse - the so-called linear pencil method (see eg . Far et al., 2006) - and then applying the theory of operator-valued free probability (Mingo & Speicher, 2017). These techniques and their application to problems of this type have been well-established elsewhere (Adlam et al., 2019; Adlam & Pennington, 2020a;b), we only lightly sketch the mathematical details, referring the reader to the literature for a more pedagogical overview. Instead, we focus on presenting the details of the requisite calculations.

Relative to prior work, the main challenge in the current setting is generalizing the calculations to include an arbitrary weight covariance matrix Σβ. This generalization is facilitated by the general theory of operator-valued free probability, and in particular through the subordinated form of the operator-valued self-consistent equations that we ﬁrst present in eqn. (S201). The form of this equation enables the simple computation of the operator-valued R-transform of the remaining random matrices, W, X, and Θ, which are all iid Gaussian and can therefore be obtained simply by using the methods of (Far et al., 2006). The remaining complication amounts to performing the trace in eqn. (S201), which asymptotically becomes an integral over the LJSD µ. While this might in general lead to a complicated coupling of many transcendental equations, it turns out that the trascendentality can be entirely factored into a single scalar ﬁxed-point equation, whose solution we denote by x (see eqn. (S237)), and the remaining equations are purely algebraic given x. To facilitate this particular simpliﬁcation, it is necessary to ﬁrst compute all of the entries in the operator-valued Stieltjes transform of the kernel matrix K, which we do in Sec. G.4.1. Using these results, we compute the remaining error terms in the subsequent sections.

As a matter of notation, note that throughout this entire section whenever a matrix X, X1, or X2 appears it is composed of iid N(0, 1) entries as in Appendix G.3. This differs from the notation of the main paper, but we follow this prescription to ease the already cumbersome presentation. This deﬁnition of X allows us to explicitly extract and represent the training covariance Σ in our calculations.

The NCAlgebra Mathematica package (NCRealization method; algorithm described in Helton et al., 2006) was used to generate the following matrix pencil QK 1:

γ n0 0 0 0 0 0 0

Θ η ζ n1 In1 0 0 ρW n1 0 0 0 0 0 0 In0 Σ1/2 0 0 0 0 0 0 W

n1 0 In0 0 0 Σβ ρ 0 0 0 0 0 0 In0 Σ1/2 0 0 0 X n0 0 0 0 0 In0 0 0 0 0 0 0 0 0 0 In0 Σ1/2 0 0 0 0 0 0 0 0 In0 X n0 0 0 0 0 0 0 0 0 Im

(S171) This matrix is speciﬁcally chosen so that inverting [QK 1] and taking the normalized trace of its ﬁrst block gives exactly γ tr K 1, the quantity of interest. Computing the full inverse of [QK 1] via repeated applications of the Schur complement formula and taking block-wise traces shows that

GK 1 1,1 = γ tr(K 1) (S172)

GK 1 9,1 = φ tr ΣβΣ1/2XK 1X Σ1/2

GK 1 2,2 = γ tr( ˆK 1) (S174)

Published as a conference paper at ICLR 2022

GK 1 3,3 = GK 1 6,6 = 1 ρ tr Σ1/2W FK 1X

GK 1 4,3 = GK 1 6,5 = tr(Σ1/2) ρ tr ΣW FK 1X

GK 1 5,3 = GK 1 6,4 = γ ρ tr Σ1/2W ˆK 1W

GK 1 6,3 = γ ρ tr ΣW ˆK 1W

GK 1 7,3 = tr ΣβΣ1/2W FK 1X Σ1/2

n0n1 tr(ΣβΣ1/2) ρ (S179)

GK 1 8,3 = tr ΣβΣW FK 1X Σ1/2

n0n1 tr(ΣβΣ) ρ (S180)

GK 1 3,4 = GK 1 5,6 = ρ tr FK 1X W

n0n1ψ (S181)

GK 1 4,4 = GK 1 5,5 = 1 ρ tr Σ1/2W FK 1X

GK 1 5,4 = γ ρ tr ˆK 1WW

GK 1 7,4 = tr ΣβΣ1/2XF ˆK 1W

n0n1 tr(Σβ) ρ (S184)

GK 1 8,4 = tr ΣβΣ1/2W FK 1X Σ1/2

n0n1 tr(ΣβΣ1/2) ρ (S185)

GK 1 3,5 = GK 1 4,6 = ρ tr Σ1/2XK 1X

GK 1 4,5 = ρ tr ΣXK 1X

GK 1 7,5 = tr ΣβΣ1/2XK 1X Σ1/2

GK 1 8,5 = tr ΣβΣXK 1X Σ1/2

GK 1 3,6 = ρ tr K 1X X

GK 1 7,6 = tr ΣβΣ1/2XK 1X

GK 1 8,6 = tr ΣβΣ1/2XK 1X Σ1/2

GK 1 7,7 = GK 1 8,8 = GK 1 9,9 = 1 (S193)

GK 1 8,7 = tr(Σ1/2) , (S194)

where GK 1 := id9 tr [(QK 1) ] 1 M9(C) is a scalar 9 9 matrix whose i, j entry GK 1 i,j is the normalized trace of the (i, j)-block of the inverse of [QK 1] . We have also deﬁned ˆK = 1 n1 FF + γIn1 (note that K is m m while ˆK is n1 n1). It is straightforward to verify that when

Published as a conference paper at ICLR 2022

the n0, n1, m limit is eventually taken, each entry of GK 1 is properly scaled and will tend toward a ﬁnite value.

We aim to compute the limiting values of these trace terms as n0, n1, m , as they will be related to the error terms of interest. To proceed, recall that the asymptotic block-wise traces of the inverse of QK 1 can be determined from its operator-valued Stieltjes transform (Mingo & Speicher, 2017). The simplest way to apply the results of (Far et al., 2006; Mingo & Speicher, 2017) is to augment QK 1 to form the the self-adjoint matrix QK 1,

and observe that we can write QK 1 as,

QK 1 = Z QK 1 W,X,Θ QK 1 Σ

= 0 I9 I9 0

0 [QK 1 W,X,Θ]

QK 1 W,X,Θ 0

0 [QK 1 Σ ]

QK 1 W,X,Θ =

γ n0 0 0 0 0 0 0

Θ η ζ n1 0 0 0 ρW n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 W

n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X n0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Σ1/2 0 0 0 0 0 0 0 0 0 0 0 Σβ ρ 0 0 0 0 0 0 0 Σ1/2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Σ1/2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

and the addition in (S196) is performed block-wise. Note that we have separated the iid Gaussian matrices W, X, Θ from the constant terms and from the Σ-dependent terms. Denote by GK 1 M18(C) the block matrix

= id18 tr QK 1 1 , (S199)

and by GK 1 Σ M18(C) the operator-valued Stieltjes transform of QK 1 Σ . Using (S196) and the deﬁnition of the operator-valued Stieltjes transform G QK 1 W,X,Θ+ QK 1 Σ , we can write

GK 1 = id18 tr Z QK 1 W,X,Θ QK 1 Σ 1 = G QK 1 W,X,Θ+ QK 1 Σ ( Z) (S200)

Thus using the subordinated form of the equations for addition of free variables (Mingo & Speicher, 2017; section 9.2 Thm. 11), and the deﬁning equation for GK 1 Σ , the operator-valued theory of free probability shows that in the limit n0, n1, m , the Stieltjes transform GK 1 satisﬁes the

Published as a conference paper at ICLR 2022

following 18 18 matrix equation:

GK 1 = GK 1 Σ ( Z RK 1 W,X,Θ( GK 1))

= id tr Z RK 1 W,X,Θ( GK 1) QK 1 Σ 1 , (S201)

where RK 1 W,X,Θ( GK 1) M18(C) is the operator-valued R-transform of QK 1 W,X,Θ. Note that (S201) is a coupled set of 18 18 scalar equations and thus eliminates all reference to large random matrices. To see this, note that Z, GK 1, RK 1 W,X,Θ( GK 1) are all scalar-entried 18 18 matrices. The righthand side of (S201) is deﬁned by expanding the inverse to obtain an 18 18 block matrix whose blocks involve various rational functions of Σ, Σβ and the scalar entries of Z, GK 1, RK 1 W,X,Θ( GK 1). Finally one computes the normalized traces of these blocks, giving scalar values and eliminating all reference to random matrices. Below, when writing out these equations explicitly, we will use the fact that traces of rational functions of Σ, Σβ tend toward expectations of the corresponding rational functions over the LJSD µ. Both here and in the sequel, to ease the already cumbersome presentation, we use GK 1 to also denote the limiting value satisfying (S201).

As described in (Adlam & Pennington, 2020a;b), since QK 1 W,X,Θ is a block matrix whose blocks are iid

Gaussian matrices (and their transposes), an explicit expression for RK 1 W,X,Θ( GK 1) can be obtained through a covariance map, denoted by η (Far et al., 2006). In particular, η : Md(C) Md(C) is deﬁned by, [η(D)]ij = X

kl σ(i, k; l, j)αk Dkl , (S202)

where αk is dimensionality of the kth block and σ(i, k; l, k) denotes the covariance between the entries of the blocks ij block of QK 1 W,X,Θ and entries of the kl block of QK 1 W,X,Θ. Here d = 18 is the number of blocks. When the constituent blocks are iid Gaussian matrices and their transposes, as is the case here, then RK 1 W,X,Θ = η (Mingo & Speicher, 2017; section 9.1 and 9.2 Thm. 11), and

therefore the entries of RK 1 W,X,Θ can be read off from eqn. (S195). To simplify the presentation, we only report the entries of RK 1 W,X,Θ(GK 1) that are nonzero, given the speciﬁc sparsity pattern of

GK 1. The latter follows from eqn. (S201) in the manner described in (Mingo & Speicher, 2017; Far et al., 2006). Practically speaking, the sparsity pattern can be obtained by iterating an eqn. (S201), starting with an ansatz sparsity pattern determined by Z, and stopping when the iteration converges to a ﬁxed sparsity pattern. In this case (and all cases that follow in the subsequent sections), the number of necessary iterations is small and can be done explicitly. We omit the details and instead simply report the following results for the nonzero entries:

RK 1 W,X,Θ( GK 1) =

0 RK 1 W,X,Θ(GK 1)

RK 1 W,X,Θ(GK 1) 0

[RK 1 W,X,Θ(GK 1)]1,1 = GK 1 2,2 (ζ η) ρGK 1 6,3 γ (S204)

[RK 1 W,X,Θ(GK 1)]1,9 = ρGK 1 8,3 γ (S205)

[RK 1 W,X,Θ(GK 1)]2,2 = ψGK 1 1,1 (ζ η)

γφ + ρψGK 1 4,5 (S206)

[RK 1 W,X,Θ(GK 1)]4,5 = ρGK 1 2,2 (S207)

[RK 1 W,X,Θ(GK 1)]6,3 = ρGK 1 1,1 γφ (S208)

[RK 1 W,X,Θ(GK 1)]8,3 = ρGK 1 1,9 γφ , (S209)

Published as a conference paper at ICLR 2022

and the remaining entries of RK 1 W,X,Θ(GK 1) are zero. Owing to the large degree of sparsity, the matrix inverse in (S201) can be performed explicitly and yields relatively simple expressions that depend on the entries of GK 1 and the matrices Σ and Σβ. For example, the (16, 4) entry of the self-consistent equation reads,

GK 1 7,4 = id tr Z RK 1 W,X,Θ( GK 1) QK 1 Σ 1

16,4 (S210)

= tr h 1 ρΣβ In0 + ρ

φγ GK 1 1,1 GK 1 2,2 Σ) 1i (S211)

n0 = Eµ h q/ ρ

= Iβ 0,1 ρ , (S213)

where to compute the asymptotic normalized trace we moved to an eigenbasis of Σ and recalled the deﬁnition of the LJSD µ and the deﬁnition of Iβ in Eq. (12). The remaining entries of the (S201) can be obtained in a similar manner and together yield the following set of coupled equations for the entries of GK 1,

GK 1 1,1 = γ GK 1 2,2 ( ζ + η + ρ) + ρGK 1 2,2 ρGK 1 6,3 γ (S214)

GK 1 2,2 = γφ ψGK 1 1,1 (η ζ) γφ ρψGK 1 4,5 1 (S215)

GK 1 3,6 = Eµ h ρGK 1 1,1 λρGK 1 1,1 GK 1 2,2 γφ

GK 1 4,5 = Eµ h λ ρGK 1 1,1 λρGK 1 1,1 GK 1 2,2 γφ

GK 1 5,4 = Eµ h γ ρφGK 1 2,2 λρGK 1 1,1 GK 1 2,2 γφ

GK 1 6,3 = Eµ h γλ ρφGK 1 2,2 λρGK 1 1,1 GK 1 2,2 γφ

GK 1 7,4 = Eµ h qγφ ρ λρGK 1 1,1 GK 1 2,2 + γφ i

GK 1 7,6 = Eµ h q

λGK 1 1,1 λρGK 1 1,1 GK 1 2,2 + γφ

GK 1 8,3 = Eµ h qγλφ ρ λρGK 1 1,1 GK 1 2,2 + γφ i

GK 1 8,5 = Eµ h qλ3/2GK 1 1,1 λρGK 1 1,1 GK 1 2,2 + γφ

GK 1 8,7 = Eµ h

GK 1 9,1 = ρGK 1 8,3 GK 1 2,2 ( ζ + η + ρ) + ρGK 1 2,2 ρGK 1 6,3 γ (S225)

GK 1 3,4 = GK 1 5,6 = Eµ h

λρGK 1 1,1 GK 1 2,2 λρGK 1 1,1 GK 1 2,2 γφ

GK 1 3,5 = GK 1 4,6 = Eµ h

λ ρGK 1 1,1 λρGK 1 1,1 GK 1 2,2 + γφ

Published as a conference paper at ICLR 2022

GK 1 4,3 = GK 1 6,5 = Eµ h γ

λφ λρGK 1 1,1 GK 1 2,2 γφ

GK 1 5,3 = GK 1 6,4 = Eµ h γ

λ ρφGK 1 2,2 λρGK 1 1,1 GK 1 2,2 + γφ

GK 1 7,3 = GK 1 8,4 = Eµ h qγ

λφ ρ λρGK 1 1,1 GK 1 2,2 + γφ i

GK 1 7,5 = GK 1 8,6 = Eµ h qλGK 1 1,1 λρGK 1 1,1 GK 1 2,2 + γφ

GK 1 7,7 = GK 1 8,8 = GK 1 9,9 = 1 (S232)

GK 1 3,3 = GK 1 4,4 = GK 1 5,5 = GK 1 6,6 = Eµ h γφ λρGK 1 1,1 GK 1 2,2 γφ

where we have used the fact that, asymptotically, the normalized trace becomes equivalent to an expectation over µ. After eliminating GK 1 6,3 and GK 1 4,5 from the ﬁrst two equations, it is straightforward to show that

τ1 tr(K 1) = 1

γ GK 1 1,1 =

(ψ φ)2 + 4xψφγ/ρ + ψ φ

τ1 tr( ˆK 1) = 1

γ GK 1 2,2 = 1

n0 X Σ1/2ΣβΣ1/2XK 1) = τ1Iβ 1,1 (S236)

where we have used the notation τ1 and τ2 from (Adlam & Pennington, 2020a;b), and τ1 is the companion transform of τ1, and where x satisﬁes the self-consistent equation,

ω + I1,1 = 1

(ψ φ)2+4xψφγ/ρ+ψ φ

2ψ ω + I1,1 . (S237)

Here we utilized the two-index set of functionals of µ, Ia,b deﬁned in Eq. (12).

Note that the product τ1 τ1 is simply related to x,

x = γρτ1 τ1 , (S238)

so that, given x, the equations for the remaining entries of GK 1 completely decouple. In particular,

GK 1 3,6 = ρGK 1 1,1 I0,1 γφ (S239)

GK 1 4,5 = ρGK 1 1,1 I1,1 γφ (S240)

GK 1 5,4 = ρGK 1 2,2 I0,1 (S241)

GK 1 6,3 = ρGK 1 2,2 I1,1 (S242)

GK 1 7,4 = Iβ 0,1 ρ (S243)

GK 1 7,6 = Iβ 1 2 ,1GK 1 1,1 γφ (S244)

GK 1 8,3 = Iβ 1,1 ρ (S245)

Published as a conference paper at ICLR 2022

GK 1 8,5 = Iβ 3 2 ,1GK 1 1,1 γφ (S246)

GK 1 8,7 = I 1

2 ,0 φ (S247)

GK 1 9,1 = ρGK 1 1,1 GK 1 8,3 γ (S248)

GK 1 3,4 = GK 1 5,6 = x I 1

2 ,1 φ (S249)

GK 1 3,5 = GK 1 4,6 =

ρGK 1 1,1 I 1

2 ,1 γφ (S250)

GK 1 4,3 = GK 1 6,5 = I 1

2 ,1 (S251)

GK 1 5,3 = GK 1 6,4 = ρGK 1 2,2 I 1

2 ,1 (S252)

GK 1 7,3 = GK 1 8,4 = Iβ 1 2 ,1 ρ (S253)

GK 1 7,5 = GK 1 8,6 = Iβ 1,1GK 1 1,1 γφ (S254)

GK 1 7,7 = GK 1 8,8 = GK 1 9,9 = 1 (S255)

GK 1 3,3 = GK 1 4,4 = GK 1 5,5 = GK 1 6,6 = I0,1 , (S256)

which will be important intermediate results for the subsequent sections.

Finally, we note that these results are sufﬁcient to compute the training error. The expected training loss can be written as,

m Etr (Y ˆy(X))(Y ˆy(X)) (S257)

m Etr Y Y K 2 (S258)

n0 (X Σ1/2ΣβΣ1/2X + σ2 εIm)K 2 (S259)

= γ2 γτ2 + σ2 ε γτ1 (S260)

= γ2 γ(τ1Iβ 1,1) + σ2 ε γτ1 . (S261)

The calculation of E21 proceeds exactly as in (Tripuraneni et al., 2021a;b) with the simple modiﬁcation of including an additional factor Σβ inside the ﬁnal trace term, yielding

φIβ 2,1 . (S262)

The calculation of E31 proceeds exactly as in (Tripuraneni et al., 2021a;b) with no modiﬁcations since there is no dependence on Σβ. The result is,

σ2 ε (ω + φI1,2)(ω + I1,1) + φ

ψ γ τ1I2,2 , (S263)

Published as a conference paper at ICLR 2022

Deﬁne the block matrix QE32 [QE32 1 QE32 2 ] by,

γ n0 0 0 0 η ζΘ (ζ η)

Θ η ζ n1 In1 0 0 ρW n1 0 0 0

0 0 In0 Σ1/2 0 0 0 Σ1/2 (η ζ)

n1 0 In0 0 0 0 0

0 0 0 0 In0 Σ1/2 0 n1Σρ n0 ρ X n0 0 0 0 0 In0 0 0 0 0 0 0 0 0 In1 0

0 0 0 0 0 0 W

0 0 0 0 0 0 η ζΘ

γ n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 W

n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(S264) and,

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Θ η ζ n1 ρW n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

γ n0 0 0 0 0

0 In0 Σ1/2 0 0 0 0 0 X n0 0 In0 0 0 0 0 0

0 0 0 In0 Σ1/2 0 0 0

0 0 0 0 In0 Σβ ρ 0 0

0 0 0 0 0 In0 Σ1/2 0 0 0 0 0 0 0 In0 X n0 0 0 0 0 0 0 0 Im

Then block matrix inversion (i.e. repeated applications of the Schur complement formula) shows that,

GE32 8,8 = GE32 14,14 = GE32 15,15 = GE32 16,16 = 1 (S266)

GE32 1,1 = GE32 9,9 = GK 1 1,1 (S267)

GE32 2,2 = GE32 7,7 = GK 1 2,2 (S268)

GE32 13,8 = GK 1 3,3 1 (S269)

GE32 3,3 = GE32 6,6 = GE32 11,11 = GE32 12,12 = GE32 4,4 = GE32 5,5 = GE32 10,10 = GE32 13,13 = GK 1 3,3 (S270)

GE32 3,4 = GE32 5,6 = GE32 10,11 = GE32 12,8 = GE32 12,13 = GK 1 3,4 (S271)

GE32 3,5 = GE32 4,6 = GE32 12,10 = GE32 13,11 = GK 1 3,5 (S272)

GE32 3,6 = GE32 12,11 = GK 1 3,6 (S273)

GE32 4,3 = GE32 6,5 = GE32 11,10 = GE32 13,12 = GK 1 4,3 (S274)

GE32 4,5 = GE32 13,10 = GK 1 4,5 (S275)

GE32 5,3 = GE32 6,4 = GE32 10,12 = GE32 11,8 = GE32 11,13 = GK 1 5,3 (S276)

GE32 5,4 = GE32 10,8 = GE32 10,13 = GK 1 5,4 (S277)

GE32 6,3 = GE32 11,12 = GK 1 6,3 (S278)

GE32 14,12 = GE32 15,13 = GK 1 7,3 (S279)

Published as a conference paper at ICLR 2022

GE32 14,13 = GK 1 7,4 (S280)

GE32 14,11 = GK 1 7,6 (S281)

GE32 15,12 = GK 1 8,3 (S282)

GE32 15,10 = GK 1 8,5 (S283)

GE32 15,14 = GK 1 8,7 (S284)

GE32 16,9 = GK 1 9,1 (S285)

GE32 14,10 = GE32 15,11 = GK 1 9,1

GE32 16,1 = φ ψ E32 , (S287)

where GE32 i,j denotes the normalized trace of the (i, j)-block of the inverse of QE32 . For brevity, we have suppressed the expressions for the other non-zero blocks.

To compute the limiting values of these traces, we require the asymptotic block-wise traces of QE32, which may be determined from the operator-valued Stieltjes transform. To proceed, we ﬁrst augment QE32 to form the the self-adjoint matrix QE32,

QE32 = 0 [QE32]

and observe that we can write QE32 as, QE32 = Z QE32 W,X,Θ QE32 Σ

= 0 I16 I16 0

0 [QE32 W,X,Θ]

QE32 W,X,Θ 0

0 [QE32 Σ ]

where QE32 W,X,θ [[QE32 W,X,θ]1 [QE32 W,X,θ]2] and,

[QE32 W,X,θ]1 =

γ n0 0 0 0 η ζΘ (ζ η)

Θ η ζ n1 0 0 0 ρW n1 0 0 0 0 0 0 0 0 0 0 0

n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 W

0 0 0 0 0 0 η ζΘ

γ n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 W

n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

[QE32 W,X,θ]2 =

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Θ η ζ n1 ρW n1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

γ n0 0 0 0 0 0 0 0 0 0 0 0 0 X n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X n0 0 0 0 0 0 0 0 0

Published as a conference paper at ICLR 2022

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Σ1/2 0 0 0 Σ1/2 (η ζ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Σ1/2 0 n1Σρ n0 ρ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Σ1/2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Σ1/2 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 Σβ ρ 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 Σ1/2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(S292) The operator-valued Stieltjes transforms satisfy, GE32 = GE32 Σ ( Z RE32 W,X,Θ( GE32))

= id tr Z RE32 W,X,Θ( GE32) QE32 Σ 1 , (S293)

where RE32 W,X,Θ( GE32) is the operator-valued R-transform of QE32 W,X,Θ. As discussed above, since QE32 W,X,Θ is a block matrix whose blocks are iid Gaussian matrices (and their transposes), an explicit expression for RE32 W,X,Θ( GE32) can be obtained from the covariance map η, which can be read off from eqn. (S288). As above, we utilize the speciﬁc sparsity pattern for GE32 that is induced by Eq. (S293), to obtain,

RE32 W,X,Θ( GE32) =

0 RE32 W,X,Θ(GE32)

RE32 W,X,Θ(GE32) 0

[RE32 W,X,θ(GE32)]1,1 = GE32 2,2 (ζ η)

γ ρGE32 6,3 γ + GE32 2,7 (ζ η) (ζ η)

[RE32 W,X,θ(GE32)]1,9 = GE32 7,2 (ζ η)

γ ρGE32 11,3 γ + GE32 7,7 (ζ η) (ζ η)

[RE32 W,X,θ(GE32)]1,16 = ρGE32 15,3 γ (S297)

[RE32 W,X,θ(GE32)]2,2 = ψGE32 1,1 (ζ η)

γφ + ρψGE32 4,5 (S298)

[RE32 W,X,θ(GE32)]2,7 = ψGE32 9,1 (ζ η)

γφ + ρψGE32 8,5 + ρψGE32 13,5 + ψGE32 1,1 (ζ η) (ζ η)

[RE32 W,X,θ(GE32)]4,5 = ρGE32 2,2 (S300)

[RE32 W,X,θ(GE32)]4,10 = ρGE32 7,2 (S301)

[RE32 W,X,θ(GE32)]6,3 = ρGE32 1,1 γφ (S302)

[RE32 W,X,θ(GE32)]6,12 = ρGE32 9,1 γφ (S303)

[RE32 W,X,θ(GE32)]7,2 = ψGE32 1,9 (ζ η)

γφ + ρψGE32 4,10 (S304)

[RE32 W,X,θ(GE32)]7,7 = ψGE32 9,9 (ζ η)

γφ + ρψGE32 8,10 + ρψGE32 13,10 + ψGE32 1,9 (ζ η) (ζ η)

Published as a conference paper at ICLR 2022

[RE32 W,X,θ(GE32)]8,5 = ρGE32 2,7 (S306)

[RE32 W,X,θ(GE32)]8,10 = ρGE32 7,7 (S307)

[RE32 W,X,θ(GE32)]9,1 = GE32 2,7 (ζ η)

γ ρGE32 6,12 γ (S308)

[RE32 W,X,θ(GE32)]9,9 = GE32 7,7 (ζ η)

γ ρGE32 11,12 γ (S309)

[RE32 W,X,θ(GE32)]9,16 = ρGE32 15,12 γ (S310)

[RE32 W,X,θ(GE32)]11,3 = ρGE32 1,9 γφ (S311)

[RE32 W,X,θ(GE32)]11,12 = ρGE32 9,9 γφ (S312)

[RE32 W,X,θ(GE32)]13,5 = ρGE32 2,7 (S313)

[RE32 W,X,θ(GE32)]13,10 = ρGE32 7,7 (S314)

[RE32 W,X,θ(GE32)]15,3 = ρGE32 1,16 γφ (S315)

[RE32 W,X,θ(GE32)]15,12 = ρGE32 9,16 γφ , (S316)

and the remaining entries of RE32 W,X,θ(GE32) are zero. As above, plugging these expressions into eqn. (S293) and explicitly performing the block-matrix inverse yields the following set of coupled equations,

GE32 7,2 = γ2 ρ τ 2 1 ψGE32 8,5 + γ2 ρ τ 2 1 ψGE32 13,5 + γ τ 2 1 ψGE32 9,1 (ζ η)

φ + γ2τ1 τ 2 1 ψ(ζ η) (ζ η)

GE32 8,3 = I 1

2 ,1η γ τ1I 3

GE32 8,4 = γ τ1I1,1 (ρτ1ψ (ζ η) + φρ)

GE32 8,5 = I1,1 (ρτ1ψ (ζ η) + φρ) ρψφ (S320)

2 ,1η γ τ1I 3

GE32 9,1 = γτ 2 1 GE32 7,2 (ζ η) γ ρτ 2 1 GE32 11,3 + γ2τ 2 1 τ1(ζ η) (ζ η) (S322)

GE32 10,3 = ρφGE32 7,2 I 1

2 ,2 γρ3/2 τ 2 1 GE32 9,1 I 3

2 ,2 γ ρ τ1φ ψI 1

2 ,2ζ + ψI 1

2 ,2η + γ τ1I 3

GE32 10,4 = ρφGE32 7,2 I0,2 γρ3/2 τ 2 1 GE32 9,1 I1,2 γ2 ρ τ 2 1 I1,2 (ρτ1ψ (ζ η) + φρ)

GE32 10,5 = ρτ1GE32 7,2 I1,2 ρ τ1GE32 9,1 I1,2 I1,2 (γ τ1φρ + xψζ xψη)

GE32 10,6 = ρτ1GE32 7,2 I 1

2 ,2 ρ τ1GE32 9,1 I 1

2 ,2 + γ2ρτ1 τ 2 1 I 3

2 ,2 (η ζ) (S326)

Published as a conference paper at ICLR 2022

GE32 11,3 = ρφGE32 7,2 I1,2 γρ3/2 τ 2 1 GE32 9,1 I2,2 γ ρ τ1φ ( ψI1,2ζ + ψI1,2η + γ τ1I2,2ρ)

GE32 11,4 = ρφGE32 7,2 I 1

2 ,2 γρ3/2 τ 2 1 GE32 9,1 I 3

2 ,2 γ2 ρ τ 2 1 I 3

2 ,2 (ρτ1ψ (ζ η) + φρ)

GE32 11,5 = ρτ1GE32 7,2 I 3

2 ,2 ρ τ1GE32 9,1 I 3

2 ,2 (γ τ1φρ + xψζ xψη)

GE32 12,4 = ρτ1GE32 7,2 I 1

2 ,2 ρ τ1GE32 9,1 I 1

γ2ρτ1 τ 2 1 ρ ψ + x2 (ζ η)

GE32 12,5 =

ρGE32 9,1 I 1

2 ,2 γ + ρ3/2τ 2 1 GE32 7,2 I 3

2 ,2 φ + I 3

2 ,2 γρ2τ 2 1 τ1ψ (ζ η) + xφρ

GE32 12,6 = ρGE32 9,1 I0,2

γ + ρ3/2τ 2 1 GE32 7,2 I1,2 φ + γρ2τ 2 1 τ1I1,2 (ζ η) x2I2,2ρ

ψ ρφ (S332)

GE32 13,3 = ρτ1GE32 7,2 I 3

2 ,2 ρ τ1GE32 9,1 I 3

2 ,2 + γ2ρτ1 τ 2 1 I 5

2 ,2 (η ζ) (S333)

GE32 13,4 = ρτ1GE32 7,2 I1,2 ρ τ1GE32 9,1 I1,2 + I2,2

γ2ρτ1 τ 2 1 ρ ψ + x2 (ζ η)

GE32 13,5 = ρGE32 9,1 I1,2

γ + ρ3/2τ 2 1 GE32 7,2 I2,2 φ + I2,2 γρ2τ 2 1 τ1ψ (ζ η) + xφρ

GE32 13,6 =

ρGE32 9,1 I 1

2 ,2 γ + ρ3/2τ 2 1 GE32 7,2 I 3

2 ,2 φ + γρ2τ 2 1 τ1I 3

2 ,2 (ζ η) x2I 5

ψ ρφ (S336)

GE32 13,8 = x I1,1

GE32 14,3 = ρτ1Iβ 3 2 ,2GE32 7,2 + ρ τ1Iβ 3 2 ,2GE32 9,1 + xψIβ 3 2 ,2 (ζ η) γ2ρτ1 τ 2 1 Iβ 5 2 ,2ρ ρψ (S338)

GE32 14,4 = ρτ1Iβ 1,2GE32 7,2 + ρ τ1Iβ 1,2GE32 9,1 Iβ 2,2 γ2ρτ1 τ 2 1 φρ + ψx2ζ ψx2η

GE32 14,5 = Iβ 1,2GE32 9,1 γ ρτ 2 1 Iβ 2,2GE32 7,2 φ + Iβ 2,2 γρ2τ 2 1 τ1(η ζ)

GE32 14,6 = Iβ 1 2 ,2GE32 9,1 γ ρτ 2 1 Iβ 3 2 ,2GE32 7,2 φ + γρ2τ 2 1 τ1Iβ 3 2 ,2 (η ζ) + x2Iβ 5 2 ,2ρ

ψ ρφ (S341)

GE32 14,8 = x Iβ 1,1 ρφ (S342)

GE32 14,11 = τ1Iβ 1 2 ,1 φ (S343)

GE32 14,13 = Iβ 0,1 ρ (S344)

GE32 15,3 = ρτ1Iβ 2,2GE32 7,2 + ρ τ1Iβ 2,2GE32 9,1 + xψIβ 2,2 (ζ η) γ2ρτ1 τ 2 1 Iβ 3,2ρ ρψ (S345)

GE32 15,4 = ρτ1Iβ 3 2 ,2GE32 7,2 + ρ τ1Iβ 3 2 ,2GE32 9,1 Iβ 5 2 ,2 γ2ρτ1 τ 2 1 φρ + ψx2ζ ψx2η

Published as a conference paper at ICLR 2022

GE32 15,5 = Iβ 3 2 ,2GE32 9,1 γ ρτ 2 1 Iβ 5 2 ,2GE32 7,2 φ Iβ 5 2 ,2 γρ2τ 2 1 τ1ψ (ζ η) + xφρ

GE32 15,6 = Iβ 1,2GE32 9,1 γ ρτ 2 1 Iβ 2,2GE32 7,2 φ + γρ2τ 2 1 τ1Iβ 2,2 (η ζ) + x2Iβ 3,2ρ ψ ρφ (S348)

GE32 15,8 = x Iβ 3 2 ,1 ρφ (S349)

GE32 15,10 = τ1Iβ 3 2 ,1 φ (S350)

GE32 15,12 = Iβ 1,1 ρ (S351)

GE32 15,14 = I 1

2 ,0 φ (S352)

GE32 16,1 = τ 2 1 Iβ 1,1GE32 7,2 (ζ η) ρτ 2 1 Iβ 1,1GE32 11,3 + γτ 2 1 τ1Iβ 1,1(ζ η) (ζ η) ρτ1GE32 15,3 (S353)

GE32 16,9 = τ1Iβ 1,1 (S354)

GE32 1,1 = GE32 9,9 = γτ1 (S355)

GE32 2,2 = GE32 7,7 = γ τ1 (S356)

GE32 3,6 = GE32 12,11 = ρτ1I0,1

GE32 4,5 = GE32 13,10 = ρτ1I1,1

GE32 6,3 = GE32 11,12 = γ ρ τ1I1,1 (S359)

GE32 11,6 = GE32 12,3 = ρτ1GE32 7,2 I1,2 ρ τ1GE32 9,1 I1,2 + γ2ρτ1 τ 2 1 I2,2ρ ψ + x I1,2 (η ζ) (S360)

GE32 14,10 = GE32 15,11 = τ1Iβ 1,1 φ (S361)

GE32 14,12 = GE32 15,13 = Iβ 1 2 ,1 ρ (S362)

GE32 5,4 = GE32 10,8 = GE32 10,13 = γ ρ τ1I0,1 (S363)

GE32 3,5 = GE32 4,6 = GE32 12,10 = GE32 13,11 =

2 ,1 φ (S364)

GE32 4,3 = GE32 6,5 = GE32 11,10 = GE32 13,12 = I 1

2 ,1 (S365)

GE32 8,8 = GE32 14,14 = GE32 15,15 = GE32 16,16 = 1 (S366)

GE32 3,4 = GE32 5,6 = GE32 10,11 = GE32 12,8 = GE32 12,13 = x I 1

2 ,1 φ (S367)

GE32 5,3 = GE32 6,4 = GE32 10,12 = GE32 11,8 = GE32 11,13 = γ ρ τ1I 1

2 ,1 (S368)

GE32 3,3 = GE32 4,4 = GE32 5,5 = GE32 6,6 = GE32 10,10 = GE32 11,11 = GE32 12,12 = GE32 13,13 = I0,1 , (S369)

Here we have used the relations in eqns. (S266)-(S287), the deﬁnition of Iβ a,b, as well as the results in Sec. G.4.1 to simplify the expressions. It is straightforward algebra to solve these equations for the undetermined entries of GE32 and thereby obtain the following expression for E32,

E32 = (η ζ)A32 + ρB32

D32 , (S370)

Published as a conference paper at ICLR 2022

A32 = ρ3τ1ψ2x4I1,1I2,2Iβ 2,2 + ρ2τ1ψx3I2,2Iβ 2,2(ρφ + xψ(ζ η))

ρ3τ1ψ2x3φI1,1I1,2Iβ 2,2 + ρ2τ1ψ2x2I1,1Iβ 1,1(η ζ)

+ ρ2τ1ψ2x2I1,1Iβ 2,2(ρ + x(ζ η)) + ρ2τ1ψx2φI1,2Iβ 2,2(ρφ + xψ(ζ η))

+ ρ3τ1ψ2x2φI1,1I1,2Iβ 1,1 ρ2τ1ψxφI1,2Iβ 1,1(ρφ + xψ(ζ η))

+ ρτ1ψx Iβ 1,1(ζ η)(ρφ + xψ(ζ η))

ρτ1ψx Iβ 2,2(ρ + x(ζ η))(ρφ + xψ(ζ η)) (S371)

B32 = ρ2ψx6I2 2,2Iβ 3,2 2ρ2ψx5φI2 2,2Iβ 2,2 + 2ρψx4φI1,2Iβ 3,2(η ζ)

2ρ2ψx4φ2I1,2I2,2Iβ 2,2 + ρ2ψx4φ2I2 1,2Iβ 3,2 + ρ2ψx4φI2 2,2Iβ 1,1

+ ρ2ψx4φI1,1I2,2Iβ 2,2 + ρ2x4I2,2Iβ 3,2(ψ + φ)

+ ρx3φI2,2Iβ 2,2(ρ(ψ + φ) + 2xψ(ζ η))

+ ρ2ψx3φ2I1,2I2,2Iβ 1,1 + ρ2ψx3φ2I1,1I1,2Iβ 2,2 + ρψx2φI1,1Iβ 1,1(ζ η)

ρx2φI2,2Iβ 1,1(ρφ + xψ(ζ η)) ρψx2φI1,1Iβ 2,2(ρ + x(ζ η))

ρ2ψx2φ2I1,1I1,2Iβ 1,1 + Iβ 3,2 x4ψ(ζ η)2 ρ2x2φ (S372)

D32 = ρ3ψx4φI2 2,2 + 2ρ2ψx2φ2I1,2(η ζ)

+ ρ3ψx2φ3I2 1,2 + ρ3x2φI2,2(ψ + φ) + ρφ x2ψ(ζ η)2 ρ2φ . (S373)

Further simpliﬁcations are possible using the raising and lowering identities in eqn. (S9), as well as the results in Sec. G.4.1, to obtain,

φ Iβ 3,2 ρψ

Iβ 1,1(ω + φI1,2)(ω + I1,1) + φ2

ψ γ τ1Iβ 1,2I2,2 + γτ1Iβ 2,2(ω + φI1,2) ,

where x γ = x γ + ργ(τ1ψ/φ + τ1)(ω + φI1,2) . (S375)

The calculation of E4 proceeds exactly as in (Tripuraneni et al., 2021a;b) with the simple modiﬁcation of including an additional factor Σβ inside the ﬁnal trace term, yielding

φ Iβ 3,2 . (S376)

G.5 FINAL RESULT FOR BIAS, VARIANCE, AND TEST ERROR

Putting the above pieces together, we have,

Bµ = φIβ 1,2 (S377)

Iβ 1,1(ω + φI1,2)(ω + I1,1) + φ2

ψ γ τ1Iβ 1,2I2,2 + γτ1Iβ 2,2(ω + φI1,2)

+ σ2 ε (ω + φI1,2)(ω + I1,1) + φ

ψ γ τ1I2,2 . (S378)

Some algebra shows that

Eµ = Bµ + Vµ (S380)

Published as a conference paper at ICLR 2022

= γ(τ1(σ2 ε + Iβ 1,1)) τ 2 1 σ2 ε (S381)

γ2τ 2 1 σ2 ε . (S382)

Corollary G.1. In the setting of Theorem 3.1, as the ridge regularization constant γ 0, Eµ = Bµ + Vµ with Bµ = φIβ 1,2 and Vµ given by

Vµ γ 0 min(φ, ψ)

|φ ψ| (σ2 ε + Iβ 1,1) +

( x Iβ 2,2 if φ < ψ x I2,2 ω+φI1,2 (σ2 ε + Iβ 1,2) otherwise , (S383)

where x is the unique positive real root of x = min(1,φ/ψ)