# minimumnorm_interpolation_under_covariate_shift__b24c26c7.pdf

Minimum-Norm Interpolation Under Covariate Shift

Neil Mallinar * 1 Austin Zane * 2 Spencer Frei 3 Bin Yu 2 4

Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identification of a phenomenon known as benign overfitting, in which linear interpolators overfit to noisy training labels and yet still generalize well. This behavior occurs under specific conditions on the source covariance matrix and input data dimension. Therefore, it is natural to wonder how such high-dimensional linear models behave under transfer learning. We prove the first non-asymptotic excess risk bounds for benignlyoverfit linear interpolators in the transfer learning setting. From our analysis, we propose a taxonomy of beneficial and malignant covariate shifts based on the degree of overparameterization. We follow our analysis with empirical studies that show these beneficial and malignant covariate shifts for linear interpolators on real image data, and for fully-connected neural networks in settings where the input data dimension is larger than the training sample size.

1. Introduction

Practical deployments of machine learning models are almost always in a transfer learning setting, where models trained on a source data distribution with noisy labels are

*Equal contribution 1Department of Computer Science, University of California San Diego, CA, USA 2Department of Statistics, University of California Berkeley, CA, USA 3Department of Statistics, University of California Davis, CA, USA 4Department of Electrical Engineering and Computer Sciences, University of California Berkeley, CA, USA. Correspondence to: Neil Mallinar <nmallina@ucsd.edu>, Austin Zane <austin.zane@berkeley.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

expected to perform well on a different target data distribution, referred to as the out-of-distribution (OOD) dataset (Oglic et al., 2022; D Amour et al., 2022). There have been many experimental works on transfer learning with complex models and datasets (Recht et al., 2019; Koh et al., 2021; Miller et al., 2021; Hendrycks et al., 2021; Wenzel et al., 2022; Liang et al., 2023), but remarkably fewer attempts to study it theoretically, even in the simplest case of linear models which have been of great interest in recent years (Dwivedi et al., 2020; Bartlett et al., 2020; Hastie et al., 2022; Tsigler & Bartlett, 2023).

There has been an extensive in-distribution (ID) theoretical interest in high-dimensional linear regression and specifically interpolation, meaning a model reaches zero training loss (Belkin et al., 2019a;b). Frameworks such as benign overfitting , or harmless interpolation (Bartlett et al., 2020; Muthukumar et al., 2020) emerged as an attempt to explain why interpolating neural networks often do not overfit catastrophically (Zhang et al., 2017). They found that, in specific cases, overfitting can be benign , meaning that a model interpolates noisy training labels and yet has vanishing excess risk. In linear regression, this occurs if and only if the training (source) covariance matrix satisfies very specific conditions. Under these conditions, the minimum-norm interpolator (MNI) approximately acts like a ridge regression solution.

This sparked an initial wave of in-distribution theoretical research into benign overfitting in high-dimensional linear models (Chatterji et al., 2022; Tsigler & Bartlett, 2023; Chatterji & Long, 2023), kernel regression (Rakhlin & Zhai, 2019; Haas et al., 2023; Barzilai & Shamir, 2023; Belkin et al., 2018), and even some shallow neural networks (Frei et al., 2022a; Kou et al., 2023; Kornowski et al., 2023; Xu et al., 2024). Although these works were motivated by a desire to understand overfitting in modern deep learning, recent works have shown that in many practical settings of interest, overfitting is not benign (Mallinar et al., 2022; Haas et al., 2023; Lai et al., 2023). Thus, deeper investigations into the generalization behavior of overfit models are warranted.

Given the increasing prevalence of overparameterized models, it is natural to ask how such models perform in the transfer learning setting. There have been some efforts to

Minimum-Norm Interpolation Under Covariate Shift

answer this in the theoretically tractable cases of linear regression and random feature and kernel regression (Pathak et al., 2022; Wang, 2023). However, these works either provide asymptotic bounds that require the training sample size and data dimension to go to infinity at the same rate (Tripuraneni et al., 2021), study minimax settings which only considers worst-case risk (Lei et al., 2021), or focus on augmented gradient-based training algorithms, like importance weighting (Wang et al., 2022).

Summary of contributions. In this paper, we investigate the generalization behavior of the minimum ℓ2-norm linear interpolator (MNI) under distribution shifts when the source distribution satisfies the conditions necessary for benign overfitting. We summarize our main contributions as follows.

We provide the first non-asymptotic, instance-wise risk bounds for covariate shifts in interpolating linear regression when the source covariance matrix satisfies benign overfitting conditions and commutes with the target covariance matrix.

We use our risk bounds to propose a taxonomy of covariate shifts for the MNI. We show how the ratio of target eigenvalues to source eigenvalues and the degree of overparameterization affect whether a shift is beneficial or malignant, meaning OOD risk is better or worse than ID risk, respectively. The degree of overparameterization is determined by the eigenspectrum s head and tail properties.

We empirically show that our taxonomy of shifts holds: (1) for the MNI on real image data under natural shifts like blur (a beneficial shift) and noise (a malignant shift), underscoring the significance of our findings beyond the idealized source and target covariances for which our theory is applicable; (2) for neural networks in settings where the input data dimension is larger than the training sample size, showing that our findings for the MNI are also reflective of the behavior of more complex models.

1.1. Prior Work and Comparisons to this Work

Excess risk analysis under distribution shifts: Tripuraneni et al. (2021) give an asymptotic analysis of highdimensional random feature regression in covariate shift. They require the number of samples, n, data dimension, p, and random feature dimension to go to at the same rate. In contrast, our non-asymptotic analysis considers finite sample cases and differing rates. This allows us to draw new conclusions about how the degree of overparameterization changes the way in which interpolating linear models exhibit out-of-distribution (OOD) generalization. Additionally,

our bounds let us analyze any sequence of eigenvalues for the target feature covariance matrix, which is not possible within their framework.

Lei et al. (2021) study linear regression under distribution shifts in the minimax setting. Their minimax bounds consider the worst-case risk over an ℓ2-ball of target models, whereas we compute risk bounds specific to any model instantiation, with no restriction on the target model class. Furthermore, their experimental results only consider the underparameterized regime.

Several other works study OOD generalization in more distant settings. Wang et al. (2022) study linear interpolators for classification, when trained with gradient descent and importance weighting, whereas we consider the closed-form MNI for linear regression. Simchowitz et al. (2023) study covariate shifts when the target function class is the sum of two other function classes, and shifts are defined with regard to metric entropy between classes, whereas we focus on well-specified linear models. Pathak et al. (2022); Ma et al. (2023); Feng et al. (2023) consider covariate shift in kernel regression based on likelihood ( importance ) ratios between source and target distributions while we consider source and target eigenvalue ratios which offer granular insights into feature scale changes whereas likelihood ratios capture shifts that affect the global data distribution. Pathak et al. (2022); Ma et al. (2023) also analyze worst-case, minimax risk for nonparametric function classes. Kausik et al. (2023) work in the proportional asymptotic regime and consider the error in variables setting with noisy features and clean labels, while our work focuses on the linear regression setting with clean features and noisy labels. Finally, we note that risk bounds in these prior works do not sufficiently account for the behavior of the high-rank covariance tail that benign overfitting requires.

Experimental work on distribution shifts: Hendrycks & Dietterich (2019) propose the CIFAR-10C dataset as an OOD counterpart to CIFAR-10, featuring test set images corrupted by visual filters like blurs and noises. Koh et al. (2021) present benchmarks on more realistic datasets with modern models that can be seen in-the-wild . Miller et al. (2021) experimentally show a linear relationship between ID accuracy and OOD accuracy for a wide range of modern neural networks and datasets, though their results show ID accuracy is almost always better than OOD accuracy. On a subset of CIFAR-10C, we find settings in which OOD accuracy is better than ID accuracy for linear interpolators.

Benign overfitting in-distribution : Bartlett et al. (2020) propose benign overfitting, give a non-asymptotic analysis of the MNI, and show specific, necessary conditions under which the MNI achieves zero excess risk in-distribution. Tsigler & Bartlett (2023) extend this work by considering benign overfitting in the case of ridge regression. Our proof

Minimum-Norm Interpolation Under Covariate Shift

techniques follow most closely to the ideas presented in these two papers for the in-distribution setting. Frei et al. (2022a) show benign overfitting in shallow non-linear MLPs trained with gradient descent on the logistic loss if the data dimension grows faster than the number of training samples. Mallinar et al. (2022) experimentally show that interpolating neural networks do not benignly overfit due to the low input data dimension. Our experiments build on this by looking at settings in which n < p and n > p where n is the training sample size and p is the input data dimension. Other works study benign overfitting under a variety of conditions (Kou et al., 2023; Chatterji & Long, 2023; Frei et al., 2023b).

2. Preliminaries

We extend notations in Bartlett et al. (2020) and Tsigler & Bartlett (2023) to the transfer learning setting with OOD generalization risk as our performance metric. Appendix A formalizes our setting of linear regression under distribution shift, and we provide necessary details here.

2.1. Linear Models for Source and Target Data

Let Ds and Dt be source and target distributions over (x, y) Rp R. We consider linear regression problems defined as follows.

Definition 1 (Linear regression). Let the training dataset be comprised of n i.i.d. pairs (xi, yi)n i=1 Dn s concatenated into a data matrix X Rn p and a response vector ys Rn, where n < p. We define

1. the covariance matrix Σs = EDs[xx ],

2. (centered rows) EDs[x] = 0,

3. (well-specified) the optimal parameter vector θ s Rp

such that y = x T θ s + εs

for (x, y) Ds, where εs is a centered random variable with variance vε2s and EDs[y|x] = x T θ s .

We test on Dt with Σt, θ t , εt defined in the same way. Note that (x, y) is used to denote single observation pairs for both source and target data. We will differentiate between the two by explicitly denoting the distribution from which the pair is drawn.

To facilitate our analysis, we introduce the following assumptions on the covariance matrices and the distribution of the data.

Assumption 2.1. For linear regression problems (Def. 1), with source and target covariance matrices Σs and Σt Rp p, we assume:

1. (simultaneously diagonalizability) Σs and Σt commute; that is, there exists an orthogonal matrix V Rp p

such that V Σs V and V Σt V are both diagonal:

Σs = E x Ds[xx T ] = diag(λ1, λ2, ..., λp),

Σt = E x Dt[xx T ] = diag( λ1, λ2, ..., λp),

where λ1 λ2 λp and λiλi 0 for all i;

2. (subgaussianity) the whitened observations z = x T Σ 1/2 s are centered i.i.d. vectors with independent coordinates and subgaussian norm σx; that is, for all γ Rp,

E[exp(γ z)] exp(σ2 x||γ||2/2)

Simultaneous diagonalizability is a common assumption in recent studies of high-dimensional linear regression (Lei et al., 2021; Kausik et al., 2023; Le Jeune et al., 2024) and we show in Section 4 with experiments that our results hold even when this is violated. Subgaussianity is also frequently used in statistical learning theory research and encompasses a wide array of distributions of interest (Vershynin, 2018).

2.2. Min-Norm Interpolator and Target Excess Risk

Given a source data matrix X, the minimum-norm interpolator (MNI) for any vector ξ Rn is defined as

bθ(ξ) := argmin n θ 2 : Xθ = ξ o

= XT (XXT ) 1ξ.

If we consider ξ = ys, then we recover the MNI for the labels given by the response model, but our analysis will also involve implicit MNIs for different label vectors in Rn.

The quantity that we seek to bound is the excess risk on the target distribution, which we define for an estimator θ Rp

R(θ, Dt) := E Dt

h y x T θ 2 y x T θ t 2i . (1)

We now derive bounds for the target excess risk and its expectation over the source response noise. The proof of the following can be found in Appendix C. Theorem 2.2. (Target excess risk decomposition) The excess risk of the MNI trained on the source data, when evaluated on the target distribution, satisfies

R(bθ(ys), Dt) 4B1 + 4B2 + 2Vεs, (2)

E εs R(bθ(ys), Dt) = B1 + B2 + E εs Vεs

+ 2(θ t θ s ) Σt(θ s bθ(Xθ s )),

Minimum-Norm Interpolation Under Covariate Shift

where we define

B1 := θ s θ t 2 Σt, (3)

B2 := θ s bθ(Xθ s ) 2 Σt, (4)

Vεs := bθ(εs) 2 Σt, (5)

and x 2 M := x Mx.

We observe that B1 is a deterministic model shift term and that no further analysis can improve its dependency on θ s , θ t , or Σt. The cross-term, (θ t θ s ) Σt(θ s bθ(Xθ s )), is dominated by the bias and variance as evidenced by the upper bound. Therefore we focus our analysis on B2 and Vεs. A useful normalized version of Vεs is defined by

Vεs/v2 εs . (6)

Note that B2, V are reminiscent of the ID bias and variance in prior work (Bartlett et al., 2020; Tsigler & Bartlett, 2023).

2.3. Separation of Components and Effective Ranks

For an index k, we define the following quantities related to the effective rank of the tail of Σs (Tsigler & Bartlett, 2023):

i>k λi nλk+1 , Rk = (P

ρk measures the ratio of the energy of the source covariance tail to the number of training data observations, after normalizing the tail eigenvalues. Rk measures the quantity of noisy features and how evenly distributed their eigenvalues are. It is minimized when there is only one nonzero eigenvalue and maximized when there are many equal eigenvalues.

Benign overfitting occurs if the MNI is overfit to noisy training labels and yet ID excess risk decays to zero. The central finding of Bartlett et al. (2020) is that the only way benign overfitting happens for the MNI is if the following occurs: (1) there exists a k = min{k : ρk b} for a universal constant b > 1, meaning that the last p k

components of Σs have a high effective rank relative to the number of training samples, n; (2) the magnitudes of the bottom p k eigenvalues are small relative to the top k ; and (3) k n. More formally, consider quantities p = p(n), a sequence of source covariance matrices Σn = diag(λ1, , λp), k = k (n) as defined above, Rk = Rk (Σn), and ρk = ρk(Σn). A sufficient condition for benign overfitting is,

lim n ρ0 = lim n k /n = lim n n/Rk = 0. (7)

If this occurs, then the MNI behaves similarly to an estimator with two components. One component has variance similar to the ordinary least squares (OLS) estimator in k

dimensions and bias similar to the ridge regression solution with ridge parameter proportional to P

i>k λi, a sort of data-induced regularization. The other component is a high-dimensional component, which has vanishing variance when the data is sufficiently high-dimensional and a bias which is proportional to P

i>k λi(θ s )2 i (Tsigler & Bartlett, 2023). From these conditions, we see that the top k components are like signal components of the data and the bottom p k components are noise components.

2.4. Spiked Covariance Models

We will consider a special case of the (k, ϵ)-spike model, a canonical covariance structure that exhibits benign overfitting for the MNI (Chatterji et al., 2022; Chatterji & Long, 2023), to experimentally show properties of interest.

Definition 2 ((k, δ, ϵ)-spike model). For a source distribution Ds, δ > 0 and ϵ > 0 such that δ ϵ, let

E x Ds[xx T ] = diag(λ1, , λk | {z } =δ , λk+1, , λp | {z } =ϵ

In this simplified setting, there are k high-energy signal directions and p k low-energy noise directions. For a target distribution Dt, we use different hyperparameters k, δ, ϵ to similarly characterize a shifted covariance matrix.

3. Main Theorems

This section provides upper and lower bounds for the variance and bias terms in Equation 6 and Equation 4, respectively. Appendix D gives a high-level overview of our proof techniques. Subsequent appendices provide complete proofs. The variance bounds are adapted from Bartlett et al. (2020), while the bias lower bound is derived from Tsigler & Bartlett (2023). Our contributions include a novel bias upper bound and a unique characterization of overparameterization degrees. We start with the bounds for the variance term. Appendix E contains a proof of the following theorem.

Theorem 3.1. (Upper and lower bounds for the variance term) There exist universal constants b, c1 > 1 given in Lemma B.1, a universal constant c2 given in Lemma B.4 and a constant c > 1 that only depends on σx, c1, c2, such that for k (0, n/c), with probability at least 1 10e n/c,

λi λi min 1, λ2 i λ2 k+1(ρk + 1)2

If in addition ρk b, with probability at least 1 7e n/c,

λi λi + n P

i>k λiλi (P

i>k λi)2 =: V /c. (9)

Minimum-Norm Interpolation Under Covariate Shift

We first note that the variance lower bound does not depend on ρk b and so it holds for any interpolating linear model, even when benign source conditions are not satisfied. However, we will see that if ρk b for some k, then the upper and lower bounds are tight. In the case where Σt = Σs, these bounds reduce to their in-distribution counterparts (Bartlett et al., 2020). Our variance bounds show that the excess risk contribution of each feature is scaled by the ratio of the target and source eigenvalues, λi/λi. We immediately see that scaling down the target eigenvalues will lessen the overall contribution to variance and that scaling up the target eigenvalues will increase the contribution. We investigate these scaling factors and the separation of the first k components and last p k components in Section 3.1.

We now state upper and lower bounds for the bias term, B2, given in Equation 4. The proof of the following theorem can be found in Appendix F.

Theorem 3.2. (Upper and lower bounds for the bias term) For the lower bound only, assume that random models θ are obtained from the underlying θ s as (θ)i = γi(θ s )i, where each γi is an independent Rademacher random variable. There exists a universal constant b > 1, constants c, C that depend only on b and σx, and k < n/C such that if ρk b, then with probability at least 1 10e n/c,

λi(θ s )2 i (1 + λi λk+1ρk )2 + X

i>k λi(θ s )2 i

If we assume that p is at most exponential in n, then with probability at least 1 5e n/c,

B2/c θ s 2 p X

λi 1 + λi λk+1ρk =: B2/c.

Note that while the lower bound is in expectation over the random models θ, the resulting expression only depends on the ground-truth θ s . This Bayesian approach also appears in prior work, i.e. Tsigler & Bartlett (2023). In studying the bias lower bound, we observe a similar separation of signal and noise components and depence on eigenvalue ratios as in the variance bounds.

To show tightness of our bounds, we assume there exists a k such that ρk b for some universal constant b > 1. When this condition is satisfied, the variance bounds are tight up to constant factors. The bias bounds leave a model-dependent and source covariance-dependent gap, which we discuss in the proof overview in Appendix D and in the complete proof found in Appendix G.

Theorem 3.3. (Tightness of variance and bias bounds) Let the lower bound and upper bound of V be given by V and V , respectively. There exists a universal constant b 1,

and constant c as defined in Theorem 3.1, and k (0, n/c) such that if ρk b, then

V /V b 2(1 + b) 2/c2, 1 .

Let the lower bound and upper bound of B2 be given by B2 and B2, respectively, and the assumptions of Theorem 3.2 be satisfied. Then

mini (θ s )2 i : (θ s )i = 0

θ s 2 1 + b 1 λ1 λk+1

Note that the gap between our bias upper and lower bounds is independent of the target distribution.

3.1. A Taxonomy of Shifts

We now present a taxonomy of covariate shifts on the target distribution inspired by our prior analysis. We first consider OOD generalization and formally categorize shifts as beneficial or malignant.

Definition 3 (Beneficial and Malignant shifts). For a source distribution, Ds, a target distribution, Dt, excess risk, R, and MNI, bθ, we say that a shift is

1. beneficial if R(bθ, Ds) > R(bθ, Dt),

2. malignant if R(bθ, Ds) < R(bθ, Dt).

We define these shifts for excess risk and note in Appendix J.1 that, empirically, the variance is the primary contributor to excess risk and the bias contributions are negligible when Σs satisfies benign overfitting conditions. This is in keeping with prior literature that focuses on studying variance in interpolating methods (Bartlett et al., 2020). We will thus focus on variance in the following discussion.

Prior work shows that if n, p at the same rate, tr(Σs) < tr(Σt) results in malignant shifts on excess risk and tr(Σs) > tr(Σt) results in beneficial shifts on excess risk (Tripuraneni et al., 2021). In this section we generalize these conditions by considering differing rates of n, p and measuring overparameterization by the modified effective rank measure Rk/n rather than p. This leads us to a novel characterization of the role of overparameterization in covariate shifts. For completeness, we describe their trace conditions in terms of our shifts in Appendix H.1.

We first consider a simplified example with separate multiplicative shifts on the signal and noise components to facilitate intuition. While this is a special case, it provides valuable insights into the dynamics of overparameterization and covariate shift that are relevant to practice. Our general results, which allow for arbitrary multiplicative shifts in every direction, are presented in Appendix H.3.

Minimum-Norm Interpolation Under Covariate Shift

2 4 6 8 Ln Dimension

Ln Excess Risk

OOD ( = 2.00, = 0.10)

OOD ( = 0.10, = 2.00)

Figure 1. We experiment with the (k, δ, ϵ) spiked covariance models and examine conditions for beneficial and malignant shifts as given in Theorem 3.4. We take n = 60, k = 10, δ = 1.0, ϵ = 1e 6, δ = 2.0, ϵ = 1e 7, and vary p. We see a cross-over from mild to severe overparameterization on the right side of p = n where both OOD shifts swap between beneficial and malignant. For both ID and OOD curves, we observe that excess risk is a decreasing function if input dimension. Curves are averaged over 100 independent runs.

Let Σs be a covariance matrix that satisfes benign source conditions and denote the ID variance term by Vid. Define Σt by λi = αλi for i k, and λi = βλi for i > k with α, β 0. Let Vood denote the OOD variance term. By Theorem 3.1, up to constants,

Vid k/n + n/Rk, (10)

Vood α(k/n) + β(n/Rk), (11)

where Rk = (P

i>k λi)2/(P

i>k λ2 i ).

It is clear that if Vood Vid > 0 then we have a malignant shift on the variance, and if Vood Vid < 0 then we have a beneficial shift on the variance. Observe that,

Vood Vid (α 1)(k/n) + (β 1)(n/Rk). (12)

In this expression, we see that the scales of signal and noise shifts, α and β, are important, as is the relationship between k/n (the classical rate) and n/Rk (the high-dimensional rate). The quantity n/Rk can be interpreted as an inverse measure of overparameterization, where smaller values correspond to higher levels of overparameterization. The rate of overparameterization relative to the classical rate of k/n determines whether the shift on the first k components (α) or the shift on the last p k components (β) contributes more to the difference in excess risk.

Based on this intuition, we define two regimes of overparameterization: mild and severe. Definition 4 (Mild and severe overparameterization for multiplicative shifts). Let Σs be a source covariance that satisfies benign source conditions and let k n. Define Σt as,

λi = αλi for i k and λi = βλi for i > k, with α, β 0. Let Cαβ := α 1

We are in the mildly overparameterized regime if

n/Rk = ω (Cαβ k/n) .

We are in the severely overparameterized regime if

n/Rk = o (Cαβ k/n) .

For β = 1 we define Cαβ = and thus are effectively in the severely overparameterized regime with regard to the types of shift we observe.

It is clear that k is important in defining regimes of overparameterization. The aforementioned definitions hold for any k < n, however we derive our taxonomy of shifts in the case in which k < n such that ρk b for a universal constant b > 1. We note that even for non-linear models or settings that do not exhibit benign overfitting we can still think about a notion of a k akin to the intrinsic dimension of the data. We empirically show in Figures 6 and 7 that our taxonomy of shifts is reflective of shift behavior in realistic settings by heuristically taking k small enough to sufficiently capture the low-dimensional signal in the data.

In Equation 12, we see that the limit of the severe overparameterization regime would take Rk first, while holding other problem parameters fixed. In this case, we are only left with α shifts on the top k components, as any shift on the bottom components is suppressed by the high rank covariance tail. This leads to behaviors akin to classical intuitions for an underparameterized linear regression estimator where k = p < n. In this case, α > 1 leads to more variance and thus harder learning, whereas α < 1 leads to less variance and thus easier learning. These notions of hard vs. easy learning naturally correspond to tr(Σt) > tr(Σs) and tr(Σt) < tr(Σs), respectively. This is shown experimentally in Figs. 1 and 7 by looking at the left and right sides of the figures.

On the other hand, in the mildly overparameterized regime covariance tail shifts are not sufficiently suppressed and lead to non-negligible interactions with shifts on the signal components. An increase in energy in the signal components can be counteracted by a decrease in energy in the noise components, effectively increasing the contrast between signal and noise in favor of the signal. Similarly, a decrease in energy in the signal components can be harmfully counteracted by an increase in energy in the noise components. This is visible in Figs. 1 and 7 just above the threshold of interpolation. Interestingly, in the mildly overparameterized regime we can also define settings in which tr(Σt) > tr(Σs) and yet still obtain a beneficial shift, and settings in which tr(Σt) < tr(Σs) and yet still obtain malignant shifts.

Minimum-Norm Interpolation Under Covariate Shift

ID ( i i 1ln 3 (i + 1)) OOD (Beneficial) ( i i 1ln 4 (i + 1)) OOD (Malignant) ( i i 1ln 2 (i + 1))

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Variance

Test Excess MSE

(a) n = 200, p = 20, h = 512

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Variance

Test Excess MSE

(b) n = 200, p = 20, h = 2048

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Variance

Test Excess MSE

(c) n = 200, p = 2k, h = 512

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Variance

Test Excess MSE

(d) n = 200, p = 2k, h = 2048

Figure 2. We train 3 layer Re LU dense neural networks with hidden width, h, on n samples from p-dimensional Gaussians. ID test data is sampled from the same distribution and OOD test sets are constructed based on beneficial and malignant covariate shifts in our theory. Ground truth models are sampled as θ s Sp 1, no model shift is invoked. For training data, X, train labels are given by ys = Xθ s + εs with label noise εs N(0, σ2). All runs reach train loss < 5e 6. Points are averaged over 20 independent runs with standard error bars reported.

We formalize these observations in the following theorem, the proof of which can be found in Appendix H.2.

Theorem 3.4. (Beneficial and Malignant Multiplicative Shifts on Variance) Let Σs be a source covariance that satisfies benign source conditions. That is, k such that ρk b for a universal constant b > 1. Define Σt as λi = αλi for i k and λi = βλi for i > k, with α, β 0.

1. If α < 1, β 1 or α 1, β < 1 then we obtain a beneficial shift in variance.

2. If α > 1, β 1 or α 1, β > 1 then we obtain a malignant shift in variance.

3. If we are in the mildly overparameterized regime:

α > 1 and β < 1 leads to beneficial shifts; α < 1 and β > 1 leads to malignant shifts.

4. If we are in the severely overparameterized regime:

α > 1 and β < 1 leads to malignant shifts; α < 1 and β > 1 leads to beneficial shifts.

Figs. 1 and 5 demonstrate the relationship between the n/Rk and k/n rates in the case of Cαβ = 1.11, Cαβ = 1, respectively, for spiked covariance models. In both, we clearly see a cross-over from beneficial to malignant shifts when we transition from mild to severely overparameterized.

Overparameterization improves OOD robustness Focusing on just the target excess risk, let α = α(n) and β = β(n, p). We say that the benignly-overfit MNI is robust if its excess risk decays to zero despite the presence of multiplicative covariate shifts. In order for the variance upper bound to decay to 0, it is sufficient to have the shifts in the signal and noise components satisfy, α = o(n/k), β = o(Rk/n). The condition β = o(Rk/n)

allows the shift strength to increase at a rate determined by the level of overparameterization, so we conclude that increasing the amount of overparameterization improves robustness to multiplicative distribution shifts. Note that α has no dependence on Rk and so robustness to shifts on the signal components is independent of the degree of overparameterization.

4. Experiments

Our theoretical results have provided insight into distribution shifts in high-dimensional linear regression. We now present experiments with linear models and neural networks, relaxing many of the assumptions used for theoretical results. Specifically, we empirically: (1) observe beneficial and malignant shifts on synthetic and real data for linear models (benignly overfit and otherwise) and even high-dimensional dense neural networks; (2) show the benefit of overparameterization in covariate shift for interpolating linear estimators; (3) validate that our findings hold when the source and target covariance matrices are not simultaneously diagonalizable, as well as under model misspecification; (4) provide experimental insight that high-dimensional neural networks, i.e. when the input data dimension is large relative to the training sample size, act similarly to the MNI whereas lowdimensional neural networks do not, regardless of the level of overparameterization. Details of experimental setup, data, and models are given in Appendix I. We now discuss key observations and takeaways from the experiments.

4.1. Synthetic Data Experiments

Fig. 1 shows excess risk vs. input dimension for data sampled from the (k, δ, ϵ)-spike covariance model with k = 10, δ = 1.0, and ϵ = 1e 6. Beneficial and malignant shifts are seen in the setting of Theorem 3.4 with α = 2.0, β = 0.1. That is, we see two cross-over points: one in the under-

Minimum-Norm Interpolation Under Covariate Shift

In-distribution Severity 1 Severity 2 Severity 3 Severity 4 Severity 5

0 1000 2000 3000 Eigenvalue index, i

Log eigenvalue, ln( i)

(a) Noise then Blur Covariance

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(b) Noise then Blur n = 500

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(c) Noise then Blur n = 2k

0 1000 2000 3000 Eigenvalue index, i

Log eigenvalue, ln( i)

(d) Blur then Noise Covariance

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(e) Blur then Noise n = 500

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(f) Blur then Noise n = 2k

Figure 3. We experiment with a custom variant of CIFAR-10C in which we apply the blur and noise image filters directly to the test set images of CIFAR-10 at each severity level, e.g. Severity 1 means that we add a small amount of noise and a small amount of blurring to the image. In the top row we first use the noise filter and then the blur filter. In the bottom row we first use the blur filter and then the noise filter. In (a) and (d), we observe that the eigenvalue decay of the shifts are non-monotonic and mirror the α < 1, β > 1 setting in our taxonomy. Indeed, we also see in (b) and (e) that when we are severely overparameterized the noisy tail effects appear to be suppressed and we still obtain beneficial shifts. On the other hand, in (c) and (f) we are in the mildly overparameterized regime and observe that the noisy tail effects hurt generalization, even for severity 4 in the top row which only adds a small amount of noise in the tail. These results are exactly in keeping with our taxonomy for the α < 1, β > 1 case. All curves are averaged over 50 independent runs.

parameterized regime and one in the overparameterized regime (going from mild to severe). This suggests that nonnegligible covariance tail effects are a property of shifting when a model is in a region around the double descent peak. The further we are from the double descent peak, the more classical our behavior is, in that the top k components are the only ones that influence shift and the bottom p k components either don t exist or have negligible effects. Appendix J.1 explores this setup for different values of α, β and interpolating linear models for eigendecay rates that lead to harmful interpolation, i.e. tempered or catastrophic overfitting (Mallinar et al., 2022).

Figs. 2 and 8 show similar results for 3-layer dense Re LU neural networks trained until near-interpolation (train MSE < 5e 6) on synthetic data with benign overfitting eigendecay rates (Bartlett et al., 2020). For the neural network, we consider p to be the dimension of the input data, rather than the number of network parameters. From Fig. 2 we observe similar trends predicted by our theory for beneficial and malignant shifts when p > n, indicating that while our theory is developed for linear models we are able to extrapolate to more complex models. In both the p > n and p < n

experiments, our results are agnostic to the hidden width of the network, further suggesting that overparameterization is qualitatively different from high-dimensionality. When p > n, a neural network appears to act like the interpolating MNI under distribution shifts. For p < n the interpolating dense net does not exhibit the properties of an interpolating MNI under distribution shift and the ID excess error is better than both beneficial and malignant OOD excess errors. However, the relative difference between beneficial and malignant shifts is still preserved. Note that we observe the exact same behavior in Fig. 12 when training Res Nets to interpolation on CIFAR-10 and testing on CIFAR-10C blur and noise corruptions (Hendrycks & Dietterich, 2019).

4.2. CIFAR-10 Experiments

Next, we consider experiments with linear interpolators on a binarized CIFAR-10 and CIFAR-10C with Gaussian noise and blur corruptions at varying levels of corruption severity. For details, see Appendix I. This experiment breaks the assumption of simultaneous diagonalizability, and the well-specified assumption as the labels for CIFAR are not obtained by a ground-truth linear model.

Minimum-Norm Interpolation Under Covariate Shift

Fig. 9 shows empirical results on the MNI fit to this problem. We plot the eigenspectra of the blur and noise covariances from CIFAR-10C compared to the eigenspectra of CIFAR10 in Figs. 9a, 9b. We identified these two shifts due to their eigenspectra reflecting what we expect to lead to beneficial and malignant shifts based on Theorem 3.4. We observe a tight relationship between changes in the eigenspectra of the target data and excess classification error when evaluated by the MNI. We notice that blurs reduce covariance energy with increased blurring, like a denoising -style operation. Experimentally this leads to improved OOD accuracy. In contrast, noise corruptions add energy to the covariance tail and lead to worsened OOD accuracy. Fig. 11 also shows that further overparameterization in this setting leads to improved behavior of the MNI on both corruptions.

Fig. 3 extends these results to a more realistic setting with custom variants of CIFAR-10C that involves applying both noise and blur filters on test set images. Using both blur and noise filters together lead to covariate shifts that feature non-monotonic behaviors when comparing source to target eigenvalues. This experiment highlights the α < 1, β > 1 case and we see that the OOD accuracy matches the predictions of our taxonomy based on whether we are in the mildly overparameterized regime (n = 500) or the severely overparameterized regime (n = 2k). In Fig. 10 we show plots for an artificially constructed version of this same experiment in which Gaussian noise is injected into the high variance directions of the blur test set, simulating the equivalent of α > 1, β < 1 in our taxonomy. We similarly find the OOD accuracy to match the predictions of our taxonomy.

5. Conclusion and Future Work

Our work provides the first finite-sample, instance-wise analysis of the MNI under transfer learning with highdimensional linear models. We show a taxonomy of beneficial and malignant covariate shifts depending on whether we are in a mild or severely overparameterized regime. In the mildly overparameterized regime, variance contributions on the top k components interact with that of the bottom p k components in non-negligible ways, leading to non-standard shifts. In the severely overparameterized regime, the highrank covariance tail suppresses variance contributions in the bottom p k components and so OOD generalization acts more classical , akin to underparameterized linear regression where k = p < n.

Benign overfitting literature commonly claims to be motivated by overparameterized neural networks, referring to the number of parameters in the network rather than the dimension of the data. However recent works have challenged this, suggesting the role of the ambient dimension and source covariance is more important than parameter count in determining whether overfitting is benign or catas-

trophic in neural networks (Frei et al., 2023a; Kornowski et al., 2023). Prior work has also shown that gradient descent on 2-layer neural networks has an implicit bias towards linear decision boundaries when p n, independent of the degree of overparameterization (Frei et al., 2022b).

Our experiments further support the view that highdimensional neural networks behave similarly to highdimensional linear models, whereas low-dimensional neural networks do not. They provide a new and important perspective on the difference between high-dimensionality and overparameterization in neural networks in the case of distribution shift, which has yet to be appreciated in the literature. While dimensionality and degree of overparameterization are inextricably linked in linear regression, practical deep learning tends to operate in the overparameterized setting, not the high-dimensional one.

An important future direction is to investigate the extent to which our results hold for distribution shifts on more complex high-dimensional datasets. It is also of interest to extend our finite-sample theoretical analysis to shallow Re LU neural networks, other nonlinear models, and learning algorithms that overfit in a tempered manner (Mallinar et al., 2022). Finally, future work might seek to extend our understanding of neural networks by carefully studying the interplay between the data dimension, number of network parameters, number of training samples, and the optimization algorithm and loss function, and how this interplay can affect ID & OOD generalization.

Another important future direction is to relax key assumptions in this work. These results rely on a simultaneous diagonalizability assumption on the source and target covariance matrices which frequently appears in related works on highdimensional linear regression (Lei et al., 2021; Kausik et al., 2023; Le Jeune et al., 2024). This enables us to highlight the different effects of covariate shifts in the signal components vs. in the noise components. In contrast, prior works (including those which can tolerate target covariances which do not satisfy simultaneous-diagonalizability (Tripuraneni et al., 2021; Lei et al., 2021)) all rely on averages or traces over matrices involving the target covariance. Exploring techniques that can effectively handle non-simultaneously diagonalizable covariance matrices while preserving the insights gained from separating the impact of covariate shifts in signal and noise components is a promising avenue for future work.

Minimum-Norm Interpolation Under Covariate Shift

Acknowledgements

The authors thank Alexander Tsigler, Peter Bartlett, Emmanuel Abbe, and Libin Zhu for useful discussion and feedback on the manuscript.

We gratefully acknowledge partial support from NSF grants DMS-2209975, 2015341, NSF grant 2023505 on Collaborative Research: Foundations of Data Science Institute (FODSI), the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and 814639, and NSF grant MC2378 to the Institute for Artificial Cyber Threat Intelligence and Operatio N (ACTION). NM gratefully acknowledges funding and support for this research from the Eric and Wendy Schmidt Center at The Broad Institute of MIT & Harvard. AZ additionally acknowledges support from NSF RTG Grant #1745640.

This work used Delta GPU compute nodes at NCSA and HPE and Expanse GPU compute nodes at Dell and SDSC through allocation CIS220009 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. Real world machine learning deployments are always in a transfer learning setting, where testing data reflects various types of distribution shifts. The broader impact of our work is our contribution of a deeper understanding of covariate shifts for high-dimensional linear models and showing, experimentally, when our intuitions carry over to neural networks and when they do not.

Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Barzilai, D. and Shamir, O. Generalization in kernel regression under realistic assumptions. Preprint, ar Xiv:2312.15995, 2023.

Belkin, M., Hsu, D. J., and Mitra, P. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Belkin, M., Hsu, D. J., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias

variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019a.

Belkin, M., Rakhlin, A., and Tsybakov, A. B. Does data interpolation contradict statistical optimality? In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019b.

Boerner, T. J., Deems, S., Furlani, T. R., Knuth, S. L., , and Towns, J. Access: Advancing innovation: Nsf s advanced cyberinfrastructure coordination ecosystem: Services & support. In In Practice and Experience in Advanced Research Computing (PEARC 23), July 2023. doi: 10. 1145/3569951.3597559.

Chatterji, N. S. and Long, P. M. Deep linear networks can benignly overfit when shallow ones do. Journal of Machine Learning Research, 24(117):1 39, 2023.

Chatterji, N. S., Long, P. M., and Bartlett, P. L. The interplay between implicit bias and benign overfitting in two-layer linear networks. Journal of Machine Learning Research, 23(263):1 48, 2022.

D Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 23(226):1 61, 2022.

Dwivedi, R., Singh, C., Yu, B., and Wainwright, M. J. Revisiting minimum description length complexity in overparameterized models. Preprint, ar Xiv:2006.10189, 2020.

Feng, X., He, X., Wang, C., Wang, C., and Zhang, J. Towards a unified analysis of kernel-based methods under covariate shift. In Conference on Neural Information Processing Systems (Neur IPS), 2023.

Frei, S., Chatterji, N. S., and Bartlett, P. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory (COLT), 2022a.

Frei, S., Vardi, G., Bartlett, P. L., Srebro, N., and Hu, W. Implicit bias in leaky relu networks trained on highdimensional data. In International Conference on Learning Representations (ICLR), 2022b.

Frei, S., Chatterji, N. S., and Bartlett, P. L. Random feature amplification: Feature learning and generalization in neural networks. Journal of Machine Learning Research, 24 (303):1 49, 2023a.

Frei, S., Vardi, G., Bartlett, P. L., and Srebro, N. Benign overfitting in linear classifiers and leaky Re LU networks from KKT conditions for margin maximization. In Conference on Learning Theory (COLT), 2023b.

Minimum-Norm Interpolation Under Covariate Shift

Haas, M., Holzm uller, D., von Luxburg, U., and Steinwart, I. Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. In Conference on Neural Information Processing Systems (Neur IPS), 2023.

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. Annals of Statistics, 50(2):949 986, 2022.

Hendrycks, D. and Dietterich, T. G. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019.

Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., and Gilmer, J. The many faces of robustness: A critical analysis of out-of-distribution generalization. In International Conference on Computer Vision (ICCV), 2021.

Kausik, C., Srivastava, K., and Sonthalia, R. Generalization error without independence: Denoising, linear regression, and transfer learning. ar Xiv preprint ar Xiv:2305.17297, 2023.

Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., et al. Wilds: A benchmark of in-thewild distribution shifts. In International Conference on Machine Learning (ICML), 2021.

Kornowski, G., Yehudai, G., and Shamir, O. From tempered to benign overfitting in relu neural networks. In Conference on Neural Information Processing Systems (Neur IPS), 2023.

Kou, Y., Chen, Z., Chen, Y., and Gu, Q. Benign overfitting in two-layer relu convolutional neural networks. In International Conference on Machine Learning (ICML), 2023.

Lai, J., Yu, Z., Tian, S., and Lin, Q. Generalization ability of wide residual networks. Preprint, ar Xiv:2305.18506, 2023.

Lei, Q., Hu, W., and Lee, J. D. Near-optimal linear regression under distribution shift. In International Conference on Machine Learning (ICML), 2021.

Le Jeune, D., Liu, J., and Heckel, R. Monotonic risk relationships under distribution shifts for regularized risk minimization. Journal of Machine Learning Research, 25(54):1 37, 2024.

Liang, W., Mao, Y., Kwon, Y., Yang, X., and Zou, J. Y. Accuracy on the curve: On the nonlinear correlation of ml performance between data subpopulations. In International Conference on Machine Learning (ICML), 2023.

Ma, C., Pathak, R., and Wainwright, M. J. Optimally tackling covariate shift in rkhs-based nonparametric regression. Annals of Statistics, 51(2):738 761, 2023.

Mallinar, N., Simon, J., Abedsoltan, A., Pandit, P., Belkin, M., and Nakkiran, P. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Conference on Neural Information Processing Systems (Neur IPS), 2022.

Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning (ICML), 2021.

Muthukumar, V., Vodrahalli, K., Subramanian, V., and Sahai, A. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1 (1):67 83, 2020.

Oglic, D., Cvetkovic, Z., Sollich, P., Renals, S., and Yu, B. Towards robust waveform-based acoustic models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1977 1992, 2022.

Pathak, R., Ma, C., and Wainwright, M. A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning (ICML), 2022.

Rakhlin, A. and Zhai, X. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory (COLT), 2019.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning (ICML), 2019.

Simchowitz, M., Ajay, A., Agrawal, P., and Krishnamurthy, A. Statistical learning under heterogeneous distribution shift. In International Conference on Machine Learning (ICML), 2023.

Tripuraneni, N., Adlam, B., and Pennington, J. Overparameterization improves robustness to covariate shift in high dimensions. In Conference on Neural Information Processing Systems (Neur IPS), 2021.

Tsigler, A. and Bartlett, P. L. Benign overfitting in ridge regression. Journal of Machine Learning Research, 24 (123):1 76, 2023.

Vershynin, R. High-dimensional probability. Cambridge University Press, 2018.

Wang, K. Pseudo-labeling for kernel ridge regression under covariate shift. Preprint, ar Xiv:2302.10160, 2023.

Minimum-Norm Interpolation Under Covariate Shift

Wang, K. A., Chatterji, N. S., Haque, S., and Hashimoto, T. Is importance weighting incompatible with interpolating classifiers? In International Conference on Learning Representations (ICLR), 2022.

Wenzel, F., Dittadi, A., Gehler, P., Simon-Gabriel, C.-J., Horn, M., Zietlow, D., Kernert, D., Russell, C., Brox, T., Schiele, B., Scholkopf, B., and Locatello, F. Assaying out-of-distribution generalization in transfer learning. In Conference on Neural Information Processing Systems (Neur IPS), 2022.

Xu, Z., Wang, Y., Frei, S., Vardi, G., and Hu, W. Benign overfitting and grokking in relu networks for xor cluster data. In International Conference on Learning Representations (ICLR), 2024.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.

Minimum-Norm Interpolation Under Covariate Shift

A. Formal assumptions

Definition 5 (Linear regression under distribution shift). We consider a training dataset comprised of n i.i.d. pairs (xi, yi)n i=1 Dn s concatenated into a data matrix X Rn p and a response vector ys Rn. The setting is overparameterized, meaning we have more input features than training samples, or n < p .

1. the covariance matrix Σs = EDs[xx ],

2. the optimal parameter vector θ src Rp, satisfying

(y x θ src)2 = minθ E Ds

We test on the distribution Dt with Σt and θ t defined in the same way. We assume

1. (centered rows) EDs[x] = 0;

2. (well-specified - source) For (X, y) Ds, y = Xθ s + εs. We assume that the components of the source noise vector εs are i.i.d. centered random variables with positive variance vε2s and that EDs[y|x] = x T θ s ;

3. (well-specified - target) For (X, y) Dt, y = Xθ t + εt. We assume that the components of the target noise vector εt are i.i.d. centered random variables with noise variance, vε2 t , and that EDt[y|x] = x T θ t ;

4. (simultaneously diagonalizability) Σs and Σt commute; that is, there exists an orthogonal matrix V Rp such that V Σs V and V Σt V are both diagonal. This allows us to fix an orthonormal basis in which we can express the covariance matrices as

Σs = E x Ds[xx T ] = diag(λ1, λ2, ..., λp),

Σt = E x Dt[xx T ] = diag( λ1, λ2, ..., λp),

where the source eigenvalues are a non-increasing sequence, λ1 λ2 λp. Note that we do not require the target eigenvalues to be a non-increasing sequence, however we require that λiλi 0 for all i;

5. (subgaussianity) the whitened data matrix, denoted Z = XΣ 1/2 s , has centered i.i.d. row vectors with independent coordinates. We assume that the rows are subgaussian with subgaussian norm σx; that is, for all γ Rp,

E[exp(γ z)] exp(σ2 x||γ||2/2).

B. Key results from prior work and technical lemmas

For ease on the reader, we replicate some key lemma statements from Bartlett et al. (2020) and Tsigler & Bartlett (2023) and provide new lemmas and corollaries that we use in our work.

Recall that ρk = 1 nλk+1 P

i>k λi, X Rn p, and Σs Rp p = diag(λ1, , λp). Let X0:k Rn k denote the matrix

comprised of the first k feature columns. Similarly, Xk:p Rn (p k) denote the matrix of the last p k feature columns. The Gram matrix of the data, denoted here by

plays a central role in the investigation of high-dimensional linear regression. Analogous to the above, we express A0:k = X0:k XT 0:k Rn n and similarly for Ak:p Rn n.

Letting Z = XΣ 1/2 s Rn p and denoting the independent column vectors of Z by zi Rn, we have the following expressions:

i λiziz T i , A i = X

j =i λjzjz T j , Ak = X

i>k λiziz T i .

Minimum-Norm Interpolation Under Covariate Shift

The following lemma from Bartlett et al. (2020) is key in controlling the largest and smallest eigenvalues of the data Gram matrix and its variants A i and Ak. Importantly, it also shows that if the energy in the bottom p k components of the covariance matrix is sufficiently large (ρk is lower bounded by a constant), then the largest and smallest eigenvalues of Ak are equal up to constants.

Lemma B.1 (Lemma 5 from Bartlett et al. (2020)). There are constants b, c 1 such that for any k 0, with probability at least 1 2e n/c,

1. for all i 1,

µk+1(A i) µk+1(A) µ1(Ak) c

j>k λj + λk+1n

2. for all 1 i k,

µn(A) µn(A i) µn(Ak) 1

j>k λj cλk+1n,

3. if ρk b, then 1 c λk+1ρkn µn(Ak) µ1(Ak) cλk+1ρkn.

A consequence of the prior eigenvalue bounds is that when ρk is lower bounded by a constant, the condition number of Ak is upper bounded by a constant. Therefore even as problem parameters such as training sample size and input dimension grow to , Ak is still well-conditioned. This is important as non-benign overfitting occurs when the condition number bound on Ak grows with problem parameters. This would happen if the lower bound on the smallest eigenvalue of Ak decays to zero too quickly which would cause the condition number of Ak to diverge. If this occurs then the excess risk of the MNI would be lower bounded. This is shown for the in-distribution case in Bartlett et al. (2020).

Corollary B.2. Following from Lemma B.1, there are constants b, c 1 such that for any k 0, with probability at least 1 2e n/c, if ρk b then µ1(Ak) µn(Ak) c2 (13)

which is the equivalent of the assumption Cond Num(k, 2e n/c, c2) as defined in Tsigler & Bartlett (2023).

The following definition and lemma omit all references to Non Crit Reg and the ridge parameter in Tsigler & Bartlett (2023).

Definition 6 (Stable Lower Eig(k, δ, L) from Tsigler & Bartlett (2023)). Assume that for any j {1, 2, , p} with probability (separate for every j) at least 1 δ,

µn(A j) µn(E Ak)/L = ( X

i>k λi)/L. (14)

We now state key assumptions that are necessary in order to obtain an explicit bias lower bound. Exchangeable coordinates (Exch Coord) is a weaker assumption than independent components of the data vector. It is used in Tsigler & Bartlett (2023) instead of independent components. We assume that components of Z are independent and so we immediately satisfy the Exch Coord, which we define here.

Definition 7 (Exch Coord). Assume the sequence of coordinates of Σ 1/2 s x, for any x X, is exchangable (any deterministic permutation of the coordinates of whitened data vectors doesn t change their distribution).

The Prior Signs assumption is necessary to obtain lower bounds on the bias term. It allows us to use bounds on the expectation of a quadratic form, Ev[v T Mv], in order to separately analyze the contributions of v and M. As the bias takes the form θ s T (I XT A 1X)Σt(I XT A 1X)θ s we see that such a bound would separate the contributions of the model from that of data-dependent matrix expressions.

Definition 8 (Prior Signs). Assume that θ is sampled from a prior distribution in the following way: one starts with vector θ and flips signs of all its coordinates with probability 0.5 independently.

Minimum-Norm Interpolation Under Covariate Shift

Under Prior Signs, the random model vector is obtained by flipping signs on the components of the ground-truth model vector. This does not affect our bounds as we see in Theorem 3.2 that our bias lower bound only relies on squared components of the random model vector which are equivalent to the squared components of the ground truth model.

An important consequence of having a bounded condition number and independent coordinates is that with high probability the smallest eigenvalue of A i for all i 1 is lower bounded by nλk+1ρk up to constants. These assumptions allow Bartlett et al. (2020) to prove Lemma B.1, which in turn allows us to derive the Stable Lower Eig condition. This is a simple consequence of B.1 and we provide details here for completeness.

Corollary B.3 (Our variant of Stable Lower Eig from Tsigler & Bartlett (2023)). For all i 1, with probability at least 1 2e n/c2

c2 µn(E Ak) = 1

Proof. By Lemma B.1, for some absolute constant c1 1 with probability at least 1 2e n/c1

i>k λi c1λk+1n.

The assumption ρk b for some b 1 gives us

i>k λi c1λk+1n = 1

c1 λk+1nρk c1λk+1n

Choosing b > c2 1 and c2 = max{c1, (1/c1 c1/b) 1}, we get that with probability at least 1 2e n/c2

The next step is to extend this result to A i for all i.

For i k, observe that A i Ak gives us µn(A i) µn(Ak). For the case of i > k, we have

j =i λjzjz j

j k λjzjz j + X

j>k,j =i λjzjz j

λ1z1z 1 + X

j>k,j =i λjzjz j

λiz1z 1 + X

j>k,j =i λjzjz j .

We assume that the features are independent and zi is centered and whitened, so λiz1z 1 + P

j>k,j =i λjzjz j has the same

Minimum-Norm Interpolation Under Covariate Shift

distribution as Ak = P

j>k λjzjz j . Therefore,

λiz1z 1 + X

j>k,j =i λjzjz j

The following corollaries provide high-probability bounds on random subgaussian vectors with independent coordinates.

Corollary B.4 (Corollary 1 from Bartlett et al. (2020)). There is a universal constant c such that for any centered random vector z Rn with independent σ2-subgaussian coordinates with unit variances, any random subspace L of Rn of codimension k that is independent of z, and any t > 0, with probability at least 1 3e t,

z 2 n + cσ2(t +

ΠL z 2 n cσ2(k + t +

where ΠL is the orthogonal projection on L .

In our proofs, we will need to control the norm of zi for all i p on the same high-probability event. In these cases we need to apply a union bound over the events in the summation. The following corollary shows how to invoke a union bound over ℓof these events in such a way that the probability over the union of all such events holds with high probability that depends n.

Corollary B.5. There is a universal constant c as defined in Corollary B.4. Let z Rn be a centered random vector with σ2-subgaussian coordinates and unit variances. Let L be a random subspace of Rn of codimension k that is independent of z.

For 0 < t < n/c0 and k (0, n/c1) for c1 > c0 with c0 sufficiently large, with probability 1 3e t,

ΠL z 2 n/c3

where c2, c3 only depends on c, c0, σ.

We obtain a union bound over the intersection of ℓof these events so long as ln(ℓ) n/c0 ℓ en/c0. Then for k (0, n/c1) for c1 > c0 with c0 sufficiently large, if ℓ en/c0, with probability at least 1 3e n/c0, ℓof the above events independently hold.

Proof. Let Corollary B.4 hold with universal constant c. Then, with probability 1 3e t for t > 0,

z 2 n + cσ2(t +

ΠL z 2 n cσ2(k + t +

c0 . Then we have that,

Minimum-Norm Interpolation Under Covariate Shift

Plugging in for z 2,

z 2 n + cσ2(t +

c0 + n c0 )

= n(1 + cσ2(c 1 0 + c 1/2 0 ))

for c1 only dependent on c, c0, σ. Now, plugging in for ΠL z 2,

ΠL z 2 n cσ2(k + t +

n cσ2(k + n

c0 + n c0 )

= n(1 cσ2(k

n + c 1 0 + c 1/2 0 )).

c2 for c2 > c0. Then it is clear that k

n + c 1 0 + c 1/2 0 )) n(1 cσ2(c 1 2 + c 1 0 + c 1/2 0 ))

for constant c3 that only depends on c, σ2, c0. We finally require that 1 cσ2(c 1 1 + c 1 0 + c 1/2 0 ) > 0 which we can achieve by taking c0 sufficiently large.

We now proceed to bound the union of ℓof the complement events, in order to obtain a bound over the intersection of ℓof these events.

For multiple zi s, define by Ai the events shown above, that zi 2 c2n and ΠLizi 2 n/c3 where zi and Li are defined analogous to z, L above. Then

P( ℓ i=1(Ai)c)

i=1 P((Ai)c)

Then P( ℓ i=1Ai) 1 3ℓe t. Observing that 3ℓe t = 3eln(ℓ)e t = 3e t+ln(ℓ) = 3e (t ln(ℓ)) we can set the per-event t accordingly and obtain the necessary bound. We want 0 < t ln(ℓ) n/c0 to complete the bound. Therefore, we need that, per-event, ln(ℓ) < t n/c0 + ln(ℓ). If ln(ℓ) n/c0 then this reduces to needing ln(ℓ) < t 2n/c0. Since each event is defined for t (0, n/c0] the union bound proof is complete by taking t = n/c0 and requiring that ln(ℓ) n/c0.

The following lemma is necessary in order to extend a summation over random variables, each lower bounded by a real number with equal probability, to a unified lower bound over the entire summation.

Lemma B.6 (Lemma 9 from Bartlett et al. (2020)). Suppose n and {ηi}n i=1 is a sequence of non-negative random variables, {ti}n i=1 is a sequence of non-negative real numbers (at least one of which is strictly positive) such that for some δ (0, 1) and any i n, P(ηi > ti) 1 δ. Then,

Minimum-Norm Interpolation Under Covariate Shift

We now provide a minor generalization of Corollary S.6 in Bartlett et al. (2020) that comes from replacing a1 in a nonincreasing sequence of non-negative numbers {ai}p i=1 with maxi ai and only requiring that {ai}p i=1 is a sequence of non-negative numbers.

Corollary B.7. There is a universal constant c such that for any sequence {ai}p i=1 of non-negative numbers such that Pp i=1 ai < , and any independent, centered, σ-subexponential random variables {ξi}p i=1, and any x > 0, with probability at least 1 2e cx,

i aiξi| σ max

x max i ai,

Lastly, the following identity will allow us to use the Prior Signs assumption to derive a new form for the bias term, which will be used for the proof of the lower bound.

Lemma B.8 (Identity for expectation of a quadratic form). Assume M Rp p is a symmetric matrix. For a random vector x Rp with mean E[x] and covariance Cov(x),

E x[x T Mx] = E[x]T M E[x] + tr(MCov(x)).

E x[x T Mx] = E[tr(x T Mx)]

= E[tr(Mxx T )]

= tr(M E[xx T ])

= tr(MCov(x) + M E[x] E[x]T )

= tr(MCov(x)) + tr(E[x]M E[x]T )

= tr(MCov(x)) + E[x]M E[x]T .

C. Proof of excess risk bound

We start by restating Theorem 2.2.

Theorem 2.2. (Target excess risk decomposition) The excess risk of the MNI trained on the source data, when evaluated on the target distribution, satisfies

R(bθ(ys), Dt) 4B1 + 4B2 + 2Vεs, (2)

E εs R(bθ(ys), Dt) = B1 + B2 + E εs Vεs

+ 2(θ t θ s ) Σt(θ s bθ(Xθ s )),

where we define

B1 := θ s θ t 2 Σt, (3)

B2 := θ s bθ(Xθ s ) 2 Σt, (4)

Vεs := bθ(εs) 2 Σt, (5)

and x 2 M := x Mx.

Minimum-Norm Interpolation Under Covariate Shift

Proof. Let us begin by noting that the excess risk of any θ is given by,

R(θ) = E (x,y) Dt

h y x θ 2i E (x,y) Dt

h y x θ t 2i

= E (x,y) Dt

h y x θ t + x θ t x θ 2i E (x,y) Dt

h y x θ t 2i

= E (x,y) Dt

h x θ t x θ 2i + 2 E (x,y) Dt

y x θ t x θ t x θ

(i) = E x Dt

h x θ t x θ 2i . (15)

Equality (i) uses that, conditional on x, y x θ t |x is mean-zero, which is given in Assumption 3 (well-specified - target). So that E (x,y) Dt

y x θ t x θ t x θ = E x θ t x θ E y x θ t x = 0.

We now note that the source-data MNI can be decomposed as follows,

bθ(ys) = X (Xs X s ) 1ys = X (Xs X s ) 1(Xθ s + εs)

= bθ(Xθ s ) + bθ(εs)

We can thus continue from (15) to characterize the excess risk of the source-data MNI as

R(bθ(ys)) = E x Dt

x θ t x bθ(ys) 2

x θ t x (bθ(Xθ s ) + bθ(εs)) 2

x (θ t bθ(Xθ s )) x bθ(εs) 2

2 x (θ t bθ(Xθ s )) 2 + 2 x bθ(εs) 2

2 x (θ t θ s + θ s bθ(Xθ s )) 2 + 2 x bθ(εs) 2

(ii) E x Dt

4 x (θ t θ s ) 2 + 4 x (θ s bθ(Xθ s )) 2 + 2 x bθ(εs) 2 . (16)

In inequalities (i) and (ii), we have used Young s inequality, which implies (a b)2 2(a c)2 + 2(b c)2 for any a, b, c R. Recalling that x 2 M := x Mx,

it is apparent that the first term is just the weighted distance between the source and target vectors,

h x (θ t θ s ) 2i = (θ t θ s ) E x Dt

xx (θ t θ s ) = θ s θ t 2 Σt. (17)

The second term looks quite similar to the bias term, B, in Bartlett et al. (2020) and Tsigler & Bartlett (2023).

x (θ s bθ(Xθ s )) 2

= θ s bθ (Xθ s ) E x Dt[xx ] θ s bθ (Xθ s )

= θ s bθ (Xθ s ) 2 Σt. (18)

Minimum-Norm Interpolation Under Covariate Shift

The key difference with the standard supervised setting is that now the quantitiy in the middle is Σt, not Σs. Equivalently, the norm on θ s bθ(Xθ s ) is induced by Σt rather than Σs.

And finally, the third term is similar to the variance term, C, in Bartlett et al. (2020):

x bθ(εs) 2 = bθ(εs) E x Dt[xx ]bθ(εs)

= bθ(εs) Σtbθ(εs)

= bθ(εs) 2 Σt. (19)

As in the bias term, the only difference is that the middle term is Σt rather than Σs. Equivalently, the norm on bθ(εs) is induced by Σt rather than Σs.

Putting it all together, we get the following upper bound for the excess risk of the minimum-norm interpolator on the training data,

R(bθ(ys)) 4 θ s θ t 2 Σt + 4 θ s bθ(Xθ s ) 2 Σt + 2 bθ(εs) 2 Σt.

This completes the upper bound for the risk.

For the lower bound, we have

E εs R(bθ(ys)) = E εs,x Dt

x θ t x bθ(ys) 2

= E εs,x Dt

x (θ t bθ(Xθ s )) x bθ(εs) 2

= E εs,x Dt

x (θ t bθ(Xθ s )) 2 2 x (θ t bθ(Xθ s )) x bθ(εs)

+ x bθ(εs) 2

(i) = E x Dt

x (θ t bθ(Xθ s )) 2 + E εs,x Dt

The equality (i) uses that, conditional on X, εs is zero-mean. Note that the second term above is just Eεs bθ(εs) 2 Σt, so we need only deal with the first term. Adding and subtracting θ s inside the square and expanding, we have

x (θ t bθ(Xθ s )) 2

h x (θ t θ s ) 2i + E x Dt

x (θ s bθ(Xθ s )) 2

h (θ t θ s ) xx (θ s bθ(Xθ s )) i

= θ t θ s 2 Σt + θ s bθ(Xθ s ) 2 Σt + 2(θ t θ s ) Σt(θ s bθ(Xθ s )).

D. Overview of variance and bias proof techniques

The central pillar of both proofs is controlling the eigenvalues of Ak, which in turn provides certain bounds on the eigenvalues of A and A i. A key finding of Bartlett et al. (2020) is that once ρk is large enough, all eigenvalues of Ak are identical up to a constant factor. Specifically,

z T Az n2λk+1ρk, z T A 1z n(nλk+1ρk) 1.

Minimum-Norm Interpolation Under Covariate Shift

D.1. Variance

Due to independence between the components of εs, the variance term from Eqn. 6 can be expressed as

V = E εs[Vεs/v2 ε]

= tr(A 1X ΣX A 1)

i=1 λiλiz T i A 2zi.

Now that we are dealing with a sum of quadratic forms, we consider the first k signal and last p k noise components separately. Using the Sherman-Morrison formula the former can be written as

i k λiλiz T i A 2zi = X

λ2 i z T i A 2 i zi (1 + λiz T i A 1 i zi)2

λ2 i n(nλk+1ρk) 2

λ2 i n2(nλk+1ρk) 2

where λiz T i A 1 i zi dominates 1 for i k . For the sum over the noise components the 1 in the denominator dominates the other term and so we directly analyze the tail contributions as,

λi λi λ2 i z T i A 2zi X

λi λi λ2 i n(nλk+1ρk) 2.

The result is that the variance term is upper and lower bounded by

λ2 i nλ2 k+1ρ2 k

times constant factors.

As in Eqn. 4, the bias term is given by

B2 = θ s XT A 1Xθ s 2 Σt = tr(θ s T (I XT A 1X)Σt(I XT A 1X)θ s )

tr(θ s θ s T ) tr((I XT A 1X)Σt(I XT A 1X))

= θ s 2 p X

λiλjz i A 1zj 2 ,

where we use the Cauchy-Scharwz inequality to separate the parameter vector from the quadratic form. A quick application of the Sherman-Morrison formula allows us to write

B2 θ s 2 p X

i=1 λi 1 1 + λiz T i A 1 i zi .

From here, we once again exert control over the eigenvalues of A i to get

1 1 + λiz T i A 1 i zi 1 1 + λi λk+1ρk ,

Minimum-Norm Interpolation Under Covariate Shift

which completes the upper bound proof sketch.

Note that the looseness of the bias bounds largely stems from the application of the Cauchy-Schwarz inequality. The only situations in which the bound becomes an equality are when

cθ s = (I XT A 1X)Σ1/2 t

for some scalar c R or when θ s is the zero vector.

Between the upper and lower bounds, the latter is likely tighter due to the use of the Prior Signs assumption. As detailed in Appendix F.2, it allows us to write

B θ s T (I diag(XT A 1X))Σt(I diag(XT A 1X))θ s ,

where for a matrix Q Rm m, we use diag(Q) Rm m to denote zeroed off-diagonal entries. The contribution of the off-diagonal entries is non-negative and dominated by the diagonals, so they can be dropped in the lower bound while preserving tightness under the Prior Signs assumption. In general, non-negative terms cannot be discarded in the proof of an upper bound, so we resort to the Cauchy-Schwarz inequality in order to avoid addressing the off-diagonals directly. However, decoupling the model vector θ s from the matrix (I XT A 1X)Σ1/2 t introduces another degree of looseness, contributing to the gap between our bounds. Improving our upper bound will require controlling the off-diagonals of this matrix product with a technique more appropriate than Cauchy-Schwarz.

E. Proof of variance bounds

Theorem 3.1. (Upper and lower bounds for the variance term) There exist universal constants b, c1 > 1 given in Lemma B.1, a universal constant c2 given in Lemma B.4 and a constant c > 1 that only depends on σx, c1, c2, such that for k (0, n/c), with probability at least 1 10e n/c,

λi λi min 1, λ2 i λ2 k+1(ρk + 1)2

If in addition ρk b, with probability at least 1 7e n/c,

λi λi + n P

i>k λiλi (P

i>k λi)2 =: V /c. (9)

Proof. We derive the variance terms necessary here and finish the proof of the upper bound in Appendix E.1 and the lower bound in Appendix E.2.

We follow the proof techniques in Bartlett et al. (2020); Tsigler & Bartlett (2023). Observe that we can express the variance term as follows,

V = E εs[Vεs/v2 ε]

= E εs[ X (XX ) 1εs 2 Σt/v2 ε].

Defining A = XX ,

V = E εs[ X A 1εs 2 Σt/v2 ε]

= E εs[(ε s A 1XΣt X A 1εs)/v2 ε]

= E εs[tr(ε s A 1XΣt X A 1εs)/v2 ε].

Minimum-Norm Interpolation Under Covariate Shift

Using the trace trick,

V = tr(A 1XΣt X A 1 E εs[εsε s ])/v2 ϵ

= tr(A 1XΣt X A 1v2 ϵ In)/v2 ϵ = tr(A 1XΣt X A 1)

= tr(XΣt X A 2)

i=1 λixi(xi) )A 2)

i=1 λiλiziz i )A 2)

where xi Rn and xi λi = zi Rn are columns of X Rn p and XΣ 1/2 s Rn p, respectively. Continuing the calculation, we have that

i=1 λiλitr(z T i A 2zi)

i=1 λiλitr(z T i (A i + ziz T i λi) 2zi)

i=1 λiλi z T i A 2 i zi (1 + λiz T i A 1 i zi)2

where A i = XXT λiziz T i = P

j =i λjzjz T j . This expression will serve as the starting point for the variance term, which we will now proceed to upper and lower bound.

E.1. Upper bound

After isolating the contribution of λi λi , most of the components of this proof are as given in the proof of Lemma 6 in Bartlett et al. (2020). For completeness, we replicate them here and refer the reader to their paper for further details and intuitions.

We start by separating the variance term into the top k components and the bottom p k components as follows,

λ2 i z T i A 2 i zi (1 + λiz T i A 1 i zi)2 + X

λi λi (λ2 i z T i A 2zi).

Fix constants b, c1 1 as defined in Lemma B.1. Then, with probability 1 2e n/c1, if ρk b then for all z Rn and i [1, k],

z T i A 2 i zi µ1(A 2 i ) zi 2

µn(A i) 2 zi 2

and on the same event,

z T i A 1 i zi (ΠLizi)T A 1 i (ΠLizi)

µn(A 1 i ) ΠLizi 2

µk+1(A i) 1 ΠLizi 2

Minimum-Norm Interpolation Under Covariate Shift

where ΠLi is the orthogonal projection onto the span of the bottom n k eigenvectors of A i. It is important to use the projection onto the bottom eigenvectors of A i in lower bounding the denominator term because we have to use the fact that µn(A 1 i ) µ1(A i) 1. When we don t do the projection, then zi is affected by all of A i and so the largest eigenvalue that affects this expression is µ1(A i) = λ1. After doing this projection, we no longer have contributions from the top k eigenvectors / eigenvalues in the summation of z T i A 1 i zi. Therefore, the largest eigenvalue that affects this summation is now λk+1 instead of λ1, and so we can use this in our lower bound instead, as desired.

Putting it together, for i k,

λ2 i z T i A 2 i zi (1 + λiz T i A 1 i zi)2 z T i A 2 i zi (z T i A 1 i zi)2

We now invoke Corollary B.5 with a union bound over k events. Let t < n/c0 and k (0, n/c) for c > c0 and c0 sufficiently large. Since k < n/c we also satisfy the union bound condition that ln(k) < n/c. Then, with probability at least 1 3e t,

ΠLizi 2 n/c3

for constants c2, c3 that only depend on σx, c0, and a universal constant c as defined in Corollary B.4.

Altogether, with probability 1 5e n/c0 for c0 sufficiently large,

λ2 i z T i A 2 i zi (1 + λiz T i A 1 i zi)2

λi λi c4 1 zi 2

λi λi c4 1 c2c2 3 n

On the same event we use to bound µk+1(A i) via Lemma B.1, we also have that µ1(A 2) µn(A) 2. As such,

λi λi (λ2 i z T i A 2zi) c2 1 P

i>k λi λi λ2 i zi 2

(nλk+1ρk)2 .

Then by Corollary B.7, there is a universal constant a such that with probability at least 1 2e t for t < n/c0 and c0 > a 1,

λi λi λ2 i zi 2 σ2 x max(t max i>k ( λiλi), s

i>k ( λiλi)2)

i>k λiλi + σ2 x max(t max i>k ( λiλi), s

i>k ( λiλi)2)

i>k λiλi + σ2 x max(t X

λi λi λ2 i .

Minimum-Norm Interpolation Under Covariate Shift

Altogether,

λi λi (λ2 i z T i A 2zi) c2 1 P i>k λi λi λ2 i zi 2

c2 1c5n (nλk+1ρk)2 X

λ2 i nλ2 k+1ρ2 k

By taking c > max(c0, c4, c6) we have that with probability 1 7e n/c,

λ2 i nλ2 k+1ρ2 k

λi λi + n P

i>k λiλi (P i>k λi)2 .

E.2. Lower bound

Recall that the variance takes the form,

i=1 λiλi z T i A 2 i zi (1 + λiz T i A 1 i zi)2 .

By Cauchy-Schwartz,

(z T i A 1 i zi)2 = | zi, A 1 i zi |2 zi 2 (z T i A 2 i zi).

We plug this identity into our lower bound, and further multiply by λi

λi , resulting in

i=1 λiλi z T i A 2 i zi (1 + λiz T i A 1 i zi)2

i=1 ( λi λi ) λ2 i z T i A 2 i zi (1 + λiz T i A 1 i zi)2

i=1 ( λi λi ) λ2 i (z T i A 1 i zi)2

||zi||2(1 + λiz T i A 1 i zi)2

i=1 ( λi λi ) 1 ||zi||2(1 + λiz T i A 1 i zi)2(λiz T i A 1 i zi) 2

i=1 ( λi λi ) 1 ||zi||2(1 + (λiz T i A 1 i zi) 1)2 .

Then, let k (0, n) and Li be the span of the bottom n k eigenvectors of A i and ΠLi be the projection onto the orthogonal complement of Li. We have that

z T i A 1 i zi (ΠLizi)T A 1 i (ΠLizi)

ΠLizi 2µk+1(A i) 1.

Minimum-Norm Interpolation Under Covariate Shift

From Lemma B.1, there is a constant c1 1, such that for any k 0, with probability 1 2e n/c1, µk+1(A i) c1(P j>k λj + λk+1n). Additionally, by Corollary B.5, let t < n/c3 and k (0, n/c) for c > c3 and c3 sufficiently large. Then, with probability at least 1 3e t

ΠLizi 2 n/c4

where c4 only depends on c3, σx and the universal constant given in Corollary B.4.

Then, for c max{c1, c3}, with probability 1 5e n/c,

z T i A 1 i zi ΠLizi 2µk+1(A i) 1

j>k λj + λk+1n).

By again applying Corollary B.5 on the same event we have

where c5 has the same dependencies as c4.

Altogether, we have for each i, with probability 1 5e n/c,

1 ||zi||2(1 + (λiz T i A 1 i zi) 1)2 1

c5n(1 + ( c4(P

j>k λj+λk+1n)

c5n(1 + c4λk+1

j>k λj λk+1n + 1))2

c5c2 4n(1/c4 + λk+1

λi (ρk + 1))2

c6n(1 + λk+1

λi (ρk + 1))2

where c6 = c5c2 4 and c > max{c1, c3} as defined above.

Finally, we invoke Lemma B.6 and that 1/(a + b)2 min(a 2, b 2)/4 to get that, with probability 1 10e n/c,

λi λi min(1, λ2 i λ2 k+1(ρk + 1)2 ).

For c7 max{8c6, c} we have that with probability 1 10e n/c7,

λi λi min(1, λ2 i λ2 k+1(ρk + 1)2 ).

F. Proof of bias bounds

Theorem 3.2. (Upper and lower bounds for the bias term) For the lower bound only, assume that random models θ are obtained from the underlying θ s as (θ)i = γi(θ s )i, where each γi is an independent Rademacher random variable. There exists a universal constant b > 1, constants c, C that depend only on b and σx, and k < n/C such that if ρk b, then with probability at least 1 10e n/c,

λi(θ s )2 i (1 + λi λk+1ρk )2 + X

i>k λi(θ s )2 i

If we assume that p is at most exponential in n, then with probability at least 1 5e n/c,

B2/c θ s 2 p X

λi 1 + λi λk+1ρk =: B2/c.

Minimum-Norm Interpolation Under Covariate Shift

F.1. Upper bound

Proof. As defined in Eqn. 4,

B2 = θ s bθ(Xθ s ) 2 Σt = θ s XT A 1Xθ s 2 Σt = θ s T (I XT A 1X)Σt(I XT A 1X)θ s . (20)

The ith row of Ip XT A 1X is given by ei λiz T i A 1X. It follows that

... Pp j=1 θj(ei[j] p

λiλjz i A 1zj) ...

... Pp j=1 θj(ei[j] p

λiλjz i A 1zj) ...

ith row shown

j=1 θj(ei[j] p

λiλjz i A 1zj) 2

j=1 θ2 j p X

λiλjz i A 1zj 2

λiλjz i A 1zj 2 .

Next we look at ith term in the outer sum.

j=1 (ei[j] p

λiλjz i A 1zj)2 = λi(1 λiz i A 1zi)2 + λi X

j =i λiλj(z i A 1zj)2

= λi(1 2λiz T i A 1zi + λ2 i (z T i A 1zi)2 + X

j =i λiλj(z i A 1zj)2)

= λi(1 2λiz T i A 1zi +

i=1 λiλj(z i A 1zj)2)

= λi(1 2λiz T i A 1zi + λiz i A 1 p X

i=1 λjzjz T j A 1zi)

= λi 1 2λiz T i A 1zi + λiz i A 1AA 1zi

= λi 1 2λiz T i A 1zi + λiz i A 1zi

= λi 1 λiz T i A 1zi .

Using the Sherman-Morrison formula, we get that

1 λiz T i A 1zi = 1 λiz T i A i + λiziz T i 1zi

= 1 λiz T i A 1 i λi A 1 i zi(1 + λiz T i A 1 i zi) 1z T i A 1 i zi

= 1 λiz T i A 1 i zi + (λiz T i A 1 i zi)2

1 + λiz T i A 1 i zi

= 1 1 + λiz T i A 1 i zi .

We now provide an upper bound for the remaining term. Let ΠLi be the orthogonal projection onto the bottom n k eigenvectors of A i. By Lemma B.1, there exist constants b, c0 1 such that if ρk b, then with probability at least

Minimum-Norm Interpolation Under Covariate Shift

µk+1(A i) c0λk+1ρkn,

1 + λiz T i A 1 i zi 1 + λi(ΠLizi)T A 1 i (ΠLizi)

1 + λi ΠLizi 2

c0λk+1nρk .

By Corollary B.5, there exist constants c1 and c2 with c2 > c1 and c1 sufficiently large such that for 0 < k < n/c2, we have with probability at least 1 3e n/c1,

ΠLizi 2 n/c3,

where c3 depends only on c1 and σ.

Plugging these in gives us with probability at least 1 5e n/c4,

λi 1 λiz T i A 1zi λi 1 + c2 5λi λk+1ρk

λi 1 + c2 5λi λk+1ρk ,

where c4 = max(c0, c1) and c5 = min(c0, c3).

Therefore by union bound over the application of Corollary B.5,

λi 1 + c2 5λi λk+1ρk

λi 1 + λi λk+1ρk ,

where c6 = min(c2 5, 1). Taking c = max(c 1 6 , c4) gives us the result.

F.2. Lower bound

After isolating the contribution of λi λi , many of the components of this proof are as given in Tsigler & Bartlett (2023). For completeness, we replicate them here.

Proof. Assume that the vector θ s is randomly distributed according to the Prior Signs(θs) assumption. Using Lemma B.8, the bias term can be rewritten as

B = E θ s [Bθ s ]

= E θ s [ θ s bθ(Xθ s ) 2 Σt]

= E θ s [(θ s )T (Ip XT (XXT ) 1X)Σt(Ip XT (XXT ) 1X)θ s ]

= E θ s [(θ s )T Mθ s ]

= E θ s [θ s ]T M E θ s [θ s ] + tr(MCov(θ s ))

= tr(MCov(θ s )).

Minimum-Norm Interpolation Under Covariate Shift

where M = (Ip XT (XXT ) 1X)Σt(Ip XT (XXT ) 1X). The last equality follows from the assumption Eθ s [(θ s )] = 0. The diagonal elements of Cov(θ s ) are the component-wise variances of θ s , which are given by (θ s )2 i = (θs)2 i . The offdiagonal elements are 0 since the components of θ s are independent. As such, we need only consider the diagonal elements of M.

Note that the ith row of Ip XT (XXT ) 1X is equal to ei λiz T i (XXT ) 1X, where ei is the ith vector of the standard orthonormal basis. It follows that the ith diagonal element of M is given by

j=1 λj(ei[j] p

λiλjz T i A 1zj)2

= λi(1 λiz T i A 1zi)2 + X

j =i λjλiλj(z T i A 1zj)2.

Hence, we can express the bias term as

i=1 (θs)2 i λi(1 λiz T i A 1zi)2 + X

j =i λjλiλj(z T i A 1zj)2

λi λi λi(θs)2 i (1 λiz T i A 1zi)2.

We are able to eliminate the second term because it is non-negative. Substituting A = A i + λiziz T i and using the Sherman-Morrison identity, we have that 1 λiz T i A 1zi = 1 1+λiz T i A 1 i zi (see proof of bias upper bound in Appendix F.1).

λi(θs)2 i (1 + λiz T i A 1 i zi)2

Let s bound each term in that sum from below with high probability. By Corollary B.3, there exist constants b, c0 1 such that for any i 0 with probability at least 1 2e n/c0, if ρk b, then

c0 nλk+1ρk.

λi (1 + λiz T i A 1 i zi)2 λi (1 + λiµn(A i) 1 zi 2)2 .

By Corollary B.5, for constants c1, c2 such that k < n/c2 with c2 > c1 for sufficiently large c1 with probability at least 1 3e n/c1 we have zi 2 c3n, where c3 depends only on c1 and σ.

We obtain that w.p. at least 1 5e n/c4,

λi θ2 i (1 + λiz T i A 1 i zi)2 λi θ2 i 1 + c2 4λi λk+1ρk

where c4 = max(c0, c1, c3). All the terms are non-negative so Lemma B.6 provides a lower bound on their sum. With probability at least 1 10e n/c4,

λi θ2 i (1 + c2 4λi λk+1ρk )2

λi θ2 i (1 + λi λk+1ρk )2 ,

Minimum-Norm Interpolation Under Covariate Shift

where c5 = 2 max(c2 4, 1).

Finally, we notice that on i > k we have ρk b > 1 and λi λk+1 giving us,

λi θ2 i (1 + λi λk+1ρk )2 X

λi θ2 i (1 + λi λk+1)2

i>k λi θ2 i .

Letting c = 4 max(c4, c5) gives us the result.

G. Proof of tightness of bounds

Theorem 3.3. (Tightness of variance and bias bounds) Let the lower bound and upper bound of V be given by V and V , respectively. There exists a universal constant b 1, and constant c as defined in Theorem 3.1, and k (0, n/c) such that if ρk b, then

V /V b 2(1 + b) 2/c2, 1 .

Let the lower bound and upper bound of B2 be given by B2 and B2, respectively, and the assumptions of Theorem 3.2 be satisfied. Then

mini (θ s )2 i : (θ s )i = 0

θ s 2 1 + b 1 λ1 λk+1

Proof. We split the proof into the variance proof in Appendix G.1 and the bias proof in Appendix G.2.

G.1. Variance Proof

Proof. Recall that

λi λi min 1, λ2 i λ2 k+1(ρk + 1)2

λ2 i nλ2 k+1ρ2 k

Since k is the smallest ℓsuch that ρℓ b, it is clear by definition that ρk 1 < b. Then we observe that

ρk 1 = 1 nλk

j>k 1 λj = λk + P

j>k λj nλk = λk + nλk+1ρk

λk + nλk+1ρk < nbλk λk > λk + nλk+1ρk

nb > nλk+1ρk

nb = λk+1ρk

Minimum-Norm Interpolation Under Covariate Shift

V : V = 1 8c6n

1, λ2 i λ2 k+1(ρk + 1)2

1, λ2 i λ2 k+1(ρk + 1)2

If the min is 1 then we are okay otherwise, using the identity above and that fact that λi λk, we have that

λ2 i λ2 k+1(ρk + 1)2 > (λk+1ρk)2

b2λ2 k+1(ρk + 1)2

= ρ2 k b2(ρk + 1)2 .

Examining the ρk terms:

ρ2 k (ρk + 1)2 = 1 ρ 2 k (ρk + 1)2

= 1 (1 + ρ 1 k )2 .

As ρk b we have that ρ 1 k b 1 + ρ 1 k 1 + b (1 + ρ 1 k ) 2 (1 + b) 2.

Putting it together we get that

λ2 i λ2 k+1(ρk + 1)2 1 8c6c

1 b2(1 + b)2

= k 8c6c b2(1 + b)2

1 8c6c b2(1 + b)2 .

On i > k, it is clear that the min is always given by the second term, as λi λk+1, so we get

V : V = 1 8c6n

1, λ2 i λ2 k+1(ρk + 1)2

λ2 i nλ2 k+1ρ2 k

ρ2 k (ρk + 1)2

1 (1 + b)2 = 1 8c6c p k (1 + b)2 > 1 8c6c(1 + b)2 .

Finally we note that for b 1 it is clear that min(b 2(1 + b) 2, (1 + b) 2) = b 2(1 + b) 2. Therefore,

V : V 1 8c6cb 2(1 + b) 2.

By setting c in the upper bound such that c > 8c6, we get

c2 b 2(1 + b) 2.

Minimum-Norm Interpolation Under Covariate Shift

G.2. Bias proof

Proof. We will bound the ratio of the lower and upper bounds by bounding the ratios of the corresponding terms in each sum. Observe that for all i, the ratio of the terms is equal to

θ 2 1 1 + λi λk+1ρk

θ 2 1 1 + λi λk+1ρk

min i (θ i )2

θ 2 1 1 + λ1 λk+1 b 1 .

On i > k, we have λi/λk+1 1, so

θ 2 1 1 + λi λk+1ρk

min i (θ i )2

θ 2 1 (1 + b 1).

Unfortunately, the looseness in the top k components coming from the gap λ1/λk+1 dominates the tighter ratios in the bottom p k components which only contain a model-dependent gap, mini θ2 i / θ 2. Future work would seek to resolve this and provide tight upper and lower bounds for the bias terms.

H. Proof of beneficial and malignant shifts

H.1. Trace conditions for simple shifts

Let Σs be any source covariance and define Σt as λi = αλi for i k and λi = βλi for i > k with α, β 0.

Then tr(Σs) = Pk i=1 λi + P

i>k λi and tr(Σt) = α(Pk i=1 λi) + β(P

For α > 1, β < 1, if P

i>k λi Pk i=1 λi < α 1

then we have that tr(Σs) < tr(Σt) and if the inequality is flipped then we obtain tr(Σs) > tr(Σt).

For α < 1, β > 1, if Pk i=1 λi P

i>k λi < β 1

then we have that tr(Σs) < tr(Σt) and if the inequality is flipped then we obtain tr(Σs) > tr(Σt).

H.2. Proof of beneficial and malignant shifts for simple shifts

We restate the theorem for ease. Theorem 3.4. (Beneficial and Malignant Multiplicative Shifts on Variance) Let Σs be a source covariance that satisfies benign source conditions. That is, k such that ρk b for a universal constant b > 1. Define Σt as λi = αλi for i k and λi = βλi for i > k, with α, β 0.

Minimum-Norm Interpolation Under Covariate Shift

1. If α < 1, β 1 or α 1, β < 1 then we obtain a beneficial shift in variance.

2. If α > 1, β 1 or α 1, β > 1 then we obtain a malignant shift in variance.

3. If we are in the mildly overparameterized regime:

α > 1 and β < 1 leads to beneficial shifts; α < 1 and β > 1 leads to malignant shifts.

4. If we are in the severely overparameterized regime:

α > 1 and β < 1 leads to malignant shifts; α < 1 and β > 1 leads to beneficial shifts.

Proof. From Theorem 3.1 and Theorem 3.3, we have that for a universal constant b > 1 if ρk b we get the following upper and lower bounds on the out-of-distribution variance for some constants c1, c2,

λi λi + n P

i>k λiλi (P

λi λi + n P

i>k λiλi (P

Analogously, the in-distribution variance is upper and lower bounded by,

where Rk = (P

i>k λi)2/ P

Let Σs be any source covariance model that satisfies benign source conditions. Define Σt by,

( αλi, i k βλi, i > k

for α, β 0.

Beneficial shifts. We use the upper bound to specify requirements for the beneficial shifts.

Let α > 1, β < 1. To obtain a beneficial shift in this setting we need,

n Rk (1 β) > k

In the case α < 1, β > 1, to obtain a beneficial shift we need,

n Rk (β 1) < k

Minimum-Norm Interpolation Under Covariate Shift

In the case where α = 1 then any β < 1 leads to beneficial shifts. Similarly when β = 1, any α < 1 leads to beneficial shifts.

Malignant shifts. We use the lower bound to specify requirements for the malignant shift.

Let α < 1 and β > 1. To obtain a malignant shift in this setting we need,

In the case of α > 1, β < 1, to obtain a malignant shift we need,

In the case where α = 1 then any β > 1 leads to malignant shifts. Similarly when β = 1, any α > 1 leads to malignant shifts.

Mild and severe overparameterization. We see that the four cases separate into settings in which we are mildly overparameterized, meaning

and settings in which we are severely overparamterized, meaning

In each of these regimes of overparameterization, the above proof has delineated whether we achieve beneficial or malignant shifts in all settings of α, β.

H.3. Generalized (necessary) conditions for beneficial and malignant shifts

Let Σs be any source covariance matrix that satisfies benign source conditions and define Σt as

( αiλi i k, βiλi i > k

with αi, βi 0 for all i.

Then the OOD variance upper bound is given by,

i=1 αi + n P

i>k βiλ2 i (P

(Pk i=1 αi) k

i>k λ2 i (βi 1) (P

i>k βiλ2 i P

i>k λ2 i 1 !

Minimum-Norm Interpolation Under Covariate Shift

and the OOD variance lower bound is given by,

i=1 αi + n P

i>k βiλ2 i (P

(Pk i=1 αi) k

i>k λ2 i (βi 1) (P

i>k βiλ2 i P

i>k λ2 i 1 !

where Vid is the ID variance bound.

Again, we use the upper bounds to prove conditions for beneficial shifts and the lower bounds to prove conditions for malignant shifts.

Beneficial shifts. From the upper bound we consider two separate cases for non-trivial beneficial shifts:

1. Pk i=1 αi < k and P

i>k βiλ2 i > P

2. Pk i=1 αi > k and P

i>k βiλ2 i < P

We start with the case of Pk i=1 αi < k and P

i>k βiλ2 i > P

i>k λ2 i . If this is satisfied, the only way to achieve a beneficial shift is if

i>k βiλ2 i P

i>k λ2 i 1 < k

1 Pk i=1 αi

We also have in this setting that,

0 < 1 Pk i=1 αi

In Equation 21 we see a notion of severe overparameterization that leads to beneficial shifts. For instance as Rk we see the left-hand-side (LHS) of Equation 21 0. So as Rk we have that finite n always leads to a beneficial shift in this setting. We note that equivalently if βi = 1 for all i then we also have the LHS 0, just as in the case of severe overparameterization. We will return to the definitions of mild and severe overparameterization for arbitrary shifts after showing the remaining conditions for beneficial and malignant shifts.

Now consider the case of Pk i=1 αi > k and P

i>k βiλ2 i < P

i>k λ2 i . If this is satisfied, the only way to achieve a beneficial shift is if

i>k βiλ2 i P

In this setting it is clear that

i>k βiλ2 i P

i>k λ2 i 1.

In Equation 22, it is clear that we have a notion of mild overparameterization that leads to beneficial shifts. As above if αi = 1 for all i then we always obtain a beneficial shift in this setting. Otherwise if Rk does not grow too quickly (as in the case with mild overparameterization) then this is a necessary condition to achieve beneficial shifts when P

i>k βiλ2 i < P

Minimum-Norm Interpolation Under Covariate Shift

Malignant shifts. From the lower bound we once again consider two separate cases for non-trivial malignant shifts:

1. Pk i=1 αi > k and P

i>k βiλ2 i < P

2. Pk i=1 αi < k and P

i>k βiλ2 i > P

We start with the case of Pk i=1 αi > k and P

i>k βiλ2 i < P

i>k λ2 i . If this is satisfied then the only way to achieve a malignant shift is if,

i>k βiλ2 i P

In the case of Pk i=1 αi < k and P

i>k βiλ2 i > P

i>k λ2 i the only way to achieve a malignant shift is if,

i>k βiλ2 i P i>k λ2 i 1 > k

1 Pk i=1 αi

We now are ready to define mild and severe overparameterization for arbitrary multiplicative shifts. Theorem H.1. (Mild and severe overparameterization for arbitrary multiplicative shifts) Let Σs be any source covariance matrix that satisfies benign source conditions, meaning k such that ρk b for a universal constant b > 1. Furthermore, let Σt be defined by λi = αiλi for i k and λi = βiλi for i > k.

We will define

! 1 P i>k βiλ2 i P

Then we are mildly overparameterized if

n Rk = ω C k

and we are severely overparameterized if

n Rk = o C k

We now state our taxonomy of covariate shifts for arbitrary multiplicative shifts. Theorem H.2. (Beneficial and Malignant (Arbitrary) Multiplicative Shifts on Variance) Let Σs be any source covariance matrix that satisfies benign source conditions, meaning k such that ρk b for a universal constant b > 1. Furthermore, let Σt be defined by λi = αiλi for i k and λi = βiλi for i > k.

1. If Pk i=1 αi k and P

i>k βiλ2 i < P

i>k λ2 i then we obtain a beneficial shift.

2. If Pk i=1 αi < k and P

i>k βiλ2 i P

i>k λ2 i then we obtain a beneficial shift.

3. If Pk i=1 αi k and P

i>k βiλ2 i > P

i>k λ2 i then we obtain a malignant shift.

4. If Pk i=1 αi > k and P

i>k βiλ2 i P

i>k λ2 i then we obtain a malignant shift.

5. If we are in the mildly overparameterized regime:

Pk i=1 αi > k and P

i>k βiλ2 i < P

i>k λ2 i leads to beneficial shifts,

Pk i=1 αi < k and P

i>k βiλ2 i > P

i>k λ2 i leads to malignant shifts.

6. If we are in the severely overparameterized regime:

Pk i=1 αi < k and P

i>k βiλ2 i > P

i>k λ2 i leads to beneficial shifts,

Pk i=1 αi > k and P

i>k βiλ2 i < P

i>k λ2 i leads to malignant shifts.

Minimum-Norm Interpolation Under Covariate Shift

I. Experiment details

I.1. Synthetic data experiments

Our synthetic data experiments use source data generated from random Gaussians with covariance structures that are known to exhibit benign overfitting. These structures include the (k, δ, ϵ) spiked covariance models and eigendecay rates given by Bartlett et al. (2020) such as λi = i α ln β(i + 1) for α = 1, β > 1. Target data is generated from random Gaussians with covariances that lead to beneficial and malignant shifts based on our theories and modifications of the aforementioned source covariance structures.

All ground truth models are sampled uniformly on the p-dimensional hypersphere, as θ s Sp 1. Label noise is sampled as ε N(0, 1), unless otherwise specified. For a data matrix X Rn p, training labels are obtained as y = Xθ s + ε. Excess risk is computed for unseen testing data from source and target distributions of interest using clean labels.

In Figure 4 we take the source to be the (k, δ, ϵ) spiked model with parameters given by k = 70, δ = 1, and ϵ = 0.005. The beneficial shift scales the first k eigenvalues by α = 1.125 and the last p k eigenvalues by β = 0.65. For the malignant shift we use α = 0.875 and β = 1.35. The minimum-norm linear interpolator is fit to 500 data points sampled from a centered multivariate Gaussian with unit variance and dimension p = 4900. The model vector is sampled from a centered Gaussian and scaled to unit norm. The x-axis represents the amount of additive label noise in training. All evaluation is done on clean data. Each point is the average of 40 runs.

In Figure 5, we take the source to be the (k, δ, ϵ) spiked model with source parameters as k = 10, δ = 1.0, ϵ = 1e 6 and target parameters k = 10, δ = 1.35, ϵ = 6.5e 7. We use n = 50 training data points, 10k held-out testing data points in each OOD test set, and vary p from 75 to 1000 dimensions. We solve OLS using the closed-form MNI solution on the source data. Each experiment is averaged over 100 independent runs.

In Figure 2 we train fully-connected neural networks with Re LU activation functions. Data is sampled as above from the covariance structures given by λi = i α ln β(i + 1) with varying β to obtain beneficial and malignant shifts. The network architecture is 3 hidden layers, with hidden widths 512 and 2048. Networks are trained with stochastic gradient descent with momentum 0.9 until the training MSE has reached < 5e 6. We start with a learning rate of 0.01 and decay by a stepped cosine schedule for 1,500 epochs. We take batch size of 64 and train without weight decay. Each experiment is averaged over 20 independent runs. We train in Py Torch with a single A100 NVIDIA GPU. In these experiments we take n = 200 and compare p = 20 with p = 2000. Label noise is sampled as N(0, σ2) and we vary σ2 to show the behaviors at varying train label noise.

In Figure 8 we train full-connected neural networks with Re LU activation functions. Source data is sampled from a mean-centered Gaussian with diagonal covariance matrix with eigenvalues λi = i 1 ln 1.5(i + 1). Target covariate shifts are implemented in the style of Theorem 3.4 where the top k source eigenvalues are multiplied by α and the bottom p k source eigenvalues are multiplied by β. In this experiment, we take k = 10, α = 2, β = 0.1 and experiment with n = 400 source data samples for p = 200 and p = 4, 000. The network architecture is 3 hidden layers with hidden width 2, 048. Our training setup is the same as given above for prior MLP experiments.

I.2. CIFAR-10 and CIFAR-10C experiments

In Figures 9 and 11 we use a binary variant of CIFAR-10 and CIFAR-10C. For details on the CIFAR-10C dataset, see Hendrycks & Dietterich (2019). The binary problem is constructed by selecting only the dog and truck classes. To stay overparameterized, we subsample n = 500, 1000, 2000 points in a class-balanced manner. Images are flattened into p = 3072 dimensional vectors. We fit our model using the OLS solution for the MNI against {0, 1} class labels. We test on the same two classes from CIFAR-10 and CIFAR-10C Gaussian blur and Gaussian noise corruptions. Recall that these two corruptions were selected for their eigenspectra s similarity to beneficial and malignant shifts, respectively. Label noise is injected by flipping class labels with a given probability.

In Figure 12, we train Res Net18 models on the entire CIFAR-10 dataset and evaluate on the CIFAR-10C test sets for the Gaussian blur and Gaussian noise corruptions. The setting is not high-dimensional because we train on 50000 images with 3072 dimensions. However, the Res Net18 architecture has around 11.7 million parameters, so the level of overparameterization is very high. The training procedure is similar to that used for our MLP experiments. Networks are trained with stochastic gradient descent with a learning rate of 0.1 and stepped cosine decay schedule for 60 epochs. Each point in the plot is an average over 30 independent runs. As before, we train in Py Torch with a single A100 NVIDIA GPU.

Minimum-Norm Interpolation Under Covariate Shift

Label noise takes the form of random label flips with probabilities 0.1 to 0.9.

J. Additional experiments

We present a number of additional supporting experiments that show: (1) more cases of the behavior of the MNI and MLPs under covariate shift on synthetic datasets; (2) underparameterized and overparameterized regimes for linear regression under covariate shift for more realistic eigendecay rates outside of (k, δ, ϵ) spiked covariance models; (3) cases in which the MNI is overfit in a tempered or catastrophic manner and evaluated on OOD datasets constructed based on our results in Theorem 3.4, indicating that our insights hold up for the MNI even when benign source conditions are not satisfied; (4) the value of overparameterization for the MNI trained on CIFAR-10 and evaluated on CIFAR-10C; (5) experiments training Res Net-18 models to interpolation on the full CIFAR-10 dataset and evaluated on CIFAR-10C blur and noise corruptions.

J.1. MNI on Synthetic Data

ID: = 1.0, = 1.0 OOD (Beneficial): = 0.65, = 1.125 OOD (Malignant): = 1.35, = 0.875

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Level

Test Excess MSE

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Level

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Level

Figure 4. We fit interpolating linear models to random Gaussian data sampled from spiked covariance models with parameters k, δ, ϵ. In this setting, k = 70, n = 500, p = 4900, δ = 1 and ϵ = 0.005. To illustrate a beneficial shift, we scale the first k eigenvalues by α = 1.125 and the last p k eigenvalues by β = 0.65. Similarly, for the malignant shift we use α = 0.875 and β = 1.35. All experiments are averaged over 25 independent runs with standard error bars displayed. Note that the bias is consistently below 10 16.

In Figure 4, we experiment with interpolating linear models where Σs, Σt are given by (k, δ, ϵ)-spike covariances with k = 70, n = 500, and p = 4900. We design problem parameters to show settings in which tr(Σt) > tr(Σs) and we get a beneficial shift, and tr(Σt) < tr(Σs) and we get a malignant shift. To do this, the source covariance matrix is constructed using δ = 1 and ϵ = 0.005. To illustrate a beneficial shift, we scale the first k eigenvalues by α = 1.125 and the last p k eigenvalues by β = 0.65. Similarly, for the malignant shift we use α = 0.875 and β = 1.35. The resulting plots are significant because they highlight the distinct effects that the first k and last p k components have on the excess risk.

As illustrated by our main theorems, increasing the energy of an eigenvalue has a negative impact on the risk. Nonetheless, these plots show that where the increase happens plays an important role on how the shift affects generalization. We are able to improve performance by decreasing the energy on the tail and increasing the energy on the head in such a way that the total energy is increased. In short, this setting is a direct connection to our theory and shows clearly that our constructions for beneficial and malignant shifts, when mildly overparameterized, hold up in low and high train label noise regimes, with higher noise exacerbating the effects of the shifts. In addition, Figure 4 demonstrates that the variance generally contributes much more significantly to the overall risk.

Minimum-Norm Interpolation Under Covariate Shift

4.5 5.0 5.5 6.0 6.5 7.0 Ln dimension, ln(p)

In-distribution Out-of-distribution

Figure 5. We experiment with the (k, δ, ϵ) spiked covariance models and examine conditions for beneficial and malignant shifts as given in Theorem 3.4. We take n = 50, k = 10, δ = 1.0, ϵ = 1e 6, δ = 1.5, ϵ = 5e 7, and vary p. In all cases, tr(Σt) > tr(Σs), showing that beneficial shifts of this form can occur. As we increase p while keeping other problem parameters fixed we observe the transition from mild to severe overparameterization and see the cross-over point between the shift going from beneficial to malignant. For both ID and OOD excess risk, we observe that excess risk is a decreasing function of input dimension. Curves are averaged over 100 independent runs.

Figure 5 shows another example of the transition from mild overparameterization to severe overparameterization in the case of (k, δ, ϵ) spiked covariance models. In this example we take k = 10, n = 50, δ = 1.0, ϵ = 1e 6. Using our shifts defined in Theorem 3.4 we set α = 1.5 and β = 0.5. We plot excess MSE on both ID and OOD test sets vs. the input data dimension, while holding all other problem parameters fixed and clearly observe the transition from beneficial to malignant shifts in keeping with our theorem.

Next, we experimentally show that while our theory is built for benign source covariance structures it holds for non-benign covariances. In particular, we examine eigendecay rates that are known to lead to tempered overfitting and catastrophic overfitting (Mallinar et al., 2022). Bartlett et al. (2020) identify the covariance structure given by λi = i 1ln 2(i + 1) as sufficient for benign overfitting. The rate of i α for α > 1 is akin to a ridgeless Laplace kernel and corresponds to tempered overfitting. Finally, the rate of i ln(i) is akin to a ridgeless Gaussian kernel and corresponds to catastrophic overfitting. This relative ordering is determined by how high-dimensional the tail eigenvalues are, in decreasing order.

p = n ID OOD ( = 2.00, = 0.10) OOD ( = 0.10, = 2.00)

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Ln Dimension

Ln Excess Risk

(a) Ridgeless, λi = i 1 ln 2(i + 1)

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Ln Dimension

Ln Excess Risk

(b) Ridgeless, λi = i 2

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Ln Dimension

Ln Excess Risk

(c) Ridgeless, λi = i ln(i)

Figure 6. Comparing covariate shift in underparameterized vs. overparameterized linear regression for three different eigendecay rates. In the p > n setting: (a) leads to benign overfitting, (b) leads to tempered overfitting, and (c) leads to catastrophic overfitting. We implement simple multiplicative shifts with α, β as defined in Section 3.1 where we take n = 50, k = 10. Ground truth models are sampled uniformly from Sp 1 and training label noise is sampled from N(0, 2). Every curve is averaged over 50 independent runs.

It is clear that even though Theorem 3.4 is for the case in which Σs satisfies benign source conditions, the style of beneficial and malignant shift we identify holds for the MNI even when overfit in a non-benign manner. That is, when Σs has eigendecay rates that are tempered or catastrophic we can still obtain non-trivial beneficial and malignant shifts by changing

Minimum-Norm Interpolation Under Covariate Shift

the energy on the signal and noise components in a heterogeneous way.

We also notice in Figure 6 that even when varying the dimension up to p = 2000 at n = 50, k = 10 we do not quite observe the cross-over from beneficial to malignant shifts in the overparameterized regime. However, we observe that in Figure 6(a) that the two OOD curves begin to cross-over. Given compute budget, we run a variant of Figure 6 where we extend up to p = 5000 and take smaller n, e.g. n = 20, 30, 40, in order to closer examine the different regimes of overfitting. In addition, we experimentally show results for p = 5, 10 which we liken to the classical linear regression regime in which k = p < n, meaning all of the signal is captured in the p components. In this setting, α shifts are all that influence the distribution shift behavior. We show these behaviors in Figure 7.

2 4 6 8 Ln Dimension

Ln Excess Risk

OOD ( = 2.00, = 0.10)

OOD ( = 0.10, = 2.00)

2 4 6 8 Ln Dimension

Ln Excess Risk

OOD ( = 2.00, = 0.10)

OOD ( = 0.10, = 2.00)

2 4 6 8 Ln Dimension

Ln Excess Risk

OOD ( = 2.00, = 0.10)

OOD ( = 0.10, = 2.00)

Figure 7. Comparing covariate shift in underparameterized vs. overparameterized linear regression for when λi = i 1 ln 2(i + 1). We implement simple multiplicative shifts with α, β as defined in Section 3.1 where we take k = 10 and vary n. Ground truth models are sampled uniformly from Sp 1 and training label noise is sampled from N(0, 2). Every curve is averaged over 100 independent runs.

J.2. MLP on Synthetic Data

We now show additional results for MLPs trained to interpolation on synthetic datasets. This experiment is analogous to that of Figure 2 except that we implement shifts in the style of Theorem 3.4.

ID OOD ( = 2.0, = 0.1) OOD ( = 0.1, = 2.0)

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Variance

Test Excess MSE

(a) n = 400, p = 200, h = 2048

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Train Label Noise Variance

Test Excess MSE

(b) n = 400, p = 4k, h = 2048

Figure 8. We implement multiplicative shifts for interpolating 3-layer Re LU MLPs in the style of Theorem 3.4. Source data, X, is sampled from a mean-centered Gaussian with diagonal covariance given by λi = i 1 ln 1.5(i + 1). Ground truth models are sampled as θ s Sp 1 and training label noise is samples as ε = N(0, σ2 x). Noisy training labels are obtained as y = Xθ s + ε. The target covariances are obtained by multiplying the top k = 10 source eigenvalues by α and the bottom p k source eigenvalues by β. From our theory, we expect that α = 2, β = 0.1 leads to beneficial shifts while α = 0.1, β = 2 leads to malignant shifts. We see this holds up when p > n, and that h > n does not change this relationship. All curves are averaged over 20 independent runs and each training run reaches MSE loss 5e 6.

In Figure 2 we sampled the ID dataset from a mean-centered Gaussian with diagonal covariance that has eigenvalues

Minimum-Norm Interpolation Under Covariate Shift

λi = i 1 ln 3(i + 1) and we examined the behavior for OOD datasets under covariate shift where the eigenvalues of the OOD covariance are given by λi = i 1 ln 2(i+1) and λi = i 1 ln 4(i+1). In Figure 8 we take the ID data to be sampled from a mean-centered Gaussian with diagonal covariance that has eigenvalues λi = i 1 ln 1.5(i + 1). For the covariance of the OOD datasets, we shift the top k = 10 eigenvalues by a factor of α and the bottom p k eigenvalues by a factor of β, as in the setting of Theorem 3.4. We experiment here with α = 2, β = 0.1 and α = 0.1, β = 2. Each model achieves training MSE 5e 6. We see the same trends as in Figure 2 with respect to p > n versus h > n. In the p < n case, even though h > n we do not clearly observe a beneficial shift as predicted by our high-dimensional linear theory. However, when p > n we do observe beneficial shifts for α = 2, β = 0.1, as suggested by our theorem for the mildly overparameterized case.

J.3. MNI on CIFAR-10C Experiments

J.3.1. ADDITIONAL BLUR AND NOISE FILTER EXPERIMENTS

In-distribution Severity 1 Severity 2 Severity 3 Severity 4 Severity 5

0 1000 2000 3000 Eigenvalue index, i

Log eigenvalue, ln( i)

(a) Blur Covariance

0 1000 2000 3000 Eigenvalue index, i

Log eigenvalue, ln( i)

(b) Noise Covariance

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(c) MNI tested on Blur

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(d) MNI tested on Noise

Figure 9. We fit the MNI to binary CIFAR-10 (dog vs. truck) and test on binary CIFAR-10C under Gaussian blur and noise corruptions. In (a), (b) we plot the eigenvalues of the covariance matrices for ID test data and on test sets for each severity. To ensure p > n we subsample the training set to n = 1k and average curves over 50 independent runs. We evaluate the MNI against all 5 corruption severities and plot excess classification error vs training label noise, which is class label flip probability. We see that the eigenspectra of the OOD datasets is directly correlated to the OOD performance of the MNI.

In-distribution Severity 1 Severity 2 Severity 3 Severity 4 Severity 5

0 1000 2000 3000 Eigenvalue index, i

Log eigenvalue, ln( i)

(a) Blur and Noise Covariance

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(b) n = 500

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

(c) n = 2000

Figure 10. We consider an experiment using a custom variant of the CIFAR-10C out-of-distribution (OOD) test sets while continuing to train on the original CIFAR-10 dataset with n training samples at varying amounts of training label noise. Our constructions injects Gaussian noise at varying severity levels into the top 200 high-variance directions of the Gaussian blur test sets at each severity level. In (a) we plot the log of the spectrum of the covariance matrices of each test set. This results in a covariance spectrum in which the top eigenvalues of the OOD data are larger than the top eigenvalues of the in-distribution (ID) eigenvalues, and the bottom eigenvalues of the OOD data are smaller than the bottom eigenvalues of the ID data. This corresponds to the α > 1, β < 1 setting in our taxonomy. In (b) and (c) we plot test excess classification error vs. train label noise. In (b) we show the severely overparameterized setting which results in malignant shifts, and in (c) we show the mildly overparameterized setting which results in benficial shifts, in keeping with the intuitions from our taxonomy. Furthermore, the trace of the OOD covariances are larger than the ID covariance and yet in (c) we observe improved OOD performance, in contrast to intuitions from prior work.

Minimum-Norm Interpolation Under Covariate Shift

J.3.2. VARYING LEVELS OF OVERPARAMETERIZATION

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

n=500 n=1000 n=2000

(a) p > n, In-distribution

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

n=500 n=1000 n=2000

(b) p > n, Blur

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

n=500 n=1000 n=2000

(c) p > n, Noise

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

n=5000 n=7500 n=10000

(d) p < n, In-distribution

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

n=5000 n=7500 n=10000

(e) p < n, Blur

0.0 0.1 0.2 0.3 0.4 0.5 Train Label Noise

Test Excess Classification Error

n=5000 n=7500 n=10000

(f) p < n, Noise

Figure 11. We fit the ridgeless OLS solution to binary CIFAR-10 (dog vs. truck) and test on binary CIFAR-10C under Gaussian blur and noise corruptions. In the top row, we vary the level of overparameterization as n = 500, 1k, 2k and average each curve over 50 independent runs. In this p > n setting the ridgelss OLS solution results in the MNI. In the bottom row we obtain a non-interpolating, ridgeless linear solution. Evaluations are done on severity 3 of CIFAR-10C, however the results hold up across all severities. We plot excess classification error vs training label noise, which is class label flip probability. We see that overparameterization improves robustness of the MNI at all noise levels.

In Figure 11 (a-c) we show that overparameterization improves OOD excess classification error for the MNI fit to binary CIFAR-10 and evaluated on binary CIFAR-10C under Gaussian blur and noise corruptions. The details of these datasets and setups are given in Appendix I. We note that all of the curves in the top row of this figure are in the overparameterized regime, meaning they are on the right side of the double descent curve. Flattened CIFAR images have p = 3, 072 and so we vary the number of training subsample sizes over n = 500, 1000, 2000 in order to remain in an overparameterized setting. We find that when we are overparameterized, as we reduce n we obtain improved performance. We average over 50 independent runs in each setting and provide standard error bars to show that this observation is not due to specific random samples. We also see that at higher levels of overparameterization, the relative difference in excess classification error between ID, blur, and noise test sets lessens. For example, at 0.0 label noise and n = 2k the average excess error varies from 0.3028 on the blur set to 0.4668 on the noise set for an absolute difference of 0.164, whereas at n = 500 the average excess error only varies from 0.252 on the blur set to 0.3131 on the noise set for an absolute difference of 0.0611.

For completeness, in Figure 11 (d - e) we show the above setting in the underparameterized regime where we obtain the linear solution via the ridgeless OLS solution. As these plots are on the left side of the double descent peak, we see that adding more data improves OOD excess classification error. While these models are not interpolating, we observe that noise corruptions lead to nearly catastrophic performance, meaning random guessing, on the OOD test sets, whereas blur corruptions lead to more benign performance. Finally the ID performance appears to be tempered, in showing a nearly linear relationship between train label noise and test excess classification error.

Minimum-Norm Interpolation Under Covariate Shift

J.4. Res Net on CIFAR-10C Experiments

In-distribution Severity 1 Severity 2 Severity 3 Severity 4 Severity 5

0.2 0.4 0.6 0.8 Train Label Noise

Test Classification Error

(a) Interpolating Res Net tested on Blur

0.2 0.4 0.6 0.8 Train Label Noise

Test Classification Error

(b) Interpolating Res Net tested on Noise

Figure 12. We train Res Net18 on clean CIFAR-10 and evaluate on test sets that has been corrupted by Gaussian blur and Gaussian noise, which correspond to beneficial and malignant shifts, respectively. Labels are flipped with probability 0.1 through 0.9, seen on the x-axis. The setting is not high-dimensional because the training data contains 50000 images, each of which are 3072-dimensional. Res Net18 contains around 11.7 million parameters, so the setting is very overparameterized. We observe that while both shifts negatively affect generalization, the beneficial shift isn t as bad as the malignant shift. This result is similar to those seen in subfigures (a) and (b) in Figure 2, where the data is not high-dimensional but the MLP is overparameterized.

Figure 12 shows the behavior of interpolating Res Nets trained on the full CIFAR-10 dataset and evaluated on CIFAR-10C blur and noise corruptions. While these numbers are suboptimal with respect to CNNs on CIFAR-10 we note that they are justified in our setting as our goal is to study interpolating models. At 90% label noise it takes a lot of compute to interpolate the entire CIFAR-10 dataset, especially if using data augmentations, weight decay, or other regularizations. As such, we turn off weight decay and data augmentations for these models to be able to tractably interpolate CIFAR-10 at high noise levels.