# optimal_ridge_regularization_for_outofdistribution_prediction__5b07b14b.pdf Optimal Ridge Regularization for Out-of-Distribution Prediction Pratik Patil 1 Jin-Hong Du 2 3 Ryan J. Tibshirani 1 We study the behavior of optimal ridge regularization and optimal ridge risk for out-of-distribution prediction, where the test distribution deviates arbitrarily from the train distribution. We establish general conditions that determine the sign of the optimal regularization level under covariate and regression shifts. These conditions capture the alignment between the covariance and signal structures in the train and test data and reveal stark differences compared to the in-distribution setting. For example, a negative regularization level can be optimal under covariate shift or regression shift, even when the training features are isotropic or the design is underparameterized. Furthermore, we prove that the optimally tuned risk is monotonic in the data aspect ratio, even in the out-of-distribution setting and when optimizing over negative regularization levels. In general, our results do not make any modeling assumptions for the train or the test distributions, except for moment bounds, and allow for arbitrary shifts and the widest possible range of (negative) regularization levels. 1. Introduction Regularization plays a crucial role in statistical modeling and is commonly incorporated into optimization-based models through a regularization term. Its effectiveness relies on properly scaling the regularization term, which is controlled by a penalty parameter that the data scientist needs to tune. Recent work in machine learning (precise references given shortly) has shed light on some rather surprising behavior exhibited by the optimal regularization level in overparameter- 1Department of Statistics, University of California, Berkeley, CA 94720, USA. 2Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 3Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.. Correspondence to: Pratik Patil . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). ized prediction models, which can be zero or even negative in certain problems with moderate signal-to-noise ratio and high dimensionality. This stands in contrast to the typical behavior from classical low-dimensional learning theory. With this motivation, our paper focuses on two key questions for high-dimensional ridge regression: (Q1) What is the behavior of the optimal ridge penalty, as a function of parameters such as signal-to-noise ratio, data aspect ratio, feature correlations, and signal structure? (Q2) What is the behavior of the optimally tuned ridge risk, as a function of these same problem parameters? To set the notation, let (xi, yi) for i [n] be i.i.d. observations in Rp R representing the feature vector and response. Denote the data matrix as X = [x1, . . . , xn] Rn p and the associated response vector as y = [y1, . . . , yn] Rn. Given a ridge penalty λ > 0, recall the ridge regression fits: bβλ = argmin b Rp y Xb 2 2/n + λ b 2 2. (1) Ridge regression (Hoerl & Kennard, 1970a;b) has received considerable recent attention, particularly in settings involving overparameterization, such as double descent (see, e.g., Belkin et al., 2020; Hastie et al., 2022; Muthukumar et al., 2020, and references therein) and benign overfitting (Bartlett et al., 2020; Koehler et al., 2021; Mallinar et al., 2022). This interest in ridge regression, especially its ridgeless limit, where λ 0+, owes to its peculiar double/multiple descent risk behavior in overparameterized regimes, which (on the surface) challenges the conventional understanding of the role of regularization (Hastie, 2020). By defining the ridge estimator as: bβλ = (X X/n + λIp) X y/n, (2) where A denotes the Moore-Penrose pseudoinverse of A, we simultaneously accommodate the case of λ > 0, in which case (2) reduces to the unique ridge predictor obtained using (1), and the case of λ = 0, in which case (2) becomes the minimum ℓ2-norm interpolator among many solutions to problem (1) when p rank(X) = n. Note that the above definition (2) is well defined even when λ < 0. For more background on the formulation of ridge regression with λ < 0, see Appendix B. Optimal Ridge Regularization for Out-of-Distribution Prediction Partial answers to questions (Q1) and (Q2) are known for the in-distribution squared prediction risk, defined as: R(bβλ) = Ex0,y0[(y0 x 0 bβλ)2 | X, y], (3) where (x0, y0) is a test point sampled independently from the same distribution Px,y as the training data. Note that the prediction risk (3) is conditional on the training data and is a function of (X, y) and the properties of Px,y. Arguably, the cleanest answers to (Q1) and (Q2) are obtained under proportional asymptotics where the sample size n and the feature size p diverge proportionally to an aspect ratio ϕ (0, ). Under certain additional assumptions, the risk (3) then almost surely converges to a limit R(λ, ϕ) that depends only on coarse properties of Px,y. Analyzing the behavior of the optimal ridge penalty λ (which minimizes R(λ, ϕ) over λ) and the optimal risk R(λ , ϕ) then consequently allows us to answer questions (Q1) and (Q2), respectively. For (Q1), consider a well-specified linear model y = Xβ + ε, where the noise ε (0n, σ2I) is independent of X, and the signal is random with β (0p, (α2/p)I). Remarkably, despite the lack of a closed-form expression for R(λ , ϕ), Dobriban & Wager (2018) show that λ = ϕ/SNR > 0, where SNR = α2/σ2 is the signal-to-noise ratio. However, in real-data experiments, it has been observed that a negative ridge penalty can be optimal (Kobak et al., 2020). Motivated by this, Wu & Xu (2020); Richards et al. (2021) analyze the sign behavior of λ beyond random isotropic signals and establish sufficient conditions for when λ < 0 or λ = 0. For (Q2), again remarkably, it follows from the results of Dobriban & Wager (2018) that for isotropic features and random isotropic signals, the risk R(λ , ϕ) increases monotonically with the data aspect ratio ϕ. Recent work by Patil & Du (2023, Theorem 6) extends this result to anisotropic features and deterministic signals (with arbitrary response distributions of bounded moments), demonstrating that optimal ridge regression exhibits a monotonic risk profile and avoids double and multiple descents. In a certain sense, these results imply the sample-wise monotonicity of the optimal expected conditional risk (over the training data set). We work to answer (Q1) and (Q2) more comprehensively across essentially all possible in-distribution (IND) ridge prediction problems. Furthermore, we will generalize this by also considering the out-of-distribution (OOD) setting, where the test point (x0, y0) in (3) has a different distribution Px0,y0 than the train distribution Px,y = Px Py|x. Distribution shift occurs in many practical machine learning applications and has gained increasing attention in the learning community. We focus on two common types of distribution shifts: (i) Covariate shift: where Px0 = Px but Py0|x0 = Py|x. (ii) Regression shift: where Py0|x0 = Py|x but Px0 = Px. Thus, we also study following generalizations of (Q1) (Q2): (Q1 ) How does distribution shift alter optimal regularization? Specifically, under what types of shift is the optimal penalty λ in the OOD setting notably different (typically smaller) compared to the IND setting? (Q2 ) How does distribution shift alter optimal risk behavior? Specifically, is R(λ , ϕ) still monotonic in ϕ when λ changes due to the distribution shift? Conversely, is optimal regularization necessary to ensure monotonic risk? 1.1. Summary of Results and Paper Outline Extended risk characterization. In Section 2, we extend the scope of risk characterization for ridge regression for the out-of-distribution setting (Proposition 2.4) that: (i) does not assume any model for the train or the test distribution, apart from certain moment bounds on the train and test response distributions, (ii) does not assume that the spectrums of feature covariance or signal projections converge to fixed distributions, and (iii) allows for the widest possible range of (negative) regularization level λmin (see Definition 2.3). Properties of optimal regularization. In Section 3, we characterize the conditions that determine the sign of λ under covariate shift (Theorem 3.3) and regression shift (Theorem 3.4). These conditions capture the general alignment between the signal and the covariance spectrum, which isolates the cases where the sign of λ under the OOD setting changes from the IND setting. Our work subsumes and extends previously known results on optimal ridge regularization to the best of our knowledge (see Table 1 for precise comparisons). Properties of optimal risk. In Section 4, we show that the OOD risk of the optimally tuned ridge R(λ , ϕ) is monotonic in the data aspect ratio ϕ and SNR (Theorem 4.2). Furthermore, we establish a partial converse (Theorem 4.3) that shows risk non-monotonicity under suboptimal regularization. To prove our results, we exploit the equivalences between subsampling and ridge regression from (Patil & Du, 2023) to the OOD setting and also allow for tuning over negative regularization (Theorem 4.4). 1.2. Related Work and Comparisons Ridge risk characterization. The asymptotic risk of ridge regression has been extensively studied in the literature under proportional asymptotics when p/n ϕ (0, ), as n, p using tools from random matrix theory and statistical physics. For well-specified linear models, expressions of risk asymptotics for the IND setting are obtained by Dobriban & Wager (2018); Hastie et al. (2022), among others. Historically, heuristic derivations of these expressions have also been derived for Gaussian process regression by Sollich (2001). Additionally, several works have explored Optimal Ridge Regularization for Out-of-Distribution Prediction Table 1: Optimal regularization landscape in ridge regression. Here, # indicates either an isotropic feature or signal covariance, and indicates anisotropic features or signal covariance. For the data aspect ratio ϕ, all indicates ϕ (0, ), under indicates ϕ (0, 1) for the underparameterized regime, and over indicates ϕ (1, ) for the overparameterized regime. For the minimum penalty λmin, neg and more neg respectively indicate the naive (loose) and exact lower bound on the negative values (Definition 2.3). For the optimal penalty λ , green and red contrast the cases when the sign changes. Arb. Mod. , Arb. SNR. , and Arb. Spec. indicate allowing for arbitrary response model, signal-to-noise ratio, and feature covariance spectrum, respectively. Please see Table 6 for the reference key. Σ β Σ0 β0 ϕ 1 λmin Arb. Mod. Arb. SNR Arb. Spec. Additional Specific Data Geometry Conditions λ Reference In-distribution # Σ β all zero + [DW, Thm. 2.1] # Σ β all zero + [HMRT, Cor. 5] under neg + [WX, Prop. 6] over neg Strict misalignment of (Σ, β) + [WX, Thm. 4] over neg Strict alignment of (Σ, β) and/or special feature model [WX, Thm. 4, Prop. 7] over zero 0 [RMR, Cor. 2] under more neg + Theorem 3.1 (1) over more neg General alignment of (Σ, β, σ2) Theorem 3.1 (2) Out-of-distribution # Σ0 β all more neg + Proposition 3.2 Σ0 β under more neg + Theorem 3.3 (1) I β over more neg + Theorem 3.3 (2) # Σ0 β over more neg General alignment of (Σ0, β, σ2) Theorem 3.3 (3) under more neg General alignment of (Σ, β, β0) Theorem 3.4 (1), (39) under more neg General misalignment of (Σ, β, β0) + Theorem 3.4 (1), (39) over more neg General alignment of (Σ, β, β0, σ2) Theorem 3.4 (2) risk asymptotics and its implications in different variants of ridge regression (Wei et al., 2022; Mel & Ganguli, 2021; Loureiro et al., 2021; Jacot et al., 2020; Simon et al., 2021; Zhou et al., 2023; Bach, 2024; Pesce et al., 2023). The risk asymptotics for the OOD setting are obtained by Canatar et al. (2021), D Amour et al. (2022, Section E.5), Patil et al. (2022, Section S.6.5), Tripuraneni et al. (2021). However, these works assume either random Gaussian features or a well-specified linear model or restrict to only the positive range of regularization. Our work extends this literature by allowing for general response models and the widest possible range of negative regularization. Behavior of optimal regularization. Under random signals with isotropic covariance, Dobriban & Wager (2018) show that the asymptotic risk over the positive range of regularization λ > 0 is minimized at λ = ϕ/SNR. Here SNR = α2/σ2, where σ2 is the noise energy and α2 is the signal energy. Remarkably, this result is invariant of the feature covariance. Similar results under Gaussian assumptions are derived by Dicker (2016), and Han & Xu (2023) extend the result to most signals in the unit ball with high probability. However, Kobak et al. (2020) demonstrate that optimal regularization can be negative for certain signal and covariance structures in real datasets. Motivated by these curious experiments, Wu & Xu (2020) provide sufficient conditions for optimal regularization of the Bayes risk under anisotropic feature covariance and random signal, assuming a limiting spectrum distribution of the covariance matrix Σ and alignment conditions between the eigenvalues of Σ and the projections of the signal β onto the eigenspace of Σ. Furthermore, Richards et al. (2021) consider strict alignment conditions for a special feature model but do not explicitly consider negative regularization. We refer to these conditions as strict (mis)alignment conditions in this paper. Our paper extends the scope of the aforementioned results for the OOD setting with general response models and allows for the widest possible range of negative regularization. The main differences to Wu & Xu (2020) are the assumptions (linear models and limiting spectrum distribution) and the hypothesis of the theorem (aligned/misaligned). Their analysis only considers high SNR regimes and analyzes the behavior of optimal regularization for bias when ϕ > 1, although they also provide an upper bound for the noise level under special cases. We provide generic sufficient conditions. We call these general (mis)alignment conditions in the paper. Table 1 provides a detailed comparison summary. Behavior of optimal risk. Under a random isotropic signal, Dobriban & Wager (2018) obtain the expression for limiting optimal risk, which can be shown to be monotonic in ϕ. Patil & Du (2023) extends these theoretical guarantees to features with an arbitrary covariance matrix and a general momentbounded response. However, their analysis is limited to the IND setting and positive regularization. Recent works have also explored other aspects of the monotonicity of optimal risk; see, for example, Nakkiran et al. (2021); Simon et al. Optimal Ridge Regularization for Out-of-Distribution Prediction (2023); Yang & Suzuki (2023). We extend the monotonicity result in Patil & Du (2023) to the OOD setting and allow for negative regularization. In addition, we also show the monotonicity of the optimal risk in SNR. The proof technique for risk monotonicity leverages the equivalences between subsampling and ridge regularization established by Du et al. (2023); Patil & Du (2023). However, the equivalences in these works only consider cases where the ridge penalty is non-negative. We extend their analyses to accommodate negative regularization, which requires extending the properties of the parameters that appear in certain fixed-point equations under negative regularization (see Appendix F). 2. Out-of-Distribution Risk Asymptotics Before describing the properties of λ in Section 3 and the behavior of optimal risk R(bβλ ) in Section 4, we provide the risk asymptotics in this section. For the reader s convenience, we summarize all our notation in Appendix A. We state assumptions on the train and test distributions below. 2.1. Data Assumptions We first define a general feature and response distribution, which we will use in our subsequent assumption shortly. Definition 2.1 (General feature and response distribution). For x Px, it can be decomposed as x = Σ1/2z, where z Rp contains i.i.d. entries with mean 0, variance 1, and (4+µ)-th moment uniformly bounded for some µ > 0. Here Σ Rp p is deterministic and symmetric with eigenvalues uniformly bounded away from 0 and + . For y Py, it has mean 0 and (4 + ν)-th moment uniformly bounded for some ν > 0. The L2 linear projection parameter of y onto x is denoted by β = E[xx ] 1E[xy], and the variance of the conditional distribution Py|x is denoted by σ2. The joint distribution Px,y is parameterized by (Σ, β, σ2). Definition 2.1 imposes weak moment assumptions on covariates and responses, which are commonly used in random matrix theory and overparameterized risk analysis (Hastie et al., 2022; Bartlett et al., 2021). These assumptions encode a wide class of distributions over Rp+1. By decomposing d Py = d Px d Py|x, we can express the response as y = x β+ε, where ε is uncorrelated with x and E[ε2] = σ2. Note that the L2 projection parameter β minimizes the linear regression error (Gy orfi et al., 2006; Buja et al., 2019a;b)1. Also note that this formulation does not impose any specific structure on the conditional distribution Py|x and does not imply that (x, y) follows a linear model, as ε is also a func- 1Technically, our results can accommodate the conditional variance σ2 depending on x with suitable regularity conditions, but for simplicity, we do not consider this variation in the current paper. tion of x. It is possible to further relax the assumption on the feature vector x to only require an appropriate convex concentration (that proves versions of the Marchenko-Pastur law) (Louart, 2022; Cheng & Montanari, 2022) or even certain infinitesimal asymptotic freeness between the population covariance matrix Σ and the sample covariance matrix X X/n (Le Jeune et al., 2024; Patil & Le Jeune, 2024). We do not consider such relaxations here. Under Definition 2.1, the joint distribution Px,y = Px,y(Σ, β, σ2) is parameterized by (Σ, β, σ2). We next state assumptions on the train and test distributions in terms of these distributions, allowing for different sets of parameters. Assumption 2.2 (Train and test distributions). Assume that Px,y and Px0,y0 are distributed according to Definition 2.1, parameterized by (Σ, β, σ2) and (Σ0, β0, σ2 0), respectively. In this paper, we consider the following types of shifts: (i) Covariate shift: where Σ = Σ0 but (β, σ) = (β0, σ0). (ii) Regression shift: where Σ = Σ0 but (β, σ) = (β0, σ0). (iii) Joint shift: where Σ = Σ0 and (β, σ) = (β0, σ0). Observe that this framework also encompasses various risk notions (even for the IND setting), including the estimation risk, which arises when (Σ0, β0, σ2 0) = (I, β, 0). 2.2. Out-of-Distribution Risk Asymptotics In this section, we obtain the asymptotic risk of ridge regression in the OOD setting. Most of the papers on ridge regression consider the range of regularization λ 0. Motivated by empirical findings in Kobak et al. (2020) that negative regularization can be optimal in real datasets, some recent works consider negative regularization; see, e.g., Wu & Xu (2020); Patil et al. (2021; 2022), using a naive lower bound of rmin(1 ϕ)2 for λ. A tighter lower bound can be obtained from Theorem 3.1 of Le Jeune et al. (2024), which provides an explicit characterization of the smallest nonzero eigenvalue of Wishart-type matrices. This bound is derived by explicitly identifying the analytic continuation to the real line of a unique solution to a certain fixed-point equation over the (upper) complex half-plane (Silverstein & Choi, 1995; Dobriban, 2015). The new bound can significantly outperform the previous naive lower bound (see Figure 1 of (Le Jeune et al., 2024)). Definition 2.3 (Lower bound on negative regularization). Let µmin R be the unique solution, satisfying µmin > rmin, to the equation: 1 = ϕ tr[Σ2(Σ + µmin I) 2], (4) and let λmin(ϕ) be given by: λmin(ϕ) = µmin ϕ tr[Σ(Σ + µmin I) 1]. (5) Optimal Ridge Regularization for Out-of-Distribution Prediction This enables feasible risk estimation over λ (λmin(ϕ), ). Here tr[A] denotes the average trace tr[A]/p of a matrix A Rp p. To reiterate, the difference between the bound (5) and the naive bound used in Wu & Xu (2020); Patil et al. (2021) can be significant, as seen in Figure 1. To characterize the asymptotic out-of-distribution (OOD) risk in Proposition 2.4, we first define the non-negative constants µ = µ(λ, ϕ) and ev = ev(λ, ϕ; Σ0) as solutions of the following fixed-point equations: µ = λ + ϕ tr[µΣ(Σ + µI) 1], ev = ϕ tr[Σ0Σ(Σ + µI) 2] 1 ϕ tr[Σ2(Σ + µI) 2]. One can interpret µ as the level of implicit regularization self-induced by the data (Bartlett et al., 2021; Misiakiewicz & Montanari, 2023). Alternatively, it is also common to parameterize the equations using its inverse v(λ, ϕ) = µ 1(λ, ϕ), which corresponds to the Stieltjes transform of the spectrum of the sample Gram matrix in the limit. With this notation in place, we can now extend the result in Eq. (11) of Patil & Du (2023) to the OOD setting as formalized below. Proposition 2.4 (Deterministic equivalents for OOD risk). Under Assumption 2.2, as n, p such that p/n ϕ (0, ) and λ (λmin(ϕ), ), the prediction risk R(bβλ) defined in (3) admits a deterministic equivalent R(bβλ) R(λ, ϕ), where the equivalent additively decomposes into: R(λ, ϕ) := B(λ, ϕ) + V(λ, ϕ) + S(λ, ϕ) + κ2, (6) with the following deterministic equivalents for the bias, variance, regression shift bias, and irreducible error: B = µ2 β (Σ + µI) 1(evΣ + Σ0)(Σ + µI) 1β, S = 2µ β (Σ + µI) 1Σ0(β0 β), κ2 = (β0 β) Σ0(β0 β) + σ2 0. Note that the deterministic equivalents presented in Proposition 2.4 depend not only on the regularization parameters (λ, ϕ), but also on the problem parameters (Σ, β, σ2) and (Σ0, β0, σ2 0), which we have omitted for notational brevity. Since the risk depends additively on σ2 0, we focus mainly on the effect of (Σ0, β0) in our analysis. Extending this result to finite samples is possible by imposing additional distributional assumptions on the features and response. Techniques in Knowles & Yin (2017); Cheng & Montanari (2022); Louart (2022), among others, can be used to obtain nonasymptotic analogs of Proposition 2.4. In this paper, we will focus only on the deterministic equivalents, which capture the first-order information (akin to expectation) of interest for our goals. 3. Properties of Optimal Regularization In this section, we focus on the optimal ridge penalty λ for the asymptotic out-of-distribution (OOD) risk, defined as2: λ argmin λ λmin(ϕ) R(λ, ϕ). (7) As discussed in Section 1.2, previous studies have explored the properties of λ for ridge regression summarized in Table 1. However, these studies predominantly focus on specific scenarios, such as isotropic signals or features, and do not consider the full range of negative penalty values. Furthermore, their investigations are restricted mainly to the IND setting when (Σ0, β0) = (Σ, β). We broaden the scope of these results, considering more general scenarios, including anisotropic signals, the full range of (negative) regularization, and both IND and OOD settings. 3.1. In-Distribution Optimal Regularization We present our initial result for the IND setting, which encompasses and extends the scope of previous works. Based on Proposition 2.4, we can characterize the properties of the optimal ridge penalty λ defined in (7) as follows. Theorem 3.1 (Optimal regularization sign for IND risk). Assume the setup of Proposition 2.4 with (Σ0, β0) = (Σ, β). 1. (Underparameterized) When ϕ < 1, we have λ 0. 2. (Overparameterized) When ϕ > 1, if for all v < 1/µ(0, ϕ), the following general alignment holds: tr[BΣ(vΣ + I) 2] + σ2 tr[BΣ(vΣ + I) 3] + σ2 > tr[Σ(vΣ + I) 2] tr[Σ(vΣ + I) 3], (8) where B = ββ , then we have λ < 0. It is worth mentioning that although we state our results for general deterministic signals, our analysis can also incorporate random signals. In such cases, when β is random, one can simply replace B in the conclusion with its expectation E[B]. Next, we highlight some special cases of Theorem 3.1 and compare them with previously known results. When Σ = I or E[B] = (α2/p)I, it is easy to verify that the general alignment condition (8) does not hold (see Remark D.1). This corresponds to the special cases studied by Dicker (2016); Dobriban & Wager (2018), where λ 0. The general alignment condition (8) in Theorem 3.1 encompasses the strict alignment conditions in Wu & Xu (2020); Richards et al. (2021). Under strict alignment conditions, Wu & Xu (2020) demonstrate that the optimal ridge penalty is negative in the overparameterized and noiseless setting. 2Over the extended reals, there is at least one solution to (7). In case there are multiple solutions λ to the problem (7), the subsequent guarantees stated in the paper hold for any solution λ . Optimal Ridge Regularization for Out-of-Distribution Prediction 4 2 0 2 4 Ridge penalty Prediction risk 4 2 0 2 4 Ridge penalty Aspect ratio 2 5 10 Lower bound naive min Figure 1: Illustration of negative or positive optimal regularization under general alignment. We plot the in-distribution risk of ridge regression against the penalty λ for varying data aspect ratios ϕ in the overparameterized regime. The left and right panels correspond to scenarios when SNR is high (σ2 = 0.01) and low (σ2 = 1), respectively. The data model has a covariance matrix (Σar1)ij := ρ|i j| ar1 with parameter ρar1 = 0.5, and a coefficient β := 1 2(w(1) + w(p)), where w(j) is the jth eigenvector of Σar1. They assume perfect alignment or misalignment between the signal distribution and the spectrum distribution of the covariance. This also includes the strong and weak features models considered in Richards et al. (2021). When the signal is strictly aligned with the spectrum of Σ and σ2 = 0, it can be shown that (8) holds for all µ > 0 (see Proposition D.2). The general alignment condition (8) allows for a broader range of signal and covariance structures. For example, in scenarios where the signal is the average of the largest and smallest eigenvectors of Σ, the strict alignment condition does not hold. However, these scenarios can still satisfy the general alignment conditions, as we will illustrate shortly. In the noiseless setting, when σ2 = 0, the alignment condition (8) can be expressed succinctly as: µ < h(µ, Σ) by defining the function h( , Σβ): µ 7 log tr[BΣ(Σ + µI) 2] and Σβ = Σββ . At a high level, these alignment conditions capture how aligned the signal vector is with the feature covariance matrix. When the alignment is strong, it indicates that the problem is effectively low-dimensional. In such cases, less regularization is needed if the signal energy in this effective direction is sufficiently large. In the noisy setting, when σ2 = 0, the condition (8) explicitly trades off the alignment of the signal and the noise level. While Wu & Xu (2020, Proposition 5) and Richards et al. (2021, Corollary 2) provide upper bounds on σ2 for optimal negative regularization under restricted data models, (8) applies to a wider class of data models. In this sense, Theorem 3.1 extends the previous results on the occurrence of optimal negative regularization to a more general setting. In Figure 1, we illustrate our theoretical result of Theorem 3.1. When SNR is high, we observe that the optimal ridge penalty can be negative in the overparameterized regime. In particular, for ϕ = 10, the negative optimal ridge penalty exceeds the bound used in Wu & Xu (2020). It is worth noting that the scenario depicted in Figure 1 does not satisfy the strict alignment condition, but is incorporated by our characterization in Theorem 3.1. 3.2. Out-of-Distribution Optimal Regularization We now investigate the behavior of the optimal ridge penalty under covariate shift or regression shift. In the IND setting, when Σ0 = Σ and the signals are isotropic (ββ (α2/p)I), previous works (Dobriban & Wager, 2018; Han & Xu, 2023) show that the optimal ridge penalty is λ = ϕ/SNR 0. Interestingly, even when allowing for negative regularization and covariate shift (Σ0 = Σ), it is easy to check that the optimal λ remains positive in data models with isotropic signals for any ϕ (0, ). Proposition 3.2 (Optimal regularization under covariate shift and random signal). When Σ0 = Σ and β0 = β, assuming isotropic signals E[ββ ] = (α2/p)I, we have λ (ϕ) = ϕ/SNR = argminλ Eβ[R(λ, ϕ)]. Furthermore, for λmin < 0 such that X X/n λmin I, even the nonasymptotic OOD risk (3) is minimized at λ p = ϕp/SNR, where ϕp = p/n. We remark that Proposition 3.2 can also be seen as a result of a Bayes optimality argument. However, we provide a more direct argument in Appendix D.3. A result of this flavor for random features is also obtained by Tripuraneni et al. (2021, Proposition 6.1). An interesting observation from Proposition 3.2 is that the optimal ridge penalty does not depend on Σ0. This implies that when the signal is isotropic, one does not need to worry about the covariate shift when tuning the penalty. Therefore, generalized cross-validation for IND risks (Patil et al., 2021) still yields optimal penalties even for OOD risks under isotropic signals! However, it is important to note that this result holds specifically for random isotropic signals, which may not be realistic in practice. In scenarios where (near) random isotropic signals are not present, the question of data-dependent regularization tuning in the OOD setting is not as straightforward. We do not consider data-dependent tuning in the current paper and instead focus on the theoretical properties of the (oracle) optimal regularization (and the corresponding risk). Recently, Wang (2023) propose a method for data-dependent tuning of ridge regression under covariate shift by generating pseudo-labels with an undersmoothed ridge regression on the training data. While the method has adaptivity guarantees in the low-dimensional regime, its consistency in the proportional asymptotics regime is not clear yet, and is an interesting direction for future investigation. Proposition 3.2 suggests that the optimal penalty λ remains invariant for OOD risks under isotropic signals. Although random isotropic signals make the theory more tractable, they generally are not realistic in practice. The following Optimal Ridge Regularization for Out-of-Distribution Prediction 0.04 0.02 0.00 0.02 0.04 Ridge penalty Prediction risk Covariate shift 0.04 0.02 0.00 0.02 0.04 Ridge penalty Regression shift Type in-distribution out-of-distribution Figure 2: Covariate and regression shift can lead to negative optimal regularization in both underparameterized and overparameterized regimes. The plot shows the IND and OOD risks against λ in the high SNR setting (σ2 = 0.01 and σ2 0 = 0). The left panel shows the overparameterized regime (ϕ = 1.5) where the optimal ridge penalty λ is negative under covariate shift, when Σ = I, Σ0 = Σar1, and β = β0 = 1 2(w(1) + w(p)). The right panel shows the underparameterized regime (ϕ = 0.5) where the optimal ridge penalty λ is negative under regression shift, when Σ = Σ0 = Σar1, β = 1 2(w(1) + w(p)), and β0 = 2β. result examines the optimal ridge penalty under deterministic signals, with similar conclusions holding for random anisotropic signals. Theorem 3.3 (Optimal regularization under covariate shift and deterministic signal). Assume the setting of Proposition 2.4 with Σ0 = Σ, β0 = β. 1. (Underparameterized) When ϕ < 1, we have λ 0. 2. (Overparameterized) When ϕ > 1, if Σ0 = I (corresponding to the estimation risk), then we have λ 0. 3. (Overparameterized) When ϕ > 1, if Σ = I and tr[Σ0B] > tr[Σ0] tr[B] + (1 + µ(0, ϕ))3 µ(0, ϕ)3 σ2 , (10) where B = ββ , then we have λ < 0. When ϕ < 1, Part (1) of Theorem 3.3 suggests that the optimal ridge penalty λ remains non-negative even under covariate shift. Similarly, when ϕ > 1, Part (2) of Theorem 3.3 (for the estimation risk) also guarantees a non-negative λ . However, it is quite surprising that in the overparameterized regime, even with deterministic features, the optimal ridge penalty λ could be negative in Part (3). In particular, in noiseless setting (σ2 = 0), the condition (10) reduces to the strict alignment condition on (Σ0, β); see Proposition D.2. While Theorem 3.3 restricts Σ = I to simplify the condition when ϕ > 1, we provide the condition with general Σ in Equation (39). However, taking Σ = I suffices to highlight this rather surprising sign-reversal phenomenon. Theorem 3.4 (Optimal regularization under regression shift). Assume the setup of Proposition 2.4 with Σ0 = Σ, β0 = β. 1. (Underparameterized) When ϕ < 1, if σ2 = o(1) and for all µ 0, the following general alignment holds: tr[B0Σ2(Σ + µI) 2] > tr[BΣ2(Σ + µI) 2], (11) Table 2: Optimal ridge penalty for OOD risks on MNIST gradually becomes more negative with increasing distribution shift. We gradually shift the test distribution by excluding samples with specific labels. For more details, please refer to Appendix G.1.2. Excl. labels: {4} {3, 4} {2, 3, 4} {1, 2, 3, 4} λ 1.03 0.48 0.00 -0.48 1.44 where B = ββ and B0 = β0β , then we have λ < 0. 2. (Overparameterized) When ϕ > 1, if the general alignment conditions (8) and (11) hold, then we have λ < 0. Similarly to Part (3) of Theorem 3.3, Part (1) of Theorem 3.4 is rather surprising. As shown in Table 1, λ is always positive for the IND setting when ϕ < 1. However, Theorem 3.4 suggests that λ can be negative, even for isotropic signals, when there is some alignment or misalignment between β Σ and (β β0) Σ. In fact, when Σ = Σ0 = I and β, β0 β 2 2, the alignment condition β Σ2(Σ+µI) 2(β0 β) 0 in Theorem 3.4 always holds for all µ > 0. It is worth noting that we assume σ2 = o(1) in Theorem 3.4 for simplicity, but a more general balance condition that holds for any σ2 > 0 is provided in (39). The numerical illustrations in Figure 2 demonstrate the results of Theorems 3.3 and 3.4. As shown, while the optimal ridge penalties λ for the IND prediction risks are positive, the OOD prediction risk can be negative and approach its lower limit. Similar observations also occur in real-world MNIST datasets (see Table 2 and Appendix G.1.2 for the experimental details). This phenomenon arises due to distribution shift, which effectively aligns β0 and Σ0. In some cases, it provides a possible explanation for the success of interpolators in practice, as the optimal ridge penalty λ can become negative under distribution shift, e.g., with random features regression (Tripuraneni et al., 2021). Intuitively, negative regularization may be optimal in certain cases due to the implicit bias of the overparameterized ridge estimator, even when λ = 0. When the signal energy is sufficiently high, it can be beneficial to subtract some of this bias at the expense of increased variance. Negative regularization effectively reduces this bias and can, therefore, be the optimal choice. In a broader context, when there is implicit regularization, such as self-regularization resulting from the data structure, and we desire the overall regularization to be smaller than this inherent amount, negative external regularization can help counterbalance it. 4. Properties of Optimal Risk Under the general covariance structure Σ, Theorem 6 in Patil & Du (2023) shows that optimal ridge regression exhibits a monotonic risk profile in the IND setting (Σ0 = Σ) and effectively avoids the phenomena of double and multiple descents. This observation motivates a broader investigation Optimal Ridge Regularization for Out-of-Distribution Prediction into the monotonic behavior of regularization optimization in the out-of-distribution setting. In this section, we investigate the monotonicity of the optimal OOD risk and converse for the suboptimal OOD risk. 4.1. Optimal Risk Monotonicity To begin with, we examine the case of isotropic signals under covariate shift. A direct consequence of Proposition 3.2 is the monotonicity property of the risk in this scenario. Proposition 4.1 (Optimal risk under isotropic signals). When Σ0 = Σ and β = β0, assuming isotropic signals E[ββ ] = (α2/p)I the optimal risk obtained at λ (ϕ) = ϕ/SNR is given by: Eβ[R(λ , ϕ)] = α2µ tr[Σ0(Σ + µ I) 1] + σ2 0, (12) where µ = µ(λ , ϕ). Furthermore, the left side of (12) is strictly increasing in ϕ if SNR (0, ) and σ2 0 are fixed and strictly increasing in SNR if ϕ, σ2, and σ2 0 are fixed. Proposition 4.1 shows that the optimal OOD risk is a monotonic function of ϕ and SNR. This is intuitive because one would expect that having more data (smaller ϕ) or larger SNR would result in a lower prediction risk. In contrast, the ridge or ridgeless predictor computed on the full data does not exhibit this property (Hastie et al., 2022, Figure 2). The optimal penalty λ is also monotonically increasing in the data aspect ratio ϕ when SNR is kept fixed. However, under anisotropic signals, the optimal regularization penalty generally depends on the specific OOD risk being considered. In such cases, it is difficult to obtain analytical formulas for the optimal ridge penalty and the optimal risk. Nonetheless, we can still show that the optimal ODD risk monotonically increases in ϕ and SNR. We formalize this in the following result, which generalizes the aforementioned IND result in Patil & Du (2023, Theorem 6) to the OOD setting, allowing risk optimization over the possible range of negative regularization (with the lower limit as given in Definition 2.3). Theorem 4.2 (Monotonicity of optimally tuned OOD risk). For λ λmin(ϕ) where λmin(ϕ) is as in (5), for all ϵ > 0 small enough, the risk of optimal ridge predictor satisfies: min λ λmin(ϕ)+ϵ R(bβλ) min λ λmin(ϕ) R(λ, ϕ), (13) and right side of (13) is monotonically increasing in ϕ if SNR and σ2 0 are fixed. In addition, when β = β0 it is monotonically increasing in SNR if ϕ, σ2, and σ2 0 are fixed. We find the monotonicity of the optimal risk in ϕ remarkable because it even holds under arbitrary covariate and regression shift! The monotonicity in SNR under regression -3 0 0.1 1 10 Ridge penalty Data aspect ratio 0.1 1.0 10.0 Data aspect ratio = min = 0 = 1.0 Figure 3: Ridge regression optimized over λ ν for different thresholds ν has monotonic risk profile. We showcase the prediction risk of optimal ridge regression under the same data model as in Figure 1, with σ2 = 0.01. The left panel shows the heatmap of the risks R(λ, ϕ) of ridge regression for different ridge penalties λ and data aspect ratios ϕ. The lines indicate the optimized ridge risks minλ ν R(λ, ϕ) at different thresholds ν. The right panel shows the optimized risk minλ ν R(λ, ϕ) as a function of ϕ. shift can be similarly analyzed but requires fixing more parameters. When considering the IND prediction risk, Patil & Du (2023, Theorem 6) demonstrate that the optimally tuned ridge over λ (0, ) exhibits a monotonic risk profile in the data aspect ratio ϕ. It is somewhat surprising that optimizing over both (0, ) and (λmin(ϕ), ) yields monotonic behavior, as numerically verified in Figure 3. We also illustrate the optimal risk monotonicity behavior on MNIST in Figure G12. The minimum risk over (λmin(ϕ), ) can be significantly lower than the one over (0, ), particularly in the overparameterized regime. Although most software packages only consider positive regularization for tuning, Figure 3 suggests that allowing negative regularization can lead to significant improvements. Finally, we consider the converse question: is optimal regularization necessary for achieving monotonic risk? When considering the excess IND prediction risk under isotropic design (i.e., Σ = I), the following theorem demonstrates that the prediction risk of ridge regression is generally nonmonotonic in the data aspect ratio ϕ (0, ). Theorem 4.3 (Non-monotonicity of suboptimally tuned risk). When (Σ0, β0) = (Σ, β) and Σ = I, the risk component equivalents defined in (6) have the following properties: 1. (Bias component) For all λ > 0, B(λ, ϕ) is strictly increasing over ϕ (0, λ + 1) and strictly decreasing over ϕ (λ + 1, ). 2. (Variance component) For all λ > 0, V(λ, ϕ) is strictly increasing over ϕ (0, ). 3. (Risk) When β 2 2 > 0, for all λ > 0 and ϵ > 0, there exist σ2, ϕ (0, ), such that R(λ, ϕ)/ ϕ ϵ, i.e., max σ2,ϕ (0, ) min λ λmin(ϕ) R(λ, ϕ)/ ϕ ϵ. (14) When σ2 is small, Theorem 4.3 implies that the risk at any fixed λ is non-monotonic, even when Σ = I. It also explains the lack of risk monotonicity in ridgeless regression. This Optimal Ridge Regularization for Out-of-Distribution Prediction result extends known results on non-monotonic behavior of variance of ridgeless regression (Yang et al., 2020). 4.2. Connection to Subsampling and Ensembling We now briefly discuss the connection between subsampling and ridge regularization (Patil & Du, 2023), used to prove the OOD risk monotonicity results in Section 4. To incorporate negative regularization and OOD risks, we extend these equivalences using tools from Le Jeune et al. (2024). Before presenting the extended equivalence results, we introduce several quantities related to the subsampled ridge ensembles. For an index set I [n] of size k, let LI Rn n be a diagonal matrix with i-th diagonal 1 if i I and 0 otherwise. The feature matrix and the response vector associated with a subsampled dataset {(xi, yi) : i I} are LIX and LIy, respectively. Given a ridge penalty λ, let bβλ k (I) denote the ridge estimator fitted on the subsample (LIX, LIy), consisting of k samples. When we aggregate the estimators fitted on all subsampled datasets of size k, we obtain the so-called full-ensemble estimators bβλ k, , which is almost surely E[bβλ k (I) | X, y], if we draw I independently from the set of index sets of size k. As k, n, p such that p/n ϕ (0, ) and p/k ψ [ϕ, ], Lemma E.1 implies the OOD risk equivalence: R(bβλ k, ) R(λ, ϕ, ψ) as in (46). When ψ = ϕ, the equivalent R(λ, ϕ, ϕ) reduces to R(λ, ϕ) defined in (6) and analyzed in previous sections. We are now ready to present our main result in this section. Theorem 4.4 (Optimal ensemble versus ridge regression under negative regularization). Let R := minψ ϕ,λ λmin(ϕ) R(λ, ϕ, ψ). Then the following statements hold: 1. (Underparameterized) When ϕ < 1 and β0 = β, λ 0, R = min λ 0 R(λ, ϕ, ϕ) = min ψ ϕ R(0; ϕ, ψ). 2. (Overparameterized) When ϕ 1, λ λmin(ϕ), R = min λ λmin(ϕ) R(λ, ϕ, ϕ) = min ψ ϕ R(λmin(ϕ); ϕ, ψ). Theorem 4.4 establishes the OOD risk equivalences between ridge predictors and full-ensemble ridgeless predictors. It shows that in the underparameterized regime, ridgeless ensembles are sufficient to achieve optimal IND risk, which is also supported by Du et al. (2023) when considering optimization over λ 0. However, in the overparameterized regime, ridgeless ensembles alone are not enough to achieve optimal risk over λ λmin(ϕ). Explicit negative regularization is required to obtain the optimal risk. In other words, the implicit regularization provided by subsampling is always positive, and under certain data geometries, using λ < 0 can improve predictive performance. 0.5 1 10 Subsample aspect ratio Ridge penalty Subsample aspect ratio Figure 4: Negative regularization can help achieve optimal risk in both underparameterized and overparameterized regimes. The heatmap illustrates the prediction risks for ridge regression as a function of the ridge penalty λ and subsample aspect ratio ψ in the full ensemble. We use the same data model as Figure 3 with σ2 = 0.01. The left and right panels show the underparameterized (ϕ = 0.5) and overparameterized regimes (ϕ = 2), respectively. The red paths represent the optimal risks, while the blue and green stars indicate the optimal ridge predictor and the optimal fullensemble ridge with the largest subsample aspect ratio. The proof of Theorem 4.4 establishes the risk equivalence between the ridge predictor and the full-ensemble ridgeless predictor. As shown in Figure 4, the risk profile of the full-ensemble ridge predictors demonstrates that negative regularization can help achieve optimal risk, particularly in the overparameterized regime when the subsample aspect ratio ψ is greater than 1. 5. Discussion This paper investigates the optimal regularization and the optimal risk of ridge regression in the OOD setting. The analysis shows the differences between the IND and OOD settings. However, in both cases, the optimal risk is monotonic in the data aspect ratio. The main takeaway is that negative regularization can improve predictive performance in certain data geometries, even more so than the IND setting. We close the paper by mentioning two future directions. The scope of this paper is limited to standard ridge regression. Under proportional asymptotics, similar conclusions are expected to hold for kernel ridge regression, as the gram matrix linearizes in the sense of asymptotic equivalence (Liang & Rakhlin, 2020; Bartlett et al., 2021; Sahraee-Ardakan et al., 2022; Misiakiewicz & Montanari, 2023). Beyond ridge variants, it is of interest to perform a similar analysis for the lasso. Preliminary investigations in the literature suggest that, similar to optimal ridge regression, the optimal lasso also exhibits monotonic risk in the overparameterization ratio (Patil et al., 2022). For a comparative illustration, see Figures G13 and G14. It is also of interest to understand the monotonicity properties of the optimal risk for other convex penalized M-estimators in general (Thrampoulidis et al., 2018). Empirical investigations indicate that these estimators also exhibit monotonic risk in the data aspect ratio under optimal regularization (Patil & Du, 2023). Making these observations rigorous is a promising future direction. Optimal Ridge Regularization for Out-of-Distribution Prediction Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Software and Data The source code for generating all of our figures is included in the supplementary material in the folder named code , along with details about the computational resources used and other relevant information. The file named README.md lists the organizational structure. Acknowledgments We thank Edgar Dobriban, Avi Feller, Qiyang Han, Dmitry Kobak, Arun Kuchibhotla, Daniel Le Jeune, Jaouad Mourtada, Alessandro Rinaldo, Alexander Wei, and Yuting Wei for helpful conversations surrounding this work. We thank the anonymous reviewers for their valuable feedback and suggestions that have improved the paper (in particular, the addition of Appendix B). We also thank the computing support provided by the ACCESS allocation MTH230020 for some of the experiments performed on the Bridges2 system at the Pittsburgh Supercomputing Center. PP and RJT were supported by ONR grant N00014-20-1-2787 during the course of this work. Bach, F. High-dimensional analysis of double descent for linear regression with random projections. SIAM Journal on Mathematics of Data Science, 6(1):26 50, 2024. Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020. Bartlett, P. L., Montanari, A., and Rakhlin, A. Deep learning: A statistical viewpoint. Acta Numerica, 30:87 201, 2021. Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167 1180, 2020. Bellec, P., Du, J.-H., Koriyama, T., Patil, P., and Tan, K. Corrected generalized cross-validation for finite ensembles of penalized estimators. ar Xiv preprint ar Xiv:2310.01374, 2023. Bj orkstr om, A. and Sundberg, R. A generalized view on continuum regression. Scandinavian Journal of Statistics, 26(1):17 30, 1999. Buja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., and Zhao, L. Models as approximations I. Statistical Science, 34(4):523 544, 2019a. Buja, A., Brown, L., Kuchibhotla, A. K., Berk, R., George, E., and Zhao, L. Models as approximations II. Statistical Science, 34(4):545 565, 2019b. Canatar, A., Bordelon, B., and Pehlevan, C. Out-ofdistribution generalization in kernel regression. Advances in Neural Information Processing Systems, 2021. Cheng, C. and Montanari, A. Dimension free ridge regression. ar Xiv preprint ar Xiv:2210.08571, 2022. D Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M. D., et al. Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237 10297, 2022. Davidson, J. Stochastic Limit Theory: An Introduction for Econometricians. Oxford University Press, 1994. Dicker, L. H. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli, 22(1):1 37, 2016. Dobriban, E. Efficient computation of limit spectra of sample covariance matrices. Random Matrices: Theory and Applications, 4(4), 2015. Dobriban, E. and Wager, S. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018. Du, J.-H., Patil, P., and Kuchibhotla, A. K. Subsample ridge ensembles: Equivalences and generalized crossvalidation. In International Conference on Machine Learning, 2023. Fink, A. M. and Jodeit Jr., M. On Chebyshev s other inequality. Lecture Notes-Monograph Series, pp. 115 120, 1984. Gy orfi, L., Kohler, M., Krzyzak, A., and Walk, H. A Distribution-free Theory of Nonparametric Regression. Springer, 2006. Han, Q. and Xu, X. The distribution of ridgeless least squares interpolators. ar Xiv preprint ar Xiv:2307.02044, 2023. Optimal Ridge Regularization for Out-of-Distribution Prediction Hastie, T. Ridge regularization: An essential concept in data science. Technometrics, 62(4):426 433, 2020. Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986, 2022. Hoerl, A. E. and Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970a. Hoerl, A. E. and Kennard, R. W. Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1): 69 82, 1970b. Hua, T. A. and Gunst, R. F. Generalized ridge regression: a note on negative ridge parameters. Communications in Statistics Theory and Methods, 12(1):37 45, 1983. Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. Kernel alignment risk estimator: Risk prediction from training data. Advances in Neural Information Processing Systems, 2020. Knowles, A. and Yin, J. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169: 257 352, 2017. Kobak, D., Lomond, J., and Sanchez, B. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. Journal of Machine Learning Research, 21, 2020. Koehler, F., Zhou, L., Sutherland, D. J., and Srebro, N. Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems, 2021. Le Jeune, D., Patil, P., Javadi, H., Baraniuk, R. G., and Tibshirani, R. J. Asymptotics of the sketched pseudoinverse. SIAM Journal on Mathematics of Data Science, 2024. Liang, T. and Rakhlin, A. Just interpolate: Kernel ridgeless regression can generalize. The Annals of Statistics, 2020. Louart, C. Sharp bounds for the concentration of the resolvent in convex concentration settings. ar Xiv preprint ar Xiv:2201.00284, 2022. Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M., and Zdeborova, L. Learning curves of generic features maps for realistic datasets with a teacherstudent model. In Advances in Neural Information Processing Systems, 2021. Mallinar, N., Simon, J. B., Abedsoltan, A., Pandit, P., Belkin, M., and Nakkiran, P. Benign, tempered, or catastrophic: A taxonomy of overfitting. ar Xiv:2207.06569, 2022. Mel, G. and Ganguli, S. A theory of high dimensional regression with arbitrary correlations between input features and target functions: Sample complexity, multiple descent curves and a hierarchy of phase transitions. In International Conference on Machine Learning, 2021. Misiakiewicz, T. and Montanari, A. Six lectures on linearized neural networks. ar Xiv preprint ar Xiv:2308.13431, 2023. Muthukumar, V., Vodrahalli, K., Subramanian, V., and Sahai, A. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1 (1):67 83, 2020. Nakkiran, P., Venkat, P., Kakade, S. M., and Ma, T. Optimal regularization can mitigate double descent. In International Conference on Learning Representations, 2021. Patil, P. and Du, J.-H. Generalized equivalences between subsampling and ridge regularization. In Conference on Neural Information Processing Systems, 2023. Patil, P. and Le Jeune, D. Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning. In International Conference on Learning Representations, 2024. Patil, P., Wei, Y., Rinaldo, A., and Tibshirani, R. J. Uniform consistency of cross-validation estimators for highdimensional ridge regression. In International Conference on Artificial Intelligence and Statistics, 2021. Patil, P., Kuchibhotla, A. K., Wei, Y., and Rinaldo, A. Mitigating multiple descents: A model-agnostic framework for risk monotonization. ar Xiv preprint ar Xiv:2205.12937, 2022. Patil, P., Du, J.-H., and Kuchibhotla, A. K. Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319):1 113, 2023. Pesce, L., Krzakala, F., Loureiro, B., and Stephan, L. Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation. ar Xiv:2302.08923, 2023. Richards, D., Mourtada, J., and Rosasco, L. Asymptotics of ridge (less) regression under general source condition. In International Conference on Artificial Intelligence and Statistics, 2021. Optimal Ridge Regularization for Out-of-Distribution Prediction Ross, S. M. Simulation. Academic Press, 2022. Sahraee-Ardakan, M., Emami, M., Pandit, P., Rangan, S., and Fletcher, A. K. Kernel methods and multi-layer perceptrons learn linear models in high dimensions. ar Xiv preprint ar Xiv:2201.08082, 2022. Silverstein, J. W. and Choi, S. I. Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):295 309, 1995. Simon, J. B., Dickens, M., Karkada, D., and Deweese, M. The eigenlearning framework: A conservation law perspective on kernel ridge regression and wide neural networks. arxiv:2110.03922, 2021. Simon, J. B., Karkada, D., Ghosh, N., and Belkin, M. More is better in modern machine learning: When infinite overparameterization is optimal and overfitting is obligatory, 2023. Sollich, P. Gaussian process regression with mismatched models. Advances in Neural Information Processing Systems, 2001. Thrampoulidis, C., Abbasi, E., and Hassibi, B. Precise error analysis of regularized M-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592 5628, 2018. Tripuraneni, N., Adlam, B., and Pennington, J. Covariate shift in high-dimensional random feature regression. ar Xiv preprint ar Xiv:2111.08234, 2021. Wang, K. Pseudo-labeling for kernel ridge regression under covariate shift. ar Xiv preprint ar Xiv:2302.10160, 2023. Wei, A., Hu, W., and Steinhardt, J. More than a toy: Random matrix models predict how real-world neural representations generalize. ar Xiv preprint ar Xiv:2203.06176, 2022. Wu, D. and Xu, J. On the optimal weighted ℓ2 regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112 10123, 2020. Yang, T.-L. and Suzuki, J. Dropout drops double descent. ar Xiv preprint ar Xiv:2305.16179, 2023. Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pp. 10767 10777, 2020. Zhou, L., Koehler, F., Sutherland, D. J., and Srebro, N. Optimistic rates: A unifying theory for interpolation learning and regularization in linear regression. Journal of Data Science, 1(1), 2023. Optimal Ridge Regularization for Out-of-Distribution Prediction This supplement serves as a companion to the paper titled Optimal Ridge Regularization for Out-of-Distribution Prediction . In the first section, we provide an outline of the supplement in Table 3. We also include a summary of the general and specific notation used in the main paper and the supplement in Tables 4 and 5, respectively. In addition, in Table 6, we provide an explanation of and pointers to the abbreviations used in the references mentioned in Table 1. A. Organization and Notation A.1. Organization Table 3: Outline of the supplement. Section Subsection Purpose Overview of supplement Appendix A.1 Organization Appendix A.2 General notation Appendix A.3 Specific notation Appendix A.4 Reference key Further details in Section 1 Appendix B Appendix B.1 Background details on ridge regression with negative regularization Proofs in Section 2 Appendix C Appendix C.1 Proof of Proposition 2.4 (out-of-distribution risk asymptotics) Proofs in Section 3 Appendix D.1 Proof of Theorem 3.1 (sign of optimal regularization for IND) Appendix D.2 Special cases of the general alignment condition in Theorem 3.1 Appendix D.3 Proof of Proposition 3.2 (optimal regularization with isotropic signals) Appendix D.4 Proof of Theorem 3.3 (sign of optimal regularization with covariate shift) Appendix D.5 Proof of Theorem 3.4 (sign of optimal regularization with regression shift) Appendix D.6 Helper lemmas (derivatives of components of risk deterministic approximation) Proofs in Section 4 Appendix E.1 Proof of Proposition 4.1 (optimal risk under isotropic signals) Appendix E.2 Proof of Theorem 4.2 (risk monotonicity with optimal regularization) Appendix E.3 Proof of Theorem 4.3 (risk non-monotonicity with suboptimal regularization) Appendix E.4 Proof of Theorem 4.4 (optimal subsample versus ridge with negative regularization) Appendix E.5 Helper lemmas (optimal risk monotonicities, full ensemble OOD risk equivalences) Some technical lemmas Appendix F.1 Properties of minimum limiting negative ridge penalty Appendix F.2 Analytic properties of certain fixed-point solutions under negative regularization Appendix F.3 Contour of fixed-point solutions under negative regularization Additional experiments Appendix G Appendix G.1 Additional illustrations for Section 3 Appendix G.3 Additional illustrations for Section 5 Optimal Ridge Regularization for Out-of-Distribution Prediction A.2. General Notation Table 4: A summary of general notation used in the paper and the supplement. Notation Description Non-bold lower case Denotes vectors (e.g., a, b, c) Non-bold upper case Denotes matrices (e.g., A, B, C) Calligraphic font Denotes sets (e.g., A, B, C) Script font Denotes certain limiting functions (e.g., A, B, C) N Set of natural numbers (a, b, c) (Ordered) tuple of elements a, b, c {a, b, c} Set of elements a, b, c [n] Set {1, . . . , n} for a natural number n R, R 0 Set of real and non-negative real numbers C, C+, C Set of complex numbers and upper and lower complex half-planes tr[A], tr[A], det(A) Trace, average trace (tr[A]/p), and determinant of a square matrix A Rp p B 1 Inverse of an invertible square matrix B Rp p C Moore-Penrose inverse of a general rectangular matrix C Rn p rank(C), null(C) Rank and nullity of a general rectangular matrix C Rn p D1/2 Principal square root of a positive semidefinite matrix D f(D) Matrix obtained by applying a function f : R R to a positive semidefinite matrix D I, 1, 0 The identity matrix, the all-ones vector, the all-zeros vector u q The ℓq norm of a vector u for q 1 u A The induced ℓ2 norm of a vector u with respect to a positive semidefinite matrix A f Lq The Lq norm of a function f for q 1 A op Operator (or spectral) norm of a real matrix A A tr Trace (or nuclear) norm of a real matrix A A F Frobenius norm of a real matrix A X = Oυ(Y ) |X| CυY for some constant Cυ that may depend on the ambient parameter υ X υ Y X CυY for some constant Cυ that may depend on the ambient parameter υ u v Lexicographic ordering for vectors u and v A B Loewner ordering for symmetric matrices A and B Asymptotics OP, o P Probabilistic big-O and little-o notation C D Asymptotic equivalence of matrices C and D (see Section 2 for more details) d , p , a.s. Denotes convergence in distribution, probability, and almost sure convergence Some additional conventions used throughout the supplement: If a proof of a statement is separated from the statement, the statement is restated (while keeping the original numbering) along with the proof for convenience. If no subscript is present for the norm u of a vector u, it is assumed to be the ℓ2 norm of u. We use C, C to denote positive absolute constants. Optimal Ridge Regularization for Out-of-Distribution Prediction A.3. Specific Notation Table 5: A summary of specific notation used in the paper and the supplement. Notation Description In-distribution Px, Py|x, Px,y Distribution of the train features supported on Rp, conditional distribution of the train response supported on R, and the joint distribution Σ Covariance matrix in Rp p of the train feature distribution rmin (rmax) Minimum (maximum) eigenvalue of the train covariance matrix σ2, α2 Conditional variance and linearized signal energy of the train response distribution SNR In-distribution signal-to-noise ratio α2/σ2 β Coefficients of the (population) linear projection of y onto x {(xi, yi)}n i=1 Train samples of size n in Rp R sampled i.i.d. from Px,y X, y Train feature matrix in Rn p and the corresponding response vector in Rn bβλ Ridge estimator fitted on train data (X, y) at regularization level λ bβλ k (I) Ridge estimator fitted on subsampled data (LIX, LIy) at the regularization level λ bβλ k, Full-ensemble ridge estimator at regularization level λ Out-of-distribution Px0, Py0|x0, Px0,y0 Distribution of the test features supported on Rp, conditional distribution of the test response supported on R, and the joint distribution (x0, y0) Test sample in Rp R sampled from Px0,y0 independently of the train data (X, y) Σ0 Covariance matrix in Rp p of the test feature distribution σ2 0 Variance of the conditional test response distribution β0 Coefficients of the L2 linear projection of y0 onto x0: E[x0x 0 ] 1E[x0y0] R(bβλ) Squared test prediction risk of ridge estimator at level λ Risk parameters ϕ, ψ Limiting data and subsample aspect ratios λ Optimal ridge penalty λmin(ϕ), vmin Minimum value of ridge regularization allowed at ϕ and the associated fixed-point parameter vp(λ, ϕ) A fixed-point parameter µ(λ, ϕ) Induced amount of implicit regularization at ridge regularization λ and aspect ratio ϕ evp(λ, ϕ; Σ) Function of a fixed-point parameter R(λ, ϕ), B(λ, ϕ), V(λ, ϕ), S(λ, ϕ) Deterministic approximation to the squared risk, the bias, the variance, and the regression shift κ2 Inflated out-of-distribution irreducible error Σβ A certain matrix capturing the alignment of feature covariance and signal vector Σββ R(λ, ϕ, ψ) Deterministic approximation to the risk of the full-ensemble estimator A.4. Reference Abbreviation Key Table 6: Links to references mentioned in Table 1. Abbreviation Reference HMRT Hastie, Montanari, Rosset, and Tibshirani (2022) DW Dobriban and Wager (2018) RMR Richards, Mourtada, and Rosasco (2021) WX Wu and Xu (2020) D Dicker (2016) Optimal Ridge Regularization for Out-of-Distribution Prediction B. Further Details in Section 1 B.1. Background Details on Ridge Regression with Negative Regularization Recall that the ridge regression estimator bβλ Rp, based on the training data X, y, is defined as the solution to the following regularized least squares problem: minimize β Rp 1 n y Xβ 2 2 | {z } L(β) +λ β 2 2 |{z} R(β) Here, λ is a regularization parameter. For ease of reference in the following, denote the least squares loss by L and ℓ2 regularization by R. This is referred to as the primal form of ridge regression, denoted by (P). One can write a Lagrangian dual problem of (P) that optimizes weights θ Rn over the data points as: minimize θ Rn θ (XX /n + λIn)θ/2 λθ y/n. (D) This is referred to as the dual form of ridge regression, denoted by (D). Let us denote the solution of this problem by bθλ. The estimator bβ is linked to the dual weights as λbβλ = X bθλ. Now, there are three cases depending on whether λ is positive, zero, or negative. When λ > 0, regardless of whether n p or p > n, both problems (P) and (D) are strictly convex and lead to the following unique ridge estimator (expressed in primal and dual forms, respectively): bβλ = (X X/n + λIp) 1X y/n = X (XX /n + λIn) 1y/n. (15) When λ = 0 (more precisely, defined through analytic continuation as λ 0+) and X X is full rank (which is the case when n p and the design X has independent columns), the optimization problem (P) is still strictly convex and leads to the following unique solution: bβ0 = (X X/n) 1X y/n. When λ = 0 (again, defined using analytic continuation as λ 0+) and XX is full rank (p > n and the design X has independent rows), the optimization problem (D) is still strictly convex and leads to the following unique solution: bβ0 = X (XX /n) 1y/n. More generally, when n p and λ smin, where smin is the smallest eigenvalue of X X/n, the optimization problem (P) is still strictly convex and leads to the primal solution in (15). Similarly, when p > n and λ smin, where smin is the smallest eigenvalue of XX /n, the dual to the optimization problem (D) is strictly convex and leads to the dual solution in (15). By defining the ridge estimator as: bβλ = (X X/n + λIp) X y/n = X (XX /n + λIn) y/n, where A denotes the Moore-Penrose pseudoinverse of the matrix A, we can handle all the cases mentioned above simultaneously. To gain intuition, we visually compare the ridge solutions with positive and negative regularization in Figures B5 and B7 for the underand over-parameterized regimes, respectively. See (Hua & Gunst, 1983; Bj orkstr om & Sundberg, 1999) also for some related discussion. Optimal Ridge Regularization for Out-of-Distribution Prediction B.1.1. VISUALIZING RIDGE REGRESSION SOLUTIONS FOR POSITIVE VERSUS NEGATIVE REGULARIZATION (UNDERPARAMETERIZED REGIME) -10 -5 0 5 10 -10 -10 -5 0 5 10 -10 Figure B5: Comparing the geometry of ridge solutions for positive versus negative regularization levels λ in the underparameterized regime when p n. We consider a two-dimensional underparameterized problem with a design matrix X having the smallest eigenvalue of X X/n as smin = 4/3 and the largest eigenvalue as smax = 1/3. We plot the contours of the ℓ2 square loss L and the ℓ2 regularizer R in the problem (P) for both positive and negative values of λ. For positive values of λ [0, ], the loss contours touch the constraint contours from the outside, while for negative values of λ ( , smax) ( smin, 0), they touch from the inside. To better understand the trend, it helps to think of tying + to on the real line, making it a projective real line . The values of λ are plotted in the following order: 0.0 0.1 0.2 3.0 3.0 0.2 0.1 0.0 . 3.0 0.2 0.1 0.1 0.2 3.0 Figure B6: Illustration of the sequence of ridge regularization levels λ used in Figure B5 and the corresponding real projective line. (Note: The λ values are not necessarily drawn to scale.) Optimal Ridge Regularization for Out-of-Distribution Prediction B.1.2. VISUALIZING RIDGE REGRESSION SOLUTIONS FOR POSITIVE VERSUS NEGATIVE REGULARIZATION (OVERPARAMETERIZED REGIME) Figure B7: Comparing the geometry of ridge solutions for positive versus negative regularization levels λ in the overparameterized regime when p > n. We consider a two-dimensional overparameterized problem with the design matrix X having the smallest non-zero eigenvalue of X X/n equal to s+ min = 1. Similarly to Figure B5, we plot contours of the ℓ2 squares loss L and the ℓ2 regularizer R (from problem (P)) for both positive and negative values of λ. As before, note that for positive values of λ, the loss contours touch the constraint contours from the outside, while for negative values of λ, they touch from the inside. To reiterate, it helps to think of tying to on the real line to better understand the trend. The values of λ are plotted in the following order: 0.0 0.1 0.2 3.0 3.0 0.2 0.1 0.0 . smax s+ min 3.0 0.2 0.1 0.1 0.2 3.0 Figure B8: Illustration of the sequence of ridge regularization levels λ used in Figure B7 and the corresponding real projective line. (Note: Here s+ min is the smallest non-zero eigenvalue of X X/n.) Optimal Ridge Regularization for Out-of-Distribution Prediction C. Proofs in Section 2 C.1. Proof of Proposition 2.4 Proposition 2.4 (Deterministic equivalents for OOD risk). Under Assumption 2.2, as n, p such that p/n ϕ (0, ) and λ (λmin(ϕ), ), the prediction risk R(bβλ) defined in (3) admits a deterministic equivalent R(bβλ) R(λ, ϕ), where the equivalent additively decomposes into: R(λ, ϕ) := B(λ, ϕ) + V(λ, ϕ) + S(λ, ϕ) + κ2, (6) with the following deterministic equivalents for the bias, variance, regression shift bias, and irreducible error: B = µ2 β (Σ + µI) 1(evΣ + Σ0)(Σ + µI) 1β, S = 2µ β (Σ + µI) 1Σ0(β0 β), κ2 = (β0 β) Σ0(β0 β) + σ2 0. Proof. The main workhorse of the proof is Lemma E.1. Recall that R(bβλ) = (bβλ β0) Σ0(bβλ β0) + σ2 0 = (bβλ β) Σ0(bβλ β) + 2(bβλ β) Σ0(β β0) + {(β0 β) Σ0(β0 β) + σ2 0}. (16) Note that the last quadratic term of (16) is just a constant. From Lemma E.1, we know that the first term of (16) has the following deterministic equivalent: (bβλ β) Σ0(bβλ β) Qp(λ, ϕ, ϕ), (17) Qp(λ, ϕ, ψ) = ecp(λ, ϕ, ψ, Σ0) + f NL 2 L2evp(λ, ϕ, ψ, Σ0), and the non-negative constants ecp(λ, ϕ, ψ, Σ0) and evp(λ, ϕ, ψ, Σ0) are defined through the following equations: 1 vp(λ, ψ) = λ + ψ Z r 1 + vp(λ, ψ)r d Hp(r), (18) evp(λ, ϕ, ψ, Σ0) = ϕ tr[Σ0Σ(vp(λ, ψ)Σ + I) 2] vp(λ, ψ) 2 ϕ Z r2 (1 + vp(λ, ψ)r)2 d Hp(r) , (19) ecp(λ, ϕ, ψ, Σ0) = β (vp(λ, ψ)Σ + I) 1(evp(λ, ϕ, ψ, Σ0)Σ + Σ0)(vp(λ, ψ)Σ + I) 1β. For the second term of (16), we also have (bβλ β) Σ0(β β0) = f NL X n (bΣ + λI) 1Σ0(β β0) β λ(bΣ + λI) 1Σ0(β β0), where f NL is defined in the proof of Lemma E.1. The first term vanishes from Patil & Du (2023, Lemma D.2) because f NL and X are uncorrelated, and the remaining part has operator norm tending to zero. It follows that (bβλ β) Σ0(β β0) β (vp(λ, ϕ)Σ + I) 1Σ0(β β0). (20) Combining (16), (17), and (20) yields R(bβλ) Rp(λ, ϕ, ϕ) := Qp(λ, ϕ, ϕ) β (vp(λ, ϕ)Σ + I) 1Σ0(β β0) + {(β0 β) Σ0(β0 β) + σ2 0}, which completes the proof. Optimal Ridge Regularization for Out-of-Distribution Prediction D. Proofs in Section 3 Before presenting the proof of our main results, we introduce some notation to facilitate the upcoming presentation. Define qb(Σ0, B) = µ2 tr[(Σ + µI) 1Σ0(Σ + µI) 1B]/p (21) qv(Σ0, B) = ϕ tr[(Σ + µI) 1Σ0(Σ + µI) 1B]/p (22) l(Σ0, B) = tr[(Σ + µI) 1Σ0B]/p. (23) We denote the derivative with respect to µ by: q b(Σ0, B) := qb(Σ0, B) q v(Σ0, B) := qv(Σ0, B) We also let B = ββ and B0 = β0β . D.1. Proof of Theorem 3.1 Theorem 3.1 (Optimal regularization sign for IND risk). Assume the setup of Proposition 2.4 with (Σ0, β0) = (Σ, β). 1. (Underparameterized) When ϕ < 1, we have λ 0. 2. (Overparameterized) When ϕ > 1, if for all v < 1/µ(0, ϕ), the following general alignment holds: tr[BΣ(vΣ + I) 2] + σ2 tr[BΣ(vΣ + I) 3] + σ2 > tr[Σ(vΣ + I) 2] tr[Σ(vΣ + I) 3], (8) where B = ββ , then we have λ < 0. Proof. When (Σ0, β0) = (Σ, β), the excess bias term S(λ, ϕ) = 0. Therefore, only the bias and variance terms contribute to the risk. We next split the proof into two cases. Part (1) Underparameterized regime. When ϕ < 1, from Lemma F.2, µp(0, ϕ) = 0, and thus, we have that 0 = B(0, ϕ) min λ [λmin(ϕ),0] B(λ, ϕ). (24) On the other hand, when ϕ < 1, because V (λ, ϕ) < 0, the variance is strictly decreasing over λ (λmin(ϕ), 0). We have V(0, ϕ) < min λ (λmin(ϕ),0) V(λ, ϕ) (25) It follows that min λ (λmin(ϕ),0] R(λ, ϕ) R(0, ϕ) min λ λmin R(λ, ϕ) = min λ 0 R(λ, ϕ). Equivalently, we have λ 0. Optimal Ridge Regularization for Out-of-Distribution Prediction Part (2) Overparameterized regime. When ϕ > 1, we begin by deriving the formula of the derivative of the bias term. When (Σ0, β0) = (Σ, β), the bias term reduces to B(λ, ϕ) = qv(Σ, Σ) 1 qv(Σ, Σ)qb(Σ, B) + qb(Σ, B) = qb(Σ, B) 1 qv(Σ, Σ). From Lemma D.4, its derivative satisfies that (1 qv(Σ, Σ))2B (λ, ϕ) = (1 qv(Σ, Σ))2h 2(µ) = q b(Σ, B)(1 qv(Σ, Σ)) + qb(Σ, B)q v(Σ, Σ) (26) µq b(Σ, B) + µq b(Σ, B)qv(Σ, I) + qb(Σ, B)q v(Σ, Σ). where the last equality is from the identity 1 qv(Σ, Σ) = λ µ + µqv(Σ, I) (27) The in-distribution excess term vanishes, i.e., E(λ, ϕ) = 0. Therefore, the derivative of the risk with respect to µ satisfies that (1 qv(Σ, Σ))2B (λ, ϕ) (28) µq b(Σ, B) + µq b(Σ, B)qv(Σ, I) + qb(Σ, B)q v(Σ, Σ) µq b(Σ, B) + µq b(Σ, B)qv(Σ, I) + qb(Σ, B)q v(Σ, Σ) µ µ tr[Σβ(Σ + µI) 2] µ2 tr[Σβ(Σ + µI) 3] + 2 µ tr[Σβ(Σ + µI) 2] µ2 tr[Σβ(Σ + µI) 3] µϕ tr[Σ(Σ + µI) 2] 2µ tr[Σβ(Σ + µI) 2] µϕ tr[Σ2(Σ + µI) 3] tr[Σβ(Σ + µI) 3(Σ + µI)] µ tr[Σβ(Σ + µI) 3] + 2 µ tr[Σβ(Σ + µI) 2] µ2 tr[Σβ(Σ + µI) 3] µϕ tr[Σ(Σ + µI) 2] 2µ tr[Σβ(Σ + µI) 2] µϕ tr[Σ2(Σ + µI) 3] = 2λ tr[ΣβΣ(Σ + µI) 3] + 2µ2ϕ tr[Σβ(Σ + µI) 2] (tr[Σ(Σ + µI) 2] tr[Σ(Σ + µI) 3]) 2µ3ϕ tr[Σβ(Σ + µI) 3] tr[Σ2(Σ + µI) 2] = 2λ tr[ΣβΣ(Σ + µI) 3] + 2µ3ϕ tr[Σβ(Σ + µI) 2] tr[Σ(Σ + µI) 3] 2µ3ϕ tr[Σβ(Σ + µI) 3] tr[Σ(Σ + µI) 2]. (29) When λ = 0, it follows that = 2µ3ϕ (1 qv(Σ, Σ))2 tr[Σβ(Σ + µI) 2] tr[Σ(Σ + µI) 3] tr[Σβ(Σ + µI) 3] tr[Σ(Σ + µI) 2] , = 2µ3ϕ (1 qv(Σ, Σ))2 tr[Σβ(Σ + µI) 2] tr[Σ(Σ + µI) 3] tr[Σβ(Σ + µI) 3] tr[Σ(Σ + µI) 2] Optimal Ridge Regularization for Out-of-Distribution Prediction where µ(λ, ϕ)/ λ > 0 from Lemma F.2. From Lemma F.2 and (30), since µ(λ, ϕ)/ λ > 0, we further have that λ tr[Σβ(Σ + µI) 2] tr[Σ(Σ + µI) 3] tr[Σβ(Σ + µI) 3] tr[Σ(Σ + µI) 2], which finishes the proof of the first conclusion. To obtain the sign of the optimal ridge penalty, note that (1 qv(Σ, Σ))2B (λ, ϕ) = 2λ tr[ΣβΣ(Σ + µI) 3] + 2µ3ϕ(tr[Σβ(Σ + µI) 2] tr[Σ(Σ + µI) 3] tr[Σβ(Σ + µI) 3] tr[Σ(Σ + µI) 2]) When ϕ > 1, because µ 0, we have that T1 < 0 when λ < 0 and T1 > 0 when λ > 0. Also, under the assumption that tr[Σβ(Σ+µI) 2] tr[Σ(Σ+µI) 3] tr[Σβ(Σ+µI) 3] tr[Σ(Σ+µI) 2] > 0, we have that T2 > 0 for all λ > λmin(ϕ), with equality holds only when λ = 0. Thus, B(λ, ϕ) is minimized at λ < 0. For the variance term, from Lemma D.4, we have V (λ, ϕ) = σ2q v(Σ, Σ) (1 qv(Σ, Σ))2 = 2σ2 ϕ tr[Σ2(Σ + µI) 3] (1 qv(Σ, Σ))2 = 2ϕσ2 µ tr[Σ(Σ + µI) 3] tr[Σ(Σ + µI) 2] (1 qv(Σ, Σ))2 := T3 + T4 (1 qv(Σ, Σ))2 = 2σ2ϕ tr[Σ2(Σ + µI) 3] (1 qv(Σ, Σ))2 . (31) Note that 2σ2ϕ tr[Σ2(Σ + µI) 3] < 0 and strictly increasing over λ 0. Thus, (1 qv(Σ, Σ))2R (λ, ϕ) = (1 qv(Σ, Σ))2[B (λ, ϕ) + V (λ, ϕ)] = 2λ tr[ΣβΣ(Σ + µI) 3] tr[Σβ(Σ + µI) 2] tr[Σ(Σ + µI) 3] tr[Σβ(Σ + µI) 3] tr[Σ(Σ + µI) 2] + 2ϕσ2 µ tr[Σ(Σ + µI) 3] tr[Σ(Σ + µI) 2] = 2λ tr[ΣβΣ(Σ + µI) 3] + 2ϕ µ3 tr[Σβ(Σ + µI) 2] + µσ2 tr[Σ(Σ + µI) 3] 2ϕ µ3 tr[Σβ(Σ + µI) 3] + σ2 tr[Σ(Σ + µI) 2]. Under the condition that tr[Σ{µ3(Σ + µI) 3}] tr[Σ{µ2(Σ + µI) 2}] > tr[Σβ{µ3(Σ + µI) 3}] + σ2 tr[Σβ{µ2(Σ + µI) 2}] + σ2 , it follows that for all λ 0, (1 qv(Σ, Σ))2R (λ, ϕ) > 0. This implies that R(λ, ϕ) is minimized at λ < 0, which finishes the proof. Optimal Ridge Regularization for Out-of-Distribution Prediction D.2. General Alignment Condition of Theorem 3.1 under Special Cases Remark D.1 (Theorem 3.1 under isotropic features or signals). When ϕ > 1 and Σ = I, the condition above is never satisfied because tr[Σβ{µ3(Σ + µI) 3}] + σ2 tr[Σβ{µ2(Σ + µI) 2}] + σ2 = tr[B] + σ2 > µ 1+µ 2 = tr[Σ{µ3(Σ + µI) 3}] tr[Σ{µ2(Σ + µI) 2}]. Similarly, when β is isotropic, the condition above is never satisfied because tr[Σβ{µ3(Σ + µI) 3}] + σ2 tr[Σβ{µ2(Σ + µI) 2}] + σ2 = tr[Σ{µ3(Σ + µI) 3}] + σ2 tr[Σ{µ2(Σ + µI) 2}] + σ2 > tr[Σ{µ3(Σ + µI) 3}] tr[Σ{µ2(Σ + µI) 2}] since tr[Σ{µ3(Σ + µI) 3}] < tr[Σ{µ2(Σ + µI) 2}]. To see this, note that µ > 0 when ϕ > 1, from Lemma F.2. Thus, tr[Σ{µ3(Σ + µI) 3}] = tr[Σ{µ2(Σ + µI) 2}{µ(Σ + µI) 1}] tr[Σ{µ2(Σ + µI) 2}] µ(Σ + µI) 1 op tr[Σ{µ2(Σ + µI) 2}]. Proposition D.2 (Theorem 3.1 under strict alignment conditions). Assuming random signals, zero noise under the strict alignment conditions of Wu & Xu (2020), the general alignment condition (8) is satisfied. Proof. Let h be a random variable following the empirical measure of the eigenvalues of Σ = UΛU (i.e., Hp) and g be a random variable following the empirical measure diag(UE[β β]U). When the joint distribution of (h, g) exists, there exists a function f such that g has mass r 7 f(r) d Hp(r). The strict alignment condition from Wu & Xu (2020) imposes f to be either strictly increasing or decreasing. This also implies that tr[Σβ] = tr[ΣB] tr[Σ] tr[B]. (32) This holds because of the following Chebyshev s other inequality (Fink & Jodeit Jr., 1984).3 Fact D.3 (Positive/negative correlations of monotone functions; see, e.g., Appendix 9.9 of Ross (2022)). Let f and g be two real-valued functions of the same monotonicity. Let H be a probability measure on the real line. Then the following inequality holds: Z f(r)g(r) d H(r) Z f(r) d H(r) Z g(r) d H(r). (33) On the other hand, if f is non-decreasing and g is non-increasing, then the inequality in (33) is reversed. Now we will verify that this condition indeed implies (8) when σ2 = 0. Define f(r) = Pp i=1(β wi)2 1{r = ri}. Hp(r) = p 1 Pp i=1 1{r = ri} and the transformed measure e Hp(r) = r(r + µ) 2p 1 Pp i=1 1{r = ri}/ R r(r + µ) 2 d Hp(r). Because f(r) is increasing and 1/(r + µ) is decreasing in r, it follows that Z f(r) r r + µ d e Hp(r) Z f(r) d e Hp(r) Z r r + µ d e Hp(r). Transforming back to Hp(r) yields that Z f(r) r (r + µ)3 d Hp(r) Z r (r + µ)2 d Hp(r) Z f(r) r (r + µ)2 d Hp(r) Z r (r + µ)3 d Hp(r). Equivalently, tr[Σβ{µ2(Σ + µI) 2}] tr[Σβ{µ3(Σ + µI) 3}] tr[Σ{µ2(Σ + µI) 2}] tr[Σ{µ3(Σ + µI) 3}], with equality holds if and only if f(r) is Hp-almost surely constant. This finishes the proof. 3This is the second (less well-known) inequality, and not the first (more well-known) tail inequality that may come to the reader s mind! Optimal Ridge Regularization for Out-of-Distribution Prediction D.3. Proof of Proposition 3.2 Proposition 3.2 (Optimal regularization under covariate shift and random signal). When Σ0 = Σ and β0 = β, assuming isotropic signals E[ββ ] = (α2/p)I, we have λ (ϕ) = ϕ/SNR = argminλ Eβ[R(λ, ϕ)]. Furthermore, for λmin < 0 such that X X/n λmin I, even the non-asymptotic OOD risk (3) is minimized at λ p = ϕp/SNR, where ϕp = p/n. Proof. With slight abuse of notations, we use R, B, V, and S to denote Eβ[R], Eβ[B], Eβ[V], and Eβ[S] as well. We split the proof into two parts. Part (1) Asymptotic risk. When β0 = β, the extra bias term is zero, that is, S(λ, ϕ) = 0. So, only the bias and variance components contribute to the risk. We begin by deriving the formula for the derivative of the bias. For isotropic signals, from Lemma D.4 and (27), we have (1 qv(Σ, Σ))2B (λ, ϕ) = α2[q v(Σ0, Σ)qb(Σ, I)(1 qv(Σ, Σ)) + qv(Σ0, Σ)q b(Σ, I)(1 qv(Σ, Σ)) + qv(Σ0, Σ)qb(Σ, I)q v(Σ, Σ) + q b(Σ0, I)(1 qv(Σ, Σ))2] = α2 q v(Σ0, Σ)qb(Σ, I) λ µ + µqv(Σ, I) + qv(Σ0, Σ)q b(Σ, I) λ µ + µqv(Σ, I) + qv(Σ0, Σ)qb(Σ, I)q v(Σ, Σ) + q b(Σ0, I) λ µ + µqv(Σ, I) 2 q v(Σ0, Σ)qb(Σ, I) + qv(Σ0, Σ)q b(Σ, I) + λ µq b(Σ0, I) + 2µq b(Σ0, I)qv(Σ, I) + α2h µq v(Σ0, Σ)qb(Σ, I)qv(Σ, I) + µqv(Σ0, Σ)q b(Σ, I)qv(Σ, I) + qv(Σ0, Σ)qb(Σ, I)q v(Σ, Σ) + µ2q b(Σ0, I)qv(Σ, I)2i . Notice that qb(Σ0, I) = µ2 tr[Σ0(Σ + µI) 2] ϕ qv(Σ0, I) q b(Σ0, I) = 2µ tr[Σ0(Σ + µI) 2] 2µ2 tr[Σ0(Σ + µI) 3] = 2µ tr[Σ0Σ(Σ + µI) 3] ϕq v(Σ0, Σ), thus, it follows that q v(Σ0, Σ)qb(Σ, I) + µq b(Σ0, I)qv(Σ, I) = q v(Σ0, Σ) qb(Σ, I) µ2 ϕ qv(Σ, I) = 0 µq b(Σ, I)qv(Σ, I) + qb(Σ, I)q v(Σ, Σ) = µ2 ϕ [ q v(Σ, Σ)qv(Σ, I) + q v(Σ, Σ)qv(Σ, I)] = 0. Therefore, we have (1 qv(Σ, Σ))2B (λ, ϕ) Optimal Ridge Regularization for Out-of-Distribution Prediction qv(Σ0, Σ)q b(Σ, I) + λ µq b(Σ0, I) + µq b(Σ0, I)qv(Σ, I) ϕ qv(Σ0, Σ)q v(Σ, Σ) α2λ2 µϕ q v(Σ0, Σ) α2λµ ϕ q v(Σ0, Σ)qv(Σ, I) ϕ [qv(Σ0, Σ)q v(Σ, Σ) + µq v(Σ0, Σ)qv(Σ, I)] α2λ2 µϕ q v(Σ0, Σ). From (27), we further have that B (λ, ϕ) = α2λ ϕ(1 qv(Σ, Σ))2 [qv(Σ0, Σ)q v(Σ, Σ) q v(Σ0, Σ)qv(Σ, Σ) + q v(Σ0, Σ)] ϕ µ qv(Σ0, Σ) 1 qv(Σ, Σ). From Lemma D.4, we know that µ qv(Σ0, Σ) 1 qv(Σ, Σ) < 0 for all λ ( λmin(ϕ), + ). Furthermore, we have R (λ, ϕ) = B (λ, ϕ) + V (λ, ϕ) ϕ µ qv(Σ0, Σ) 1 qv(Σ, Σ) + σ2 µ qv(Σ0, Σ) 1 qv(Σ, Σ). Setting the above to zero gives λ = ϕσ2/α2. Also note that R (λ, ϕ) is negative when λ < λ and positive when λ > λ , as well as µ/ λ > 0 from Lemma F.2 for all λ λmin(ϕ). Thus, λ gives the optimal risk. Part (2) Finite-sample risk. Recall the finite-sample risk is given by (3). Under Assumption 2.2 and isotropic signals, it follows that R(bβλ) = λ2α2 tr[(bΣ + λI) 2Σ0] + σ2ϕ tr[(bΣ + λI) 2bΣΣ0]. Taking the derivative with respect to λ yields that λ = 2λα2 tr[(bΣ + λI) 2Σ0] 2λ2α2 tr[(bΣ + λI) 3Σ0] 2σ2ϕ tr[(bΣ + λI) 3bΣΣ0] = 2λα2 tr[(bΣ + λI) 3bΣΣ0] 2σ2ϕ tr[(bΣ + λI) 3bΣΣ0] = 2(λα2 σ2ϕ) tr[(bΣ + λI) 2bΣΣ0]. Setting the above to zero gives λ p = ϕn/SNR. D.4. Proof of Theorem 3.3 Theorem 3.3 (Optimal regularization under covariate shift and deterministic signal). Assume the setting of Proposition 2.4 with Σ0 = Σ, β0 = β. 1. (Underparameterized) When ϕ < 1, we have λ 0. 2. (Overparameterized) When ϕ > 1, if Σ0 = I (corresponding to the estimation risk), then we have λ 0. 3. (Overparameterized) When ϕ > 1, if Σ = I and tr[Σ0B] > tr[Σ0] tr[B] + (1 + µ(0, ϕ))3 µ(0, ϕ)3 σ2 , (10) where B = ββ , then we have λ < 0. Proof. When β0 = β, the extra bias term is zero, that is, S(λ, ϕ) = 0. So, only bias and variance contribute to the risk. We next split the proof into two cases. Optimal Ridge Regularization for Out-of-Distribution Prediction Part (1) Underparameterized regime. When ϕ < 1, from Lemma F.3 we have µ(0, ϕ) = 0 and the bias defined in Proposition 2.4 becomes zero, i.e., B(0, ϕ) = 0. From the fixed-point equation (53), we also have limλ 0+ λ/µ(λ, ϕ) = 0. From Lemma D.4 and (27), we have (1 qv(Σ, Σ))2B (λ, ϕ) = q v(Σ0, Σ)qb(Σ, B)(1 qv(Σ, Σ)) + qv(Σ0, Σ)q b(Σ, B)(1 qv(Σ, Σ)) + qv(Σ0, Σ)qb(Σ, B)q v(Σ, Σ) + q b(Σ0, B)(1 qv(Σ, Σ))2 = q v(Σ0, Σ)qb(Σ, B) λ µ + µqv(Σ, I) + qv(Σ0, Σ)q b(Σ, B) λ µ + µqv(Σ, I) + qv(Σ0, Σ)qb(Σ, B)q v(Σ, Σ) + q b(Σ0, B) λ µ + µqv(Σ, I) 2 q v(Σ0, Σ)qb(Σ, B) + qv(Σ0, Σ)q b(Σ, B) + λ µq b(Σ0, B) + 2µq b(Σ0, B)qv(Σ, I) + h µq v(Σ0, Σ)qb(Σ, B)qv(Σ, I) + µqv(Σ0, Σ)q b(Σ, B)qv(Σ, I) + qv(Σ0, Σ)qb(Σ, B)q v(Σ, Σ) + µ2q b(Σ0, B)qv(Σ, I)2i . Thus, the derivative of the bias term becomes zero, i.e., B (0, ϕ) = 0. Note that the variance term is strictly decreasing over (λmin(ϕ), + ) from Lemma D.4. Similarly to part (1) of the proof of Theorem 3.1, it follows that λ 0. Part (2) Overparameterized regime and Σ = I. When Σ = I, the above derivative in Part (1) becomes (1 qv(I, I))2B (λ, ϕ) q v(Σ0, I)qb(I, B) + qv(Σ0, I)q b(I, B) + λ µq b(Σ0, B) + 2µq b(Σ0, B)qv(I, I) + h µq v(Σ0, I)qb(I, B)qv(I, I) + µqv(Σ0, I)q b(I, B)qv(I, I) + qv(Σ0, I)qb(I, B)q v(I, I) + µ2q b(Σ0, B)qv(I, I)2i . Note from (21) and (22), we have that qb(Σ0, B) = µ2 (1 + µ)2 tr[Σ0B] q b(Σ0, B) = 2µ2 (1 + µ)3 + 2µ (1 + µ)2 tr[Σ0B] = 2µ (1 + µ)3 tr[Σ0B] = 2 µ(1 + µ)qb(Σ0, B), qv(I, I) = ϕ (1 + µ)2 q v(I, I) = 2ϕ (1 + µ)3 = 2 1 + µqv(I, I) qv(Σ0, I) = ϕ (1 + µ)2 tr[Σ0] q v(Σ0, I) = 2ϕ (1 + µ)3 tr[Σ0] = 2 1 + µqv(Σ0, I). We further have (1 qv(I, I))2B (λ, ϕ) Optimal Ridge Regularization for Out-of-Distribution Prediction 2 1 + µqv(Σ0, I)qb(I, B) + 2 µ(1 + µ)qv(Σ0, I)qb(I, B) + 2λ µ2(1 + µ)qb(Σ0, B) + 4 1 + µqb(Σ0, B)qv(I, I) 1 + µqv(Σ0, I)qb(I, B)qv(I, I) + 2 1 + µqv(Σ0, I)qb(I, B)qv(I, I) 2 1 + µqv(Σ0, I)qb(I, B)qv(I, I) + 2µ 1 + µqb(Σ0, B)qv(I, I)2] (1 + µ)5 tr[Σ0] tr[B] + 2λ (1 + µ)3 tr[Σ0B] + 4ϕµ2 (1 + µ)5 tr[Σ0B] 2µ 1 + µqv(Σ0, I)qb(I, B)qv(I, I) + 2µ 1 + µqb(Σ0, B)qv(I, I)2 (1 + µ)5 tr[Σ0] tr[B] + 2λ (1 + µ)3 tr[Σ0B] + 4ϕµ2 (1 + µ)5 tr[Σ0B] (1 + µ)7 (tr[Σ0B] tr[Σ0] tr[B]) (1 + µ)5 tr[Σ0] tr[B] + 2[µ(1 + µ)2 ϕµ(1 µ)] (1 + µ)5 tr[Σ0B] (1 + µ)7 (tr[Σ0B] tr[Σ0] tr[B]) (1 + µ)5 (tr[Σ0] tr[B] tr[Σ0B]) + 2µ (1 + µ)3 tr[Σ0B] (1 + µ)7 (tr[Σ0B] tr[Σ0] tr[B]). For the variance term, from Lemma D.4, we have (1 qv(I, I))2V (λ, ϕ) = σ2[q v(Σ0, I)(1 qv(I, I)) + qv(Σ0, I)q v(I, I)] = σ2 q v(Σ0, I) λ µ + µqv(I, I) + qv(Σ0, I)q v(I, I) = σ2 2λ µ(1 + µ)qv(Σ0, I) σ2 2 1 + µ [µqv(Σ0, I)qv(I, I) + qv(Σ0, I)qv(I, I)] = σ2 2λϕ µ(1 + µ)3 tr[Σ0] σ2 2ϕ2 (1 + µ)4 tr[Σ0]. From the fixed-point equation (53) we have µ(1 + µ) ϕµ = λ(1 + µ) and thus, (1 qv(I, I))2V (λ, ϕ) = σ2 2ϕ(µ(1 + µ) ϕµ) µ(1 + µ)4 tr[Σ0] σ2 2ϕ2 (1 + µ)4 tr[Σ0] = σ2 2ϕ (1 + µ)3 tr[Σ0], which is strictly increasing in λ 0. Then we have (1 qv(I, I))2R (λ, ϕ) = (1 qv(Σ, Σ))2[B (λ, ϕ) + V (λ, ϕ)] (1 qv(Σ, Σ))2|λ=0[B (0; ϕ) + V (0; ϕ)] = 2λ ϕ(1 µ) (1 + µ)5 (tr[Σ0] tr[B] tr[Σ0B]) + 1 (1 + µ)3 tr[Σ0B] Optimal Ridge Regularization for Out-of-Distribution Prediction (1 + µ)7 (tr[Σ0B] tr[Σ0] tr[B]) σ2 2ϕ (1 + µ)3 tr[Σ0] = 2λϕ (1 + µ)5 ϕ tr[Σ0B] + tr[Σ0] tr[B] (34) (1 + µ)5 2λϕµ2 tr[Σ0B] tr[Σ0] tr[B] (35) tr[Σ0B] tr[Σ0] tr[B] + (1 + µ)3 µ3 σ2 . (36) Note that when ϕ > 1, µ(λ, ϕ) > 0 from Lemma F.2. Under the alignment condition, we have tr[Σ0B] > tr[Σ0] tr[B] + (1 + µ(0, ϕ))3 µ(0, ϕ)3 σ2 tr[Σ0] tr[B]. Thus, for λ 0, the second term (35) is non-negative, and the third term (36) is strictly positive. When λ 0, from the fixed-point equation (53) we have ϕ = (1 + µ) λ(1 + µ)/µ 1 + µ and (1 + µ)2 ϕ ϕ(1 + µ) ϕ = ϕµ. Then, we know that the first term (34) is non-negative. Therefore, it follows that for all λ 0, (1 qv(Σ, Σ))2R (λ, ϕ) > 0. This implies that R(λ, ϕ) is minimized at λ < 0. Part (3) Overparameterized regime and Σ0 = I. When Σ0 = I, the above derivative in Part (1) becomes (1 qv(Σ, Σ))2B (λ, ϕ) q v(I, Σ)qb(Σ, B) + qv(I, Σ)q b(Σ, B) + λ µq b(I, B) + 2µq b(I, B)qv(Σ, I) + [µq v(I, Σ)qb(Σ, B)qv(Σ, I) + µqv(I, Σ)q b(Σ, B)qv(Σ, I) (37) + qv(I, Σ)qb(Σ, B)q v(Σ, Σ) + µ2q b(I, B)qv(Σ, I)2] (38) Recalling the definitions of qb( , ) and qv( , ) from (21) and (22), observe that qb(I, B) = µ2 tr[(Σ + µI) 2B] q b(I, B) = 2µ tr[(Σ + µI) 2B] 2µ2 tr[(Σ + µI) 3B] = 2 µqb(I, B) 2µ2 tr[(Σ + µI) 3B] = 2µ tr[(Σ + µI) 3ΣB], qv(I, Σ) = ϕ tr[(Σ + µI) 2Σ] q v(I, Σ) = 2ϕ tr[(Σ + µI) 3Σ]. Next, we work on terms (37) and (38) that do not involve a factor of λ/µ. µq v(I, Σ)qb(Σ, B)qv(Σ, I) + µqv(I, Σ)q b(Σ, B)qv(Σ, I) + qv(I, Σ)qb(Σ, B)q v(Σ, Σ) + µ2q b(I, B)qv(Σ, I)2 = 2µϕ tr[(Σ + µI) 3Σ]qb(Σ, B)qv(Σ, I) µqb(Σ, B) 2µ2 tr[(Σ + µI) 3ΣB] qv(I, Σ)qv(Σ, I) 2ϕ tr[(Σ + µI) 3Σ2]qv(I, Σ)qb(Σ, B) µqb(I, B) 2µ2 tr[(Σ + µI) 3B] qv(Σ, I)2 Optimal Ridge Regularization for Out-of-Distribution Prediction = 2µϕ tr[(Σ + µI) 3Σ] + 1 µ tr[(Σ + µI) 2Σ] tr[(Σ + µI) 3Σ2] qb(Σ, B)qv(Σ, Σ) µqb(I, B) µ2 tr[(Σ + µI) 3B] µ tr[(Σ + µI) 3ΣB] qv(Σ, I)2 This implies that B (0, ϕ) = 0 and (1 qv(Σ, Σ))2B (λ, ϕ) µ (q v(I, Σ)qb(Σ, B) + qv(I, Σ)q b(Σ, B) + 2µq b(I, B)qv(Σ, I)) + λ2 µ2 q b(I, B) µ (q v(I, Σ)qb(Σ, B) + qv(I, Σ)q b(Σ, B) + 2µq b(I, B)qv(Σ, I)) + λ2 µ2 q b(I, B) µ µϕ tr[(Σ + µI) 3Σ] + qv(Σ, I) qb(Σ, B) 2µ2 tr[(Σ + µI) 3ΣB]qv(Σ, I) + 4µ2 tr[(Σ + µI) 3ΣB]qv(Σ, I) + λ2 µ2 q b(I, B) µϕ tr[(Σ + µI) 3Σ2]qb(Σ, B) + 2µ2 tr[(Σ + µI) 3ΣB]qv(Σ, I) + λ2 µ2 q b(I, B) tr[(Σ + µI) 3Σ2] tr[(Σ + µI) 2ΣB] + µ tr[(Σ + µI) 3ΣB] tr[(Σ + µI) 2Σ] + 2λ 1 ϕ tr[(Σ + µI) 1Σ] tr[(Σ + µI) 3ΣB] tr[(Σ + µI) 3Σ2] tr[(Σ + µI) 2ΣB] tr[(Σ + µI) 3ΣB] tr[(Σ + µI) 2Σ2] + 2λ tr[(Σ + µI) 3ΣB] = 2ϕλ tr[(Σ + µI) 3Σ2] tr[(Σ + µI) 2ΣB] + 2λ tr[(Σ + µI) 3ΣB](1 ϕ tr[(Σ + µI) 2Σ2]). From Lemma F.3 (3), we know that 1 ϕ tr[(Σ + µI) 2Σ2] 0. When ϕ > 1, it then follows that B is strictly negative for all λ [λmin(ϕ), 0) and strictly positive for all λ > 0 because µ(λ, ϕ) > 0. For the variance term, from Lemma D.4, we have (1 qv(Σ, Σ))2V (λ, ϕ) = σ2[q v(I, Σ)(1 qv(Σ, Σ)) + qv(I, Σ)q v(Σ, Σ)] = σ2 q v(I, Σ) λ µ + µqv(Σ, I) + qv(I, Σ)q v(Σ, Σ) tr[(Σ + µI) 3Σ] λ µ + µqv(Σ, I) + tr[(Σ + µI) 3Σ2]qv(I, Σ) µ tr[(Σ + µI) 3Σ] + tr[(Σ + µI) 2Σ]qv(Σ, I) = 2ϕσ2 1 ϕ tr[(Σ + µI) 1Σ] tr[(Σ + µI) 3Σ] + tr[(Σ + µI) 2Σ]qv(Σ, I) tr[(Σ + µI) 3Σ] + ϕ(tr[(Σ + µI) 2Σ]2 tr[(Σ + µI) 1Σ] tr[(Σ + µI) 3Σ]) , which is strictly negative for all λ λmin(ϕ). Combining the above two derivatives, we conclude that λ > 0 in this case. D.5. Proof of Theorem 3.4 Theorem 3.4 (Optimal regularization under regression shift). Assume the setup of Proposition 2.4 with Σ0 = Σ, β0 = β. Optimal Ridge Regularization for Out-of-Distribution Prediction 1. (Underparameterized) When ϕ < 1, if σ2 = o(1) and for all µ 0, the following general alignment holds: tr[B0Σ2(Σ + µI) 2] > tr[BΣ2(Σ + µI) 2], (11) where B = ββ and B0 = β0β , then we have λ < 0. 2. (Overparameterized) When ϕ > 1, if the general alignment conditions (8) and (11) hold, then we have λ < 0. Proof. We split the proof into two parts. Part (1) Underparameterized regime. From the proof of Theorem 3.1, we know that when Σ0 = Σ, the bias term satisfies that min λ [λmin(ϕ),0] B(λ, ϕ) B(0, ϕ) = 0, from (24). From Lemma D.4, the excess bias term has the following derivative: S (λ, ϕ) = 2β Σ(Σ + µI) 2Σ(β0 β) = 2β Σ(Σ + µI) 2Σ(β β0), which is zero when λ = 0 because µ(0, ϕ) = 0 when ϕ < 1, from Lemma F.2. Under the condition that β Σ(Σ + µI) 2Σ(β β0) < 0 for all µ 0, we have that S(λ, ϕ) is strictly increasing in λ 0. From (31), we have V (λ, ϕ) = 2σ2ϕ tr[Σ2(Σ + µI) 3] (1 qv(Σ, Σ))2 . Also, since B 0 with equality holds if λ = 0. Then we have, if S (λ, ϕ) + V (λ, ϕ) = 2β Σ(Σ + µI) 2Σ(β β0) + 2σ2ϕ tr[Σ2(Σ + µI) 3] (1 qv(Σ, Σ))2 > 0 (39) for all µ µ(0, ϕ), then λ < 0. Note that when λ + , µ + and the denominator of the second term tends to one, so we have V (λ, ϕ) µ 3. On the other hand, the first term scales as S (λ, ϕ) µ 2. Eventually, the first term dominates. Thus, the condition (39) could hold when S (λ, ϕ) is positive and large enough. Especially, when σ2 = o(1) and the assumed alignment condition is met, it follows that R(λ, ϕ) = B(λ, ϕ) + V(λ, ϕ) + S(λ, ϕ) is minimized over λ < 0. Part (2) Overparameterized regime. From the proof of Theorem 3.1, we know that when (8), B(λ, ϕ) + V(λ, ϕ) is minimized at λ < 0. Under the condition that β Σ(Σ + µI) 2Σ(β β0) 0 for all µ > 0, we have S (λ, ϕ) 0 over λ (λmin(ϕ), + ). This implies that S(λ, ϕ) is increasing over λ (λmin(ϕ), + ). Combining the two results, we further see that the risk R(λ, ϕ) is minimized at λ < 0. D.6. Helper Lemmas Lemma D.4 (Out-of-distribution risk derivatives). Under the same conditions as in Proposition 2.4, we have λ = (B (λ, ϕ) + V (λ, ϕ) + S (λ, ϕ)) µ B (λ, ϕ) := B(λ, ϕ) µ = 1 (1 qv(Σ, Σ))2 h q v(Σ0, Σ)qb(Σ, B)(1 qv(Σ, Σ)) + qv(Σ0, Σ)q b(Σ, B)(1 qv(Σ, Σ)) + qv(Σ0, Σ)qb(Σ, B)q v(Σ, Σ) + q b(Σ0, B)(1 qv(Σ, Σ))2i V (λ, ϕ) := V(λ, ϕ) µ = σ2 q v(Σ0, Σ) q v(Σ0, Σ)qv(Σ, Σ) + q v(Σ, Σ)qv(Σ0, Σ) (1 qv(Σ, Σ))2 Optimal Ridge Regularization for Out-of-Distribution Prediction S (λ, ϕ) := S(λ, ϕ) µ2 qb(Σ0, (B B0)Σ). and µ = 1/vp(λ, ϕ). Furthermore, we have V (λ, ϕ) < 0 for all λ (λmin(ϕ), + ). Proof. We split the proof into different parts. Part (1) Bias term. Recall the expression for bias for out-of-distribution squared risk: B(λ, ϕ) = qv(Σ0, Σ) 1 qv(Σ, Σ)qb(Σ, B) + qb(Σ0, B) h1(µ) = qv(Σ0, Σ), h2(µ) = qb(Σ, B) 1 qv(Σ, Σ), h3(µ) = qb(Σ0, B). (41) Then we have B(λ, ϕ) = h1(µ)h2(µ) + h3(µ), B (λ, ϕ) = h 1(µ)h2(µ) + h1(µ)h 2(µ) + h 3(µ) = q v(Σ0, Σ) qb(Σ, B) 1 qv(Σ, Σ) + qv(Σ0, Σ)q b(Σ, B)(1 qv(Σ, Σ)) + qb(Σ, B)q v(Σ, Σ) (1 qv(Σ, Σ))2 + q b(Σ0, B) = 1 (1 qv(Σ, Σ))2 h q v(Σ0, Σ)qb(Σ, B)(1 qv(Σ, Σ)) + qv(Σ0, Σ)q b(Σ, B)(1 qv(Σ, Σ)) + qv(Σ0, Σ)qb(Σ, B)q v(Σ, Σ) + q b(Σ0, B)(1 qv(Σ, Σ))2i . Part (2) Variance term. Recall that the variance term is given by: V(λ, ϕ) = σ2 qv(Σ0, Σ) 1 qv(Σ, Σ). The derivative in µ is: V = σ2 1 (1 qv(Σ, Σ))2 {q v(Σ0, Σ)(1 qv(Σ, Σ)) + qv(Σ0, Σ)q v(Σ, Σ)}. q v(Σ, Σ) = 2ϕ tr[Σ2(Σ + µI) 2] < 0 q v(Σ0, Σ) = 2ϕ tr[Σ0Σ(Σ + µI) 2] = 2ϕ tr[(Σ + µI) 1Σ1/2Σ0Σ1/2(Σ + µI) 1] < 0. From Lemma F.3, we also have qv(Σ, Σ) > 0 and 1 qv(Σ, Σ) > 0 when λ (λmin(ϕ), + ). Therefore, it holds that V (λ, ϕ) < 0. Part (3) Extra bias term. Recall that the extra bias term is given by S(λ, ϕ) = 2β (vΣ + I) 1Σ0(β β0) = 2µβ (Σ + µI) 1Σ0(β β0) = l(Σ0, (B B0)Σ). Then, the derivative is given by: S (λ, ϕ) = 2β Σ(Σ + µI) 2Σ0(β β0) = 2 µ2 qb(Σ0, (B B0)Σ). Optimal Ridge Regularization for Out-of-Distribution Prediction E. Proofs in Section 4 E.1. Proof of Proposition 4.1 Proposition 4.1 (Optimal risk under isotropic signals). When Σ0 = Σ and β = β0, assuming isotropic signals E[ββ ] = (α2/p)I the optimal risk obtained at λ (ϕ) = ϕ/SNR is given by: Eβ[R(λ , ϕ)] = α2µ tr[Σ0(Σ + µ I) 1] + σ2 0, (12) where µ = µ(λ , ϕ). Furthermore, the left side of (12) is strictly increasing in ϕ if SNR (0, ) and σ2 0 are fixed and strictly increasing in SNR if ϕ, σ2, and σ2 0 are fixed. Proof. When β = β0 and ββ α2I, from (27), we have that B(λ, ϕ) = α2 qv(Σ0, Σ) 1 qv(Σ, Σ)qb(Σ, I) + α2qb(Σ0, I) 1 qv(Σ, Σ)qv(Σ, I) + qv(Σ0, I) 1 qv(Σ, Σ) λ + µqv(Σ0, I) ϕ qv(Σ0, Σ) + µqv(Σ0, I) α2 λ ϕ qv(Σ0, Σ) 1 qv(Σ, Σ) = α2µ tr[Σ0(Σ + µI) 1] α2 λ ϕ qv(Σ0, Σ) 1 qv(Σ, Σ). Therefore, from Proposition 2.4 and Theorem 3.4, the optimal risk is given by R(bβλ ) Bp(λ , ϕ) + Vp(λ , ϕ) + σ2 0 = α2µ tr[Σ0(Σ + µ I) 1] + σ2 α2λ qv(Σ0, Σ) 1 qv(Σ, Σ) + σ2 0 = α2µ tr[Σ0(Σ + µ I) 1] + σ2 0 = α2 tr[Σ0(v Σ + I) 1] + σ2 0, where µ = µ(λ , ϕ) and v = v(λ , ϕ). Note that when σ2 and σ2 0 are fixed, R(λ , ϕ) = ϕσ2 (µ /λ ) tr[Σ0(Σ + µ I) 1] + σ2 0 is simply a function of λ . From Lemma F.2 (4), we have that µ /λ is strictly decreasing in λ . Also note that tr[Σ0(Σ + µ I) 1] is strictly decreasing in λ . Thus, we know that R(λ , ϕ) is strictly decreasing in λ when ϕ, σ, σ0 are fixed. Because λ is strictly decreasing in SNR, we further find that R(λ , ϕ) is strictly increasing in SNR. E.2. Proof of Theorem 4.2 Theorem 4.2 (Monotonicity of optimally tuned OOD risk). For λ λmin(ϕ) where λmin(ϕ) is as in (5), for all ϵ > 0 small enough, the risk of optimal ridge predictor satisfies: min λ λmin(ϕ)+ϵ R(bβλ) min λ λmin(ϕ) R(λ, ϕ), (13) and right side of (13) is monotonically increasing in ϕ if SNR and σ2 0 are fixed. In addition, when β = β0 it is monotonically increasing in SNR if ϕ, σ2, and σ2 0 are fixed. Proof. We split the proof into different parts. Optimal Ridge Regularization for Out-of-Distribution Prediction Part (1) Risk characterization and equivalence. From the proof of Theorem 4.4, we have R(bβλ) Rp(λ, ϕ, ϕ) where Rp is defined in (50). Furthermore, for any ψ [ϕ, + ], there exists a unique λ λmin(ϕ) defined through (44). For any pair of (λ1, ψ1) and (λ2, ψ2) on the path P(λ; ϕ, ψ) as defined in (55), we have that: Rp(λ1; ϕ, ψ1) = Rp(λ2; ϕ, ψ2). Part (2) Risk monotonicity. From Lemma F.3 (3), we know that the denominator of evp defined in (19) is non-negative: vp(λ, ψ) 2 ϕ Z r2(1 + vp(λ, ψ)r) 2 d Hp(r) vp(λ, ψ) 2 ψ Z r2(1 + vp(λ, ψ)r) 2 d Hp(r) 0 for λ λmin(ϕ). Therefore, Rp(λ, ϕ, ψ) is increasing in ϕ for any fixed (λ, ψ). Furthermore, since Rp(λ, ϕ, ψ) is a continuous function of ϕ and v(λ; ψ), it follows that for 0 < ϕ1 ϕ2 < , min ψ ϕ1 Rp(λ, ϕ1, ψ) min ψ ϕ2 Rp(λ, ϕ1, ψ) min ψ ϕ2 Rp(λ, ϕ2, ψ), where the first inequality follows because {ψ : ψ ϕ1} {ψ : ψ ϕ2}, and the second inequality follows because Rp(λ, ϕ, ψ) is increasing in ϕ for a fixed ψ. Thus, minψ ϕ Rp(λ, ϕ, ψ) is a continuous and monotonically increasing function in ϕ. Part (3) Optimal subsampling and regularization. Similar to the proof of Part (3) in Lemma E.1, we have min ψ ϕ Rp(0; ϕ, ψ) min λ λmin(ϕ)+ϵ R(bβλ). From Part (3), we know that the former is continuous and monotonically increasing in ϕ, which finishes the proof. Part (4) Monotonicity in signal-to-noise ratio. When β0 = β, the extra bias term S is zero. When σ2 is fixed, note that B and κ are strictly increasing in α2 while V does not depend on α2. Thus, we know that Rp is strictly increasing in SNR. Consequently, minψ ϕ Rp(0; ϕ, ψ) is strictly increasing in SNR. E.3. Proof of Theorem 4.3 Theorem 4.3 (Non-monotonicity of suboptimally tuned risk). When (Σ0, β0) = (Σ, β) and Σ = I, the risk component equivalents defined in (6) have the following properties: 1. (Bias component) For all λ > 0, B(λ, ϕ) is strictly increasing over ϕ (0, λ + 1) and strictly decreasing over ϕ (λ + 1, ). 2. (Variance component) For all λ > 0, V(λ, ϕ) is strictly increasing over ϕ (0, ). 3. (Risk) When β 2 2 > 0, for all λ > 0 and ϵ > 0, there exist σ2, ϕ (0, ), such that R(λ, ϕ)/ ϕ ϵ, i.e., max σ2,ϕ (0, ) min λ λmin(ϕ) R(λ, ϕ)/ ϕ ϵ. (14) Proof. For isotropic features Σ = I, analogous to the proof of Theorem 4.2, the excess prediction risk is given by R bβλ Rp(λ, ϕ, ϕ) where Rp(λ, ϕ, ψ) is defined in (6). When Σ = I, the non-negative constants ecp(λ, ψ) and evp(λ, ϕ, ψ) are defined through the following equations: (ϕ + λ 1)2 + 4λ (ϕ + λ 1) Optimal Ridge Regularization for Out-of-Distribution Prediction evp(λ, ϕ, ψ) = ϕ 1 (1 + vp(λ, ψ))2 vp(λ, ψ) 2 ϕ 1 (1 + vp(λ, ψ))2 ecp(λ, ψ) = (vp(λ, ψ)Σ + I) 2α2. From Theorem 1 of Patil & Du (2023), for all (λ, ϕ) (0, )2, there exists ψ = ψ(λ, ϕ) such that the prediction risk (45) of the full-ensemble estimator are asymptotically equivalent: Rp(λ, ϕ, ϕ) Rp(0; ϕ, ψ). Note that the left-hand side is simply the risk of the ridge predictor with ridge penalty λ. Furthermore, from (44), it holds that λ + ϕ Z r 1 + vp(0; ψ)r d H(r) = ψ Z r 1 + vp(0; ψ)r d H(r). Taking the derivative with respect to ϕ on both sides yields that ψ ϕ = 1 λvp(0; ψ) 1 λvp(0; ψ)evp(0; ϕ, ψ). (42) We consider three cases below. (1) α2 = 0 and σ2 > 0. In this case, the excess risk equals the variance component: R p(λ, ϕ, ϕ) = σ2evp(λ, ϕ, ϕ) 1 ϕ Z vp(λ, ϕ)r 1 + vp(λ, ϕ)r 2 d Hp(r) 1 Let f(ϕ) := ϕ R (vp(λ, ϕ)r/(1 + vp(λ, ϕ)r))2 d Hp(r). Then, the monotonicity of the above display in ϕ is the same as that of f in ϕ. Note that (ϕ + λ 1)2 + 4λ (ϕ + λ 1) p (ϕ + λ 1)2 + 4λ (ϕ λ 1) Taking the derivative with respect to ϕ, we have f (ϕ) = ( ϕ + λ + 1) p (ϕ + λ 1)2 + 4λ ϕ λ + 1 2 (ϕ + λ 1)2 + 4λ p (ϕ + λ 1)2 + 4λ + λ ϕ + 1 2 . Since f (ϕ) > 0 when ϕ (0, λ + 1) and f (ϕ) < 0 when ϕ (λ + 1, + ), we know that Rp is strictly increasing over ϕ (0, λ + 1) and strictly decreasing over ϕ (λ + 1, + ). Thus, the monotonicity of Rp(λ, ϕ, ϕ) follows. (2) α2 > 0 and σ2 = 0. In this case, the excess risk equals the bias component: Rp(λ, ϕ, ϕ) = ecp( λ; ϕ)(evp(λ, ϕ, ϕ) + 1) 1 (1 + vp(λ, ϕ))2 1 ϕ 1 (1 + vp(λ, ϕ))2 Optimal Ridge Regularization for Out-of-Distribution Prediction (1 ϕ)vp(λ, ϕ)2 + 2vp(λ, ϕ) + 1. Let g(ϕ) = (1 ϕ)vp(λ, ϕ)2 + 2vp(λ, ϕ) + 1. Taking the derivative with respect to ϕ yields (ϕ + λ 1)2 + 4λ λ ϕ + 1 (λ + ϕ 1)2 + 4λ λ + 3(ϕ 1) p λ2 + 2λ(ϕ + 1) + ϕ2 2ϕ + 1 λ2 4λ(ϕ + 1) 3(ϕ 1)2 (ϕ + λ 1)2 + 4λ λ ϕ + 1 (λ + ϕ 1)2 + 4λ h(ϕ). By simple calculations, one can show that h (ϕ) < 0 and h(ϕ) h(0) < 2λ when λ > 0. Therefore, we have g (ϕ) < 0 and Rp is strictly increasing over ϕ (0, ). (3) General cases when α2 > 0. Note that Rp(λ, ϕ, ϕ) = σ2evp(λ, ϕ, ψ) + ecp( λ; ϕ)(evp(λ, ϕ, ϕ) + 1) =: f1(ϕ) + f2(ϕ), where f1 first increases and then decreases in ϕ, and f2 is a strictly increasing function. Note that only f1 depends on σ2. Because for any λ > 0 and ϕ (λ + 1, ), f 1(ϕ) < 0 and its scale is proportional to σ2, we have that for all ϵ > 0, there exists σ2 > 0 such that f 1(ϕ) > f 2(ϕ) + ϵ. This implies that max σ2,ϕ (0, ) min λ 0 R(λ, ϕ) which completes the proof. E.4. Proof of Theorem 4.4 Theorem 4.4 (Optimal ensemble versus ridge regression under negative regularization). Let R := minψ ϕ,λ λmin(ϕ) R(λ, ϕ, ψ). Then the following statements hold: 1. (Underparameterized) When ϕ < 1 and β0 = β, λ 0, R = min λ 0 R(λ, ϕ, ϕ) = min ψ ϕ R(0; ϕ, ψ). 2. (Overparameterized) When ϕ 1, λ λmin(ϕ), R = min λ λmin(ϕ) R(λ, ϕ, ϕ) = min ψ ϕ R(λmin(ϕ); ϕ, ψ). Proof. We split the proof into different parts. Part (1) Risk characterization. From Lemma E.2, we have R(bβλ k, ) Rp(λ, ϕ, ψ). Part (2) Risk equivalence. From Lemma F.4, for any ψ [ϕ, + ], there exists a unique λ λmin(ϕ) such that 1 v = λ + ϕ Z r 1 + vr d Hp(r), and 1 v = ψ Z r 1 + vr d Hp(r). (44) For any pair of (λ1, ψ1) and (λ2, ψ2) on the path P(λ; ϕ, ψ) as defined in (55), we have that: Rp(λ1; ϕ, ψ1) = Rp(λ2; ϕ, ψ2). Optimal Ridge Regularization for Out-of-Distribution Prediction Part (3) Optimal risk. From (49), we have that for any λ λmin(ϕ), there exists ψ 1 such that |Rp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)| a.s. 0. From Lemma F.1 and Lemma F.3, 1/vp( λ; ψ) [ rmin, ] and lim ψ ϕ Rp(λmin(ϕ); ϕ, ψ) = lim λ λmin(ϕ)+ Rp(λ, ϕ, ϕ) = + . Similar to the proof of Lemma 28 from Bellec et al. (2023), one can show that the sequence of functions {Rp(λmin(ϕ); ϕ, ψ(λ)) Rp(λ, ϕ, ϕ)}p N is uniformly equicontinuous on λ Λ = [λmin(ϕ) + ϵ, ] almost surely for some small ϵ > 0 such that Rp(λ, ϕ, ψ) is no larger than the null risk Rp(+ ; ϕ, ϕ) when λ [λmin(ϕ) + ϵ, + ]. From Theorem 21.8 of Davidson (1994), it further follows that the sequences converge to zero uniformly over Λ almost surely. This implies that 0 = lim sup p max λ λmin(ϕ)+ϵ [Rp(λmin(ϕ); ϕ, ψ(λ)) Rp(λ, ϕ, ϕ)] min ψ ϕ lim sup p max λ λmin(ϕ)+ϵ [Rp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] lim sup p min ψ ϕ max λ λmin(ϕ)+ϵ [Rp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] = lim sup p min ψ ϕ Rp(λmin(ϕ); ϕ, ψ) min λ λmin(ϕ)+ϵ Rp(λ, ϕ, ϕ) . Conversely, since for any ψ ψ(λmin(ϕ) + ϵ), there exists λ λmin(ϕ) + ϵ such that |Rp(λmin(ϕ); ϕ, ψ) Rp(λ(ψ); ϕ, ϕ)| a.s. 0. Similarly, we can show that {Rp(λmin(ϕ); ϕ, ψ) Rp(λ(ψ); ϕ, ϕ)}p N is uniformly equicontinuous on ψ Ψ = [ψ(λmin(ϕ) + ϵ), ] almost surely. This also implies that 0 = lim inf p min ψ ψ(λmin(ϕ)+ϵ)[Rp(λmin(ϕ); ϕ, ψ) Rp(λ(ψ); ϕ, ϕ)] max λ λmin(ϕ) lim inf p min ψ ψ(λmin(ϕ)+ϵ)[Rp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] lim inf p max λ λmin(ϕ) min ψ ψ(λmin(ϕ)+ϵ)[Rp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] = lim inf p min ψ ψ(λmin(ϕ)+ϵ) Rp(λmin(ϕ); ϕ, ψ) min λ λmin(ϕ) Rp(λ, ϕ, ϕ) . Combining the previous two inequalities implies that min λ λmin(ϕ)+ϵ Rp(λ, ϕ, ϕ) min ψ ψ(λmin(ϕ)+ϵ) Rp(λmin(ϕ); ϕ, ψ). min λ λmin(ϕ)+ϵ Rp(λ, ϕ, ϕ) = min λ λmin(ϕ) Rp(λ, ϕ, ϕ) min ψ ψ(λmin(ϕ)+ϵ) Rp(λmin(ϕ); ϕ, ψ) = min ψ ϕ Rp(λmin(ϕ); ϕ, ψ), we further have min λ λmin(ϕ) Rp(λ, ϕ, ϕ) = min ψ ϕ Rp(λmin(ϕ); ϕ, ψ), which holds for ϕ (0, ). This finishes the proof of the second conclusion. Part (4) Optimal risk when ϕ < 1. When β0 = β, the excess bias term S 0. From Lemma D.4, we have that the bias component ecp(λ, ϕ, ϕ) of the risk equivalent is minimized at λ = 0 when ϕ < 1. Since evp(λ, ϕ, ϕ) is a strictly increasing function in vp(λ, ϕ) and vp(λ, ϕ) is a strictly decreasing function in λ, we see that evp(λ, ϕ, ϕ) is a strictly decreasing function in λ. Thus, we have that min λ [λmin,0] Rp(λ, ϕ, ϕ) Rp(0; ϕ, ϕ). This implies that min λ λmin Rp(λ, ϕ, ϕ) min λ 0 Rp(0; ϕ, ϕ), which finishes the proof of the first conclusion. Optimal Ridge Regularization for Out-of-Distribution Prediction E.5. Helper Lemmas Lemma E.1 (Monotonicity of generalized prediction risk with optimal ridge regularization). Suppose Assumption 2.2 hold. Define the generalized mean squared risk for a estimator bβ as: R(bβ; A, b, β0) = LA,b(bβ β0) 2 2, (45) where LA,b(β) = Aβ + b is a linear functional and (A, b) is independent of (X, y) such that A op and b 2 are almost surely bounded. Then, as k, n, p such that p/n ϕ (0, ) and p/k ψ [ϕ, ], there exists a sequence of random variables {Qp(λ, ϕ, ψ)} p=1 that is asymptotically equivalent to the risk of the full-ensemble ridge predictor, R(bβλ k, ; A, b, β0) Qp(λ, ϕ, ψ) := ecp(λ, ϕ, ψ, A A) + f NL 2 L2evp(λ, ϕ, ψ, A A), (46) where the non-negative constants ecp(λ, ϕ, ψ, A A) and evp(λ, ϕ, ψ, A A) are defined through the following equations: 1 vp(λ, ψ) = λ + ψ Z r 1 + vp(λ, ψ)r d Hp(r), evp(λ, ϕ, ψ, A A) = ϕ tr[A AΣ(vp(λ, ψ)Σ + I) 2] vp(λ, ψ) 2 ϕ Z r2 (1 + vp(λ, ψ)r)2 d Hp(r) , ecp(λ, ϕ, ψ, A A) = β 0 (vp(λ, ψ)Σ + I) 1(evp(λ, ϕ, ψ, A A)Σ + A A)(vp(λ, ψ)Σ + I) 1β0. For the ridge predictor when k = n and λmin(ϕ) defined in (5), the optimal risk equivalence minλ λmin(ϕ) Qp(λ, ϕ, ϕ) is monotonically increasing in ϕ. Proof. Given an observation (x, y), recall the decomposition y = f LI(x) + f NL(x) explained in Section 2. For n i.i.d. samples from the same distribution as (x, y), we define analogously the vector decomposition: y = f LI + f NL, (47) where f LI = Xβ0 and f NL = [f NL(xi)]i [n]. R(bβλ k, ; A, b, β0) = R(bβλ k, ; A, 0, β0) + 2b A(bβλ k, β0) + b 2 2. By Theorem 3 of Patil & Du (2023), the cross term vanishes, that is, b A(bβλ k, β0) a.s. 0. We then have |R(bβλ1 k1, ; A, b, β0) R(bβλ2 k2, ; A, b, β0)| a.s. |R(bβλ1 k1, ; A, 0, β0) R(bβλ2 k2, ; A, 0, β0)|. Thus, it suffices to analyze R(bβλ k, ; A, 0, β0). To simplify the notation, we define Rp(λ, ϕ, ψ) := R bβλ1 p/ψ , ; A, 0p, β0 (48) to indicate the dependency solely on p and (λ, ϕ, ψ). We split the proof into different parts. Part (1) Risk characterization. Under Assumption 2.2, from Equation (11) of Patil & Du (2023), we have that for λ 0, R(bβλ k, ; A, b, β0) Qp(λ, ϕ, ψ) where Qp is defined in (46). Note that for λ [λmin, 0), the fixed-point solution vp(λ, ψ) satisfies the same fixed-point equation as the one for λ 0. Since the above deterministic equivalent depends on λ only through vp(λ, ψ), it also applies to λ [λmin, 0). Optimal Ridge Regularization for Out-of-Distribution Prediction Part (2) Risk equivalence. From Lemma F.4, we have that, for any ψ [ϕ, + ], there exists λ uniquely defined through (44). For any pair of (λ1, ψ1) and (λ2, ψ2) on the path P(λ; ϕ, ψ) as defined in (55), the generalized risk functionals (45) of the full-ensemble estimator are asymptotically equivalent: Rp(λ1; ϕ, ψ1) Qp(λ1; ϕ, ψ1) = Qp(λ2; ϕ, ψ2) Rp(λ2; ϕ, ψ2). (49) Part (3) Risk monotonicity. Note that from Lemma F.10 (3) and Lemma F.11 (3) Du et al. (2023), vp(λ, ψ) 2 ϕ R r2(1+ vp(λ, ψ)r) 2 d Hp(r) is non-negative. Then we have evp(λ, ϕ, ψ, A A) = tr[A AΣ(vp(λ, ψ)Σ + I) 2] ϕ 1vp(λ, ψ) 2 Z r2 (1 + vp(λ, ψ)r)2 d Hp(r) is increasing in ϕ (0, ψ] for any fixed ψ. Thus, Qp(λ, ϕ, ψ) is increasing in ϕ for any fixed (λ, ψ). Furthermore, since Qp(λ, ϕ, ψ) is a continuous function of ϕ and v(λ; ψ), it follows that for 0 < ϕ1 ϕ2 < , min ψ ϕ1 Qp(λ, ϕ1, ψ) min ψ ϕ2 Qp(λ, ϕ1, ψ) min ψ ϕ2 Qp(λ, ϕ2, ψ), where the first inequality follows because {ψ : ψ ϕ1} {ψ : ψ ϕ2}, and the second inequality follows because Qp(λ, ϕ, ψ) is increasing in ϕ for a fixed ψ. Thus, minψ ϕ Qp(λ, ϕ, ψ) is a continuous and monotonically increasing function in ϕ. Part (4) Optimal subsampling and regularization. From (49), we have that for any λ λmin(ϕ), there exists ψ 1 such that |Qp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)| a.s. 0. From Lemma F.1 and Lemma F.3, 1/vp( λ; ψ) [ rmin, ] and limψ ϕ Qp(λmin(ϕ); ϕ, ψ) = + . Similarly to the proof of Lemma 28 from Bellec et al. (2023), one can show that the sequence of functions {Qp(λmin(ϕ); ϕ, ψ(λ)) Rp(λ, ϕ, ϕ)}p N is uniformly equicontinuous on λ Λ = [λmin(ϕ)+ϵ, ] almost surely for some small ϵ > 0 such that |Qp(λmin(ϕ); ϕ, ψ(λ))| is not greater than the null risk Qp(+ ; ϕ, ϕ) when λ [λmin(ϕ) + ϵ, + ]. From Theorem 21.8 of Davidson (1994), it further follows that the sequences converge to zero uniformly over Λ almost surely. This implies that 0 = lim sup p max λ λmin(ϕ)+ϵ [Qp(λmin(ϕ); ϕ, ψ(λ)) Rp(λ, ϕ, ϕ)] min ψ ϕ lim sup p max λ λmin(ϕ)+ϵ [Qp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] lim sup p min ψ ϕ max λ λmin(ϕ)+ϵ [Qp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] = lim sup p min ψ ϕQp(λmin(ϕ); ϕ, ψ) min λ λmin(ϕ)+ϵ Rp(λ, ϕ, ϕ) . Conversely, since for any ψ ψ(λmin(ϕ) + ϵ), there exists λ λmin(ϕ) + ϵ such that |Qp(λmin(ϕ); ϕ, ψ) Rp(λ(ψ); ϕ, ϕ)| a.s. 0. Similarly, we can show that {Qp(λmin(ϕ); ϕ, ψ) Rp(λ(ψ); ϕ, ϕ)}p N is uniformly equicontinuous on ψ Ψ = [ψ(λmin(ϕ) + ϵ), ] almost surely. This also implies that 0 = lim inf p min ψ ψ(λmin(ϕ)+ϵ)[Qp(λmin(ϕ); ϕ, ψ) Rp(λ(ψ); ϕ, ϕ)] max λ λmin(ϕ) lim inf p min ψ ψ(λmin(ϕ)+ϵ)[Qp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] lim inf p max λ λmin(ϕ) min ψ ψ(λmin(ϕ)+ϵ)[Qp(λmin(ϕ); ϕ, ψ) Rp(λ, ϕ, ϕ)] = lim inf p min ψ ψ(λmin(ϕ)+ϵ)Qp(λmin(ϕ); ϕ, ψ) min λ λmin(ϕ) Rp(λ, ϕ, ϕ) . Combining the previous two inequalities implies that min λ λmin(ϕ)+ϵ Rp(λ, ϕ, ϕ) min ψ ψ(λmin(ϕ)+ϵ)Qp(λmin(ϕ); ϕ, ψ). Optimal Ridge Regularization for Out-of-Distribution Prediction min ψ ψ(λmin(ϕ)+ϵ)Qp(λmin(ϕ); ϕ, ψ) = min ψ ψ(λmin(ϕ))Qp(λmin(ϕ); ϕ, ψ) = min ψ ϕQp(λmin(ϕ); ϕ, ψ), we further have min λ λmin(ϕ)+ϵ Rp(λ, ϕ, ϕ) min ψ ϕQp(λmin(ϕ); ϕ, ψ). From the second part, we know that the latter is continuous and monotonically increasing in ϕ, which finishes the proof. Lemma E.2 (Out-of-distribution full-ensemble risk asymptotics). Under Assumption 2.2, as k, n, p such that p/n ϕ (0, ) and p/k ψ [ϕ, ], for λ λmin(ϕ), it holds that R(bβλ k, ) Rp(λ, ϕ, ψ) := Qp(λ, ϕ, ψ) 2β (vp(λ, ϕ)Σ + I) 1Σ0(β β0) + [(β0 β) Σ0(β0 β) + σ2 0]. (50) Qp(λ, ϕ, ψ) := ecp(λ, ϕ, ψ, Σ0) + f NL 2 L2evp(λ, ϕ, ψ, Σ0), where the non-negative constants ecp(λ, ϕ, ψ, Σ0) and evp(λ, ϕ, ψ, Σ0) are defined through the following equations: 1 vp(λ, ψ) = λ + ψ Z r 1 + vp(λ, ψ)r d Hp(r), evp(λ, ϕ, ψ, Σ0) = ϕ tr[Σ0Σ(vp(λ, ψ)Σ + I) 2] vp(λ, ψ) 2 ϕ Z r2 (1 + vp(λ, ψ)r)2 d Hp(r) , ecp(λ, ϕ, ψ, Σ0) = β (vp(λ, ψ)Σ + I) 1(evp(λ, ϕ, ψ, Σ0)Σ + Σ0)(vp(λ, ψ)Σ + I) 1β. Proof. Similar to the proof of Proposition 2.4, we have the decomposition R(bβλ k, ) = (bβλ k, β0) Σ0(bβλ k, β0) + σ2 0 = (bβλ k, β) Σ0(bβλ k, β) + 2(bβλ k, β) Σ0(β β0) + [(β0 β) Σ0(β0 β) + σ2 0] Qp(λ, ϕ, ψ) + 2EI Ik[(bβ(I) β) Σ0(β β0) | Dn] + [(β0 β) Σ0(β0 β) + σ2 0] Qp(λ, ϕ, ψ) 2β (vp(λ, ψ)Σ + I) 1Σ0(β β0) + [(β0 β) Σ0(β0 β) + σ2 0], where the first asymptotic equivalent is from Lemma E.1 and the second is from Patil & Du (2023, Lemma D.1 (1)). This finishes the proof. F. Technical Lemmas F.1. Fixed-Point Equations for Minimum Ridge Penalty Recall that under Assumption 2.2, the minimum ridge penalty λmin = λmin(ϕ) can be determined by the following equations: 1 = ϕ Z v0r 1 + v0r 2 d P(r), rmin < v 1 0 0, 1 v0 = λmin + ϕ Z r 1 + v0r d P(r), or equivalently 1 = ϕ Z r µ0 + r 2 d P(r), rmin < µ0 0, Optimal Ridge Regularization for Out-of-Distribution Prediction µ0 = λmin + ϕ Z µ0r µ0 + r d P(r). We next analyze the properties of µ0 in ϕ. Lemma F.1 (Continuity properties with the minimum regularization parameter). Let a > 0 and b < be real numbers. Let P be a probability measure supported on [a, b]. Consider the function v0( ) : ϕ 7 µ0(ϕ), over (0, ), where a µ0(ϕ) 0 is the unique solution to the following fixed-point equation: 1 = ϕ Z r µ0(ϕ) + r 2 d P(r), (51) Then the following properties hold: (1) The range of µ0(ϕ) is [ a, ). (2) µ0(ϕ) is continuous and strictly increasing over ϕ (0, ). In addition, µ0(ϕ) has a root at ϕ = 1. (3) λmin(ϕ) = µ0(ϕ)(1 ϕ R r/(µ0(ϕ) + r) d P(r)) is non-positive over ϕ (0, ) with zero obtained when ϕ = 1. Furthermore, it is strictly increasing over ϕ (0, 1) and strictly decreasing over ϕ (1, ). Proof. The existence of the solution v0(ϕ) = 1/µ0(ϕ) to the fixed-point equation 1 ϕ = Z v0(ϕ)r 1 + v0(ϕ)r follows from Theorem 3.1 of Le Jeune et al. (2024). Next, we split the proof into different parts. Part (1). Define h(x) = R (xr)2/(1 + xr)2 d P(r). When ϕ < 1, h(v0(ϕ)) = 1/ϕ > 1, which implies that 1/v0(ϕ) [ a, 0). When ϕ = + , 1/v0(ϕ) = a if P(a) > 0. When ϕ 1, h(v0(ϕ)) = 1/ϕ 1, which implies that v0(ϕ) (0, ], with infinity obtained when ϕ = 1, or equivalently, 1/v0(ϕ) [0, ). Part (2). Since g(t) = h(t 1) 1 is positive, strictly increasing and continuous over t [ a, ), by the continuous inverse theorem, we have that 1/v0(ϕ) = g 1(ϕ) is strictly increasing and continuous over ϕ (0, ). From Part (1), we also have 1/v0(1) = 0, which finishes the proof. Part (3). Consider the function f(x) = 1 ϕ R r/(x + r) d P(r). When ϕ < 1, we know that f(µ0(ϕ)) > 0 because from (51), 1 = ϕ Z r µ0(ϕ) + r 2 d P(r) > ϕ Z r µ0(ϕ) + r d P(r), which holds because µ0(ϕ) < 0 from Part (2). Analogously, when ϕ > 1, f(µ0(ϕ)) < 0. Therefore, λmin(ϕ) = µ0(ϕ)f(µ0(ϕ)) 0 with equality obtained when ϕ = 1. Because f(x) is strictly decreasing in both ϕ and x, combining the sign properties, we further find that λmin(ϕ) = µ0(ϕ)f(µ0(ϕ)) is strictly increasing over ϕ (0, 1) and strictly decreasing over ϕ (1, ). F.2. Properties of Fixed-Point Equations under Negative Regularization Our analysis involves v(λ, ϕ) as a unique solution to the fixed-point equation in 1 v(λ, ϕ) = λ + ϕ Z r 1 + v(λ, ϕ)r d Hp(r). (52) Define µ(λ, ϕ) = 1/v(λ, ϕ). Equivalently, we have that µ(λ, ϕ) is a unique solution to the following fixed-point equation: µ(λ, ϕ) = λ + ϕ Z µ(λ, ϕ)r µ(λ, ϕ) + r d Hp(r). (53) The analytic properties of the function λ 7 v(λ, ϕ) on (λmin(ϕ), + ) for ϕ (1, ) are detailed in Lemma F.2. Optimal Ridge Regularization for Out-of-Distribution Prediction Lemma F.2 (Analytic properties in the regularization parameter). Let 0 < a b < be real numbers. Let P be a probability measure supported on [a, b]. Let ϕ > 0 be a real number. For λ > λmin(ϕ), let v(λ, ϕ) > 0 denote the solution to the fixed-point equation µ(λ, ϕ) = λ + ϕ Z µ(λ, ϕ)r r + µ(λ, ϕ) d P(r). Then the following properties hold: (1) (Monotonicity) For ϕ (0, ), the function λ 7 µ(λ, ϕ) is strictly increasing in λ (λmin(ϕ), ). (2) (Range) For ϕ (0, 1], limλ λmin(ϕ) µ(λ, ϕ) = and µ(0, ϕ) = 0. For ϕ (1, ), limλ λmin(ϕ) µ(λ, ϕ) (0, ). For ϕ (0, ), limλ + µ(λ, ϕ) = + . (3) (Differentiability) For ϕ (0, ), the function λ 7 µ(λ, ϕ) is differentiable over Λ. (4) The function λ 7 λ/µp(λ, ϕ) is strictly increasing in λ with limλ 0 λ/µp(λ, ϕ) = 0 and limλ 0 λ/µp(λ, ϕ) = 1. Proof. Note that λ = µ(λ, ϕ) ϕ Z µ(λ, ϕ)r r + µ(λ, ϕ) d P(r). Define a function f by f(x) = x ϕ Z xr x + r d P(r). Observe that µ(λ, ϕ) = f 1(λ). The claim of differentiability of the function λ 7 µ(λ, ϕ) follows from the differentiability and strict monotonicity of f, similar to Patil et al. (2022, Lemma S.6.14). For the last property, from the definition of the fixed-point equation, we have 1 = λ µp(λ, ϕ) + ϕ Z r r + µ(λ, ϕ) d P(r). Because ϕ R r r+µ(λ,ϕ) d P(r) is strictly decreasing in λ, we have that λ/µp(λ, ϕ) is strictly increasing in λ. Because limλ + µp(λ, ϕ) = + , we know that lim λ ϕ Z r r + µ(λ, ϕ) d P(r) = 0. lim λ λ µp(λ, ϕ) = 1. On the other hand, because limλ 0 µp(λ, ϕ) = 0 for ϕ (0, 1] and limλ 0 µp(λ, ϕ) < , it directly implies that lim λ 0 λ µp(λ, ϕ) = 0. Consequently, the conclusion follows. Lemma F.3 (Continuity properties in the aspect ratio for ridge regression, adapted from Patil et al. (2023)). Let a > 0, b < and λ R be real numbers. Let P be a probability measure supported on [a, b]. For λ R, let Φ(λ) = {ϕ (0, ) | λmin(ϕ) < λ}. Consider the function µ(λ, ) : ϕ 7 µ(λ, ϕ), over ϕ Φ(λ) such that λmin(ϕ) λ, where µ(λ, ϕ) a is the unique solution to the following fixed-point equation: µ(λ, ϕ) = λ + ϕ Z µ(λ, ϕ)r µ(λ, ϕ) + r d P(r). (54) Then the following properties hold: (1) The range of the function µ( λ; ) is a subset of [λ, ] when λ 0 and [ a, ] when λ < 0. (2) The function µ( λ; ) is continuous and strictly increasing over Φ(λ). Furthermore, limϕ µ(λ, ϕ) = + , limϕ 0+ µ(λ, ϕ) = λ when λ 0, and limϕ 0+ µ(λ, ϕ) = when λ < 0. Optimal Ridge Regularization for Out-of-Distribution Prediction (3) The function evv( λ; ) : ϕ 7 evv(λ, ϕ), where evv(λ, ϕ) = µ(λ, ϕ)2 ϕ Z (µ(λ, ϕ)r)2(µ(λ, ϕ) + r) 2 d P(r) 1 , is positive and continuous over Φ(λ). (4) The function evb( λ; ) : ϕ 7 evb(λ, ϕ), where evb(λ, ϕ) = evv(λ, ϕ) Z ϕ(µ(λ, ϕ)r)2(µ(λ, ϕ) + r) 2 d P(r), is positive and continuous over Φ(λ). Proof. Properties (1)-(4) follow similarly to Patil et al. (2022, Lemma S.6.15). F.3. Contours of Fixed-Point Solutions under Negative Regularization Lemma F.4 (Contours of fixed-point solutions). As n, p such that p/n ϕ (0, ), let λmin : ϕ 7 λmin(ϕ) as defined in (5). For any ψ [ϕ, + ], there exists a unique value λ λmin(ψ) (or conversely for λ [λmin(ϕ), ], there exists a unique value ψ [ϕ, ]) such that for all (λ, ψ) on the path (as in Figure F9) P = {(1 θ) (λ, ϕ) + θ (λmin(ψ), ψ) | θ [0, 1]}, (55) it holds that µ(λ, ψ) = µ(λ, ϕ) = µ(λmin(ψ), ψ), where µ(λ, ψ) is as defined in (53). 0.1 0.3 1.0 3 10 Subsample aspect ratio Ridge penalty Figure F9: Heatmap of µp(λ, ψ) for isotropic covariance matrix Σ = I in the symmetric log-log scale (where the logarithmic scale is applied symmetrically to both positive and negative values on the x-axis and y-axis). The 5 black dashed lines indicate 5 different equivalence paths. The boundary of negative ridge penalties is given by (1 ψ)2. Proof. Note that µ(λmin(ϕ), ϕ) = µ0(ϕ) which is the solution to the fixed-point equation (51). From Lemma F.1 (2), we see that the function ψ 7 µ(λmin(ψ), ψ) is strictly increasing over ψ [ϕ, ] with the range lim ψ ϕ µ(λmin(ψ), ψ) = µ0(ϕ), lim ψ + µ(λmin(ψ), ψ) = + . From Lemma F.3, the function λ 7 µ(λ, ϕ) is strictly increasing over λ [λmin, ] with range lim λ λmin µ(λ, ϕ) = µ0(ϕ), lim λ + µ(λ, ϕ) = + . Optimal Ridge Regularization for Out-of-Distribution Prediction For ψ [ϕ, ], by the intermediate value theorem, there exists a unique λ [λmin(ϕ), ] such that v(λ, ϕ) = v( λmin(ψ); ψ). Conversely, for λ [λmin(ϕ), ], there also exists a unique ψ [ϕ, ] such that v(λ, ϕ) = v( λmin(ψ); ψ). Based on the definition of fixed-point solutions, it follows that µ(λ, ϕ) = λ + ϕ Z µ(λ, ϕ)r µ(λ, ϕ) + r d Hp(r) = λmin(ϕ) + ψ Z µ( λmin; ψ)r µ( λmin; ψ) + r d Hp(r) = µ( λmin; ψ). Then, for any (λ, ψ) = (1 θ)(λ, ϕ) + θ(λmin(ψ), ψ) on the path P, we have µ(λ, ϕ) = (1 θ)µ(λ, ϕ) + θµ(λmin(ψ), ψ) = (1 θ)λ + (1 θ)ϕ Z µ(λ, ϕ)r µ(λ, ϕ) + r d Hp(r) + θλmin(ψ) + θψ Z µ( λmin(ψ); ψ)r µ( λmin(ψ); ψ) + r d Hp(r) = λ + ψ Z µ(λ, ϕ)r µ(λ, ϕ) + r d Hp(r). Because µ(λ, ψ) is the unique solution to the fixed-point equation: µ(λ, ψ) = λ + ψ Z µ(λ, ψ)r µ(λ, ψ) + r d Hp(r), it then follows that µ(λ, ψ) = µ(λ, ϕ) = µ(λmin(ψ), ψ). This completes the proof. Optimal Ridge Regularization for Out-of-Distribution Prediction G. Experimental Details and Additional Numerical Illustrations The source code for reproducing the results of this paper can be found at the following location: https://github.com/ jaydu1/ood-ridge. G.1. Additional Illustrations for Section 3 G.1.1. NUMERICAL VERIFICATION OF (8) UNDER GENERIC ALIGNMENT SCENARIOS Under Assumption 2.2, suppose the data model has a covariance matrix Σ to be (Σar1)ij := ρ|i j| ar1 with parameter ρar1 (0, 1), and a coefficient β := 1 2(w(1) + w(p)), where w(j) is the jth eigenvector of Σar1. 4p(w(1) + w(p)) Σ(w(1) + w(p)) = 1 4pw (1)Σw(1) + 1 4pw (p)Σw(p) = 1 4(r(1) + r(p)). On the other hand, we also have tr[Σ] tr[B] = p β 2 2 p = 1 2p2 One can numerically verify that when p = 500 and ρar1 = 0.5, r(1) + r(p) 3.33 > 2 = 2 which contradicts the implication (32) of the strict alignment condition. 10 5 10 3 10 1 101 0.00 Figure G10: Numerical evaluation of b2/b3 s2/s3 under the same data model in Figure 1, where sk = tr[Σ{µk(Σ + µI) k}] and bk = tr[Σβ{µk(Σ + µI) k}] for k = 2, 3. On the other hand, from Figure G10, we see that the general alignment condition tr[Σβ{µ2(Σ + µI) 2}] tr[Σβ{µ3(Σ + µI) 3}] = tr[Σ{µ2(Σ + µI) 2}] tr[Σ{µ3(Σ + µI) 3}] holds for various data aspect ratios in the noiseless setting. G.1.2. REAL DATA ILLUSTRATION Following the approach of Kobak et al. (2020), we consider a similar setup using random Fourier features on MNIST. Feature generation. The pixel values are normalized to the range [ 1, 1]. The features are then mapped from the original dimension of 28 28 to 1000 random Fourier features. This is achieved by multiplying the features with a 784 500 random matrix W, where the elements are independently drawn from a normal distribution with mean 0 and standard deviation 0.2. The real and imaginary parts of exp( i XW) are taken as separate features, where X Rn 784. Optimal Ridge Regularization for Out-of-Distribution Prediction Training details. For training, we randomly select n = 64 images, and for testing, we hold out 10,000 images. The response variable y represents the digit value ranging from 0 to 9. Our model includes an intercept term, which is not subject to penalization. To estimate the expected out-of-distribution risk, we average the risks across 100 random samples from the training set. Distribution shift. To generate distribution shift, we gradually exclude samples with given labels in the test set: - Type 1: - Type 2: {4} - Type 3: {3, 4} - Type 4: {2, 3, 4} - Type 5: {1, 2, 3, 4} In other words, Type 1 represents no covariate shift, while Type 5 represents the case with potentially the most severe covariate shift. The results are summarized in Figure G11. We observe a clear pattern where the optimal ridge penalty shifts toward negative values, suggesting that Theorem 3.3 may occur in real-world datasets. However, for Type 5, the optimal ridge penalty becomes positive again. This could be due to the removal of Class 1, which reduces the degree of alignment. Also, observe that the minimum risks increase from Type 1 to 5. 100 50 0 50 100 Ridge penalty Prediction risk 100 50 0 50 100 Ridge penalty Relative prediction risk of the excluded class Type 1 2 3 4 5 Figure G11: Effect of distribution shift on the optimal ridge penalty on MNIST. (a) The left panel illustrates the risk profile (against the regularization penalty) of ridge regression on the MNIST dataset when subjected to different types of distribution shifts. Different colors represent different types of shift from less severe (Type 1) to more severe (Type 5). The y-axis represents the out-of-distribution prediction risk for the task of accurately predicting the digit value for unseen images. The figure shows a clear pattern where the optimal ridge penalty shifts towards negative values, in the spirit of Theorem 3.4. The only exception seems to be Type 5 for which the optimal ridge penalty becomes positive again. It is likely due to the removal of Class 1, which reduces the degree of alignment. (b) The right panel shows the relative OOD prediction risk computed only on the excluded class, compared to the IND prediction risk of the ridge predictor fitted only on the training data of the same class. The reason we compare the relative prediction risk is to compensate for the differences in the conditional variances in the different classes. Observe that Class 1 (Type 5) has the lowest relative prediction risk, which partially explains the increase in optimal regularization of Type 5 in the left panel. Optimal Ridge Regularization for Out-of-Distribution Prediction G.2. Additional Illustrations for Section 4 Experiment details. Similar to the setup in Appendix G.1.2, we further vary the number of training sample size n from 25 to 200, and inspect the OOD prediction risk of the optimal ridge predictor. The results are summarized in Figure G12. 50 100 150 200 Training sample size n Prediction risk of optimal ridge Type 1 2 3 4 5 Figure G12: Effect of distribution shift on the risk monotonicity behavior of optimal ridge on MNIST. The figure illustrates the risk profile (against the training sample size) of optimal ridge regression on the MNIST dataset when subjected to different types of distribution shifts. We follow the same setup as for Table 2 (see Appendix G.1.2 for more details) and vary the number of training sample size n from 25 to 200, and inspect the OOD prediction risk of the optimal ridge predictor. Different colors represent different types of shift from less severe (Type 1) to more severe (Type 5). The y-axis represents the out-of-distribution prediction risk for the task of accurately predicting the digit value for unseen images. The figure shows a clear pattern where the optimal ridge exhibits a monotonically decreasing risk in the training sample size n. Optimal Ridge Regularization for Out-of-Distribution Prediction G.3. Additional Illustrations for Section 5 (Ridge versus Lasso Monotonicities) G.3.1. SUBSAMPLED RIDGELESS VERSUS FULL RIDGE 0.1 0.6 3 19 100 Subsample aspect ratio Data aspect ratio 0.1 0.6 3 19 100 Ridge penalty 10 1 100 101 Data aspect ratio Optimal ridge Figure G13: Ridge optimal risk monotonicity and suboptimal risk non-monotonicity. Optimal ridge has a monotonic risk in the data aspect ratio, but the risk for fixed ridge penalty λ may not be monotonic in the data aspect ratio. The data model has Σ = Σar1 with ρar1 = 0.25 (the covariance matrix of an auto-regressive process of order 1 (AR(1)) is given by Σar1, where (Σar1)ij = ρ|i j| ar1 for some parameter ρar1 (0, 1)), β being the leading eigenvector of Σ and σ = 0.5. The leftmost panel shows the limiting risk of the full-ensemble ridgeless regression at various data and subsample aspect ratios (ϕ, ψ). The middle panel shows the limiting risk of the ridge predictor (on the full data) at various data aspect ratios and regularization penalties (ϕ, λ). We highlight the optimal risks at a given data aspect ratio for the leftmost and middle panels using slender red lines. Observe that the optimal risk in both cases is increasing as a function of ϕ. G.3.2. SUBSAMPLED LASSOLESS VERSUS FULL LASSO 0.1 0.6 3 19 100 Subsample aspect ratio Data aspect ratio 0.0 0.1 1 11 100 Lasso penalty 10 1 100 101 Data aspect ratio Optimal lasso Figure G14: Lasso optimal risk monotonicity and suboptimal risk non-monotonicity. Similar to ridge regression, optimal lasso has a monotonic risk in the over-parameterization ratio, but the risk for fixed lasso penalty λ may not be monotonic in the data aspect ratio. The data model has Σ = I and βj i.i.d. ϵP1/ ϕϵ + (1 ϵ)P0 where the sparsity level is ϵ = 0.01 and σ2 = 1, such that SNR = 1. As for ridge regression in Figure G13, optimal risks for each data aspect ratio are highlighted using slender red lines in the left and middle panels. Similarly to the ridge curves in Figure G13, observe that the optimal risk in both cases is increasing as a function of ϕ.