# an_effective_theory_of_bias_amplification__b99ac7ed.pdf

Published as a conference paper at ICLR 2025

AN EFFECTIVE THEORY OF BIAS AMPLIFICATION

Arjun Subramonian1, , Samuel J. Bell2, Levent Sagun2, Elvis Dohmatob2,3,4

1UCLA 2Meta FAIR 3Concordia University 4Mila

Machine learning models can capture and amplify biases present in data, leading to disparate test performance across social groups. To better understand, evaluate, and mitigate these biases, a deeper theoretical understanding of how model design choices and data distribution properties contribute to bias is needed. In this work, we contribute a precise analytical theory in the context of ridge regression, both with and without random projections, where the former models feedforward neural networks in a simplified regime. Our theory offers a unified and rigorous explanation of machine learning bias, providing insights into phenomena such as bias amplification and minority-group bias in various feature and parameter regimes. For example, we observe that there may be an optimal regularization penalty or training time to avoid bias amplification, and there can be differences in test error between groups that are not alleviated with increased parameterization. Importantly, our theoretical predictions align with empirical observations reported in the literature on machine learning bias. We extensively empirically validate our theory on synthetic and semi-synthetic datasets.

1 INTRODUCTION

Machine learning (ML) datasets can encode a plethora of biases which, when said data is used to train models, can result in systems that can cause practical harm. Datasets that encode correlations that only hold for a subset of the data may cause disparate performance when models are used more broadly, such as an X-ray pneumonia classifier that only functions on images from certain hospitals (Zech et al., 2018). This issue is magnified when coupled with under-representation, whereby a dataset fails to adequately reflect parts of the underlying data distribution, often further marginalizing certain groups. Lack of representation results in systems that might work well on average, but fail for minoritized groups, including facial recognition systems that fail for darker-skinned women (Buolamwini & Gebru, 2018), large language models that consistently misgender transgender and nonbinary people (Ovalle et al., 2023), or image classification technology that only works in Western contexts (de Vries et al., 2019; Richards et al., 2024).

Unfortunately, contemporary models may exhibit bias amplification, whereby dataset biases are not only replicated, but exacerbated (Zhao et al., 2017; Hendricks et al., 2018; Wang & Russakovsky, 2021). While previous research has shown that amplification is a function of both dataset properties and how we choose to construct our models (Hall et al., 2022; Sagawa et al., 2020; Bell & Sagun, 2023), it is not fully clear how bias amplification occurs mechanistically, nor do we precisely understand which settings lead to its emergence. Thus, in this work, we propose a novel theoretical framework that explains how model design choices (e.g., number of parameters, regularization penalty) and data distributional properties (e.g., number of features, group imbalance, label noises) interact to amplify bias. Moreover, our framework provides an account of prior work on bias amplification (Bell & Sagun, 2023) and minority-group bias (Sagawa et al., 2020).

A theory of bias amplification is important for several reasons. First, as empirical research necessarily yields only sparse data points often focused on only the most common regimes theory allows us to interpolate between past findings, and reason about how bias emerges in under-explored settings. Second, a precise theory gives us the depth of understanding needed in order to intervene, potentially supporting the development of both novel evaluations and mitigations. Finally, beyond explaining

Work done while interning at Meta. Corresponding author: Arjun Subramonian (arjunsub@cs.ucla.edu).

Published as a conference paper at ICLR 2025

already-known phenomena, our theory makes new predictions, suggesting new avenues for future research.

1.1 MAIN CONTRIBUTIONS

In this work, we develop a unifying and rigorous theory of ML bias in the settings of ridge regression with and without random projections. In particular, we precisely analyze test error disparities between groups (e.g., demographic groups or protected categories) with different data distributions when training on a mixture of data from these groups. We characterize these disparities in high dimensions using operator-valued free probability theory (OVFPT), thereby avoiding possibly loose bounds on critical quantities. Our theory encompasses different parameterization regimes, group sizes, label noises, and data covariance structures. Moreover, our theory has applications to important problems in ML bias that have recently been empirically investigated:

Bias amplification. Even in the absence of group imbalance and spurious correlations, a single model that is trained on a combination of data from different groups can amplify bias beyond separate models that are trained on data from each group (Bell & Sagun, 2023). With our theory, we reproduce and analyze the bias amplification findings of Bell & Sagun (2023) in controlled settings. We further observe how stopping model training early or tuning the regularization hyperparameter can alleviate bias amplification.

Minority-group bias. Overparamaterization can hurt test performance on minority groups due to spurious features (Sagawa et al., 2020; Khani & Liang, 2021). We theoretically analyze how model size and extraneous features affect minority-group bias.

We extensively empirically validate our theory in controlled and semi-synthetic settings. Specifically, we show that our theory aligns with practice in the cases of: (1) bias amplification with synthetic data generated from isotropic covariance matrices and the semi-synthetic dataset Colored MNIST (Arjovsky et al., 2019), and (2) minority-group bias under different model sizes with synthetic data generated from diatomic covariance matrices. In these applications, we expose new, interesting phenomena in various regimes. For example, a larger number of features than samples can amplify bias under overparameterization, there may be an optimal regularization penalty or training time to avoid bias amplification, and there can be differences in test error between groups that are not alleviated with increased parameterization. Our observations of phenomena in Sections 4 and 5 are largely empirical but are supported by their agreement with our theory. Our theory of ML bias can inform strategies to evaluate and mitigate unfairness in ML, or be used to caution against the usage of ML in certain applications.

1.2 RELATED WORK

Bias amplification. A long line of research has explored how ML exacerbates biases in data. For example, a single model that is trained on a combination of data from different groups can amplify bias (Zhao et al., 2017; Wang & Russakovsky, 2021), even beyond what would be expected when separate models are trained on data from each group (Bell & Sagun, 2023). Hall et al. (2022) conduct a systematic empirical study of bias amplification in the context of image classification, finding that amplification can vary greatly as a function of model size, training set size, and training time. Furthermore, overparameterization, despite reducing a model s overall test error, can disproportionately hurt test performance for minority groups (Sagawa et al., 2020; Khani & Liang, 2021). Models can also overestimate the importance of poorly-predictive, low-signal features for minority groups, thereby hurting performance on these groups (Leino et al., 2019). In this paper, we distill a holistic theory of how model design choices and data distributional properties affect disparate test performance across groups, which can encompass seemingly disparate bias phenomena.

High-dimensional analysis of ML. A suite of works have analyzed the expected dynamics of ML in appropriate asymptotic scaling limits, e.g., the rate of features d to samples n converges to a finite values as d and n respectively scale towards infinity (Adlam & Pennington, 2020b; Tripuraneni et al., 2021; Lee et al., 2023). Notably, Bach (2023) theoretically analyzes the double descent phenomenon (Spigler et al., 2019; Belkin et al., 2019) in ridge regression with random projections by computing deterministic equivalents for relevant random matrix quantities in a proportionate scaling limit. Like Adlam & Pennington (2020b); Tripuraneni et al. (2021); Lee et al. (2023), we leverage the tools of

Published as a conference paper at ICLR 2025

OVFPT (Mingo & Speicher, 2017), which is at the intersection of random matrix theory (RMT) and functional analysis. However, Adlam & Pennington (2020b) focus on training and testing a random features model on data from the same Gaussian distribution. Furthermore, Tripuraneni et al. (2021); Lee et al. (2023) focus on training a random features model on data from one Gaussian distribution and testing the model on a different Gaussian. In contrast, we study the random features model in the setting of training on a mixture of Gaussian distributions and testing on each component. Because a mixture is more expressive than a single Gaussian, our theoretical results cannot be derived as a special case of these other works. Furthermore, our theory non-trivially generalizes Bach (2023), which we recover in Corollary J.1 as a special case, and requires more powerful analytical techniques.

Certain prior theoretical work precisely analyzes the bias of models trained on a mixture of data from different groups in a high-dimensional setting (Mannelli et al., 2024; Jain et al., 2024). Like Mannelli et al. (2024); Jain et al. (2024), we study linear models and consider bias as the disparity in test performance of a model between groups. We further consider some similar factors that give rise to bias amplification (e.g., group imbalance, group data variance, inter-group similarity, dataset size). We also share some theoretical conclusions, such as bias can occur even when the groups have the same ground-truth weights (see Section 5) and are balanced (Section 4.1). Additionally, we both discuss the paradigms of training a single model for both groups vs. separate models for each group. However, the main distinction between our work and Mannelli et al. (2024); Jain et al. (2024) is that we precisely characterize how models amplify bias in different parameterization regimes, that is, we examine the impact of model size on bias. This enables us to expose new, richer insights into the impact of overand underparameterization on bias amplification (see Figure 1, Section 4, and Section 5). See Appendix C for further comparison of our work to Mannelli et al. (2024); Jain et al. (2024).

2 PRELIMINARIES

2.1 DATA DISTRIBUTIONS

We consider a ridge regression problem on a dataset from the following multivariate Gaussian mixture with two groups s = 1 and s = 2. These groups could represent different demographic groups or protected categories.

(Group ID) Law(s) = Bernoulli(p), (1) (Features) Law(x | s) = N(0, Σs), (2) (Ground-truth weights) Law(w 1) = N(0, Θ/d), Law(w 2 w 1) = N(0, /d), (3)

(Labels) Law(y | s, x) = N(f s (x), σ2 s), with f s (x) := x w s. (4)

The scalar p (0, 1) controls for the relative size of the two groups (e.g., p = 1/2 in the balanced setting). For simplicity of notation, we define p1 = p and p2 = 1 p. The d d positive-definite matrices Σ1 and Σ2 are the covariance matrices for the different groups. The d-dimensional vectors w 1 and w 2 are the ground-truth weights vectors for each group. w 1 and w 2 w 1 are independently sampled from zero-mean Gaussian distributions with covariances Θ/d and /d, respectively. In particular, setting = 0 corresponds to the case that both groups have identical ground-truth weights. We define Θ1 = Θ, Θ2 = Θ + . Finally, σ2 s corresponds to the label noise for each group s. While we consider the case of two groups only for conciseness, our theoretical methods readily extend to any finite number of groups.

2.2 MODELS AND METRICS

Learning. A learner is given an IID sample D = {(x1, y1), . . . , (xn, yn)} (X Rn d, Y Rn) of data from the above distribution and it learns a model for predicting the label y from the feature vector x. Thus, X is the total design matrix with ith row xi, and y the total response vector with ith component yi. Let Ds = (X Rns d, Y Rns) be the data pertaining only to group s, so that D = D1 D2 is a partitioning of the entire dataset. Two choices are available to the learner: (1) learn a model a bfs F on each dataset Ds, or (2) learn a single model bf F on the entire dataset D. In practice, a choice is made based on scaling vs. personalization considerations.

We consider two solvable settings for linear models: classical ridge regression in the ambient input space, and ridge regression in a feature space given by random projections. The latter allows us to

Published as a conference paper at ICLR 2025

study the role of model size in ML bias, by varying the output dimension of the random projection mapping. This output dimension m controls the size of a feedforward neural network in a simplified regime (Maloney et al., 2022; Bach, 2023).

Classical Ridge Regression. We will first consider the function class F {Rd R} of linear ridge regression models without random projections. For any vector w Rd, the model f with parameters w is defined by f(x) = x w, for all x Rd, and is learned with ℓ2-regularization. We define the generalization error or risk of any model f with respect to group s as:

Rs(f) = E [(f(x) f s (x))2 | s]. (5)

We consider ridge regression because in addition to its analytical tractability, it can be viewed as the asymptotic limit of many learning problems (Dobriban & Wager, 2018; Richards et al., 2021; Hastie et al., 2022). We now formally define some metrics related to bias amplification. Definition 2.1 (Bias Amplification). We isolate the contribution of the model to bias when learning from data with different groups. This intuitive conceptualization of bias amplification allows us to quantify the phenomenon. Grounded in the literature (Bell & Sagun, 2023), we define the Expected Difficulty Disparity (EDD) as:

EDD = |E R2( bf2) E R1( bf1)|, (6)

where the expectations are w.r.t. randomness in the training data and any other sources of randomness in the models. The EDD captures the difference in test risk between models trained and evaluated on each group separately. In contrast, we define the Observed Difficulty Disparity (ODD) as:

ODD = |E R2( bf) E R1( bf)|. (7)

The ODD captures the bias (i.e., difference in test risk between groups) of a model trained on both groups. Finally, we define the Amplification of Difficulty Disparity (ADD) as ADD = ODD

EDD. We say that bias amplification occurs when ADD > 1. See Appendix C for further motivation of ADD.

Ridge Regression with Random Projections. We consider feedforward neural networks in a simplified regime which can be approximated via random projections, i.e., a one-hidden-layer neural network f(x) = v Sx with a linear activation function. In particular, we extend classical ridge regression by transforming our learned weights as bw = Sbη Rd, where S Rd m is a random projection with entries that are IID sampled from N(0, 1/d). Ridge regression with random projections offers analytical tractability while exposing bias amplification phenomena related to model size; such phenomena are not exposed by classical ridge regression (see Figure 1). Moreover, it has been shown that in high dimensions, training a one-hidden-layer neural network with gradient descent effectively learns a linear predictor over random features (Yehudai & Shamir, 2019). Furthermore, Adlam & Pennington (2020a); Bach (2023); inter alia are able to reproduce interesting phenomena like double descent using the random features model. Nevertheless, Yehudai & Shamir (2019) have shown that the model often cannot learn even a Re LU neuron, suggesting that some mechanisms of bias amplification could be different in nonlinear networks.

3 THEORETICAL ANALYSIS

Assumptions. Some of our theorems will require standard technical assumptions that we detail here and in Appendix B. Assumption 3.1 describes the proportionate scaling limits, standard in RMT, in which we will work. These limits enable us to derive deterministic analytical formulae for the expected test risk of models. Our experiments (see Sections 4 and 5) validate our theory. Assumption 3.1. In the case of ridge regression with random projections, we will work in the following proportionate scaling limit:

n, n1, n2, d , n1/n p1, n2/n p2, d/n ϕ, m/n ψ, m/d γ, (8) d/n1 ϕ1, m/n1 ψ1, d/n2 ϕ2, m/n2 ψ2, (9)

for some constants ϕ1, ϕ2, ϕ, ψ1, ψ2, ψ (0, ). The scalar ϕ captures the rate of features to samples. Observe that ϕ = p1ϕ1 and ϕ = p2ϕ2. We note that ϕγ = ψ and ϕsγ = ψs. The scalar ψ captures the rate of parameters to samples. The setting ψ > 1 (resp. ψ < 1) corresponds to the overparameterized (resp. underparameterized) regime.

Published as a conference paper at ICLR 2025

3.1 MAIN RESULT: RIDGE REGRESSION WITH RANDOM PROJECTIONS

To provide a mechanistic understanding of how ML models may amplify bias, our theory elucidates differences in the test error between groups when a single model is trained on a combination of data from both groups vs. when separate models are trained on data from each group. We first consider the classical ridge regression model in Appendix D before studying ridge regression with random projections below, which is a more realistic but still analytically solvable setup.

Single Random Projections Model Learned for Both Groups. We first consider the ridge regression model bf with random projections, which is learned using empirical risk minimization and ℓ2regularization with penalty λ. The parameter bw of the linear model bf is given by the following optimization problem:

bw = Sbη Rd, with bw = arg min η Rm L(η) =

s=1 n 1 Xs Sη Ys 2 2 + λ η 2 2. (10)

Explicitly, one can write bw = S(Z Z + nλIm) 1Z Y , where Z := XS. Before presenting our result for the random projections model, we provide some relevant definitions. Definition 3.1. Let tr A := (1/d) tr A be the normalized trace operator and (e1, e2, τ, u1, u2, ρ) be the unique positive solution to the following fixed-point equations: 1/τ = 1 + tr LK 1, 1/es = 1 + ψτ tr Σs K 1, for s {1, 2}, (11)

ρ = τ 2 tr (γρL2 + λ2D)K 2, us = ψe2 s tr Σs(γτ 2D + ρId)K 2, for s {1, 2}, (12) where: L = p1e1Σ1 + p2e2Σ2, K = γτL + λId, D = p1u1Σ1 + p2u2Σ2 + B. (13)

For deterministic d d PSD matrices A and B, we define the following auxiliary quantities:

h(1) j (A) := pjγejτ tr AΣj K 1, (14)

h(2) j (A, B) := pjγ tr AΣj(γejτ 2B + pj γτ 2Σj (ejuj ej uj) + ejρId λujτId)K 2, (15)

h(3) j (A, B) := pj tr AΣj(γe2 jpjΣj(pj γτ 2uj Σj + γτ 2B + ρId)

+ uj(pj γej τΣj + λId)2)K 2, (16)

h(4) j (A, B) := pjγpj tr ΣjΣj A(γτ 2(ejej B pje2 juj Σj pj Σj e2 j uj)

λτ(ejuj + ej uj)Id + ejej ρId)K 2. (17)

In Appendix H, we intuitively interpret the scalars es, τ, us, ρ in the setting where a separate model is learned for each group. In essence, our theory extends the scalars to the more general setting where a single model is trained on a mixture of data from different groups. Furthermore, each of the terms h(1) j , . . . , h(4) j capture the limiting values of different sources of covariance between the sample covariance matrices for the groups, the resolvent matrix, and the random projections matrix S. These sources of covariance are written explicitly in Appendix G, and naturally arise from expanding the solution to the ridge regression problem with random projections.

We now present Theorem 3.1, which is our main contribution. Theorem 3.1 presents a novel biasvariance decomposition for the test error Rs( bf) for each group s {1, 2} in the context of ridge regression with random projections. It is a non-trivial generalization of theories in high-dimensional ML which requires the powerful machinery of OVFPT (see proof in Appendix G).

Theorem 3.1. Under Assumptions B.2 and 3.1, it holds that Rs( bf) Bs( bf) + Vs( bf), with

j=1 σ2 j ϕh(2) j (Id, Σs), (18)

Bs( bf) = tr ΘsΣs + h(3) 1 (Θs, Σs) + h(3) 2 (Θs, Σs) + 2h(4) 1 (Θs, Σs) (19)

2h(1) 1 (ΘsΣs) 2h(1) 2 (ΘsΣs) + h(3) s ( , Σs) (20)

( 0, s = 1, h(3) 1 ( , Σ2) + h(4) 2 ( , Σ2) h(1) 1 ( Σ2), s = 2. (21)

Published as a conference paper at ICLR 2025

Figure 1: ODD, EDD, and ADD phase diagrams for ridge regression with random projections. We plot the bias amplification phase diagrams with respect to ϕ (rate of features to samples) and ψ (rate of parameters to samples), as predicted by our theory for ridge regression with random projections (Theorems 3.1, 3.2). Red regions indicate theoretical predictions greater than 1 (i.e., bias amplification in the rightmost plot), while blue regions indicate theoretical predictions less than 1 (i.e., bias deamplification in the rightmost plot). Darkness indicates intensity. We consider isotropic covariance matrices: Σ1 = 2Id, Σ2 = Id, Θ = 2Id, = Id. Additionally, n = 1 104, σ2 1 = σ2 2 = 1. We further choose λ = λ1 = λ2 = 1 10 6 to approximate the minimum-norm interpolator. We show that bias amplification can occur even in the balanced data setting, i.e., when p1 = p2 = 1/2.

The unregularized limit corresponds to the minimum-norm interpolator, and alternatively may be viewed as training a neural network until convergence (Ali et al., 2019). We discuss methods for, and the complexity of, solving the above fixed-point equations in Appendix I. Furthermore, in Appendix J, we directly express the bias and variance of the test risk of an unregularized model trained on just group s in terms of the second and first-order degrees of freedom of Σs and the parameterization rate ψs. Moreover, in Appendix M, we derive the approximate bias amplification profile of an unregularized model with respect to the ratio c = σ2 2/σ2 1 of label noises, in the setting where the eigenspectra of the covariance matrices have power-law decay.

Separate Random Projections Model Learned for Each Group. We now consider the ridge regression models bf1 and bf2 with random projections, which are learned using empirical risk minimization and ℓ2-regularization with penalties λ1 and λ2, respectively. In particular, we have the following optimization problem for each group s: arg minη Rm L(w) = n 1 s Xs Sη Ys 2 2 + λs η 2 2. Alternatively, the reader can think of each bfs as the limit of bf when ps 1. In this setting, we deduce Theorem 3.2, which follows from Theorem 3.1.

Theorem 3.2. Under Assumptions B.2 and 3.1, it holds that Rs( bfs) Bs( bfs) + Vs( bfs), where Vs( bfs) = limps 1 Vs( bf) and Bs( bfs) = limps 1 Bs( bf) (see Appendix H for explicit formulae).

Phase Diagram. The phase diagram for the random projections model (Figure 1) offers rich insights into how the rate of parameters to samples (ψ), in interaction with the rate of features to samples (ϕ), affects bias amplification. In the ODD and EDD profiles, we observe phase transitions at ϕ = ψ (when ψ < 0.5) and ψ = 0.5 (i.e., ψ1 = ψ2 = 1), where these metrics begin decreasing significantly. ψs = 1 is a known interpolation threshold for random features models (Adlam & Pennington, 2020a; D Ascoli et al., 2020). In contrast, at ψ = 1 and ϕ = 1, the ODD drastically increases. Furthermore, at ϕ = ψ (when ψ < 0.5) and ϕ = 0.5 (for ψ > 0.5), the EDD greatly increases. Accordingly, in the ADD profile, we observe phase transitions at ϕ = ψ (when ψ < 0.5), ψ = 0.5, ψ = 1, and ϕ = 1, where bias amplification begins occurring (i.e., ADD > 1). However, bias seems to be deamplified (i.e., ADD < 1) at ϕ = ψ (when ψ < 0.5) and ϕ = 0.5 (when ψ > 0.5). Some observations are less visible due the granularity of the color thresholding in Figure 1.

4 BIAS AMPLIFICATION

We empirically show how ridge regression models with random projections may amplify bias when a single model is trained on a combination of data from different groups vs. when separate models are trained on data from each group (Bell & Sagun, 2023). We further show how our theory: (1) predicts bias amplification, and (2) exposes new, interesting bias amplification phenomena in various regimes.

Published as a conference paper at ICLR 2025

4.1 ISOTROPIC COVARIANCE

Setup. To mirror the setting of Bell & Sagun (2023), we consider balanced data (p1 = p2 = 1/2) without spurious correlations (Σ1 = a1Id, Σ2 = a2Id, for a1, a2 > 0). The groups have different ground-truth weights (Θ = 2Id, = Id). Refer to App. K.1 for full details due to space limitations.

Validation of Theory. Figure 2 and the figures in Appendix L reveal that Theorems 3.1 and 3.2 closely predict the ODD, EDD, and ADD of ridge regression models with random projections under diverse settings. Note that, as indicated by the error bars, some of our empirical estimates (especially those with larger magnitude) have higher variance and their variance is influenced by the choice of ψ, ϕ, a1, a2, σ2 1, σ2 2. Notably, our theory predicts the observation of Bell & Sagun (2023) that models can amplify bias even with balanced groups and without spurious correlations. We present new phenomena predicted by our theory below.

Effect of Label Noise. In the ODD profile, when the label noise ratio c = σ2 2/σ2 1 is larger, the right tail is higher for ϕ (rate of features to samples) closer to 1 than other ϕ. This suggests that under overparameterization, a larger noise ratio and similar number of features and samples can increase disparities in test risk between groups when a single model is learned for both groups. We aim to explain this phenomenon analytically in Section M. Moreover, the EDD curve is generally higher for larger c, suggesting that a larger noise ratio increases disparities in test risk when a separate model is learned for each group. This finding is supported by our experiment with real data (see Figure 11).

Effect of Model Size. We observe interesting divergent behavior as ψ (rate of parameters to samples) increases for different ϕ (rate of features to samples). When ϕ > 1, as ψ increases, the ODD increases and then decreases, peaking at the interpolation threshold at ψ = 1. Similarly, when ϕ > 0.5 (i.e., ϕ1 = ϕ2 > 1), as ψ increases, the EDD increases and then decreases, peaking at the interpolation threshold at ψ = 0.5 (i.e., ψ1 = ψ2 = 1). Accordingly, when ϕ > 0.5, bias is effectively deamplified (ADD < 1) at ψ = 0.5 and when ϕ > 1, bias amplification peaks (ADD > 1) at ψ = 1. In contrast, when ϕ < 1, the ODD decreases as ψ increases, plateauing at different finite values. Similarly, when ϕ < 0.5, the EDD generally decreases and plateaus as ψ increases; in some cases, the EDD dips and/or increases and plateaus. A notable exception to these trends occurs when ϕ 1, with the corresponding ODD and ADD curves increasing as ψ increases, plateauing at a significantly larger value (i.e., ADD 1) than the curves corresponding to other values of ϕ. We observe a similar phenomenon for the EDD curves when ϕ1 = ϕ2 1. Hence, overparameterization can greatly amplify bias when the number of features is close to the number of samples. Regardless of the regime of ϕ, the left tail of the ADD profile appears to plateau at 1. The right tail plateaus at different finite values, with the curves corresponding to ϕ > 1 consistently plateauing above 1. This suggests that when there are more features than samples, overparameterization amplifies bias.

Some of the peaks and valleys in Figure 2 can be attributed to double descent. However, double descent in high dimensions has primarily been studied in the setting where data are drawn from a single Gaussian distribution; this corresponds to the EDD setting, where a separate model is learned for each group. In Figure 1, we observe a double descent peak in the EDD at ψ1 = ψ2 = 1 (Adlam & Pennington, 2020a; D Ascoli et al., 2020). Our work extends the theoretical treatment of double descent to the setting of training a model on a mixture of Gaussians. However, our theory of bias amplification cannot be reduced to double descent. For example, we note other interpolation thresholds in Figure 1; our use of a linear activation does not have a confounding effect here, as interpolation thresholds have also been observed in random features models with nonlinear activations (Adlam & Pennington, 2020b). In addition, much of Sections 4 and 5, and Appendix M, are devoted to studying the tails or limiting behavior of bias amplification with respect to ψ and ϕ.

Effect of Number of Features. In the ODD and ADD profiles, when the rate of features to samples ϕ > 1, the right tail generally plateaus at higher values (i.e., greater than 1) when ϕ is closer to 1. This suggests that with a similar number of features and samples, under overparameterization, bias amplification increases. In contrast, when ϕ < 1, the right tail of the ODD and EDD curves seems to plateau at higher values when ϕ is larger. Regardless of the regime of the rate of features to samples ϕ, the left tails of the ODD and EDD curves are generally higher for larger ϕ.

4.2 REGULARIZATION AND TRAINING DYNAMICS

We now explore how regularization and training dynamics affect bias amplification.

Published as a conference paper at ICLR 2025

2 2 20 22 1

Figure 2: Our theory predicts that models can amplify bias even with balanced groups and without spurious correlations. We empirically validate our theory (Theorems 3.1 and 3.2) for ODD, EDD, and ADD under the setup described in Section 4.1, with a1 = 0.5, a2 = 1, σ2 1 = 1, and σ2 2 = 1 10 5. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We plot ODD and EDD on the same scale for easy comparison, and include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. We include all the plots with error bars in Appendix L.

Setup. We revisit the experimental setup for Section 4.1. We modulate a1, a2, ψ (rate of parameters to samples), as well as λ (regularization penalty) to understand the effects of regularization and early stopping on bias amplification. We fix σ2 1 = σ2 2 = 1, and the rate of features to samples ϕ = 0.75.

Effect of Regularization and Training Time. In simplistic settings, we can simulate model learning over training time t by setting λ = 1/t (Ali et al., 2019). In the figures in Appendix O, we observe that regardless of the regime of ψ, ADD 1 (i.e., there is neither bias amplification nor deamplification) with high regularization or a short training time. When ψ > 1 (i.e., in the overparameterized regime), the ADD is generally greater than 1 across values of λ (i.e., bias is amplified), while when ψ < 1 (i.e., in the underparameterized regime), the ADD is generally less than 1 (i.e., bias is deamplified). Moreover, when ψ > 1, as regularization decreases (or training time increases), bias amplification increases and plateaus. In contrast, when ψ < 1, as regularization decreases (or training time increases), bias deamplification increases and plateaus. A notable exception to this trend occurs when ψ is close to 1, where bias is initially deamplified and then amplified as λ decreases (or t increases). This suggests that there may be an optimal regularization penalty or training time to avoid bias amplification and increase bias deamplification. Intuitively, as training progresses, overparameterized models may discover shortcut associations (Geirhos et al., 2020) that do not generalize equally well across groups, yielding bias amplification. In practice, an optimal λ or t can be selected by searching for values that strike a desired balance between overall validation error and empirical bias amplification. The search space can be reduced by using the above ADD trends w.r.t. λ and t that our theory predicts for overvs. underparameterized models (see Appendix R for more details). It is important for ML practitioners the consider the interplay between high vs. low featureto-sample regimes and overparameterization in inducing bias amplification vs. deamplification when selecting optimal hyperparameters (see Figure 1).

In general, the calibration λ = 1/t may not yield a theoretically tight picture of how bias evolves with t. The use of discrete gradient descent in practice rather than continuous-time gradient flows might yield further discrepancies. However, the calibration λ = 1/t yields a ratio of gradient flow to ridge risk that is at most 1.69, with no assumptions on the features X (Ali et al., 2019). Moreover, in the controlled settings considered by Ali et al. (2019), this ratio empirically appears to be quite close to 1, and thus may be sufficient for extrapolating our results. Like us, Jain et al. (2024) and Hall et al. (2022) find that bias and bias amplification can vary substantially during training; future work can establish stronger connections between our observations and the results of Jain et al. (2024), who analytically identify phases in the evolution of bias and a crossing phenomenon in the test error curves of groups during training. However, Jain et al. (2024) do not consider the effect of overand underparameterization on bias evolution. While our analysis relies on the simplistic calibration λ = 1/t, it reveals divergent behavior in how bias evolves depending on model size.

Corroboration on Real Data. We further investigate the effect of training time on bias amplification on a more realistic dataset. We train a convolutional neural network (CNN) on Colored MNIST (see Appendix K.2 for more details). Colored MNIST is a semi-synthetic dataset derived from MNIST where digits are randomly re-colored to be red or green (Arjovsky et al., 2019). We treat the color of each digit as its group, and we manipulate the groups to have different levels of label noise. In

Published as a conference paper at ICLR 2025

our experimental protocol: (1) the color of each digit (in both train and test) is chosen uniformly at random (i.e., with probability 0.5) and independently of the label; (2) by default, in the training set, the labels of red digits are flipped with probability 0.05 while the labels of green digits are flipped with probability 0.25; (3) labels are binarized (i.e., digits 0-4 correspond to 0 while digits 5-9 correspond to 1); and (4) each training step constitutes a step of gradient descent based on a batch of 250 instances. Although Colored MNIST is a classification task and we use a complex CNN architecture, our theory correctly predicts that as the training time t increases, the ODD of the CNN is relatively low while the EDD is much larger, producing bias deamplification.

20 40 60 80 Training time t

Difficulty disparity

Figure 3: Our theory predicts that disparate label noise between groups deamplifies bias on Colored MNIST. We plot the ODD and EDD of a CNN over training time t for Colored MNIST. As t increases, the ODD is relatively low while the EDD is noticeably higher. The error bars capture the standard deviation computed over 10 random seeds.

Taking t corresponds to the setting of λ 0+ in our theory (Theorems 3.1, 3.2). Because we assign the colors at random, the only difference in image features between groups is color; therefore, we expect the covariance matrices Σ1 and Σ2 to roughly coincide and = 0 (i.e., w 1 = w 2). Note that we do not make any assumptions about the structure of Σ1, Σ2. Furthermore, p1 = p2 = 1/2, and thus, ϕ1 = ϕ2 and ψ1 = ψ2. Additionally, we analogize the probability of label flipping to label noise in ridge regression. Hence, e1 = e2, u1 = u2. Accordingly, limλ 0+ B1( bf) = limλ 0+ B1( bf1) limλ 0+ B2( bf) = limλ 0+ B2( bf2). Simultaneously, limλ 0+ V1( bf) limλ 0+ V2( bf). However, limλ 0+ V1( bf1) σ2 1/2 V = 0.05/2 V = 0.025V (where V = ϕ1h(2) 1 (Id, Σ)), while limλ 0+ V2( bf2) σ2 2/2 V = 0.25/2 V = 0.125V . This results in ODD 0 while EDD 0.1|V |, which explains the divergence of ODD and EDD in Figure 3. Intuitively, the high label noise for the green digits prohibits the separate model bf2 from achieving a low test risk compared to bf1; the single model bf achieves a comparable test risk on both groups, effectively deamplifying bias, because of the better learning signal from the red digits. This phenomenon is similar to positive transfer, wherein the EDD of a model generally tends to be higher than the ODD when the labeling rules of imbalanced groups are sufficiently similar (Mannelli et al., 2024). However, Mannelli et al. (2024) do not explore the impact of model size on positive transfer. We show that the ODD can be less than the EDD depending on ψ in Figure 2, where = Id (i.e., the groups have different labeling rules). Future work can study the ADD = ODD

EDD profile when = 0. Refer to Appendix P for additional Colored MNIST experiments.

5 MINORITY-GROUP BIAS

Recent work has revealed that overparameterization may hurt test performance on minority groups due to spurious features (Sagawa et al., 2020; Khani & Liang, 2021). Our theory provides new insights into how model size and extraneous features affect minority-group bias.

Setup. To mirror the settings of Sagawa et al. (2020); Khani & Liang (2021), we consider diatomic

covariance matrices of core and extraneous features. We define A B = A 0 0 B

, and choose

Σ1 = a1Iπd 0I(1 π)d, Σ2 = a2Iπd b2I(1 π)d, for π (0, 1), a1, b2 > 0, a1 = a2. Refer to App. K.1 for full details and a discussion of extraneous vs. spurious features (due to space limitations).

Interpolation Thresholds. The together R2 (i.e., the test risk for the minority group in the single model setting) has different interpolation thresholds as ψ (rate of parameters to samples) increases, depending on ϕ (rate of features to samples) and π (fraction of core features). Notably, as ϕ increases, the interpolation thresholds occur at larger model sizes, culminating at ψ = 1. This suggests that for a higher rate of features to samples, a larger model size can greatly increase the together test risk of the minority group. Furthermore, the interpolation thresholds all occur closer to ψ = 1 for larger π, collapsing to a single threshold at ψ = 1 when π 1 (as in Appendix L). Therefore, a lower fraction

Published as a conference paper at ICLR 2025

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

Figure 4: Minority-group test risk can peak with different model sizes depending on the rate of features to samples. We empirically demonstrate that minority-group bias is affected by extraneous features. We validate our theory (Theorems 3.1 and 3.2) for together R1, R2 (i.e., single model learned for both groups) and separate R1, R2 (i.e., separate model learned per group) under the setup described in Section 4.2, with a1 = 2, b2 = 0.2, and π = 0.5. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. All y-axes are on the same scale for easy comparison. All the plots with error bars are in Appendix Q.

of core features can yield more possible model sizes that increase the test risk of the minority group. In addition, the together R2 exhibits a steeper rate of growth around the interpolation thresholds for larger b2, suggesting that a higher variance in the extraneous features can also increase the test risk of the minority group in the single model setting. The phenomenon of different interpolation thresholds is not visible for R2 when a separate model is trained per group; however, we do observe the expected double descent peaks in the separate R1 and R2 curves at ψ1 = 1 and ψ2 = 1, respectively.

Overparameterization. The right tails of the together R2 curves plateau at different finite values depending on ϕ. In particular, for ϕ closer to 1, the together R2 curves generally plateau at a higher value, suggesting that a similar number of features and samples can exacerbate minority-group bias under overparameterization. Furthermore, for smaller π and certain values of ϕ < 1, the right tail of the together R1 curve plateaus at a lower value than the together R2 curve. This suggests that there can be differences in test error between groups that are not alleviated even with increased model size. This phenomenon diminishes in magnitude as the fraction of core features increases. This phenomenon supports the finding of Sagawa et al. (2020) that overparameterization with spurious features can increase test risk disparities between groups. We identify that the magnitude of this phenomenon may depend on both the rate of features to samples and fraction of core features.

6 CONCLUSION

We present a unifying, rigorous, and effective theory of ML bias in the settings of ridge regression with and without random projections. Our theory predicts interesting insights into bias amplification and minority-group bias in different feature and parameter regimes. These findings can inform strategies to evaluate and mitigate unfairness in ML (see Appendix R for more details). However, there remain practical challenges to assessing whether a model is prone to bias amplification. These include robustly estimating the feature covariance matrices (Bickel & Levina, 2008) and label noises (Fr enay & Kab an, 2014) for groups from sample data, especially for minority groups which have limited data. Even so, practitioners can use our theory and empirical observations to form intuition about when disparities in the variability of features and labels across groups can amplify bias.

Our theoretical methods are easily extendable to the case of more than two groups and can accommodate label noise sampled from other distributions. However, our theory is not directly extendable to different proportionate scaling limits (e.g., d2/n has a finite limit instead of d/n). Additionally, our theory requires approximately normally-distributed data and thus does not currently account for missing features, which are common in the real world (Feng et al., 2024). Furthermore, our theory implicitly assumes that group information is known, which is not always true (Coston et al., 2019); however, because we work in an asymptotic scaling limit, having access to group information with o(min(n1, n2)) noise is sufficient. As future work, we can leverage Gaussian equivalents (Goldt et al., 2022) to extend our theory to wide, fully-trained networks in the NTK (Jacot et al., 2018) and lazy (Chizat et al., 2019) regimes; this will enable us to understand how, apart from model size, other design choices like nonlinear activation functions and learning rate may affect bias amplification.

Published as a conference paper at ICLR 2025

Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained biasvariance decomposition. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 11022 11032. Curran Associates, Inc., 2020a.

Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp. 74 84. PMLR, 2020b.

Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. In The 22nd international conference on artificial intelligence and statistics, pp. 1370 1378. PMLR, 2019.

Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Francis Bach. High-dimensional analysis of double descent for linear regression with random projections, 2023.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Samuel J. Bell and Skyler Wang. The multiple dimensions of spuriousness in machine learning, 2024.

Samuel James Bell and Levent Sagun. Simplicity bias leads to amplified performance disparities. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 23, pp. 355 369, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594003.

Peter J. Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices. Annals of Statistics, 36:199 227, 2008.

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Sorelle A. Friedler and Christo Wilson (eds.), Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pp. 77 91. PMLR, 23 24 Feb 2018.

Andrea Caponnetto and Ernesto de Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007.

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.

Amanda Coston, Karthikeyan Natesan Ramamurthy, Dennis Wei, Kush R. Varshney, Skyler Speakman, Zairah Mustahsan, and Supriyo Chakraborty. Fair transfer learning with missing protected attributes. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 19, pp. 91 98, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450363242. doi: 10.1145/3306618.3314236.

Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborov a. Generalization error rates in kernel regression: the crossover from the noiseless to noisy regime. Journal of Statistical Mechanics: Theory and Experiment, 2022(11):114004, nov 2022.

St ephane D Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. Double trouble in double descent: Bias and variance(s) in the lazy regime. In Hal Daum e III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2280 2290. PMLR, 13 18 Jul 2020.

Terrance de Vries, Ishan Misra, Changhan Wang, and Laurens van der Maaten. Does object recognition work for everyone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 52 59, 2019.

Published as a conference paper at ICLR 2025

Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018.

Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp. 46979 47013. Curran Associates, Inc., 2024a.

Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse. ar Xiv preprint ar Xiv:2410.04840, 2024b.

Reza Rashidi Far, Tamer Oraby, Wlodzimierz Bryc, and Roland Speicher. Spectra of large block matrices. ar Xiv preprint cs/0610045, 2006.

Raymond Feng, Flavio Calmon, and Hao Wang. Adapting fairness interventions to missing values. Advances in Neural Information Processing Systems, 36, 2024.

Benoˆıt Fr enay and Ata Kab an. A comprehensive introduction to label noise. In The European Symposium on Artificial Neural Networks, 2014.

Robert Geirhos, J orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665 673, 2020.

Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc M ezard, and Lenka Zdeborov a. The gaussian equivalence of generative models for learning with shallow neural networks. In Mathematical and Scientific Machine Learning, pp. 426 471. PMLR, 2022.

Melissa Hall, Laurens van der Maaten, Laura Gustafson, Maxwell Jones, and Aaron Adcock. A systematic study of bias amplification, oct 2022.

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.

Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 771 787, 2018.

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Anchit Jain, Rozhin Nobahari, Aristide Baratin, and Stefano Sarao Mannelli. Bias in motion: Theoretical insights into the dynamics of bias in sgd training. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp. 24435 24471. Curran Associates, Inc., 2024.

V. Kargin. Subordination for the sum of two random matrices. The Annals of Probability, 43(4):2119 2150, 2015.

Fereshte Khani and Percy Liang. Removing spurious features can hurt accuracy and affect groups disproportionately. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21, pp. 196 205, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445883.

Donghwan Lee, Behrad Moniri, Xinmeng Huang, Edgar Dobriban, and Hamed Hassani. Demystifying disagreement-on-the-line in high dimensions. In International Conference on Machine Learning, pp. 19053 19093. PMLR, 2023.

Klas Leino, Matt Fredrikson, Emily Black, Shayak Sen, and Anupam Datta. Feature-wise bias amplification. In International Conference on Learning Representations, 2019.

Published as a conference paper at ICLR 2025

Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws. ar Xiv preprint ar Xiv:2210.16859, 2022.

Stefano Sarao Mannelli, Federica Gerace, Negar Rostamzadeh, and Luca Saglietti. Bias-inducing geometries: exactly solvable data model with fairness implications. In ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2024.

V.A. Marˇcenko and Leonid Pastur. Distribution of eigenvalues for some sets of random matrices. Math USSR Sb, 1:457 483, 1967.

James A. Mingo and Roland Speicher. Free Probability and Random Matrices, volume 35 of Fields Institute Monographs. Springer, 2017.

Anaelia Ovalle, Palash Goyal, Jwala Dhamala, Zachary Jaggers, Kai-Wei Chang, Aram Galstyan, Richard Zemel, and Rahul Gupta. i m fully who i am : Towards centering transgender and non-binary voices to measure biases in open language generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 23, pp. 1246 1266, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594078.

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge (less) regression under general source condition. In International Conference on Artificial Intelligence and Statistics, pp. 3889 3897. PMLR, 2021.

Megan Richards, Polina Kirichenko, Diane Bouchacourt, and Mark Ibrahim. Does progress on object recognition benchmarks improve generalization on crowdsourced, global data? In The Twelfth International Conference on Learning Representations, 2024.

Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pp. 8346 8356. PMLR, 2020.

Stefano Spigler, Mario Geiger, St ephane d Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning. Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019.

Nilesh Tripuraneni, Ben Adlam, and Jeffrey Pennington. Overparameterization improves robustness to covariate shift in high dimensions. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 13883 13897. Curran Associates, Inc., 2021.

Angelina Wang and Olga Russakovsky. Directional bias amplification. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10882 10893. PMLR, 18 24 Jul 2021.

Sierra Wyllie, Ilia Shumailov, and Nicolas Papernot. Fairness feedback loops: training on synthetic data amplifies bias. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 24, pp. 2113 2147, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400704505. doi: 10.1145/3630106.3659029.

Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 15(11):e1002683, nov 2018. ISSN 1549-1676. doi: 10.1371/journal.pmed.1002683.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2979 2989, Copenhagen, Denmark, sep 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1323.

Published as a conference paper at ICLR 2025

Table of Contents

A Warm-up: Deriving Marchenko-Pastur Law via Operator-Valued Free Probability Theory 16 A.1 Step 1: Constructing a Linear Pencil . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Step 2: Constructing the Fundamental Equation via Freeness . . . . . . . . . . . 16 A.3 Step 3: The Final Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B Technical Assumptions 18

C Related Work (Continued) 19 High-dimensional analysis of bias. . . . . . . . . . . . . . . . . . . . . 19 Bias amplification metrics. . . . . . . . . . . . . . . . . . . . . . . . . 19

D Warm-Up: Classical Linear Model 20 D.1 Single Model Learned for Both Groups . . . . . . . . . . . . . . . . . . . . . . 20 D.2 Separate Model Learned Per Group . . . . . . . . . . . . . . . . . . . . . . . . 20 D.3 Phase Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

E Proof of Theorem D.2 22 E.1 Variance Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 E.2 Bias Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

F Proof of Theorem D.1 26 F.1 Variance Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 F.2 Bias Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

G Proof of Theorem 3.1 36 G.1 Computing E tr r(1) j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

G.2 Computing E tr r(2) j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

G.3 Computing E tr r(3) j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

G.4 Computing E tr r(4) j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

H Theorem 3.2 44

I Solving Fixed-Point Equations for Theorem D.1 45 I.1 Proportional Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 45 I.2 The General Regularized Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 I.3 Unregularized Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

J Corollary J.1 49 J.1 Case 1: θ0 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.2 Case 2: θ0 > 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

K Experimental Details 52 K.1 Synthetic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Setup for Section 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Setup for Section 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Extraneous vs. Spurious Features. . . . . . . . . . . . . . . . . . . . . 52

Published as a conference paper at ICLR 2025

K.2 Colored MNIST Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

L Bias Amplification Plots 54

M Power-Law Covariance 57

N Proof of Corollary M.1 59

O Bias Amplification During Training 60

P Colored MNIST Plots 61

Q Minority-Group Bias Plots 62

R Actionable Insights from Theory 65 Searching for optimal hyperparameters. . . . . . . . . . . . . . . . . . . 65 Informing evaluation and mitigation strategies. . . . . . . . . . . . . . . 65

Published as a conference paper at ICLR 2025

A WARM-UP: DERIVING MARCHENKO-PASTUR LAW VIA OPERATOR-VALUED FREE PROBABILITY THEORY

We provide a detailed example of how to apply linear pencils and operator-valued free probability theory (OVFPT) to derive the classical Marchenko-Pastur (MP) law (Marˇcenko & Pastur, 1967). Let S = (1/n)X X Rd d be the empirical covariance matrix for an n d random matrix X with IID entries from N(0, 1). If n tends to infinity while d is held fixed, then S converges to the population covariance matrix, here Σ = Id. If d also tends to infinity, then the limit seizes to exist. It turns out that one can still make sense of the limiting distribution of eigenvalues of S in the case d/n stays constant, i.e.,

n, d , d/n γ (0, ). (22)

In particular, we seek to understand the behavior of the random histogram:

i=1 bλi, (23)

where bλ1, . . . , bλd are the eigenvalues of S. In the aforementioned limit, bµn converges to a deterministic law µMP on R called the MP law. This is central to the field of random matrix theory (RMT), a primary tool in probability theory, statistical analysis of neural networks, finance, etc. We are interested in an even more powerful tool free probability theory (FPT) which is powerful enough to give a precise picture of deep learning in certain linearized regimes (e.g., random features, NTK) and interesting phenomena (e.g., triple descent) via analytic calculation.

A.1 STEP 1: CONSTRUCTING A LINEAR PENCIL

For any positive λ, consider the 2 2 block matrix Q defined by:

Let tr be the normalized trace operator on square matrices and set φ = E tr . This gives random (n + d) (n + d) matrices the structure of a von Neumann algebra A. Define a 2 2 matrix G = G(Q) by:

G = (I2 φ)Q 1, i.e gi,j = φ([Q 1]i,j) = [φ(Q 1)]i,j for all i, j {1, 2}. (25)

Thus, the operator (I2 φ)Q 1 extracts the expectation of the normalized trace of the blocks of the inverse of the a 2 2 block matrix Q.

Observe that:

E tr (S + λId) 1 = g2,2

This is a direct consequence of inverting a 2 2 block matrix (namely Schur s complement). The mechanical advantage of Equation 26 is that the resolvent (S + λId) 1 depends quadratically on X while g2,2 is defined via Q, which is linear in X. For this reason, Q is called a linear pencil for (S + λId) 1. The construction of appropriate linear pencils for rational functions of random matrices is a crucial step in leveraging FPT.

A.2 STEP 2: CONSTRUCTING THE FUNDAMENTAL EQUATION VIA FREENESS

For any B Mb(C)+, define a block matrix B 1A by:

[B 1A]ij = bij Idi, if di = dj 0, else . (27)

Published as a conference paper at ICLR 2025

Here, b b is the number of blocks in the linear pencil QX, that is, b = 2. Now, observe that we can write Q = F QX, where:

F = Id 0 0 In

= I2 1A and QX =

One can then express G = (Ib φ)Q 1 = (Ib φ)(F QX) 1. From operator-valued FPT, we know that in the proportionate scaling limit given by Equation 22, the following fixed-point equation (due to the asymptotic freeness of QX and F) is satisfied by G:

G = (Ib φ)(F R 1A) 1, (29)

where R = RQX(G), and RQX is the R-transform of QX which maps Mb(C)+ to itself like so:

RQX(B)ij = X

k,ℓ σ(i, k; ℓ, j)αkbkℓ. (30)

Here, σ(i, k; ℓ, j) is the covariance between the entries of block (i, k) and block (ℓ, j) of QX, while αk is the dimension of the block (k, ℓ).

A.3 STEP 3: THE FINAL CALCULATION

By the structure of QX, one can compute from Equation 30:

nλ g2,2 = γ

λg2,2, (31)

r1,2 = 0, (32) r2,1 = 0, (33)

nλ g1,1 = 1

λg1,1. (34)

Combining this with Equation 29, one has:

G = (I2 φ)(Z R 1A) 1 = (I2 R) 1 = 1 + (γ/λ)g2,2 0 0 1 + g2,2/λ

= λ/(λ + γg2,2) 0 0 λ/(λ + g1,1)

Comparing the matrix entries, this translates to the following scalar equations:

g1,1 = λ λ + γg2,2 , (36)

g2,2 = λ λ + g1,1 , (37)

g2,1 = g1,2 = 0. (38)

Plugging the second equation into the first (to eliminate g1,1) gives:

g2,2 = λ λ + λ/(λ + γg2,2).

Setting m = g2,2/λ then gives m = (λ + 1/(1 + γm)) 1, i.e.,

1 m = λ + 1 1 + γm, (39)

which is precisely the functional equation characterizing the Stieltjes transform (evaluated at λ = z) of the MP law with shape parameter γ. By treating λ as a complex number and applying the Cauchy-inversion formula, we can recover µMP.

Published as a conference paper at ICLR 2025

B TECHNICAL ASSUMPTIONS

Assumption B.1. In the case of classical ridge regression, we will work in the following proportionate scaling limit:

n, n1, n2, d , n1/n p1, n2/n p2, d/n1 ϕ1, d/n2 ϕ2, d/n ϕ, (40)

for some constants ϕ1, ϕ2, ϕ (0, ). The scalar ϕ captures the rate of features to samples. Observe that ϕ = p1ϕ1 and ϕ = p2ϕ2.

Assumption B.2. The per-group covariance matrices Σ1 and Σ2 and ground-truth weight covariance matrices Θ and are all simultaneously diagonalizable; hence, all these matrices commute.

While Assumption B.2 may appear reductive, our goal is to analyze the bias amplification phenomenon in a sufficient setting that does not introduce complexities due to non-commutativity. Notably, our main theoretical result does not assume isotropic covariance. For example, our theory accommodates diatomic covariance (see Section 5) and power-law covariance (see Appendix M).

Assumption B.3. In Corollary M.1, we assume the following spectral densities exist when d :

ν P(R+) is the limiting spectral density of Σ2Σ 1 1 , of the ratios λ(2) j /λ(1) j of the eigenvalues of the respective covariance matrices,

µ P(R+, R+) is the joint limiting density of the spectra of Σ2Σ 1 1 and Σ1,

π P(R+) is the limiting density of the spectrum of .

Published as a conference paper at ICLR 2025

C RELATED WORK (CONTINUED)

High-dimensional analysis of bias. Mannelli et al. (2024) employ the replica method, which is non-rigorous, while we use OVFPT, which is entirely rigorous. Moreover, Mannelli et al. (2024); Jain et al. (2024) study the application of linear classification to Gaussian data with isotropic covariance; in contrast, we study the application of regression with random projections (a simplified model of feedforward neural networks) to Gaussian data with more general covariance structure (i.e., covariance matrices that are simultaneously diagonalizable) and noisy labels. This allows us to analyze the effects of these additional factors on bias. We make additional connections between our work and Mannelli et al. (2024); Jain et al. (2024) in Section 4.2.

Bias amplification metrics. Our definition of ADD is consistent with the conceptualization of bias of Bell & Sagun (2023). At a high level, our definition quantifies how many times worse model bias would be if a ML practitioner opted to train a single model on a mixture of data from two groups (i.e., the setting in which bias is observed in practice) vs. separate models for the data from each group (i.e., the setting which corresponds to the bias in the data alone, and thus the a priori amount of bias we would expect in the case of a single model). In sum, we seek to isolate the contribution of the model to bias when learning from data with different groups.

Published as a conference paper at ICLR 2025

D WARM-UP: CLASSICAL LINEAR MODEL

Technical Difficulty. The analysis of the test errors (e.g., Rs( bf)) amounts to the analysis of the trace of rational functions of sums of random matrices. Although the limiting spectral density of sums of random matrices is a classical computation using subordination techniques (Marˇcenko & Pastur, 1967; Kargin, 2015), a more involved analysis is required in our case. This difficulty is even greater in the setting of random projections (see Section 3.1). Thus, we employ OVFPT to compute the exact high-dimensional limits of such quantities. We derive Theorems D.2 and D.1 using OVFPT (in Appendices F and E). Theorem D.1 is a non-trivial generalization of Proposition 3 from (Bach, 2023), which can be recovered by taking ps 1 (i.e., ps 0).

D.1 SINGLE MODEL LEARNED FOR BOTH GROUPS

We first consider the classical ridge regression model bf, which is learned using empirical risk minimization and ℓ2-regularization with penalty λ. The parameter vector bw Rd of the linear model bf is given by the following problem:

bw = arg min w Rd L(w) =

s=1 n 1 Xsw Ys 2 2 + λ w 2 2. (41)

The unregularized limit λ 0+ corresponds to ordinary least-squares (OLS). We provide in Theorem D.1 a novel bias-variance decomposition for the test error Rs( bf) for each group s {1, 2}. We first present some relevant definitions.

Definition D.1. For any group index s {1, 2}, we define (e1, e2, u(s) 1 , u(s) 2 ) to be the unique positive solution to the following system of fixed-point equations:

1/es = 1 + ϕ tr Σs K 1, u(s) k = ϕe2 k tr Σk(p1u(s) 1 Σ1 + p2u(s) 2 Σ2 + Σs)K 2, k {1, 2}, (42)

where K = p1e1Σ1 + p2e2Σ2 + λId and tr A := (1/d) tr A is the normalized trace operator.

The fixed-point equations for es are non-linear and often not analytically solvable for general Σ1, Σ2. This is typical in RMT.

Theorem D.1. Under Assumptions B.2 and B.1, it holds that: Rs( bf) Bs( bf) + Vs( bf), with

Vs( bf) = V (1) s ( bf) + V (2) s ( bf), (43)

V (k) s ( bf) = pkσ2 kϕ tr Σk ekΣs λu(s) k Id + pk Σk (eku(s) k ek u(s) k ) K 2, (44)

Bs( bf) = B(1) s ( bf) + B(3) s ( bf) +

( 0, s = 1, 2B(2) 2 ( bf), s = 2, (45)

B(1) s ( bf) = ps tr Σs (ps (1 + psu(s) s )e2 s Σs Σs + u(s) s (psesΣs + λId)2)K 2, (46)

B(2) 2 ( bf) = p1λ tr Σ1((1 + p2u(2) 2 )e1Σ2 u(2) 1 (p2e2Σ2 + λId))K 2, (47)

B(3) s ( bf) = λ2 tr Θs(p1u(s) 1 Σ1 + p2u(s) 2 Σ2 + Σs)K 2, (48)

where 1 = 2 and 2 = 1.

D.2 SEPARATE MODEL LEARNED PER GROUP

We now treat the case of fitting a separate model bfs per group. Suppose that the classical ridge regression models bf1 and bf2 are learned using empirical risk minimization and ℓ2-regularization with penalties λ1 and λ2, respectively. In particular, we have the following optimization problem for each group s:

arg min w Rd L(w) = 1

(xi,yi) Ds (x i w yi)2 + λs w 2 2 = Xsw Ys 2 2 ns + λs w 2 2. (49)

We first present some relevant definitions.

Published as a conference paper at ICLR 2025

Definition D.2. Let df (s) m (t) = tr Σm s (Σs + t Id) m, and κs be the unique positive solution to the equation κs λs = κsϕs df (s) 1 (κs).

In this setting, we deduce Theorem D.2.

Theorem D.2. Under Assumptions B.2 and B.1, it holds that:

Rs( bfs) Bs( bfs) + Vs( bfs), with (50)

Vs( bfs) = σ2 sϕs df (s) 2 (κs)

1 ϕs df (s) 2 (κs) , Bs( bfs) = κ2 s tr ΘsΣs (Σs + κs Id) 2

1 ϕs df (s) 2 (κs) . (51)

D.3 PHASE DIAGRAM

We present the bias amplification phase diagram (Figure 5) predicted by Theorems D.1 and D.2 for the classical ridge regression model. The phase diagram offers insights into how ϕ (rate of features to samples) affects bias amplification. To obtain the precise phase diagram, we solve the scalar equations numerically. In the ODD profile, we observe an interpolation threshold at ϕ = 1. To the right of the threshold, we observe a tail that descends towards 1. To the left of the threshold, the ODD descends below 1 with a local minimum at ϕ 0.25 before increasing. In contrast, we observe that the EDD continually grows as ϕ increases, ascending from a small value, exhibiting an inflection point at ϕ = 0.5, and plateauing after ϕ = 1. Accordingly, the ADD increases significantly as ϕ decreases (with an intermediate inflection point at ϕ = 0.5), peaks at ϕ = 1, and descends towards 1 as ϕ increases (i.e., bias remains amplified in this phase). In sum, bias is most amplified when the rate of features to samples ϕ 1 and ϕ = 1. Interestingly, bias amplification consistently occurs (i.e., ADD > 1) across all observed values of ϕ.

0 1 2 3 4 10 3

Figure 5: ODD, EDD, and ADD phase diagrams for classical ridge regression. We plot the bias amplification phase diagrams with respect to ϕ (rate of features to samples), as predicted by our theory for ridge regression without random projections (Theorems D.1, D.2). Dashed black lines indicate theoretical predictions. We consider isotropic covariance matrices: Σ1 = 2Id, Σ2 = Id, Θ = 2Id, = Id. Additionally, n = 1 104, σ2 1 = σ2 2 = 1. We further choose λ = λ1 = λ2 = 1 10 6 to approximate the minimum-norm interpolator. We observe that bias amplification can occur even in the balanced data setting, i.e., when p1 = p2 = 1/2, without spurious correlations.

Published as a conference paper at ICLR 2025

E PROOF OF THEOREM D.2

Proof. We define Ms = X s Xs and Es = Ys Xsw s. Note that bws = (X s Xs + nsλs Id) 1X s (Xsw s + Es) = (Ms + nsλs Id) 1Msw s + (Ms + nsλs Id) 1X s Es. We deduce that Rs( bfs) = Bs( bfs) + Vs( bfs), where:

Bs( bfs) = E (Ms + nsλs Id) 1Msw s w s 2 Σs, (52)

Vs( bfs) = E (Ms + nsλs Id) 1X s Es 2 Σs. (53)

E.1 VARIANCE TERM

Note that the variance term Vs( bf) of the test error of bfs evaluated on group s is given by:

Vs( bfs) = σ2 s E tr Xs(Ms + nsλs Id) 1Σs(Ms + nsλs Id) 1X s (54)

= σ2 s E tr (Ms + nsλs Id) 1Ms(Ms + nsλs Id) 1Σs. (55)

We can re-express this as:

ns Vs( bfs) = σ2 s E tr (Hs + λs Id) 1Hs(Hs + λs Id) 1Σs (56)

= σ2 s λs E tr (Hs/λs + Id) 1(Hs/λs)(Hs/λs + Id) 1Σs, (57)

where Hs = X s Xs/ns and Xs = ZsΣ1/2 s , with Z1 Rn1 d and Z2 Rn2 d being independent random matrices with IID entries from N(0, 1). Thus, the variance term is proportional to: tr (Hs + λs Id) 1Hs(Hs + λs Id) 1Σs. (58)

WLOG, we consider the case where s = 1. The matrix of interest has a linear pencil representation given by (with zero-based indexing):

(H1/λ1 + Id) 1(H1/λ1)(H1/λ1 + Id) 1Σ1 = Q 1 0,8, (59)

where the linear pencil Q is defined as follows:

1 2 1 0 0 Σ

1 2 1 0 0 0 0 0 Id 1 λ1 n1 Z 1 0 0 0 0 0 0 0 0 In1 1 λ1 n1 Z1 0 0 0 0 0

1 2 1 0 0 Id 0 0 0 0 0 0 0 0 0 Id 1 λ1 n1 Z 1 0 0 0 0 0 0 0 0 In1 1 λ1 n1 Z1 0 0

0 0 0 0 0 0 Id Σ

1 2 1 0 0 Id Σ1 0 0 0 0 0 0 0 0 Id

We compute Q using the NCMinimal Descriptor Realization function of the NCAlgebra library1. We further symmetrize Q by constructing the self-adjoint matrix Q:

Q = 0 Q Q 0

This enables us to apply known formulae for the R-transform of Gaussian block matrices (Far et al., 2006). We note that Q 1 0,17 = Q 1 0,8. Taking similar steps as Lee et al. (2023), we use OVFPT on Q.

Let G = (I18 E tr )Q 1 R18 18 be the matrix whose entries are normalized traces of blocks2 of Q 1. We provide a detailed example of how to apply OVFPT to derive the MP law in Appendix A. One can arrive at that, in the asymptotic limit given by Equation 40, the following holds:

E tr (H1 + λ1Id) 1H1(H1 + λ1Id) 1Σ1 = G0,17

λ1 = (G5,14 G2,14) tr (Σ1G2,11 + λ1Id) 1 Σ1 (Σ1G5,14 + λ1Id) 1 Σ1. (62)

1https://github.com/NCAlgebra/NC 2By convention, the trace of a non-square block is zero.

Published as a conference paper at ICLR 2025

We will now obtain the fixed-point equations satisfied by G2,11 and G5,14. We observe that:

G2,11 = λ1 λ1 + ϕ1G3,10 , G3,10 = λ1 tr Σ1(Σ1G2,11 + λ1Id) 1 (63)

= G2,11 = 1 1 + ϕ1 tr Σ1(Σ1G2,11 + λ1Id) 1 , (64)

G5,14 = λ1 λ1 + ϕ1G6,13 , G6,13 = λ1 tr Σ1 (Σ1G5,14 + λ1Id) 1 (65)

= G5,14 = 1

1 + ϕ1 tr Σ1 (Σ1G5,14 + λ1Id) 1 . (66)

We recognize that we must have the identification e1 = G2,11 = G5,14, where e1 0. Therefore:

e1 + ϕ1 df (1) 1 (λ1/e1) (67)

i.e., 1 = e1 + ϕ1 df (1) 1 (λ1/e1) = λ1/κ1 + ϕ1 df (1) 1 (κ1) (68)

κ1 = λ1 + κ1ϕ1 df (1) 1 (κ1), (69)

where df (s) m (t) = tr Σm s (Σs + t Id) m and κ1 = λ1/e1. Additionally:

G2,14 = λ1ϕ1G3,13 ( λ1 + ϕ1G3,10)( λ1 + ϕ1G6,13) = ϕ1e2 1 G3,13

λ1 = tr (Σ1G2,11 + λ1Id) 2(Σ1G2,14 + λ1Id)Σ1 (71)

e2 1 df (1) 2 (κ1) + λ1 tr (Σ1e1 + λ1Id) 2Σ1, (72)

λ1 = tr (Σ1e1 + λ1Id) 1Σ1. (73)

G5,14 G2,14 = e2 1

1 ϕ1 G3,10 + G3,13

G3,10 + G3,13

e2 1 df (1) 2 (κ1) + λ1 tr (Σ1e1 + λ1Id) 2Σ1 (75)

tr (Σ1e1 + λ1Id) 2(Σ1e1 + λ1Id)Σ1 (76)

e2 1 df (1) 2 (κ1) e1

e2 1 df (1) 2 (κ1) (77)

= G5,14 G2,14

e2 1 df (1) 2 (κ1). (78)

c1 1, c1 = G5,14 G2,14

e2 1 = 1 + ϕ1c1 df (1) 2 (κ1), (79)

i.e., c1 = 1

1 ϕ1 df (1) 2 (κ1) . (80)

λ1 = c1 df (1) 2 (κ1) = df (1) 2 (κ1)

1 ϕ1 df (1) 2 (κ1) . (81)

In conclusion:

κ1 = λ1 + κ1ϕ1 df (1) 1 (κ1), (82)

V1( bf1) = σ2 1ϕ1 df (1) 2 (κ1)

1 ϕ1 df (1) 2 (κ1) . (83)

Published as a conference paper at ICLR 2025

Following similar steps for V2( bf2), we get:

κ2 = λ2 + κ2ϕ2 df (2) 1 (κ2), (84)

V2( bf2) = σ2 2ϕ2 df (2) 2 (κ2)

1 ϕ2 df (2) 2 (κ2) . (85)

To further substantiate our result, let us consider the unregularized case where λs = 0 and ϕs < 1:

κs = 0, Vs( bfs) = σ2 sϕs 1 ϕs . (86)

From an alternate angle, we know that:

Rs( bfs) = E bws w s 2 Σs = E (X s Xs) 1X s Es 2 Σs (87)

= σ2 s E tr Xs(X s Xs) 1Σs(X s Xs) 1X s (88)

= σ2 s E tr (X s Xs) 1Σs = σ2 s ns d 1 tr Id = σ2 s d ns d 1 σ2 sϕs 1 ϕs , (89)

where we have used Lemma E.1 below.

Lemma E.1. Let n and d be positive integers with n d + 2. If Z is an n d random matrix with IID rows from N(0, Σ), then:

E(Z Z) 1 = 1 n d 1Σ 1. (90)

E.2 BIAS TERM

We can compute the bias term Bs( bfs) of the test error of bfs evaluated on group s as:

Bs( bfs) = E (Ms + nsλs Id) 1Msw s w s 2 Σs (91)

= E (Ms + nsλs Id) 1Msw s (Ms + nsλs Id) 1(Ms + nsλs Id)w s 2 Σs (92)

= E (Ms + nsλs Id) 1nsλsw s 2 Σs (93)

= n2 sλ2 s E tr (Ms + nsλs Id) 1w s(w s) (Ms + nsλs Id) 1Σs. (94)

We can re-express this as:

1 λ2s Bs( bfs) = E tr (Hs + λs Id) 1Θs(Hs + λs Id) 1Σs (95)

Bs( bfs) = E tr (Hs/λs + Id) 1Θs(Hs/λs + Id) 1Σs, (96)

where Θs = Θ, s = 1 Θ + , s = 2. WLOG, we consider the case where s = 1. The matrix of interest

has a linear pencil representation given by (with zero-based indexing):

(H1/λ1 + Id) 1Θ(H1/λ1 + Id) 1Σ1 = Q 1 0,8, (97)

where the linear pencil Q is defined as follows:

1 2 1 0 0 Θ 0 0 0 0 0 Id 1

λ n Z 1 0 0 0 0 0 0 0 0 In1 1

λ n Z1 0 0 0 0 0

1 2 1 0 0 Id 0 0 0 0 0

0 0 0 0 Id Σ

1 2 1 0 0 Σ1 0 0 0 0 0 Id 1

λ n Z 1 0 0 0 0 0 0 0 0 In1 1

1 2 1 0 0 Id 0 0 0 0 0 0 0 0 0 Id

Published as a conference paper at ICLR 2025

We note that Q 1 0,17 = Q 1 0,8. Using OVFPT, we deduce that, in the limit given by Equation 40, the following holds:

E tr (H1/λ1 + Id) 1Θ(H1/λ1 + Id) 1Σ1 = G0,17, (99)

with G0,17 = λ1 tr (Σ1G2,11 + λ1Id) 1 (λ1Θ + Σ1G2,15) (Σ1G6,15 + λ1Id) 1 Σ1. (100)

We will now obtain the fixed-point equations satisfied by G2,11 and G6,15. We observe that:

G2,11 = λ1 λ1 + ϕ1G3,10 , G3,10 = λ1 tr Σ1(Σ1G2,11 + λ1Id) 1 (101)

= G2,11 = 1 1 + ϕ1 tr Σ1(Σ1G2,11 + λ1Id) 1 , (102)

G6,15 = λ1 λ1 + ϕ1G7,14 , G7,14 = λ1 tr Σ1 (Σ1G6,15 + λ1Id) 1 (103)

= G6,15 = 1

1 + ϕ1 tr Σ1 (Σ1G6,15 + λ1Id) 1 . (104)

We recognize that we must have the identification e1 = G2,11 = G6,15, where e1 0. Therefore:

1 + ϕ1 tr Σ1 (Σ1e1 + λ1Id) 1 , (105)

i.e., κ1 = λ1 + κ1ϕ1 df (1) 1 (κ1). (106)

Additionally:

G2,15 = λ1ϕ1G3,14 ( λ1 + ϕ1G3,10)( λ1 + ϕ1G7,14) = ϕ1e2 1 G3,14

λ1 = tr (Σ1G2,11 + λ1Id) 2(Σ1G2,15 + λ1Θ)Σ1 (108)

e2 1 df (1) 2 (κ1) + λ1

e2 1 tr (Σ1 + κ1Id) 2ΘΣ1, (109)

= G2,15 = ϕ1G2,15 df (1) 2 (κ1) + λ1ϕ1 tr (Σ1 + κ1Id) 2ΘΣ1, (110)

i.e., G2,15 = λ1ϕ1

1 ϕ1 df (1) 2 (κ1) tr (Σ1 + κ1Id) 2ΘΣ1. (111)

G0,17 = κ2 1 tr (Σ1 + κ1Id) 2 ΘΣ1 + κ2 1 df (1) 2 (κ1)G2,15

= κ2 1 tr (Σ1 + κ1Id) 2 ΘΣ1 + κ2 1 ϕ1 df (1) 2 (κ1)

1 ϕ1 df (1) 2 (κ1) tr (Σ1 + κ1Id) 2ΘΣ1 (113)

1 + ϕ1 df (1) 2 (κ1)

1 ϕ1 df (1) 2 (κ1)

κ2 1 tr (Σ1 + κ1Id) 2 ΘΣ1. (114)

In conclusion:

B1( bf1) = κ2 1 tr (Σ1 + κ1Id) 2 ΘΣ1

1 ϕ1 df (1) 2 (κ1) . (115)

Following similar steps for B2( bf2), we get:

B2( bf2) = κ2 2 tr (Σ2 + κ2Id) 2 (Θ + )Σ2

1 ϕ2 df (2) 2 (κ2) . (116)

We observe that in the unregularized case (i.e., λs = 0), κs = 0. In this setting, Bs( bfs) = 0 as expected.

Published as a conference paper at ICLR 2025

F PROOF OF THEOREM D.1

Proof. We define M = X X + nλId. Note that one has:

bw = M 1(M1w 1 + X 1 E1 + M2w 2 + X 2 E2). (117)

We deduce that Rs( bf) = Bs( bf) + Vs( bf), where:

Bs( bf) = E M 1Ms w s + M 1Msw s w s 2 Σs, (118)

Vs( bf) = E M 1(X 1 E1 + X 2 E2) 2 Σs (119)

= E M 1X 1 E1 2 Σs + E M 1X 2 E2 2 Σs, (120)

with s = 2, s = 1 1, s = 2.

F.1 VARIANCE TERMS

Note that Vs( bf) of the test error of bf evaluated on group s is given by:

Vs( bf) = σ2 1E tr X1M 1Σs M 1X 1 + σ2 2E tr X2M 1Σs M 1X 2 (121)

= σ2 1E tr M 1M1M 1Σs + σ2 2E tr M 1M2M 1Σs. (122)

We can re-express this as:

n Vs( bf) = σ2 1E tr (H + λId) 1H1(H + λId) 1Σs + σ2 2E tr (H + λId) 1H2(H + λId) 1Σs,(123)

where H = H1 + H2, Hs = X s Xs/n, and Xs = ZsΣ1/2 s with Z1 Rn1 d and Z2 Rn2 d being independent random matrices with IID entries from N(0, 1).

WLOG, we focus on tr (H + λId) 1H2(H + λId) 1Σs. The matrix of interest has a linear pencil representation given by (with zero-based indexing):

(H1/λ + H2/λ + Id) 1(H2/λ)(H1/λ + H2/λ + Id) 1Σs = Q 1 1,8, (124)

Published as a conference paper at ICLR 2025

where the linear pencil Q is defined as follows:

Id 0 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0

1 2 2 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2 0 0 0 0 0 0 0 0 0 0 0

1 2 2 0 0 0 0 0 0 0 0 0 0

1 2 2 0 0 0 Id Σ

1 2 1 0 0 Σs 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0 0 0 0 0

1 2 1 0 0 Id 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 Id Σ

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0

0 0 0 0 0 0 0 0 Σ

1 2 1 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2

0 0 0 0 0 0 0 0 Σ

1 2 2 0 0 0 0 0 Id

Using OVFPT, we deduce that, in the limit given by Equation 40, the following holds:

E tr (H1 + H2 + λId) 1H2(H1 + H2 + λId) 1Σs = G1,23

λ = λ 1 tr p2Σ2(λΣs G0,15 + λG0,27Id p1Σ1G0,15G5,24 + p1Σ1G0,27G5,20) (127)

(p1Σ1G5,20 + p2Σ2G0,15 + λId) 2. (128)

Published as a conference paper at ICLR 2025

By identifying identical entries of Q 1, we must have that G5,20

λ . For G6,21 and G2,17, we observe that:

G6,21 = λ λ + ϕG7,20 , G7,20 = λ tr Σ1 (p1Σ1G6,21 + p2Σ2G2,17 + λId) 1 (129)

= G6,21 = 1

1 + ϕ tr Σ1 (p1Σ1G6,21 + p2Σ2G2,17 + λId) 1 , (130)

G2,17 = λ λ + ϕG3,15 , G3,15 = λ tr Σ2 (p1Σ1G6,21 + p2Σ2G2,17 + λId) 1 (131)

= G2,17 = 1

1 + ϕ tr Σ2 (p1Σ1G6,21 + p2Σ2G2,17 + λId) 1 . (132)

We define η1 = G6,21

λ , η2 = G2,17

λ , with η1 0, η2 0. Therefore:

ηs = 1 λ + ϕ tr Σs K 1 , (133)

where K = η1p1Σ1 + η2p2Σ2 + Id. Additionally, by identifying identical entries of Q 1, we must have that G5,24 = G6,25, G0,27 = G2,28. We observe that:

G10,25 = λ λ + ϕG11,24 , (134)

G6,25 = λϕG7,24 ( λ + ϕG7,20)( λ + ϕG11,24) = ϕλ2η2 1 G7,24

λ = λ 2 tr K 2(p1Σ1G6,25 + p2Σ2G2,28 λΣs)Σ1, (136)

= G6,25 = ϕη2 1 tr K 2(p1Σ1G6,25 + p2Σ2G2,28 λΣs)Σ1, (137)

G13,28 = λ λ + ϕG14,27 , (138)

G2,28 = λϕG3,27 ( λ + ϕG3,15)( λ + ϕG14,27) = ϕλ2η2 2 G3,27

λ = λ 2 tr K 2(p1Σ1G6,25 + p2Σ2G2,28 λΣs)Σ2, (140)

= G2,28 = ϕη2 2 tr K 2(p1Σ1G6,25 + p2Σ2G2,28 λΣs)Σ2. (141)

We now define v(s) 1 = G6,25, v(s) 2 = G2,28, with v(s) 1 0, v(s) 2 0. Therefore, v(s) 1 , v(s) 2 obey the following system of equations:

v(s) k = ϕη2 k tr K 2(v(s) 1 p1Σ1 + v(s) 2 p2Σ2 + λΣs)Σk. (142)

We further define u(s) k = v(s) k λ . Putting all the pieces together:

λ = λ 1 tr p2Σ2 η2Σs u(s) 2 Id + p1Σ1(η2u(s) 1 η1u(s) 2 ) K 2. (143)

By symmetry, in conclusion:

Vs( bf) = V (1) s ( bf) + V (2) s ( bf), (144)

V (k) s ( bf) = λ 1ϕσ2 k tr pkΣk ηkΣs u(s) k Id + pk Σk (ηku(s) k ηk u(s) k ) K 2, (145)

with k = 2, k = 1 1, k = 2.

Published as a conference paper at ICLR 2025

We now corroborate our result in the limit p2 1 (i.e., p1 0) and s = 2. We observe that:

ϕ ϕ2, λ λ2, (146)

V (1) 2 ( bf) = 0, (147)

V (2) 2 ( bf) λ 1ϕ2σ2 2 = tr Σ2(η2Σ2 u(2) 2 Id)K 2 (148)

v(2) 2 = ϕ2η2 2 tr K 2(v(2) 2 Σ2 + λ2Σ2)Σ2 (149)

= ϕ2(v(2) 2 + λ2) df (2) 2 (κ2), (150)

u(2) 2 = ϕ2 df (2) 2 (κ2)

1 ϕ2 df (2) 2 (κ2) , (151)

V (2) 2 ( bf) λ 1ϕ2σ2 2 = κ2 df (2) 2 (κ2) u(2) 2 tr Σ2(η2Σ2 + Id) 2 (152)

= κ2 df (2) 2 (κ2) κ2 2u(2) 2 tr Σ2(Σ2 + κ2Id) 2 (153)

= κ2 df (2) 2 (κ2) κ2u(2) 2 ( df (2) 1 (κ2) df (2) 2 (κ2)) (154)

= κ2(1 + u(2) 2 ) df (2) 2 (κ2) κ2u(2) 2 df (2) 1 (κ2) (155)

= κ2 κ2ϕ2 df (2) 1 (κ2)

1 ϕ2 df (2) 2 (κ2) df (2) 2 (κ2) (156)

= λ df (2) 2 (κ2)

1 ϕ2 df (2) 2 (κ2) , (157)

V (2) 2 ( bf) = σ2 2ϕ2 df (2) 2 (κ2)

1 ϕ2 df (2) 2 (κ2) , (158)

which exactly recovers the result for V2( bf2) as expected.

F.2 BIAS TERMS

Recall that:

Bs( bf) = E M 1Ms w s + M 1Msw s w s 2 Σs. (159)

Now, observe that M 1M1w 1 w 1 = M 1M1w 1 M 1Mw 1 = M 1M2w 1 nλM 1w 1. Let δ = w 2 w 1. Then:

Bs( bf) = E M 1Ms ( 1)s 1δ nλM 1w s 2 Σs (160)

= E tr δ Ms M 1Σs M 1Ms δ (161)

2( 1)s 1nλE tr δ Ms M 1Σs M 1w s (162)

+ n2λ2E tr (w s) M 1Σs M 1w s (163)

= B(1) s ( bf) 2( 1)s 1B(2) s ( bf) + B(3) s ( bf), (164)

B(1) s ( bf) = E tr (H1/λ + H2/λ + Id) 1(Hs /λ) (Hs /λ)(H1/λ + H2/λ + Id) 1Σs, (165)

B(2) s ( bf) = E tr δ (Hs /λ)(H1/λ + H2/λ + Id) 1Σs(H1/λ + H2/λ + Id) 1w s, (166)

B(3) s ( bf) = E tr (H1/λ + H2/λ + Id) 1Θs(H1/λ + H2/λ + Id) 1Σs. (167)

Because δ and w 1 are independent and sampled from zero-centered distributions:

B(2) 1 ( bf) = 0, (168)

B(2) 2 ( bf) = E tr (H1/λ + H2/λ + Id) 1 (H1/λ)(H1/λ + H2/λ + Id) 1Σ2. (169)

Published as a conference paper at ICLR 2025

WLOG, for B(1) s , we focus on the case s = 1. The matrix of interest has a linear pencil representation given by (with zero-based indexing):

(H1/λ + H2/λ + Id) 1(H2/λ) (H2/λ)(H1/λ + H2/λ + Id) 1Σ1 = Q 1 1,16, (170)

where the linear pencil Q is defined as follows:

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 Id Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0

1 2 2 0 0 Id Σ

1 2 1 0 0 Σ1 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0 0 0 0 0 0

0 0 0 0 0 Σ

1 2 1 0 0 Id 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 Id Σ

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0 0

0 0 0 0 0 0 0 0 0 Σ

1 2 1 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2 0

0 0 0 0 0 0 0 0 0 Σ

1 2 2 0 0 0 0 0 Id Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id

Using OVFPT, we deduce that, in the limit given by Equation 40, the following holds:

E tr (H1/λ + H2/λ + Id) 1(H2/λ) (H2/λ)(H1/λ + H2/λ + Id) 1Σ1 = G1,33, (172)

G1,33 = λ 1 tr p2Σ2 (p2Σ2G2 2,19(λ p1G6,27)Σ1 G2,30(p1Σ1G6,23 + λId)2)

(p1Σ1G6,23 + p2Σ2G2,19 + λId) 2.

Published as a conference paper at ICLR 2025

By identifying identical entries of Q 1, we must have that η1 = G6,23

λ , η2 = G2,19

λ . For G7,24 and G3,20, we observe that:

G7,24 = λ λ + ϕG8,23 , G8,23 = λ tr Σ1 (p1Σ1G7,24 + p2Σ2G3,20 + λId) 1 (174)

= G7,24 = 1

1 + ϕ tr Σ1 (p1Σ1G7,24 + p2Σ2G3,20 + λId) 1 , (175)

G3,20 = λ λ + ϕG4,19 , G4,19 = λ tr Σ2 (p1Σ1G7,24 + p2Σ2G3,20 + λId) 1 (176)

= G3,20 = 1

1 + ϕ tr Σ2 (p1Σ1G7,24 + p2Σ2G3,20 + λId) 1 . (177)

By again identifying identical entries of Q 1, we further have that v(1) 1 = G6,27 = G7,28, v(1) 2 = G2,30 = G3,31. We observe that:

G7,28 = ϕλ2η2 1 G8,27

λ = λ 2 tr K 2(p1Σ1G7,28 + p2Σ2G3,31 λΣ1)Σ1 (179)

= v(1) 1 = ϕη2 1 tr K 2(v(s) 1 p1Σ1 + v(s) 2 p2Σ2 + λΣ1)Σ1, (180)

G3,31 = ϕλ2η2 2 G4,30

λ = λ 2 tr K 2(p1Σ1G7,28 + p2Σ2G3,31 λΣ1)Σ2, (182)

= v(1) 2 = ϕη2 2 tr K 2(v(s) 1 p1Σ1 + v(s) 2 p2Σ2 + λΣ1)Σ2. (183)

Putting all the pieces together:

B(1) 1 ( bf) = tr p2Σ2 (p2η2 2Σ2(1 + p1u(s) 1 )Σ1 + u(s) 2 (p1η1Σ1 + Id)2)K 2. (184)

In conclusion:

B(1) s ( bf) = tr ps Σs (ps η2 s Σs (1 + psu(s) s )Σs + u(s) s (psηsΣs + Id)2)K 2. (185)

Now, switching our focus to B(2) 2 ( bf), the matrix of interest has a linear pencil representation given by (with zero-based indexing):

(H1/λ + H2/λ + Id) 1 (H1/λ)(H1/λ + H2/λ + Id) 1Σ2 = Q 1 0,15, (186)

Published as a conference paper at ICLR 2025

where the linear pencil Q is defined as follows:

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 Id Σ

1 2 1 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 Id Σ

1 2 2 0 0 Σ2 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2 0 0 0 0 0 0 0

0 0 0 0 0 Σ

1 2 2 0 0 Id 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 Id Σ

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0

0 0 0 0 0 0 0 0 0 Σ

1 2 1 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2

0 0 0 0 0 0 0 0 0 Σ

1 2 2 0 0 0 0 0 Id

Like before, the following holds:

E tr (H1/λ + H2/λ + Id) 1 (H1/λ)(H1/λ + H2/λ + Id) 1Σ2 = G1,25, (188)

G1,25 = tr p1Σ1 (λΣ2G2,18 + λG2,26Id p2Σ2G2,18G6,29 + p2Σ2G2,26G6,22)

(p1Σ1G2,18 + p2Σ2G6,22 + λId) 2 (189)

Published as a conference paper at ICLR 2025

By identifying identical entries of Q 1, we must have that η1 = G2,18

λ , η2 = G6,22

λ . For G3,19 and G7,23, we observe that:

G3,19 = λ λ + ϕG4,18 , G4,18 = λ tr Σ1 (p1Σ1G3,19 + p2Σ2G7,23 + λId) 1 (190)

= G3,19 = 1

1 + ϕ tr Σ1 (p1Σ1G3,19 + p2Σ2G7,23 + λId) 1 , (191)

G7,23 = λ λ + ϕG8,22 , G8,22 = λ tr Σ2 (p1Σ1G3,19 + p2Σ2G7,23 + λId) 1 (192)

= G7,23 = 1

1 + ϕ tr Σ2 (p1Σ1G3,19 + p2Σ2G7,23 + λId) 1 . (193)

By again identifying identical entries of Q 1, we further have that v(2) 1 = G2,26 = G3,27, v(2) 2 = G6,29 = G7,30. We observe that:

G3,27 = ϕλ2η2 1 G4,26

λ = λ 2 tr K 2(p1Σ1G3,27 + p2Σ2G7,30 λΣ2)Σ1, (195)

= v(2) 1 = ϕη2 1 tr K 2(v(2) 1 p1Σ1 + v(2) 2 p2Σ2 + λΣ2)Σ1, (196)

G7,30 = ϕλ2η2 2 G8,29

λ = λ 2 tr K 2(p1Σ1G3,27 + p2Σ2G7,30 λΣ2)Σ2, (198)

= v(2) 2 = ϕη2 2 tr K 2(v(2) 1 p1Σ1 + v(2) 2 p2Σ2 + λΣ2)Σ2. (199)

Putting all the pieces together:

B(1) 2 ( bf) = 0, (200)

B(2) 2 ( bf) = tr p1Σ1 η1Σ2 u(2) 1 Id + p2Σ2(η1u(2) 2 η2u(2) 1 ) K 2. (201)

Finally, switching our focus to B(3) 1 ( bf), the matrix of interest has a linear pencil representation given by (with zero-based indexing):

(H1/λ + H2/λ + Id) 1Θ(H1/λ + H2/λ + Id) 1Σ1 = Q 1 1,8, (202)

Published as a conference paper at ICLR 2025

where the linear pencil Q is defined as follows:

1 2 1 0 0 Σ

1 2 2 0 0 Σ1 0 0 0 0 0 0 Θ Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2 0 0 0 0 0 0 0

1 2 2 0 0 0 0 0 0 Id 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 Id Σ

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λ n Z1 0 0 0

0 0 0 0 0 0 0 0 Σ

1 2 1 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λ n Z 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λ n Z2

0 0 0 0 0 0 0 0 Σ

1 2 2 0 0 0 0 0 Id

The following holds:

(H1/λ + H2/λ + Id) 1Θ(H1/λ + H2/λ + Id) 1Σ1 = G1,23, (204)

with G1,23 = λ tr Θ( p1Σ1G2,24 p2Σ2G5,27 + λΣ1)(p1Σ1G2,17 + p2Σ2G5,20 + λId) 2.

By identifying identical entries of Q 1 and following similar steps as before, we must have the identification η1 = G2,17

λ , η2 = G5,20

λ , as well as v(1) 1 = G2,24, v(1) 2 = G5,27. Therefore, in conclusion:

B(3) s ( bf) = tr Θs(p1u(s) 1 Σ1 + p2u(s) 2 Σ2 + Σs)K 2. (205)

Published as a conference paper at ICLR 2025

In the limit ps 1 (i.e., ps 0), we observe that:

ϕ ϕs, λ λs, (206)

B(1) s ( bf) 0, (207)

B(2) s ( bf) 0, (208)

B(3) s ( bf) = tr Θs(u(s) s + 1)Σs K 2, (209)

v(s) s = ϕsη2 s tr K 2(v(s) s + λs)Σ2 s (210)

= ϕs(v(s) s + λs) df (s) 2 (κs) (211)

u(s) s = ϕs df (s) 2 (κs)

1 ϕs df (s) 2 (κs) , (212)

B(3) s ( bf) = κ2 s tr ΘsΣs(Σs + κs Id) 2

1 ϕs df (s) 2 (κs) , (213)

Bs( bf) B(3) s ( bf), (214)

which matches up exactly with Bs( bfs) as expected.

Published as a conference paper at ICLR 2025

G PROOF OF THEOREM 3.1

Proof. The gradient of the loss L is given by:

s S X s (Xs Sη Ys)/n + λη = X

s S Ms Sη X

s S X s Ys/n + λη

s S X s Ys/n,

where H = S MS + λIm Rm m, with M = M1 + M2 and Ms = X s Xs/n. Thus, setting R = H 1, we may write:

bw = Sbη = SRS (X 1 Y1 + X 2 Y2)/n

= SRS (M1w 1 + M2w 2) + SRS X 1 E1/n + SRS X 2 E2/n.

We deduce the following bias-variance decomposition:

E bw w s 2 Σs = Bs( bf) + Vs( bf), where

Vs( bf) = V (1) s ( bf) + V (2) s ( bf), with V (j) s ( bf) = σ2 j ϕE tr Mj SRS Σs SRS ,

Bs( bf) = E SRS (M1w 1 + M2w 2) w s 2 Σs.

We can further decompose Bs( bf), first considering the case s = 1. We define δ = w 2 w 1.

E SRS (M1w 1 + M2w 2) w 1 2 Σ1 = E (SRS (M1 + M2) Id)w 1 + SRS M2δ 2 Σ1 = E (SRS M Id)w 1 2 Σ1 + E SRS M2δ 2 Σ1 = E tr Θ(MSRS Id)Σ1(SRS M Id) + E tr M2SRS Σ1SRS M2 = E tr ΘΣ1 + E tr ΘMSRS Σ1SRS M 2E tr ΘΣ1SRS M + E tr M2SRS Σ1SRS M2.

We can similarly decompose B2:

E SRS (M1w 1 + M2w 2) w 2 2 Σ2 = E SRS (M1w 1 + M2w 2) w 2 2 Σ2 = E (SRS (M1 + M2) Id)w 2 SRS M1δ 2 Σ2 = E (SRS M Id)w 2 2 Σ2 + E SRS M1δ 2 Σ2 2E tr (w 2) (MSRS Id)Σ2SRS M1δ

= E tr Θ2(MSRS Id)Σ2(SRS M Id) + E tr M1SRS Σ2SRS M1 2E tr (MSRS Id)Σ2SRS M1 = E tr Θ2Σ2 + E tr Θ2MSRS Σ2SRS M 2E tr Θ2Σ2SRS M

+ E tr M1SRS Σ2SRS M1 2E tr MSRS Σ2SRS M1 + 2E tr Σ2SRS M1.

Furthermore, we observe that: E tr AMSRS BSRS M (215)

= E tr AM1SRS BSRS M1 + E tr AM2SRS BSRS M2 + 2E tr AM1SRS BSRS M2, (216)

E tr ASRS M = E tr ASRS M1 + E tr ASRS M2. (217)

Hence, we desire deterministic equivalents for the following expressions:

r(1) j (A) = ASRS M j, (218)

r(2) j (A, B) = AM j SRS BSRS , (219)

r(3) j (A, B) = AM j SRS BSRS M j, (220)

r(4) j (A, B) = AM j SRS BSRS M j , (221)

Published as a conference paper at ICLR 2025

M j = Σ1/2 j Z j ZjΣ1/2 j , R = (S MS + Im) 1, M = M 1 + M 2, (222)

M j = Mj/λ, R = λR, M = M/λ. (223)

In summary:

V (j) s ( bf) = σ2 j ϕλ 1E tr r(2) j (Id, Σs), (224)

Bs( bf) = tr ΘsΣs (225)

+ E tr r(3) 1 (Θs, Σs) + E tr r(3) 2 (Θs, Σs) + 2E tr r(4) 1 (Θs, Σs) (226)

2E tr r(1) 1 (ΘsΣs) 2E tr r(1) 2 (ΘsΣs) (227)

+ E tr r(3) s ( , Σs) (228)

( 0, s = 1, E tr r(3) 1 ( , Σ2) + E tr r(4) 2 ( , Σ2) E tr r(1) 1 ( Σ2), s = 2 . (229)

G.1 COMPUTING E tr r(1) j

WLOG, we focus on r(1) 1 . The matrix of interest has a linear pencil representation given by (with zero-based indexing):

r(1) 1 = Q 1 1,10, (230)

where the linear pencil Q is defined as follows:

Id 0 S 0 0 0 0 0 0 0 0 A Id 0 0 0 0 0 0 0 0 0 0 0 Im S 0 0 0 0 0 0 0

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 Id 1

λZ 1 0 0 0 0 0 0 0 0 0 0 In1 1

λZ1 0 0 0 0

1 2 1 0 0 0 0 0 Id 0 0 0 Σ

1 2 1 0 0 0 0 0 0 0 Id 1

λZ 2 0 0 0 0 0 0 0 0 0 0 In2 1

1 2 2 0 0 0 0 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 Id

Using the tools of OVFPT, the following holds:

E tr r(1) 1 = G1,21, (232)

G1,21 = tr γp1Σ1AG2,13G5,16(γG2,13(p1Σ1G5,16 + p2G8,19) + λId) 1. (233)

For G5,16 and G8,19, we observe that:

G5,16 = λ λ + ϕG6,15 , G6,15 = λγG2,13 tr Σ1(γG2,13(p1Σ1G5,16 + p2Σ2G8,19) + λId) 1,

= G5,16 = 1 1 + ψG2,13 tr Σ1(γG2,13(p1Σ1G5,16 + p2Σ2G8,19) + λId) 1 , (235)

G8,19 = λ λ + ϕG9,18 , G9,18 = λγG2,13 tr Σ2(γG2,13(p1Σ1G5,16 + p2Σ2G8,19) + λId) 1,

G8,19 = 1 1 + ψG2,13 tr Σ2(γG2,13(p1Σ1G5,16 + p2Σ2G8,19) + λId) 1 . (237)

Published as a conference paper at ICLR 2025

We define e1 = G5,16, e2 = G8,19, with e1 0, e2 0. We further observe that:

G2,13 = 1 1 + G3,11 , (238)

G3,11 = tr (p1Σ1G5,16 + p2Σ2G8,19)(γG2,13(p1Σ1G5,16 + p2Σ2G8,19) + λId) 1. (239)

We define τ = G2,13 0. We further define L = p1e1Σ1 + p2e2Σ2, K = γτL + λId. Therefore, we have the following system of equations:

es = 1 1 + ψτ tr Σs K 1 , τ = 1 1 + tr LK 1 . (240)

In conclusion:

E tr r(1) j = pjγejτ tr AΣj K 1. (241)

G.2 COMPUTING E tr r(2) j

WLOG, we focus on r(2) 1 . The matrix of interest has a linear pencil representation given by (with zero-based indexing):

r(2) 1 = Q 1 1,13, (242)

Published as a conference paper at ICLR 2025

where the linear pencil Q is defined as follows:

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λZ1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 Id Σ

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Im S 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 0 0 Id Σ

1 2 2 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λZ2 0 0 0 0 0 0 0 0 0

0 0 0 0 0 Σ

1 2 2 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Im S 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 Id Σ

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λZ1 0 0 0

0 0 0 0 0 0 0 0 0 0 0 Σ

1 2 1 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λZ2

0 0 0 0 0 0 0 0 0 0 0 Σ

1 2 2 0 0 0 0 0 0 0 Id

The following holds:

E tr r(2) 1 = G1,33, (244)

G1,33 = p1 tr AΣ1P1P 1 2 , (245) P1 = γλBG3,23G6,26G12,32 γp2Σ2G3,23G6,26G9,38G12,32 (246) + γp2Σ2G3,35G6,26G9,29G12,32 + λG3,23G6,32Id + λG3,15G12,32Id, (247) P2 = (γG6,26(p1Σ1G3,23 + γp2Σ2G9,29) + λId) (248) (γG12,32(p1Σ1G15,35 + p2Σ2G18,38) + λId). (249)

Following similar steps as before and recognizing identifications, we arrive at that:

e1 = G3,23 = G15,35, (250) e2 = G9,29 = G18,38, (251) τ = G6,26 = G12,32. (252)

Published as a conference paper at ICLR 2025

We now focus on the remaining terms. We observe that:

G3,35 = ϕe2 1 G4,14

λ = γ tr Σ1(γτ 2(p1Σ1G3,35 + p2Σ2G9,38 λB) λG6,32Id)K 2, (254)

G9,38 = ϕe2 2 G10,37

λ = γ tr Σ2(γτ 2(p1Σ1G3,35 + p2Σ2G9,38 λB) λG6,32Id)K 2. (256)

We define u1 = G3,35

λ , u2 = G9,38

λ , with u1 0, u2 0. We further define D = p1u1Σ1 + p2u2Σ2 + B. We now observe that:

G6,32 = G7,31 (G7,25 + 1)(G13,31 + 1) = τ 2G7,31, (257)

G7,31 = tr (γG6,32L2 + λ2D)K 2. (258)

Defining ρ = G6,32, we must have the following system of equations:

us = ψe2 s tr Σs(γτ 2D + ρId)K 2, (259)

ρ = τ 2 tr (γρL2 + λ2D)K 2. (260)

In conclusion:

P2 = K2, (261)

P1 = λγe1τ 2B + λγτ 2p2Σ2(e1u2 e2u1) + λe1ρId λ2u1τId, (262)

E tr r(2) j = λpjγ tr AΣj(γejτ 2B + γτ 2pj Σj (ejuj ej uj) + ejρId λujτId)K 2. (263)

G.3 COMPUTING E tr r(3) j

WLOG, we focus on r(3) 1 . The matrix of interest has a linear pencil representation given by (with zero-based indexing):

r(3) 1 = Q 1 1,20, (264)

Published as a conference paper at ICLR 2025

where the linear pencil Q is defined as follows:

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λZ1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 Id Σ

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Im S 0 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 0 0 Id Σ

1 2 2 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λZ2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 Σ

1 2 2 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Im S 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 Id Σ

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λZ1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 Σ

1 2 1 0 0 0 0 Id 0 0 0 Σ

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λZ2 0

0 0 0 0 0 0 0 0 0 0 0 Σ

1 2 2 0 0 0 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id

It holds that E tr r(3) 1 = G1,41. We immediately observe that:

e1 = G3,24, G15,36, (266) e2 = G9,30, G18,39, (267) τ = G6,27, G12,33, (268)

ρ = G6,33. (271)

In conclusion:

E tr r(3) j = pj tr AΣj(γe2 jpjΣj(γτ 2uj pj Σj + γτ 2B + ρId) + uj(γej τpj Σj + λId)2)K 2. (272)

Published as a conference paper at ICLR 2025

G.4 COMPUTING E tr r(4) j

WLOG, we focus on r(4) 1 . The matrix of interest has a linear pencil representation given by (with zero-based indexing):

r(4) 1 = Q 1 1,20, (273)

where the linear pencil Q is defined as follows:

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λZ1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 Id Σ

1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Im S 0 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 0 0 Id Σ

1 2 2 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λZ2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 Σ

1 2 2 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Im S 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 Id Σ

1 2 1 0 0 Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In1 1 λZ1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 Σ

1 2 1 0 0 0 0 Id 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id 1 λZ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In2 1 λZ2 0

0 0 0 0 0 0 0 0 0 0 0 Σ

1 2 2 0 0 0 0 0 0 0 Id Σ

1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Id

It holds that E tr r(4) 1 = G1,41. We immediately observe that:

e1 = G3,24, G15,36, (275) e2 = G9,30, G18,39, (276) τ = G6,27, G12,33, (277)

ρ = G6,33. (280)

Published as a conference paper at ICLR 2025

In conclusion:

E tr r(4) j = pjγpj tr ΣjΣj A(γτ 2(Bejej pjΣje2 juj pj Σj e2 j uj) (281)

λτ(ejuj + ej uj)Id + ejej ρId)K 2. (282)

Published as a conference paper at ICLR 2025

H THEOREM 3.2

Definition H.1. Let (e1, e2, τ1, τ2, u1, u2, ρ1, ρ2) is be unique positive solution to the following system of fixed-point equations:

es = 1 1 + ψsτs tr Σs(γτsesΣs + λs Id) 1 , for s {1, 2} (283)

τs = 1 1 + tr esΣs(γτsesΣs + λs Id) 1 , for s {1, 2} (284)

us = ψse2 s tr Σs(γτ 2 s (us + 1)Σs + ρs Id)(γτsesΣs + λs Id) 2, for s {1, 2} (285)

ρs = τ 2 s tr (γρs(esΣs)2 + λ2 s(us + 1)Σs)(γτsesΣs + λs Id) 2, for s {1, 2}. (286)

For deterministic d d PSD matrices A and B, we define the following auxiliary quantities:

h(1) j (A) := γejτj tr AΣj(γτjejΣj + λj Id) 1, (287)

h(2) j (A) := γ tr AΣj(γejτ 2 j Σj + ejρj Id λjujτj Id)(γτjejΣj + λj Id) 2, (288)

h(3) j (A) := tr AΣj(γe2 jΣj(γτ 2 j Σj + ρj Id) + λ2 juj Id)(γτjejΣj + λj Id) 2. (289)

Under Assumptions B.2 and 3.1, it holds that:

Rs( bfs) Bs( bfs) + Vs( bfs), with Vs( bfs) = lim ps 1 Vs( bf), Bs( bfs) = lim ps 1 Bs( bf). (290)

More explicitly:

Vs( bfs) = σ2 sϕsh(2) s (Id), Bs( bfs) = tr ΘsΣs + h(3) s (Θs) 2h(1) s (ΘsΣs). (291)

Proof. Theorem 3.2 follows from Theorem 3.1 in the limit ps 1 (i.e., ps 0).

The scalars es, τs, us, ρs can be intuitively interpreted in the setting where a separate model is learned for each group. For ridge regression with random projections and λ 0+, we show in Equations 337 and 338 that es, τs are related to the normalized first-order degrees of freedom I1,1 of the population covariance matrix Σs. es captures the effect of the feature rate ϕs while τ captures the effect of the parameterization rate γ. Similarly, for classical ridge regression, we show in Equation 69 that es is related to the normalized first-order degrees of freedom df (s) 1 . On the other hand, us and ρs can be understood as pseudo-variances. Indeed, for ridge regression with random projections and λ 0+, Equations 341 and 349 show that us, ρs, Vs are all related to the normalized second-order degrees of freedom I2,2 of Σs.

Published as a conference paper at ICLR 2025

I SOLVING FIXED-POINT EQUATIONS FOR THEOREM D.1

I.1 PROPORTIONAL COVARIANCE MATRICES

When λ 0+, it is not possible to analytically solve the fixed-point equations for the constants in Definition 3.1 for general Σ1, Σ2. As such, we consider a more tractable case where the covariance matrices are proportional, i.e., Σ1 = a1Σ and Σ2 = a2Σ, for some Σ Rd d.

We define θ = λ γτ(a1p1e1+a2p2e2) and η = tr Σ(Σ + θId) 1. Then, we have that:

1/es = e s = 1 + ψτ tr Σs K 1 = 1 + ϕasη a1p1e1 + a2p2e2 , (292)

1/τ = τ = 1 + tr LK 1 = 1 + (η/γ)τ = 1 1 η/γ . (293)

If θ0 = 0, then η0 = 1. Therefore, e s 1+ ϕas a1p1e1+a2p2e2 , which is a quadratic fixed-point equation. Accounting for the constraint that es > 0, the fixed-point equation requires that ϕ < 1. Moreover, τ 1 1/γ, which requires that γ > 1. We further observe that ρ (τ 2 tr γL2K 2)ρ, which implies that ρ 0. We can then see that, for c {a1, a2}:

us ϕγ2τ 2e2 sas(a1p1u1 + a2p2u2 + c) tr Σ2K 2 (294)

= ϕe2 sas(a1p1u1 + a2p2u2 + c)

(a1p1e1 + a2p2e2)2 , (295)

which is a linear fixed-point equation in us. In contrast, if θ0 > 1, we have e s = 1 + ψτasηθ

λ and the equation:

(1 η/γ) a1p1 1+ ψ(1 η/γ)a1ηθ

λ + a2p2 1+ ψ(1 η/γ)a2ηθ

which is a quartic equation in η. This highlights the difficulties of rigorously isolating the effects of different components on bias amplification. We empirically investigate how different components (e.g., covariance structures, group sizes) affect bias amplification and minority-group bias in Sections 4 and 5, and extensively validate that our theory predicts these implications.

I.2 THE GENERAL REGULARIZED CASE

We now consider the case where the covariance structure is the same for both groups, i.e., Σ1 = Σ2 = Σ. The calculations presented is this subsection are shared with Appendix H.2 of Dohmatob et al. (2024b), of which two authors of this work are also authors.

In this setting, it is clear that e1 = e2 = e and u1 = u2 = u, where (τ, e, u, ρ) now satisfy:

1/e = 1 + ψτ tr ΣK 1, 1/τ = 1 + tr K0K 1, where K0 := eΣ, K := γτK0 + λId, (297)

u = ψe2 tr Σ1(γτ 2L + ρId)K 2, ρ = τ 2 tr (γρK2 0 + λ2L )K 2, L := (1 + u)Σ. (298)

We first introduce some notation related to the degrees of freedom of Σ before proceeding with our theoretical result. Definition I.1. Let dfm(t) = tr Σm (Σ + t Id) m for any positive integer m. Furthermore, define Ia,b(t) = tr Σa(Σ + t Id) b for any positive integers a, b.

Lemma I.1. The scalars u and ρ = ρ/(γτ 2) solve the following pair of linear equations:

u = ϕI2,2(θ)(1 + u) + ϕI1,2(θ)ρ ,

γρ = I2,2(θ)ρ + θ2I1,2(θ)(1 + u). (299)

Furthermore, the solutions can be explicitly represented as:

u = ϕz γ ϕz I2,2(θ), ρ = θ2I2,2(θ) γ ϕz I2,2(θ), (300)

Published as a conference paper at ICLR 2025

where z = I2,2(θ)(γ I2,2(θ)) + θ2I1,2(θ)2.

In particular, in the limit γ , it holds that:

θ κ, ρ 0, u ϕI2,2(κ) 1 ϕI2,2(κ) df2(κ)/n 1 df2(κ)/n, (301)

where κ > 0 is uniquely satisfies the fixed-point equation κ λ = κ tr Σ(Σ + κId) 1/n.

Proof. The equations defining these scalars are:

u = ψe2 tr Σ(γτ 2L + ρId)K 2, (302)

ρ = τ 2 tr (γρK2 0 + λ2L )K 2, (303)

where K0 = eΣ, K = γτK0 + λId, and L := uΣ + B. Further, since B = Σ, we have L = (1 + u)Σ. Now, we can rewrite the previous equations like so

u = ψe2 tr Σ(γτ 2(1 + u)Σ + ρId)K 2 = ϕγ2τ 2e2(1 + u) tr Σ2K 2 + ϕγe2ρ tr ΣK 2,

ρ = τ 2 tr (γρe2Σ2 + λ2(1 + u)Σ)K 2 = γτ 2e2ρ tr Σ2K 2 + λ2τ 2(1 + u) tr ΣK 2.

This can be equivalently written as:

u = ϕ(1 + u)γ2τ 2e2 tr Σ2K 2 + ϕρ γ2τ 2e2 tr ΣK 2, (304)

γρ = ρ γ2τ 2e2 tr Σ2K 2 + (1 + u)λ2 tr ΣK 2. (305)

Now, observe that:

τ 2e2 tr Σ2K 2 = tr Σ2(Σ + θId) 2/γ2 = I2,2(θ)/γ2, (306)

τ 2e2 tr ΣK 2 = tr Σ(Σ + θId) 2/γ2 = I1,2(θ)/γ2, (307)

λ2 tr ΣK 2 = θ2 tr Σ(Σ + θId) 2 = θ2I1,2(θ), (308)

e2 tr ΣK 2 = tr Σ(Σ + θId) 2/(γτ)2 = I1,2(θ)/(γτ)2, (309)

τ 2 tr ΣK 2 = tr Σ(Σ + θId) 2/(γe)2 = I1,2(θ)/(γe)2, (310)

where we have used the definition θ = λ/(γτe). Thus, u and ρ have limiting values which solve the system of linear equations:

u = ψγ γ 2I2,2(θ)(1 + u) + ψγ γ 2I1,2ρ = ϕI2,2(θ)(1 + u) + ϕI1,2(θ)ρ ,

γρ = I2,2(θ)ρ + θ2I1,2(θ)(1 + u) = I2,2(θ)ρ + θ2I1,2(θ)(1 + u),

where we have used the identity ϕγ = ψ. These correspond exactly to the equations given in the lemma. This proves the first part.

For the second part, indeed, τ = 1 η0/γ 1 in the limit γ , and so θ λ/(γe) which verifies the equation:

θ λ + λψ tr Σ(γeΣ + λ) 1 = λ + ϕ λ

γe tr Σ(Σ + λ

γe Id) 1 λ + θ tr Σ(Σ + θId) 1/n,

i.e., θ λ + θ df1(θ)/n and θ > 0. By comparing with the equation κ λ = κ df1(κ)/n satisfied by κ > 0 in Definition D.2, we conclude θ κ.

Now, Equation 299 becomes ρ = 0, and u = ϕI2,2(κ)(1 + u), i.e.,

u = ϕI2,2(κ) 1 ϕI2,2(κ) df2(κ)/n 1 df2(κ)/n,

as claimed.

Published as a conference paper at ICLR 2025

I.3 UNREGULARIZED LIMIT

The calculations presented in this subsection are shared with Appendix H.2 of Dohmatob et al. (2024b), of which two authors of this work are also authors.

Define the following auxiliary quantities:

θ := λ γτe, χ := λ

where τ, e, u, and ρ are as previously defined. Lemma I.2. In the limit λ 0+, we have the following analytic formulae:

χ χ0 = (1 ψ)+ γθ0, (312) κ κ0 = (ψ 1)+ θ0/ϕ, (313) τ τ0 = 1 η0/γ, (314) e e0 = 1 ϕη0, (315)

where θ0 is the unique positive solution of the fixed-point equation η0 = I1,1(θ0).

Proof. Observe that K0 = eΣ and K = γτK0 + λId = γτe (Σ + θId). Defining η := I1,1(θ), one can then rewrite the equations defining e and τ as follows:

e = λ + ψτλ tr ΣK 1 = λ + ψτλ

γτe tr Σ(Σ + θId) 1 = λ + ϕηe , (316)

τ = λ + λ tr K0K 1 = λ + λe

γτe tr Σ(Σ + θId) 1 = λ + (η/γ)τ . (317)

We deduce that:

e = λ 1 ϕη , τ = λ 1 η/γ , τ e = λγθ. (318)

In particular, the above means that η min(γ, 1/ϕ). The last part of equations Equation 318 can be rewritten as follows:

λ (1 ϕη)(1 η/γ) = γθ, i.e., ϕη2 (ϕγ + 1)η + γ λ

θ = 0. (319)

This is a quadratic equation for η as a function of λ and θ, with roots

η = ϕγ + 1 p

(ϕγ + 1)2 4(ϕγ (ϕ/θ)λ)

2ϕ = ψ + 1 p

(ψ + 1)2 4(ψ ϕ/θ )

Now, for small λ > 0 and ψ = 1, we can do a Taylor expansion to get:

η ψ + 1 |ψ 1|

2ϕ 1 θ|ψ 1|λ + O(λ2).

More explicitly:

η+ O(λ2) 1/ϕ + λ/((1 ψ)θ), if ψ < 1 γ + λ/((ψ 1)θ), if ψ > 1 ,

η O(λ2) + γ λ/((1 ψ)θ), if ψ < 1 1/ϕ λ/((ψ 1)θ), if ψ > 1 .

Because η min(1, 1/ϕ, γ), we must have the expansion:

η O(λ2) + γ λ/((1 ψ)θ), if ψ < 1 1/ϕ + λ/((ψ 1)θ), if ψ > 1 ,

= η0 1 (1 ψ)θ0 λ + O(λ2), (321)

Published as a conference paper at ICLR 2025

provided θ0 > 0, i.e η0 = 1. in this regime, we obtain:

τ = λ 1 η/γ λ/(1 1 + λ/((1 ψ)γθ0)) = (1 ψ)γθ0, if ψ 1 λ/(1 1/ψ + o(1)) 0, if ψ > 1 ,

e = λ 1 ϕη λ/(1 ψ + o(1)) 0, if ψ 1 λ/(1 1 + λϕ/((ψ 1)θ0) (ψ 1)θ0/ϕ, if ψ > 1 ,

τ = 1 η/γ 1 η0/γ = (1 1/ψ)+, e = 1 ϕη 1 ϕη0 = (1 ψ)+.

On the other hand, if θ0 = 0 (which only happens if ψ < 1 and γ > 1, or ψ 1 and ϕ 1), it is easy to see from Equation 318 that we must have τ 0, e 0, τ 1 1/γ, e 1 ϕ 0.

Published as a conference paper at ICLR 2025

J COROLLARY J.1

As a special case of Theorem 3.1, we recover Corollary J.1, which aligns with Proposition 4 from (Bach, 2023). Theorem 3.1 is a non-trivial generalization of Proposition 4.

Corollary J.1 captures how the covariance matrix affects the test risk of a model through the normalized second and first-order degrees of freedom of Σs. Corollary J.1 also reveals that in the underparameterized regime (ψs < 1), the bias and variance of the test risk of the model strictly increase as a function of ψs (rate of parameters to samples); the test risk of the model explodes (i.e., there is catastrophic overfitting (Bach, 2023)) when ψs gets close to 1. In the overparameterized regime (ψs > 1), the bias and variance of the test risk decrease as ψs increases. Corollary J.1. Under Assumptions B.2 and 3.1, it holds in the unregularized setting λs 0+ that

θ0 tr ΘsΣs(Σs+θ0Id) 1

1 ψs , γ, ψs < 1 0, ψs < 1, γ 1 or 1 ψs γ θ2 0 tr ΘsΣs(Σs+θ0Id) 2

1 ϕs I2,2(θ0) + θ0 tr ΘsΣs(Σs+θ0Id) 1

ψs 1 , ψs 1, ψs γ

σ2 sψs 1 ψs , γ, ψs < 1

σ2 sϕs 1 ϕs , ψs < 1, γ 1 or 1 ψs γ

σ2 sϕs I2,2(θ0) 1 ϕs I2,2(θ0) + σ2 s ψs 1, ψs 1, ψs γ

where Ia,b(t) = tr Σa(Σ + t Id) b for any positive integers a, b; and θ0 is the unique solution to the following non-linear equation:

γ, γ, ψs < 1 1, ψs < 1, γ 1 or 1 ψs γ 1/ϕs, ψs 1, ψs γ . (324)

Proof. Define e = 1/es 0, τ = 1/τs 0, θ = λsτ e /γ, and η = I1,1(θ) [0, 1]. One can then express e and τ as:

e = 1 + ψτs tr Σ(γτsesΣ + λs Id) 1 = 1 + ϕsηe , (325)

τ = 1 + tr esΣ(γτsesΣ + λs Id) 1 = 1 + (η/γ)τ . (326)

We deduce that:

e = 1 1 ϕsη , (327)

τ = 1 1 η/γ , (328)

λτ e = γθ. (329)

We define the following limiting values:

lim λs 0+ θ θ0, lim λs 0+ η η0, (330)

lim λs 0+ es e0, lim λs 0+ τs τ0, (331)

lim λs 0+ us u0, lim λs 0+ ρs ρ0. (332)

There are now two cases to consider.

J.1 CASE 1: θ0 = 0

This implies η0 = 1. Therefore, by simple computation, e0 = 1/e 0 = 1 ϕsη0 = 1 ϕs and τ0 = 1/τ 0 = 1 1/γ. This requires ϕs 1 and γ 1.

Published as a conference paper at ICLR 2025

J.2 CASE 2: θ0 > 0

Equation 329 can be re-written as:

λs (1 ϕsη)(1 η/γ) = γθ, i.e., ϕsη2 (ψs + 1)η + γ λs

θ = 0. (333)

We solve this quadratic equation for η, arriving at the solutions:

η = ψs + 1 p

(ψs + 1)2 4(ψs (ϕs/θ)λs)

2ϕs = ψs + 1 p

(ψs + 1)2 4(ψs (ϕs/θ)λs)

Taking the limit of η as λs 0+ gives:

η+ ψs + 1 + |ψs 1|

2ϕs = ψs/ϕs = γ, if ψs 1, 1/ϕs, if ψs < 1,

η ψs + 1 |ψs 1|

2ϕs = 1/ϕs, if ψs 1, ψs/ϕs = γ, if ψs < 1.

Recall that we have the following constraints:

We can show that η0 = 1/ϕs is incompatible with ψs < 1. Indeed, otherwise we would have τ 0 = 1/(1 η0/γ) = 1/(1 1/ψs) < 0. Similarly, if ψs > 1, we would have e0 = 1 ϕsγ = 1 ψs < 0. Therefore, η0 = η . Furthermore, if ψs, γ < 1, it must be that θ0 > 0 and η0 = γ. Instead, if ψs < 1, γ 1, we must have that ϕs 1, and therefore, θ0 = 0 and η0 = 1. Similarly, if ψs 1, γ 1, and ϕs 1 (i.e., 1 ψs γ), we must have that θ0 = 0 and η0 = 1. In all other cases where ψs 1, it must be that η0 = 1/ϕs (which additionally requires ϕs 1 or ψs γ). Succinctly:

γ, γ, ψs < 1 1, ψs < 1, γ 1 or 1 ψs γ 1/ϕs, ψs 1, ψs γ . (336)

Plugging this into Equation 327 and Equation 328 gives:

e0 = 1 ϕsη0 = 1 ϕs I1,1(θ0), (337)

τ0 = 1 η0/γ = 1 I1,1(θ0)/γ. (338)

We will now solve for u0 and ρ0/τ 2 0 . We can re-write us and ρs/τ 2 s as:

ρs/τ 2 s = γ 1(ρs/τ 2 s )I2,2(θ) + θ2(us + 1)I1,2(θ), (339)

τ 2 s us = τ 2 s ϕs(us + 1)I2,2(θ) + ϕsγ 1ρs I1,2(θ). (340)

Solving for u0 and ρ0/τ 2 0 yields:

u0 = ϕζ γ ϕζ I2,2(θ0), ρ0/τ 2 0 = γθ2 0I2,2(θ0) γ ϕζ I1,2(θ0), (341)

where ζ = I2,2(θ0)(γ I2,2(θ0)) + θ2 0I1,2(θ0)2. (342)

Published as a conference paper at ICLR 2025

We can then see for the variance term that:

Vs( bfs) = σ2 sϕsγ tr Σs(γesτ 2 s Σs + esρs Id λsusτs Id)(γτses) 2(Σs + θId) 2 (343)

= σ2 sϕs(1/es) tr Σ2 s(Σs + θId) 2 + (σ2 sϕs/γ)(1/es)(ρs/τ 2 s ) tr Σs(Σs + θId) 2 (344)

σ2 sϕs(us)(1/es)θ tr Σs(Σs + θId) 2 (345)

= σ2 sϕs I2,2(θ)/es + σ2 sϕs(ρs/τ 2 s )I1,2(θ)/(γes) σ2 sϕsusθI1,2(θ)/es (346)

σ2 sϕs I2,2(θ0) σ2 sϕsu0θ0I1,2(θ0) 1 ϕs I1,1(θ0) + σ2 sϕsρ0/τ 2 0 γ(1 ϕs I1,1(θ0)) (347)

= σ2 sϕsξ ϕsξ + I2,2(θ0) γ , (348)

where ξ = I2 1,1(θ0) 2I1,1(θ0)I2,2(θ0) + I2,2(θ0)γ and we have used the fact that I1,2(θ) = (I1,1(θ) I2,2(θ))/θ. Plugging in I1,1(θ0) = η0, we have that:

σ2 sψs 1 ψs , γ, ψs < 1

σ2 sϕs 1 ϕs , ψs < 1, γ 1 or 1 ψs γ

σ2 sϕs I2,2(θ0) 1 ϕs I2,2(θ0) + σ2 s ψs 1, ψs 1, ψs γ

where we have used that I2,2(θ0) = I2,2(0) = 1 in the second case.

Likewise, for the bias term, we obtain:

Bs( bfs) = tr ΘsΣs + tr ΘsΣs(γe2 sΣs(γτ 2 s Σs + ρs Id) + λ2 sus Id)(γτsesΣs + λs Id) 2 (350)

2γesτs tr ΘsΣ2 s(γτsesΣs + λs Id) 1 (351)

tr ΘsΣs(Σ2 s + 2θ0Σs + θ2 0Id)(Σs + θ0Id) 2 (352)

+ tr ΘsΣs(Σ2 s)(Σs + θ0Id) 2 (353)

+ tr ΘsΣs((ρ0/τ 2 0 )Σs/γ)(Σs + θ0Id) 2 (354)

+ tr ΘsΣs(θ2 0u0Id)(Σs + θ0Id) 2 (355)

+ tr ΘsΣs( 2Σ2 s 2θ0Σs)(Σs + θ0Id) 2 (356)

= θ2 0(u0 + 1) tr ΘsΣs(Σs + θ0Id) 2 + (1/γ)(ρ0/τ 2 0 ) tr ΘsΣ2 s(Σs + θ0Id) 2. (357)

Again, plugging in I1,1(θ0) = η0, we have that:

θ0 tr ΘsΣs(Σs+θ0Id) 1

1 ψs , γ, ψs < 1

θ2 0 tr ΘsΣs(Σs+θ0Id) 2

1 ϕs = 0, ψs < 1, γ 1 or 1 ψs γ

θ2 0 tr ΘsΣs(Σs+θ0Id) 2

1 ϕs I2,2(θ0) + θ0 tr ΘsΣs(Σs+θ0Id) 1

ψs 1 , ψs 1, ψs γ

where we have used that tr ΘsΣ2 s(Σs+θ0Id) 2 = tr ΘsΣs(Σs+θ0Id) 1 θ0 tr ΘsΣs(Σs+θ0Id) 2 and in the second case, θ0 = 0 and I2,2(θ0) = 1.

Published as a conference paper at ICLR 2025

K EXPERIMENTAL DETAILS

K.1 SYNTHETIC EXPERIMENTS

Across all experiments on synthetic data, we choose n = 400. We further use 5 runs to estimate test risks (e.g., ERs( bf), ERs( bfs)), and 5 runs to capture the variance of the estimators, for a total of 25 runs. We use 10,000 samples to estimate test risks.

Our experiments validate that bias amplification occurs even in low-dimensional regimes. In Sections 4 and 5, and Appendices L, O, and Q, we show that our theory predicts bias amplification for models trained on only n = 400 samples. The high-dimensional regime is commonly studied in ML theory and statistical physics (as we mention in Section 1.2), as it makes precise analysis more tractable.

Setup for Section 4.1. We further choose λ = 1 10 6 to approximate the minimum-norm interpolator; we henceforth set λ = λ1 = λ2 for simplicity. We modulate a1, a2, σ2 1, σ2 2, as well as ψ (rate of parameters to samples) and ϕ (rate of features to samples) to understand the effects of model size, number of features, and sample size on bias amplification. We consider diverse and dense values of these variables to obtain a clear picture of when and how models amplify bias.

Setup for Section 5. The first πd features represent common core features of groups 1 and 2 while the latter (1 π)d features capture unshared extraneous features for group 2 (e.g., spurious features). Intuitively, this setting can model: (1) learning from data from two groups where one group suffers from spurious features (Sagawa et al., 2020), or (2) learning from a mixture of raw data (i.e., with spurious features) and clean data (i.e., without spurious features) for a single population (Khani & Liang, 2021). We ask: Does our theory predict how the inclusion of different amounts of extraneous features affect the test risk of a minority group (compared to the majority group) when a single model is trained on data from both groups vs. a separate model is trained per group?

Although Sagawa et al. (2020) consider classification instead of regression, to mirror their experimental setting, we pick p1 = 0.9 (i.e., group 1 is much larger than group 2) and Θ = Id, = 0 (i.e., w 1 = w 2). We additionally choose λ = 1 10 6 and σ2 1 = σ2 2 = 1. We modulate a1, b2, as well as ψ (rate of parameters to samples) and ϕ (rate of features to samples). Notably, this setting also captures learning problems with o(d) overlapping core and extraneous features in our asymptotic scaling limit. An extremization of this setting is choosing Σ1 = a1Iπd 0I(1 π)d, Σ2 = 0Iπd b2I(1 π)d, where groups 1 and 2 have no overlapping features.

The experiments in Section 5 validate that our analysis does not rely on conditional dependence heterogeneity. That is, we empirically verify that our theory still holds and predicts bias amplification occurs even when w 1 = w 2 (see Figure 4). In essence, the structure and eigenspectra of the covariance matrices of the two groups still contribute to bias amplification even when the ground-truth weights for the groups are the same. In our theory, we only allow the possibility of w 1 = w 2 to be as general as possible. In practice, labeling rule heterogeneity may be leveraged, for example, to train a mixture of experts that is regularized to deamplify bias.

Extraneous vs. Spurious Features. Our usage of extraneous features (i.e., features that are different across groups and correlated with labels) differs from classical definitions of spuriousness (i.e., noncausal correlations between features and labels) (Bell & Wang, 2024); indeed, the extraneous features are used to generate the labels of the minority group. For example, Khani & Liang (2021) model both the labels and spurious features in linear regression as being separately generated by the core features, such that the labels and spurious features are associated but not causally related. However, this setup is not encompassed by our modeling assumptions, as it entails that the ground-truth parameter and feature covariance matrices are not jointly diagonalizable. In contrast, Sagawa et al. (2020) study spurious correlations in classification. At a high level, Sagawa et al. (2020) create four subgroups of a population with different combinations of class labels y { 1, 1} and group labels a { 1, 1}. The core and spurious features are then sampled from normal distributions parameterized by y, a (respectively) and different variance levels. By setting the spurious features to have a significantly lower variance than the core features and making y and a highly associated (i.e., imbalanced groups), the authors coerce models to perform classification as a function of primarily the spurious features of the majority group, which does not generalize to the minority group. To capture this spirit, our setup uses imbalanced groups, and the data for the majority group provides no learning signal for the

Published as a conference paper at ICLR 2025

extraneous features; this coerces models to perform regression as a function of primarily the core features, without learning appropriate parameters for the extraneous features, and thus generalize poorly to the minority group.

K.2 COLORED MNIST EXPERIMENTS

Train-test split. Colored MNIST has a total of 60k instances. Each image is 28 28 3 pixels. We use the prescribed 0.67-0.33 train-test split. We do not perform validation of hyperparameters, which we mostly adopt3.

Model architecture. By default, our CNN architecture consists of: (1) a convolutional layer (3 in-channels, 20 out-channels, kernel size of 5, stride of 1); (2) a max pooling layer (kernel size of 2, stride of 2); (3) a second convolutional layer (20 in-channels, 50 out-channels, kernel size of 5, stride of 1); (4) a second max-pooling layer (kernel size of 2, stride of 2); (5) a fully-connected layer (R800 R500); and (6) a second fully-connected layer (R500 R1).

Model training. We train each model with a batch size of 250 for a single epoch with respect to groups (i.e., 80 training steps given there are two groups). We use a cross-entropy loss and the Adam optimizer with learning rate 0.01. We run all experiments on a single NVIDIA L40S. We report our results over 10 random seeds.

3https://colab.research.google.com/github/reiinakano/ invariant-risk-minimization/blob/master/invariant_risk_minimization_ colored_mnist.ipynb

Published as a conference paper at ICLR 2025

L BIAS AMPLIFICATION PLOTS

2 2 20 22 1

2 log a1 = 0.5, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e 0.5

2 2 20 22 0

2 2 20 22 0

2 log a1 = 0.5, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e0

2 2 20 22 0

2 2 20 22 0

2 2 20 22 0

2 log a1 = 0.5, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e1.0

Figure 6: We empirically demonstrate that bias amplification occurs and validate our theory (Theorems 3.1 and 3.2) for ODD, EDD, and ADD under the setup described in Section 4.1. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We plot ODD and EDD on the same scale for easy comparison, and include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. The error bars capture the range of the estimators over 25 random seeds.

Published as a conference paper at ICLR 2025

2 2 20 22 0

2 2 20 22 0

2 2 20 22 0

2 log a1 = 2.0, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e 0.5

2 2 20 22 0

2 2 20 22 0

2 2 20 22 0

2 log a1 = 2.0, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e0

2 2 20 22 0

2 2 20 22 0

2 2 20 22 0

2 log a1 = 2.0, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e1.0

Figure 7: We empirically demonstrate that bias amplification occurs and validate our theory (Theorems 3.1 and 3.2) for ODD, EDD, and ADD under the setup described in Section 4.1. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We plot ODD and EDD on the same scale for easy comparison, and include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. The error bars capture the range of the estimators over 25 random seeds.

Published as a conference paper at ICLR 2025

2 2 20 22 0

2 2 20 22 0

2 2 20 22 0

2 log a1 = 8.0, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e 0.5

2 2 20 22 0

2 2 20 22 0

2 2 20 22 0

2 log a1 = 8.0, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e0

2 2 20 22 0

2 2 20 22 0

2 2 20 22 0

2 log a1 = 8.0, a2 = 1.0, 2 1 = 1.0, 2 2 = 1.0e1.0

Figure 8: We empirically demonstrate that bias amplification occurs and validate our theory (Theorems 3.1 and 3.2) for ODD, EDD, and ADD under the setup described in Section 4.1. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We plot ODD and EDD on the same scale for easy comparison, and include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. The error bars capture the range of the estimators over 25 random seeds.

Published as a conference paper at ICLR 2025

M POWER-LAW COVARIANCE

To better understand how ϕ (rate of features to samples) and the label noise ratio c affect bias amplification, we derive explicit phase transitions in the bias amplification profile of unregularized ridge regression with random projections in terms of these quantities. We consider the setting of power-law covariance, as it is analytically tractable and can be translated to the case of wide neural networks (Caponnetto & de Vito, 2007; Cui et al., 2022; Maloney et al., 2022), where the exponents can be empirically gauged. Let the eigenvalues λ(s) k of Σs have power-law decay, i.e., λ(s) k = k βs, for all k and some positive constants β1 and β2. WLOG, we will assume β1 > β2. Note that βs controls the effective dimension and ultimately the difficulty of fitting the noiseless part of the signal from group s. If βs is large, then all the information is concentrated in a few features, and so the learning problem is easier. We similarly assume that the eigenvalues µk of have power-law decay µk = k α, for all k and constant α > 0. Finally, we consider balanced groups (i.e., p1 = p2 = 1/2). Under this setup, we have the following corollary. Corollary M.1. Suppose that in the single model setting, as λ 0+, (e1, e2, τ, u1, u2, ρ) is the unique positive solution to the following fixed-point equations:

1/τ = 1 + 1/(γτ), 1/es = 1 + ϕ tr Σs L 1, for s {1, 2}, (359)

ρ = 0, us = ϕe2 s tr Σs DL 2, for s {1, 2}, (360) where: L = p1e1Σ1 + p2e2Σ2, D = p1u1Σ1 + p2u2Σ2 + B. (361)

Furthermore, suppose ψs < 1, γ 1 or 1 ψs γ. Under the assumptions of Theorem 3.1 and Assumption B.3, as λ 0+, we have the following approximate analytical phase transitions in the bias amplification profile of ridge regression with random projections:

lim d,n1,n2 ϕ1,2 2ϕ ADD c |c 1|, lim c 0+ lim d,n1,n2 ϕ1,2 2ϕ ADD 0, (362)

lim c lim d,n1,n2 ϕ1,2 2ϕ ADD 1, lim c 1 lim d,n1,n2 ϕ1,2 2ϕ ADD . (363)

We relegate the proof to Appendix N and empirically assess the validity of this result in Figure 9. The phase transitions reveal that bias amplification peaks near c = 1, bias deamplification peaks when c 0+, and bias is roughly neither amplified or deamplified when c . Furthermore, the right tail of the ODD profile (which Corollary M.1 predicts to be proportional to c) is higher than the left tail (i.e., 0) for larger c. However, the left tail of the EDD profile (which Corollary M.1 predicts to be proportional to |c 1|) does not increase steeply as c 0+. Interestingly, in the proof of Corollary M.1, we observe that the bias term depends on tr Σs; therefore, the setting k, λ(s) k 1/µk (e.g., common in learning from synthetic data (Dohmatob et al., 2024a)) can prevent the bias term from vanishing or even cause it to explode. This may explain why training models on synthetic data (i.e., data previously generated by the model) may amplify unfairness (Wyllie et al., 2024).

Published as a conference paper at ICLR 2025

2 1 0 1 2 log c

2 1 0 1 2 log c

2 1 0 1 2 log c

1 = 8.0, 2 = 4.0

Figure 9: Our theory predicts that bias amplification is larger for higher noise ratios than lower noise ratios. We observe that Corollary M.1 generally predicts the ADD profile with respect to the noise ratio c. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what Theorem 3.1 predicts. We plot ODD and EDD on the same scale for easy comparison, and include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. The error bars capture the range of the estimators over 25 random seeds. We consider the setup described in Appendix M with ψ = 0.5, ϕ = 0.2, and λ = 1 10 6.

Published as a conference paper at ICLR 2025

N PROOF OF COROLLARY M.1

Proof. We begin by computing the ODD in the limit λ 0+. We define u(s) j = uj for B = Σs. By Assumption B.3, we can re-express the constants in Definition 3.1 in terms of the limiting spectral densities of the covariance matrices:

e1 = 1 1 + ϕ R 0 1 p1e1+p2e2rdν(r), e2 = 1 1 + ϕ R 0 r p1e1+p2e2rdν(r), (364)

τ = 1 1 + 1 γτ = 1 1/γ, ρ = 0, (365)

u(1) 1 = ϕe2 1

u(1) 1 p1 + u(1) 2 p2r + 1 (p1e1 + p2e2r)2 dν(r), u(1) 2 = ϕe2 2

u(1) 1 p1r + u(1) 2 p2r2 + r (p1e1 + p2e2r)2 dν(r),

u(2) 1 = ϕe2 1

u(2) 1 p1 + u(2) 2 p2r + r (p1e1 + p2e2r)2 dν(r), u(2) 2 = ϕe2 2

u(2) 1 p1r + u(2) 2 p2r2 + r2

(p1e1 + p2e2r)2 dν(r).

Since β1 > β2, β2 ( β1) > 0. As such, for d , the ratios rk = λ(2) k /λ(1) k have the approximate limiting distribution ν = δr= , i.e., a Dirac atom at infinity. Thus:

e1 = 1, e2 = 1 ϕ

p2 = 1 ϕ2, τ = 1 1/γ, ρ = 0, (368)

u(1) 1 = 0, u(1) 2 = 0, u(2) 1 = 0, u(2) 2 = ϕ p2(p2 ϕ). (369)

Now, we can re-express the variance terms as:

V1( bf) = ϕσ2 1

p1 (p1 + p2e2r)2 dν(r) + ϕσ2 2

p2e2r (p1 + p2e2r)2 dν(r) = 0, (370)

V2( bf) = ϕσ2 1

p1r + p1p2u(2) 2 r (p1 + p2e2r)2 dν(r) + ϕσ2 2

(p1e1 + p2e2r)2 dν(r) = σ2 2ϕ p2 ϕ. (371)

Likewise, we can re-express the bias terms as:

B1( bf) = Z

aδe2 2p2 2r2

(e1p1 + e2p2r)2 dµ(r, a)dπ(δ) = Z

0 aδ dµ(a)dπ(δ) = 0, (372)

B2( bf) = 0. (373)

In this calculation, we observe that the adversarial setting k, λ(1) k 1/µk can prevent the bias term from vanishing. Putting these pieces together and recalling that p2 = 1/2:

ODD V1( bf) V2( bf) = 2ϕσ2 1 1 2ϕc. (374)

We now compute the EDD. We can once again re-express the constants in Definition H.1 in terms of the limiting spectral densities of the covariance matrices:

es = 1 1 + ϕs/es = 1 ϕs, τs = 1 1/γ. (375)

By Corollary J.1, because ψs < 1, γ 1 or 1 ψs γ, Bs( bfs) = 0 and Vs( bfs) = σ2 sϕs 1 ϕs .

Therefore, because ϕ = psϕs:

EDD V1( bf1) V2( bf2) = 2ϕ 1 2ϕ

σ2 1 σ2 2 = 2ϕσ2 1 1 2ϕ |c 1| , (376)

ADD c |c 1|. (377)

Published as a conference paper at ICLR 2025

O BIAS AMPLIFICATION DURING TRAINING

10 5 10 3 10 1

10 5 10 3 10 1

10 5 10 3 10 1 1

a1 = 0.5, a2 = 1.0

10 5 10 3 10 1

10 5 10 3 10 1

10 5 10 3 10 1 0

a1 = 2.0, a2 = 1.0

10 5 10 3 10 1 0

10 5 10 3 10 1 0

10 5 10 3 10 1 1

a1 = 8.0, a2 = 1.0

Figure 10: Our theory reveals that there may be an optimal regularization penalty to deamplify bias. We empirically demonstrate that bias amplification can be heavily affected by λ and validate our theory (Theorems 3.1 and 3.2) for ODD, EDD, and ADD under the setup described in Section 4.2. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. The error bars capture the range of the estimators over 25 random seeds.

Published as a conference paper at ICLR 2025

P COLORED MNIST PLOTS

We further assess the applicability of our conclusions about the effects of label noise (Figures 3, 11) and model size (Figure 12) on bias amplification for Colored MNIST. Please see Section 4.2 for a discussion of Figure 3. We observe in Figure 11 that as we increase the label noise ratio, the EDD generally increases, while the ODD remains relatively low, which is suggested by our theoretical reasoning in Section 4.2. Furthermore, in Figure 12, as the hidden dimension m of the penultimate layer of the CNN increases, the ODD appears to decrease and plateau, which is predicted by our theoretical results (see Section 4.1) in the Colored MNIST regime where ϕ < 1. However, the EDD does not appear to decrease; while this is plausibly predicted by our theory, it requires going beyond our simplistic assumption that Σ1 roughly coincides with Σ2 and studying the interplay between ϕs, ψs, Σs for each group s (as suggested by Appendix L).

2 4 6 8 Label noise ratio 2 2/ 2 1

Difficulty disparity

Figure 11: Our theory predicts that more disparate label noise between groups deamplifies bias on Colored MNIST. We plot the ODD and EDD of a CNN for different label noise ratios c = σ2 2/σ2 1 for Colored MNIST. As c increases, the EDD generally increases while the ODD remains relatively low, which is predicted by our theory (see reasoning in Section 4.2). In our experiments, σ2 1 = 0.05 stays fixed while σ2 2 varies. For each value of c, the model is evaluated after t = 80 training steps and has a penultimate layer with dimension m = 500. The error bars capture the standard deviation computed over 10 random seeds.

150 300 450 600 Hidden dimension m

Difficulty disparity

Figure 12: Our theory predicts that a larger model size reduces bias on Colored MNIST in the single model setting. We plot the ODD and EDD of a CNN for different model sizes m (where m is the dimension of the penultimate CNN layer) for Colored MNIST. As m increases, the ODD appears to decrease and plateau, which is in line with what our theory predicts in the regime where ϕ < 1 (see analysis in Section 4.1). The EDD does not tend towards 0. In our experiments, σ2 1 = σ2 2 = 0.05. For each value of m, the model is evaluated after t = 80 training steps. The error bars capture the standard deviation computed over 10 random seeds.

Published as a conference paper at ICLR 2025

Q MINORITY-GROUP BIAS PLOTS

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 0.2, = 0.5

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 0.2, = 0.75

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 0.2, = 0.9

Figure 13: We empirically demonstrate that minority-group bias is affected by extraneous features. We validate our theory (Theorems 3.1 and 3.2) for together R1, R2 (i.e., single model learned for both groups) and separate R1, R2 (i.e., separate model learned per group) under the setup described in Section 4.2. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. All y-axes are on the same scale for easy comparison. The error bars capture the range of the estimators over 25 random seeds.

Published as a conference paper at ICLR 2025

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 2.0, = 0.5

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 2.0, = 0.75

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 2.0, = 0.9

Figure 14: We empirically demonstrate that minority-group bias is affected by extraneous features. We validate our theory (Theorems 3.1 and 3.2) for together R1, R2 (i.e., single model learned for both groups) and separate R1, R2 (i.e., separate model learned per group) under the setup described in Section 4.2. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. All y-axes are on the same scale for easy comparison. The error bars capture the range of the estimators over 25 random seeds.

Published as a conference paper at ICLR 2025

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 4.0, = 0.5

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 4.0, = 0.75

2 2 20 22 0

Together R1

Together R2

Separate R1

Separate R2

2 log a1 = 2.0, b1 = 0.0, a2 = 2.0, b2 = 4.0, = 0.9

Figure 15: We empirically demonstrate that minority-group bias is affected by extraneous features. We validate our theory (Theorems 3.1 and 3.2) for together R1, R2 (i.e., single model learned for both groups) and separate R1, R2 (i.e., separate model learned per group) under the setup described in Section 4.2. The solid lines capture empirical values while the corresponding lower-opacity dashed lines represent what our theory predicts. We include a black dashed line at ADD = 1 to contrast bias amplification vs. deamplification. All y-axes are on the same scale for easy comparison. The error bars capture the range of the estimators over 25 random seeds.

Published as a conference paper at ICLR 2025

R ACTIONABLE INSIGHTS FROM THEORY

Searching for optimal hyperparameters. In practice, an optimal regularization penalty λ or training time t can be selected by searching for values that strike a desired balance between overall validation error (that is not too high) and bias amplification (that is not too high). As we would estimate the test error using the empirical validation error, we can estimate bias amplification using the validation set. Moreover, we would need to train: (1) the main model on a mixture of data from groups, and (2) auxiliary separate models on the data for each group.

However, it may be expensive to train auxiliary models for each candidate value of λ and t. The search space can be reduced by using insights from our theory. For instance, with overparamaterization, as λ decreases (or t increases), bias amplification increases and plateaus, and with underparameterization, as λ decreases (or t increases), bias deamplification increases and plateaus. When the curves are monotone with respect to λ, the optimal λ is either at the left tail of the curve (e.g., λ = 0) or the right tail (i.e., the largest λ among the reasonable options). In contrast, Figure 10 shows that when ψ is close to the interpolation threshold of 1, bias amplification is often not monotone with respect to λ.

Informing evaluation and mitigation strategies. Our theory offers avenues to assess whether a ML model trained on certain real-world data is prone to bias amplification and mitigate this amplification, even though we may lack direct access to population parameters like Σ. We can estimate such parameters using samples (e.g., bΣ = X X). However, even if we are unable to robustly estimate these parameters, our theory still provides valuable insights. For example, we observe that the ratios of parameters for the groups is often what matters, e.g., label noise ratio σ2 2/σ2 1 (see Section 4.1), ratio of covariance eigenvalues (see Appendix M). Thus, practitioners can use our theory to get intuition about when disparities in the variability of labels and features across groups can amplify bias.

Moreover, our findings warn against the conventional wisdom that increased model overparameterization or data balancing can alleviate bias issues. In addition, our theory informs criteria for feature selection (e.g., discarding features with disparate variance across groups) and warns ML practitioners about the interplay between high vs. low feature-to-sample regimes and overparameterization in inducing bias amplification. Nevertheless, additional work is required to make rigorous connections between our theoretical findings and better strategies for evaluating and mitigating the bias of models.