# demystifying_disagreementontheline_in_high_dimensions__85dc4182.pdf Demystifying Disagreement-on-the-Line in High Dimensions Donghwan Lee * 1 Behrad Moniri * 2 Xinmeng Huang 1 Edgar Dobriban 3 Hamed Hassani 2 Evaluating the performance of machine learning models under distribution shifts is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackling this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have led to the discovery of the disagreement-on-theline phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny Image Net-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings. 1. Introduction Modern machine learning methods such as deep neural networks are effective at prediction tasks when the input test data is similar to the data used during training. However, *Equal contribution 1Graduate Group in Applied Mathematics and Computational Science, University of Pennsylvania, PA, USA 2Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA 3Department of Statistics and Data Science, University of Pennsylvania, PA, USA. Correspondence to: Donghwan Lee , Behrad Moniri . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). they can be extremely sensitive to changes in the input data distribution (e.g., Biggio et al. (2013); Szegedy et al. (2014); Hendrycks et al. (2020), etc.). This is a significant concern in safety-critical applications where errors are costly (e.g., Oakden-Rayner et al. (2020), etc.). In such scenarios, it is important to estimate how well the predictive model performs on out-of-distribution (OOD) data. Collecting labeled data from new distributions can be costly, but unlabeled data is often readily available. As such, recent research efforts have focused on developing methods that can estimate a predictive model s OOD performance using only unlabeled data (e.g., Garg et al. (2021); Deng & Zheng (2021); Chen et al. (2021); Guillory et al. (2021), etc.). In particular, works dating back at least to Recht et al. (2019) suggest that the out-of-distribution (OOD) and indistribution (ID) errors of predictive models of different complexities are highly correlated. This was rigorously proved in Tripuraneni et al. (2021) for random features model under covariate shift. However, determining the correlation requires labeled OOD data. To sidestep this requirement, Baek et al. (2022) proposed an alternative approach that looks at the disagreement on an unlabeled set of data points between pairs of neural networks with the same architecture trained with different sources of randomness. They observed a linear trend between ID and OOD disagreement, as for ID and OOD error. Surprisingly, the linear trend had the same empirical slope and intercept as the linear trend between ID and OOD accuracy. This phenomenon, termed disagreement-on-the-line, allows estimating the linear relationship between OOD and ID error using only unlabeled data, and finally allows estimating the OOD error. At the moment, the theoretical basis for disagreement-onthe-line remains unclear. It is unknown how generally it occurs, and what factors (such as the type of models or data used) may influence it. To better understand or even demystify these empirical findings, in this paper, we develop a theoretical foundation for studying disagreement. We focus on the following key questions: Is disagreement-on-the-line a universal phenomenon? Under what conditions is it guaranteed to happen, and what happens if those conditions fail? To work towards answering these questions, we study dis- Demystifying Disagreement-on-the-Line in High Dimensions 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Source Risk Disagreement Figure 1. Target vs. source risk and shared-sample disagreement of random features model trained on CIFAR-10. Solid lines are derived from Theorem 4.1. Target domain is CIFAR-10-C-Fog (Hendrycks & Dietterich, 2018). See Section 5 for details. agreement in a widely used theoretical framework for highdimensional learning, random features models. We consider a setting where input data is from a Gaussian distribution, but possibly with a different covariance structure at training and test time, and study disagreement under the highdimensional/proportional limit setting. We define various types of disagreement depending on what randomness the two models share. We rigorously prove that depending on the type of shared randomness and the regime of parameterization, the disagreement-on-the-line may or may not happen in random feature models trained using ridgeless least squares. Moreover, in contrast to prior observations, the line for disagreement and the line for risk may have different intercepts, even if they share the same slope. Additionally, we prove that adding ridge regularization breaks the exact linear relation, but an approximate linear relation still exists. Thus, we find that even in a simple theoretical setting, disagreement-on-the-line is a nuanced phenomenon that can depend on the type of randomness shared, regularization, and the level of overparametrization. Experiments we performed on CIFAR-10-C and other datasets are consistent with our theory, even though the assumptions of Gaussianity of inputs and linearity of the data generation are not met (Figure 1, 4). This suggests that our theory is relevant beyond our theoretical setting. 1.1. Main Contributions We provide an overview of the paper and our results. We propose a framework for the theoretical study of disagreement. We introduce a comprehensive and unifying set of notions of disagreement (Definition 2.1). Then, we find a limiting formula for disagreement in the high-dimensional limit where the sample size, input dimension, and feature dimension grow proportionally (Theorem 3.1). Based on this characterization, we study how disagreement under source and target domains are related. We identify under what conditions and for which type of disagreement the disagreement-on-the-line phenomenon holds (Section 4). Theorem 4.3 and Corollary 4.4 show an approximate linear relation when the conditions are not met. When the disagreement-on-the-line holds in our model, our results imply that the target vs. source line for risk and the target vs. source line for disagreement have the same slope. This is consistent with the findings of Baek et al. (2022), that whenever OOD vs. ID accuracy is on a line, OOD vs. ID agreement is also on the same line. However, unlike their finding, in our problem, the intercepts of the lines can be different (Remark 4.2). In Section 5, we conduct experiments on several datasets including CIFAR-10-C, Tiny Image Net-C, and Camelyon17. The experimental results are generally consistent with our theoretical findings, even as the theoretical conditions we use (e.g., Gaussian input, linear generative model, etc.) may not hold. This suggests a possible universality of the theoretical predictions. Our work shows that disagreement-on-the-line is a subtle phenomenon that depends on the shared randomness, regularization, and regime of parameterization. We also identify a difference between the intercept of the line for risk and the line for disagreement. If these factors are not properly considered, the disagreementon-the-line principle can lead to an inaccurate OOD performance estimation. 1.2. Related Work Random Features Model. Random features models were introduced by Rahimi & Recht (2007) as an approach for scaling kernel methods to massive datasets. Recently, they have been used as a standard model for the theoretical study of deep neural networks. Despite its simplicity, it is rich enough to capture various phenomena of deep learning including double descent (Mei & Montanari, 2022; Adlam et al., 2022; Lin & Dobriban, 2021), adversarial training (Hassani & Javanmard, 2022), feature learning (Ba et al., 2022), and transfer learning (Tripuraneni et al., 2021). In particular, in this model, the number of parameters and the ambient dimension are disentangled, hence the effect of overparameterization can be studied on its own. Linear Relation Under Distribution Shift. Several intriguing phenomena have been observed in empirical studies of distribution shifts. Recht et al. (2019); Hendrycks et al. Demystifying Disagreement-on-the-Line in High Dimensions (2021); Koh et al. (2021); Taori et al. (2020); Miller et al. (2021) observed linear trends between OOD and ID test error. Tripuraneni et al. (2021) proved this phenomenon in random features models under covariate shift. Recently, the notion of disagreement has been gaining a lot of attention (e.g., Hacohen et al. (2020); Chen et al. (2021); Jiang et al. (2021); Nakkiran & Bansal (2020); Baek et al. (2022); Atanov et al. (2022); Pliushch et al. (2022), etc.). In particular, Baek et al. (2022) empirically showed that OOD agreement between the predictions of pairs of neural networks also has a strong linear correlation with their ID agreement. They further observed that the slope and intercept of the OOD vs ID agreement line closely match that of the accuracy. This can be used to predict the OOD performance of predictive models only using unlabeled data. High-dimensional Asymptotics. Work on highdimensional asymptotics dates back at least to the 1960s (Raudys, 1967; Deev, 1970; Raudys, 1972) and has more recently been studied in a wide range of areas, such as high-dimensional statistics (e.g., Raudys & Young (2004); Serdobolskii (2007); Paul & Aue (2014); Yao et al. (2015); Dobriban & Wager (2018), etc.), wireless communications (e.g., Tulino & Verd u (2004); Couillet & Debbah (2011), etc.), and machine learning (e.g., Gy orgyi & Tishby (1990); Opper (1995); Opper & Kinzel (1996); Couillet & Liao (2022); Engel & Van den Broeck (2001), etc.). Technical Tools. The results derived in this paper rely on the Gaussian equivalence conjecture studied and used extensively for random features model (e.g., Goldt et al. (2022); Hu & Lu (2022); Montanari & Saeed (2022); Mei & Montanari (2022); Hassani & Javanmard (2022); Tripuraneni et al. (2021); Loureiro et al. (2021); d Ascoli et al. (2021), etc.). Our analytical results build upon the series of recent work Mel & Pennington (2021); Adlam & Pennington (2020a); Tripuraneni et al. (2021) using random matrix theory and operator-valued free probability (Far et al., 2008; Mingo & Speicher, 2017). 2. Preliminaries 2.1. Problem Setting We study a supervised learning setting where the training data (xi, yi) Rd R, i [n], of dimension d and sample size n, is generated according to xi i.i.d. N(0, Σs), and yi = 1 d β xi + εi, (1) where εi i.i.d. N(0, σ2 ε). Additionally, the true coefficient β Rd is assumed to be randomly drawn from N(0, Id). The linear relationship between (xi, yi) is not known. We fit a model to the data, which can then be used to predict labels for unlabeled examples at test time. We consider two-layer neural networks with fixed, randomly generated weights in the first layer a random features model as the learner. We let the width of the internal layer be N N. For a weight matrix W RN d with i.i.d. random entries sampled from N(0, 1), an activation function σ : R R applied elementwise, and the weights a RN of a linear layer, the random features model is defined by f W,a(x) = 1 The trainable parameters a RN are fit via ridge regression to the training data X = (x1, . . . , xn) Rd n and Y = (y1, . . . , yn) Rn. Specifically, for a regularization parameter γ > 0, we solve ˆa = arg min a RN 2 + γ a 2 2, and use ˆy(x) = ˆa σ(Wx/ N as the model prediction for a data point x Rd. Defining F = σ(WX/ d) and f = σ(Wx/ d), we can write ˆy(x) = Y 1 N F F + γIn N F f . (2) To emphasize the dependence on W, X, Y , we also use the notation ˆy W,X,Y . It has been recognized in e.g., Adlam & Pennington (2020a); Ghorbani et al. (2021); Mei & Montanari (2022) that only linear data generative models can be learned in the proportional-limit high-dimensional regime by random features models, and the non-linear part behaves like an additive noise. Thus, we consider linear generative models as in (1). Results for non-linear models can be obtained via linearization, as is standard in the above work. We also highlight that our theoretical findings are validated by simulations on standard datasets (such as CIFAR-10-C) where the input distribution is non-Gaussian and the data generation model is non-linear. 2.2. Distribution Shift At training time (1), the inputs xi are sampled from the source domain, Ds = N(0, Σs). At test time, we assume the input distribution shifts to the target domain, Dt = N(0, Σt). We do not restrict the change in P(y|x) since disagreement is independent of the label y. Previous work (Lei et al., 2021; Tripuraneni et al., 2021; Wu et al., 2022) found that the learning problem under covariate shift is fully characterized by input covariance matrices. For this reason, we do not consider shifts in the mean of the input distribution. Demystifying Disagreement-on-the-Line in High Dimensions 2.3. Definition of Disagreement Hacohen et al. (2020); Chen et al. (2021); Jiang et al. (2021); Nakkiran & Bansal (2020); Baek et al. (2022) define notions of disagreement (or agreement) to quantify the difference (or similarity) between the predictions of two randomly trained predictive models in classification tasks. Prior work on disagreement considers three sources of randomness that lead to different predictive models: (i) random initialization, (ii) sampling of the training set, and (iii) sampling/ordering of mini-batches. Motivated by these results, we propose analogous notions of disagreement in random features regression. We consider (i), (ii) and their combination, as (iii) is not present in our problem. The independent disagreement measures how much the prediction of two models with independent random weights and trained on two independent sets of training datasets disagree, on average. Similar notions were used in (Nakkiran & Bansal, 2020; Pliushch et al., 2022; Jiang et al., 2021; Baek et al., 2022). The shared-sample disagreement measures the average difference of the predictions of two models with independent random weights, but trained on a shared training set. Similar notions were used in (Pliushch et al., 2022; Jiang et al., 2021; Baek et al., 2022; Atanov et al., 2022). The shared-weight disagreement measures the average difference of the predictions of two models with shared random weights, but trained on two independent training samples. Similar notions were used in (Jiang et al., 2021; Baek et al., 2022). While the prior work typically used 0-1 loss to define agreement/disagreement in classification, we use the squared loss to measure disagreement for real-valued outputs. Definition 2.1 (Disagreement). Consider two random features models trained on the data (X1, Y1), (X2, Y2) Rd n Rn with random weight matrices W1, W2 RN d, respectively. We measure the disagreement of two models by their mean squared difference Disj i(n, d, N, γ) = E h (ˆy W1,X1,Y1(x) ˆy W2,X2,Y2(x))2i , where the expectation is over β, W1, W2, X1, Y1, X2, Y2, and j {s, t} is the domain that x Dj is from, and the index i {I, SS, SW} corresponds to one of the following cases. Independent disagreement (i = I): the training data (X1, Y1), (X2, Y2) are independently generated from (1), with the same β. The weights W1, W2 RN d are independent matrices with i.i.d. N(0, 1) entries. Shared-Sample disagreement (i = SS): the training samples are shared, i.e., (X1, Y1) = (X2, Y2) = (X, Y ), where (X, Y ) is generated from (1). The weights W1, W2 RN d are independent matrices with i.i.d. N(0, 1) entries. Shared-Weight disagreement (i = SW): the training data (X1, Y1), (X2, Y2) are independently generated from (1), with the same β. Two models share the weights, i.e., W1 = W2 = W. The weights are shared, i.e., W1 = W2 = W, where W RN d is a matrix with i.i.d. N(0, 1) entries. 2.4. Conditions We characterize the asymptotics of disagreement in the proportional limit asymptotic regime defined as follows. Condition 2.2 (Asymptotic setting). We assume that n, d, N with d/n ϕ > 0 and d/N ψ > 0. To characterize the limit of disagreement, we need conditions on the spectral properties of Σs and Σt as their dimension d grows. When multiple growing matrices are involved, it is not sufficient to make assumptions on the individual spectra of the matrices, but rather, they have to be considered jointly (Wu & Xu, 2020; Tripuraneni et al., 2021; Mel & Pennington, 2021). We assume that the joint spectral distribution of Σs and Σt converges to a limiting distribution µ on R2 + as d . Condition 2.3. Let λs 1, . . . , λs d 0 be the eigenvalues of Σs and v1, . . . , vd be the corresponding eigenvectors. Define λt i = v i Σtvi for i [d]. We assume the joint empirical spectral distribution of (λs i, λt i), i [d] converges in distribution to a limiting distribution µ on R2 +. That is, 1 d Pd i=1 δ(λs i,λt i) µ, where δ is the Dirac delta measure. We additionally assume that µ has a compact support. We denote random variables drawn from µ by (λs, λt), and write ms = Eµ[λs] and mt = Eµ[λt]. For the existence of certain derivatives and expectations, we assume the following mild condition on the activation function σ : R R. Condition 2.4. The activation function σ : R R is differentiable almost everywhere. There are constants c0 and c1 such that |σ(x)|, |σ (x)| c0ec1x, whenever σ (x) exists. For j {s, t} and a standard Gaussian random variable Z N(0, 1), define ρj = E[Zσ( mj Z)]2 mj , ωj = V[σ( mj Z)] These constants characterize the non-linearity of the activation σ and will appear in the asymptotics of disagreement. Note that when σ is Re LU activation σ(x) = max(x, 0), we have ρj = 1/4, ωj = mj(1 2/π) for j {s, t}. Demystifying Disagreement-on-the-Line in High Dimensions 3. Asymptotics of Disagreement In this section, we present our results on characterizing the limits of disagreement defined in Definition 2.1 for random features models. We introduce results for general ridge regression and also study the ridgeless limit γ 0. 3.1. Ridge Setting For i {I, SS, SW} and j {s, t}, define the asymptotic disagreement Disj i(ϕ, ψ, γ) = lim n,d,N Disj i(n, d, N, γ), where the limit is in the regime considered in Condition 2.2. Asymptotics in random features models and linear models with general covariance (e.g., training/test error, bias, variance, etc.) typically do not have a closed form, and can only be implicitly described through self-consistent equations (Tulino & Verd u, 2004; Dobriban & Wager, 2018; Adlam et al., 2022; Mei & Montanari, 2022; Hastie et al., 2022). To facilitate analysis of these implicit quantities, previous work (e.g., Dobriban & Sheng (2021; 2020); Tripuraneni et al. (2021); Mel & Pennington (2021), etc.) proposed using expressions containing only one implicit scalar. We show that similar to the asymptotic risk derived in Tripuraneni et al. (2021), the asymptotic disagreements can be expressed using a scalar κ which is the unique non-negative solution of the self-consistent equation κ = ψ + ϕ p (ψ ϕ)2 + 4κψϕγ/ρs 2ψ(ωs + Is 1,1(κ)) , (4) where Ij a,b is the integral functional of µ defined by Ij a,b(κ) = ϕEµ , j {s, t}. (5) We omit κ and simply write Ij a,b whenever the argument is clear from the context. Recall from Condition 2.3 that µ describes the joint spectral properties of source and target covariance matrices, so Ij a,b can be viewed as a summary of the joint spectral properties. The following theorem our first main result shows that Disj I (ϕ, ψ, γ), Disj SS(ϕ, ψ, γ), Disj SW(ϕ, ψ, γ) are well defined, and characterizes them. Theorem 3.1 (Disagreement in general ridge regression). For j {s, t}, the asymptotic independent disagreement is Disj I (ϕ, ψ, γ) = 2ρjψκ ϕγ + ρsγ(τψ + τϕ)(ωs + ϕIs 1,2) h γτ(ωj + ϕIj 1,2)Is 2,2 + (σ2 ε + Is 1,1)(ωs + ϕIs 1,2)(ωj + Ij 1,1) ψ γ τ(σ2 ε + ϕIs 1,2)Ij 2,2 i , and the asymptotic shared-sample disagreement is Disj SS(ϕ, ψ, γ) = Disj I (ϕ, ψ, γ) 2ρjκ2(σ2 ε + ϕIs 1,2)Ij 2,2 ρs(1 κ2Is 2,2) , and the asymptotic shared-weight disagreement is Disj SW(ϕ, ψ, γ) = Disj I (ϕ, ψ, γ) 2ρjψκ2(ωj + ϕIj 1,2)Is 2,2 ρs(ϕ ψκ2Is 2,2) , where τ and τ are the limiting normalized trace of (F F/N + γIn) 1 and (FF /N + γIN) 1, respectively. They can be expressed as functions of κ as follows: (ψ ϕ)2 + 4κψϕγ/ρs + ψ ϕ The expressions in Theorem 3.1 are written in terms of the non-linearity constants ρs, ρt, ωs, ωt, the dimension parameters ψ, ϕ, the regularization γ, the noise level σ2 ε, the summary statistics Is a,b, It a,b of µ, and τ, τ, κ. Since τ, τ are algebraic functions of κ, the expressions are functions of one implicit variable κ. This theorem can be used to make numerical predictions for disagreement. To do so, we first solve the self-consistent equation (4) using a fixed-point iteration and find κ. Then, we plug κ into the terms appearing in the theorem. Figure 2 shows an example, supporting that the theoretical predictions of Theorem 3.1 match very well with simulations even for moderately large d, n, N. Theoretical Innovations. To prove this theorem, we first rely on Gaussian equivalence (Section A.3, A.4) to express disagreement as a combination of traces of rational functions of i.i.d. Gaussian matrices. Then, we construct linear pencils (Section A.5) and use the theory of operator-valued free probability (Section A.1, A.2) to derive the limit of these trace objects. This general strategy has been used previously in Adlam et al. (2022); Adlam & Pennington (2020b); Tripuraneni et al. (2021); Mel & Pennington (2021). However, in the expressions of disagreement, new traces appear that did not exist in prior work. We construct new suitable linear pencils to derive the limit of these traces. While this leads to a coupled system of self-consistent equations of many variables, it turns out that they can be factored into a single scalar variable κ defined through the self-consistent equation (4), and every term appearing in the limiting disagreements, can be written as algebraic functions Demystifying Disagreement-on-the-Line in High Dimensions 1 2 3 4 5 6 φ/ψ = N/n Target disagr. I disagr. SS disagr. SW disagr. Figure 2. Independent, shared-sample, and shared-weight disagreement under target domain in random features regression with Re LU activation function, ϕ = lim d/n = 0.5, versus ϕ/ψ = lim N/n. We set γ = 0.01, σ2 ε = 0.25, and µ = 0.5δ(1.5,5) + 0.5δ(1,1). Simulations are done with d = 512, n = 1024, and averaged over 300 trials. The continuous lines are theoretical predictions from Theorem 3.1, and the dots are simulation results. of κ. These results might also be of independent interest. Since limiting disagreements only rely on the same implicit variable as the variable appearing in the limiting risk, we can derive the results in Section 4. 3.2. Ridgeless Limit In the ridgeless limit γ 0, the self-consistent equation (4) for κ becomes κ = min(1, ϕ/ψ) ωs + Is 1,1(κ). (7) Further, the asymptotic limits in Theorem 3.1 can be simplified as follows. Corollary 3.2 (Ridgeless limit). For j {s, t} and in the ridgeless limit γ 0, the asymptotic independent disagreement is lim γ 0 Disj I (ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) 2ρjκ(σ2 ε+ϕIs 1,2)Ij 2,2 ρs(ωs+ϕIs 1,2) ϕ > ψ, 2ρjκ(ωj+ϕIj 1,2)Is 2,2 ρs(ωs+ϕIs 1,2) ϕ < ψ, and the asymptotic shared-sample disagreement is lim γ 0 Disj SS(ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) (ωj+ϕIj 1,2)Is 2,2 ωs+ϕIs 1,2 κ(σ2 ε+ϕIs 1,2)Ij 2,2 1 κ2Is 2,2 Table 1. Existence of disagreement-on-the-line in the overparametrized regime for different regularization and types of disagreement. The symbols , , correspond to exact, approximate, no linear relation, respectively. Dis I and Dis SS Dis SW γ 0 (Theorem 4.1) (Section 4.2) γ > 0 (Theorem 4.3) and the asymptotic shared-weight disagreement is lim γ 0 Disj SW(ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) (σ2 ε+ϕIs 1,2)Ij 2,2 ωs+ϕIs 1,2 ψκ(ωj+ϕIj 1,2)Is 2,2 ϕ ψκ2Is 2,2 where κ is defined in (7). In the ridgeless limit, I and SS disagreement have a single term that depends on ψ, which motivates the analysis in Section 4 that examines the disagreement-on-the-line phenomenon. In contrast, SW disagreement has two linearly independent terms that are functions of ψ, leading to a distinct behavior compared to I and SS disagreement. The asymptotics in Corollary 3.2 reveal another interesting phenomenon regarding disagreements of random features model in the ridgeless limit. For example, it follows from Corollary 3.2 that SS disagreement tends to zero in the infinite overparameterization limit where the width N of the internal layer is much larger than the data dimension d, so that ψ = lim d/N 0. However, the same is not true for the I and SW disagreement. This indicates that, in the infinite overparameterization limit, the randomness caused by the random weights disappears, and the model is solely determined by the training sample. 4. When Does Disagreement-on-the-Line Hold? In this section, based on the characterizations of disagreements derived in the previous section, we study for which types of disagreement and under what conditions, the linear relationship between disagreement under source and target domain of models of varying complexity holds. 4.1. I and SS disagreement Ridgeless. In the overparametrized regime ϕ > ψ, the self-consistent equation (7) is independent of ψ = lim d/N, and so is κ. This implies the following linear trend of I and SS disagreement, in the ridgeless limit. Demystifying Disagreement-on-the-Line in High Dimensions 0.0 0.5 1.0 Source disagr. Target disagr. 0.0 0.5 1.0 ψ/φ 0.0 0.5 1.0 1.5 Source 0 1 2 Source SW disagr. Figure 3. (a) Target vs. source I, SS, SW disagreement in the ridgeless and underparametrized regime (ϕ < ψ). There is no linear trend in this regime. (b) Deviation from the line, Dist SS(ϕ, ψ, γ) a Diss SS(ϕ, ψ, γ), as a function of ψ for non-zero γ. The deviation becomes larger as γ increases. See Section D.2 for figures for I disagreement and risk. (c) Target vs. source lines for I, SS disagreement and risk, in the overparametrized regime ψ/ϕ (0, 1). The lines have identical slopes but different intercepts. (d) Deviation from the line, limγ 0 Dist SW(ϕ, ψ, γ) a Diss SW(ϕ, ψ, γ), vs. limγ 0 Diss SW(ϕ, ψ, γ), in the overparametrized regime (ϕ > ψ). This shows disagreement-on-the-line does not happen for SW disagreement. We use ϕ = 0.5, σ2 ε = 10 4, and Re LU activation σ. We set µ = 0.4δ(0.1,1) + 0.6δ(1,0.1) in (a), (b), (d) and µ = 0.5δ(4,1) + 0.5δ(1,4) in (c). Theorem 4.1 (Exact linear relation). Define a = ρt(ωt + It 1,1) ρs(ωs + Is 1,1), b SS = 0, b I = 2κ2(σ2 ε + ϕIs 1,2)(ρt It 2,2 aρs Is 2,2) ρs(1 κ2Is 2,2) , (8) for κ satisfying (7). We fix ϕ and regard the disagreement Disj i(ϕ, ψ, γ), i {I, SS}, j {s, t}, as a function of ψ. In the overparametrized regime ϕ > ψ and for i {I, SS}, lim γ 0 Dist i(ϕ, ψ, γ) = a lim γ 0 Diss i(ϕ, ψ, γ) + bi, (9) where the slope a and the intercept b I are independent of ψ. Recall from (3) and (5) that ρs, ρt, ωs, ωs are constants describing non-linearity of the activation σ, and Is a,b, It a,b are statistics summarizing spectra of Σs, Σt. Therefore, the slope a is determined by the property of σ, Σs, Σt. By plugging in sample covariance, we can build an estimate of the slope in finite-sample settings. Also as a sanity check, if we set Σs = Σt, then we recover a = 1 and b I = 0 as there will be no difference between source and target domain. Remark 4.2. The slope a = ρt(ωt + It 1,1)/ρs(ωs + Is 1,1) is same as the slope from Proposition C.3. This is consistent with the empirical observations from Baek et al. (2022) that the linear trend between ID disagreement and OOD disagreement has the same slope as the linear trend between ID risk and OOD risk. However, unlike in Baek et al. (2022), in our case, the intercepts can be different. This can be seen in Figure 1 and Figure 3 (c), and also from (51). Our analysis provides an explicit formula for the intercepts. Specifically, the intercepts can be numerically computed using equations (8), (51), and Theorem C.1 if σ2 ε is known. Note that in the general case of non-linear generative models, σ2 ε corresponds to the sum of the noise level and the non-linear component of the data-generating function. By estimating σ2 ε, we can obtain estimates of the intercepts which can be potentially used for OOD performance estimation. Ridge. When γ > 0, the exact linear relation between source disagreement and target disagreement no longer holds in our model. However, it turns out that there is still an approximate linear relation, as we show next. Theorem 4.3 (Approximate linear relation of disagreement). Let a, b SS, b I be defined as in (8). Given ϕ > ψ, deviation from the line, for I and SS disagreement, is bounded by |Dist I(ϕ, ψ, γ) a Diss I(ϕ, ψ, γ) b I| ψγ + ψγ + γ p ψγ)/(1 ψ/ϕ + p |Dist SS(ϕ, ψ, γ) a Diss SS(ϕ, ψ, γ)| ψγ + ψγ + γ p ψγ)/(1 ψ/ϕ + p where C > 0 depends on ϕ, µ, σ2 ε, and σ. We see the upper bounds vanish as γ 0, consistent with Theorem 4.1. Also, the upper bound for SS disagreement vanishes as ψ 0, which is confirmed in Figure 3 (b). We now present an analog of Theorem 4.3 for prediction error of the random features model. This is a generalization of Proposition C.3, which shows an exact linear relation between risks in the ridgeless and overparametrized regime. Corollary 4.4 (Approximate linear relation of risk). Denote prediction risk in the source and target domains by Es, Et, respectively (see Section C for definitions). Let a, brisk be Demystifying Disagreement-on-the-Line in High Dimensions 0.0 0.1 0.2 0.3 Source 0.0 0.2 0.4 0.6 Source 0.0 0.2 0.4 Source Risk SS disagr. Figure 4. (a) CIFAR-10-C-Snow (severity 3) (b) Tiny Image Net-C-Fog (severity 3) (c) Camelyon17; For more results, see Section D.3. defined as in (8) and (51). Given ϕ > ψ, deviation from the line, for risk, is bounded by |Et a Es brisk| C(γ + ψγ + ψγ + γ ψγ + ψγ2) (1 ψ/ϕ + ψγ)2 , where C > 0 depends on ϕ, µ, σ2 ε, and σ. Theorem 4.3 and Corollary 4.4 together show that the phenomenon we discussed in Remark 4.2 occurs, at least approximately, even when applying ridge regularization. In the underparametrized case ψ > ϕ, the self-consistent equation (7) is dependent on ψ, and so is κ. Hence, there is no analog of the linear relation we find in Theorem 4.1 in this regime. Figure 3 (a) displays this phenomenon. 4.2. SW disagreement In Corollary 3.2, unlike I and SS disagreement, SW disagreement contains two linearly independent functions of ψ. Hence, the disagreement-on-the-line phenomenon (9) cannot occur for any choice of slope and intercept independent of ψ. Figure 3 (a) and (d) confirm the non-linear relation between target vs. source SW disagreement in underparametrized and overparametrized regimes, respectively. 5. Experiments 5.1. Experiments Setup We conduct experiments on the following datasets. The associated code can be found at https://github.com/ dh7401/RF-disagreement. CIFAR-10-C. Hendrycks & Dietterich (2018) introduced a corrupted version of CIFAR-10 (Krizhevsky et al., 2009). We choose two classes and assign the label y {0, 1} to each. We use CIFAR-10 as the source domain and CIFAR10-C as the target domain. Tiny Image Net-C. Tiny Image Net (Wu et al., 2017), a smaller version of Image Net (Deng et al., 2009), consists of natural images of size 64 64 in 200 classes. Tiny Image Net C (Hendrycks & Dietterich, 2019) is a corrupted version of Tiny Image Net. We down-sample images to 32 32 and create two super-classes each consisting of 10 of the original classes. We consider Tiny Image Net as the source domain and Tiny Image Net-C as the target domain. Camelyon17. Camelyon17 (Bandi et al., 2018) consists of tissue slide images collected from five different hospitals, and the task is to identify tumor cells in the images. Koh et al. (2021) proposed a patch-based variant of the task, where the input x is 96 96 image and the label y {0, 1} indicates whether the central 32 32 contains any tumor tissue. We crop the central 32 32 region and use it as the input in our problem. We use Hospital 0 as the source domain and Hospital 2 as the target domain. We run random features regression with Re LU activation on these datasets. We use training sample size n = 1000, random features dimension N {3000, 4000, . . . , 49000}, input dimension d = 3072, regularization γ = 0. We test the trained model on the rest of the sample and plot target vs. source SS disagreement and risk. Plots for I and SW disagreements can be found in Section D.4. We estimate the covariance Σs and Σt using the test sample and derive the theoretical slope of target vs. source line predicted by Theorem 4.1 (see Section D.1). Since the limiting spectral distribution of sample covariance is generally different from that of population covariance, we remark that this may lead to a biased estimate of the slope. As the intercept brisk involves the unknown noise level σ2 ε, it is difficult to make a theoretical prediction on its value. For this reason, Demystifying Disagreement-on-the-Line in High Dimensions we fit the intercept instead of using its theoretical value. 5.2. Results While Theorem 4.1 is proved only for Gaussian input and linear generative model, we observe the disagreement-onthe-line phenomenon on all three datasets (Figure 4), in which these assumptions are violated. In this regard, a flurry of recent research (see e.g., Hastie et al. (2022); Hu & Lu (2022); Loureiro et al. (2021); Goldt et al. (2022); Wang et al. (2022); Dudeja et al. (2022); Montanari & Saeed (2022); Pesce et al. (2023)) has proved that findings assuming Gaussian inputs often hold in a much wider range of models. While none of the existing work exactly fits the setting considered in this paper, this gives yet another indication that our theory should remain true more generally. The rigorous characterization of this universality is left for future work. Also, we find that target vs. source risk does not exhibit a clear linear trend, especially in Tiny Image Net and Camelyon17. This is because Proposition C.3 does not hold in the case of concept shift, i.e., the shift in P(y|x). However, since disagreement is oblivious to the change of P(y|x), the disagreement-on-the-line is a general phenomenon happening regardless of the type of distribution shift. 6. Conclusion In this paper, we propose a framework to study various types of disagreement in the random features model. We precisely characterize disagreement in high dimensions and study how disagreement under the source and target domains relate to each other. Our results show that the occurrence of disagreement-on-the-line in the random features model can vary depending on the type of disagreement, regularization, and regime of parameterization. We show that, contrary to the prior observation, the line for disagreement and the line for risk can differ in their intercepts. We run experiments on several real-world datasets and show that the results hold in settings more general than the theoretical setting that we consider. When the above factors are not properly considered, OOD performance estimation using the disagreement-on-the-line phenomenon can be inaccurate and unreliable. Our findings indicate a potential for further examination of the disagreement-on-the-line principle. Acknowledgements The work of Behrad Moniri is supported by The Institute for Learning-enabled Optimization at Scale (TILOS), under award number NSF-CCF-2112665. Donghwan Lee was supported in part by ARO W911NF-20-1-0080, DCIST, Air Force Office of Scientific Research Young Investigator Program (AFOSR-YIP) #FA9550-20-1-0111 award; Xinmeng Huang was supported in part by the NSF DMS 2046874 (CAREER), NSF CAREER award CIF-1943064. Adlam, B. and Pennington, J. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, 2020a. Adlam, B. and Pennington, J. Understanding double descent requires a fine-grained bias-variance decomposition. In Advances in Neural Information Processing Systems, 2020b. Adlam, B., Levinson, J. A., and Pennington, J. A random matrix perspective on mixtures of nonlinearities in high dimensions. In International Conference on Artificial Intelligence and Statistics, 2022. Anderson, G. W. Convergence of the largest singular value of a polynomial in independent Wigner matrices. The Annals of Probability, 41(3B):2103 2181, 2013. Anderson, G. W. and Zeitouni, O. A CLT for a band matrix model. Probability Theory and Related Fields, 134(2): 283 338, 2006. Atanov, A., Filatov, A., Yeo, T., Sohmshetty, A., and Zamir, A. Task discovery: Finding the tasks that neural networks generalize on. In Advances in Neural Information Processing Systems, 2022. Ba, J., Erdogdu, M. A., Suzuki, T., Wang, Z., Wu, D., and Yang, G. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems, 2022. Baek, C., Jiang, Y., Raghunathan, A., and Kolter, J. Z. Agreement-on-the-line: Predicting the performance of neural networks under distribution shift. In Advances in Neural Information Processing Systems, 2022. Bai, Z. and Silverstein, J. W. Spectral Analysis of Large Dimensional Random Matrices, volume 20. Springer, 2010. Bandi, P., Geessink, O., Manson, Q., Van Dijk, M., Balkenhol, M., Hermsen, M., Bejnordi, B. E., Lee, B., Paeng, K., Zhong, A., et al. From detection of individual metastases to classification of lymph node status at the patient level: the Camelyon17 challenge. IEEE Transactions on Medical Imaging, 38(2):550 560, 2018. Demystifying Disagreement-on-the-Line in High Dimensions Banna, M., Merlev ede, F., and Peligrad, M. On the limiting spectral distribution for a large class of symmetric random matrices with correlated entries. Stochastic Processes and their Applications, 125(7):2700 2726, 2015. Banna, M., Najim, J., and Yao, J. A CLT for linear spectral statistics of large random information-plus-noise matrices. Stochastic Processes and their Applications, 130(4):2250 2281, 2020. Benaych-Georges, F. Rectangular random matrices, related convolution. Probability Theory and Related Fields, 144 (3):471 515, 2009. Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇSrndi c, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Proc. Joint European Conf. Mach. Learning and Knowledge Discovery in Databases, pp. 387 402, 2013. Chen, J., Liu, F., Avci, B., Wu, X., Liang, Y., and Jha, S. Detecting errors and estimating accuracy on unlabeled data with self-training ensembles. In Advances in Neural Information Processing Systems, 2021. Cheng, X. and Singer, A. The spectrum of random innerproduct kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013. Couillet, R. and Debbah, M. Random Matrix Methods for Wireless Communications. Cambridge University Press, 2011. Couillet, R. and Liao, Z. Random Matrix Methods for Machine Learning. Cambridge University Press, 2022. d Ascoli, S., Gabri e, M., Sagun, L., and Biroli, G. On the interplay between data structure and loss function in classification problems. In Advances in Neural Information Processing Systems, 2021. Deev, A. Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size. In Sov. Math. Dokl., volume 11, pp. 1547 1550, 1970. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Image Net: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. Deng, W. and Zheng, L. Are labels always necessary for classifier accuracy evaluation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. Dobriban, E. and Sheng, Y. Wonder: Weighted one-shot distributed ridge regression in high dimensions. Journal of Machine Learning Research, 21(66):1 52, 2020. Dobriban, E. and Sheng, Y. Distributed linear regression by averaging. The Annals of Statistics, 49(2):918 943, 2021. Dobriban, E. and Wager, S. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018. Dudeja, R., Sen, S., and Lu, Y. M. Spectral universality of regularized linear regression with nearly deterministic sensing matrices. ar Xiv preprint ar Xiv:2208.02753, 2022. Dyson, F. J. A Brownian-motion model for the eigenvalues of a random matrix. Journal of Mathematical Physics, 3 (6):1191 1198, 1962. El Karoui, N. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 50, 2010. Engel, A. and Van den Broeck, C. Statistical mechanics of learning. Cambridge University Press, 2001. Erd os, L. The matrix Dyson equation and its applications for random matrices. ar Xiv preprint ar Xiv:1903.10060, 2019. Erd os, L., P ech e, S., Ram ırez, J. A., Schlein, B., and Yau, H.-T. Bulk universality for Wigner matrices. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 63(7):895 925, 2010. Erd os, L., Yau, H.-T., and Yin, J. Bulk universality for generalized Wigner matrices. Probability Theory and Related Fields, 154(1):341 407, 2012. Fan, Z. and Montanari, A. The spectral norm of random inner-product kernel matrices. Probability Theory and Related Fields, 173(1):27 85, 2019. Far, R. R., Oraby, T., Bryc, W., and Speicher, R. Spectra of large block matrices. ar Xiv preprint cs/0610045, 2006. Far, R. R., Oraby, T., Bryc, W., and Speicher, R. On slowfading MIMO systems with nonseparable correlation. IEEE Transactions on Information Theory, 54(2):544 553, 2008. Garg, S., Balakrishnan, S., Lipton, Z. C., Neyshabur, B., and Sedghi, H. Leveraging unlabeled data to predict outof-distribution performance. In International Conference on Learning Representations, 2021. Gaudin, M. Sur la loi limite de l espacement des valeurs propres d une matrice ale atoire. Nuclear Physics, 25: 447 458, 1961. Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 1054, 2021. Demystifying Disagreement-on-the-Line in High Dimensions Goldt, S., Loureiro, B., Reeves, G., Krzakala, F., M ezard, M., and Zdeborov a, L. The Gaussian equivalence of generative models for learning with shallow neural networks. In Mathematical and Scientific Machine Learning, pp. 426 471, 2022. Guillory, D., Shankar, V., Ebrahimi, S., Darrell, T., and Schmidt, L. Predicting with confidence on unseen distributions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. Gy orgyi, G. and Tishby, N. Statistical theory of learning a rule. Neural networks and spin glasses, pp. 3 36, 1990. Haagerup, U. and Thorbjørnsen, S. A new application of random matrices: Ext(C red(F2)) is not a group. Annals of Mathematics, 162(2):711 775, 2005. Haagerup, U., Schultz, H., and Thorbjørnsen, S. A random matrix approach to the lack of projections in C red(F2). Advances in Mathematics, 204(1):1 83, 2006. Hacohen, G., Choshen, L., and Weinshall, D. Let s agree to agree: Neural networks share classification order on real datasets. In International Conference on Machine Learning, 2020. Hassani, H. and Javanmard, A. The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. ar Xiv preprint ar Xiv:2201.05149, 2022. Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986, 2022. Helton, J. W., Far, R. R., and Speicher, R. Operator-valued semicircular elements: solving a quadratic matrix equation with positivity constraints. International Mathematics Research Notices, 2007(9), 2007. Helton, J. W., Mai, T., and Speicher, R. Applications of realizations (aka linearizations) to free probability. Journal of Functional Analysis, 274(1):1 79, 2018. Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018. Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019. Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., and Lakshminarayanan, B. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020. Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. Hu, H. and Lu, Y. M. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 2022. Jiang, Y., Nagarajan, V., Baek, C., and Kolter, J. Z. Assessing generalization of SGD via disagreement. In International Conference on Learning Representations, 2021. Kirsch, A. and Gal, Y. A note on assessing generalization of SGD via disagreement . Transactions on Machine Learning Research, 2022. Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., et al. WILDS: A benchmark of in-thewild distribution shifts. In International Conference on Machine Learning, 2021. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Lei, Q., Hu, W., and Lee, J. Near-optimal linear regression under distribution shift. In International Conference on Machine Learning, 2021. Lin, L. and Dobriban, E. What causes the test error? going beyond bias-variance via ANOVA. Journal of Machine Learning Research, 22:155 1, 2021. Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M., and Zdeborov a, L. Learning curves of generic features maps for realistic datasets with a teacherstudent model. In Advances in Neural Information Processing Systems, 2021. Mehta, M. L. Random matrices. Elsevier, 2004. Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667 766, 2022. Mel, G. and Pennington, J. Anisotropic random feature regression in high dimensions. In International Conference on Learning Representations, 2021. Demystifying Disagreement-on-the-Line in High Dimensions Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, 2021. Mingo, J. A. and Speicher, R. Free probability and random matrices, volume 35. Springer, 2017. Montanari, A. and Saeed, B. N. Universality of empirical risk minimization. In Conference on Learning Theory, 2022. Montanari, A., Ruan, F., Sohn, Y., and Yan, J. The generalization error of max-margin linear classifiers: Highdimensional asymptotics in the overparametrized regime. ar Xiv preprint ar Xiv:1911.01544, 2019. Nakkiran, P. and Bansal, Y. Distributional generalization: A new kind of generalization. ar Xiv preprint ar Xiv:2009.08092, 2020. Oakden-Rayner, L., Dunnmon, J., Carneiro, G., and R e, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM conference on health, inference, and learning, pp. 151 159, 2020. Opper, M. Statistical mechanics of learning: Generalization. The Handbook of Brain Theory and Neural Networks,, pp. 922 925, 1995. Opper, M. and Kinzel, W. Statistical mechanics of generalization. In Models of Neural Networks III, pp. 151 209. Springer, 1996. Paul, D. and Aue, A. Random matrix theory in statistics: A review. Journal of Statistical Planning and Inference, 150:1 29, 2014. Pesce, L., Krzakala, F., Loureiro, B., and Stephan, L. Are gaussian data all you need? extents and limits of universality in high-dimensional generalized linear estimation. ar Xiv preprint ar Xiv:2302.08923, 2023. Pliushch, I., Mundt, M., Lupp, N., and Ramesh, V. When deep classifiers agree: Analyzing correlations between learning order and image statistics. In European Conference on Computer Vision, 2022. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2007. Raudys, ˇS. On determining training sample size of linear classifier. Computing Systems (in Russian), 28:79 87, 1967. Raudys, ˇS. On the amount of a priori information in designing the classification algorithm. Technical Cybernetics (in Russian), 4:168 174, 1972. Raudys, ˇS. and Young, D. M. Results in statistical discriminant analysis: A review of the former Soviet Union literature. Journal of Multivariate Analysis, 89(1):1 35, 2004. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do Image Net classifiers generalize to Image Net? In International Conference on Machine Learning, 2019. Serdobolskii, V. I. Multiparametric Statistics. Elsevier, 2007. Shlyakhtenko, D. Random Gaussian band matrices and freeness with amalgamation. International Mathematics Research Notices, 1996(20):1013 1025, 1996. Shlyakhtenko, D. Gaussian random band matrices and operator-valued free probability theory. Banach Center Publications, 43(1):359 368, 1998. Speicher, R. Combinatorial theory of the free product with amalgamation and operator-valued free probability theory, volume 627. American Mathematical Society, 1998. Speicher, R. and Vargas, C. Free deterministic equivalents, rectangular random matrix models, and operator-valued free probability theory. Random Matrices: Theory and Applications, 1(2), 2012. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. Tao, T. and Vu, V. Random matrices: universality of local eigenvalue statistics. Acta mathematica, 206(1):127 204, 2011. Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts in image classification. In Advances in Neural Information Processing Systems, 2020. Tripuraneni, N., Adlam, B., and Pennington, J. Overparameterization improves robustness to covariate shift in high dimensions. In Advances in Neural Information Processing Systems, 2021. Tulino, A. M. and Verd u, S. Random matrix theory and wireless communications. Communications and Information Theory, 1(1):1 182, 2004. Voiculescu, D. Addition of certain non-commuting random variables. Journal of Functional Analysis, 66(3):323 346, 1986. Demystifying Disagreement-on-the-Line in High Dimensions Voiculescu, D. Symmetries of some reduced free product c -algebras. In Operator Algebras and their Connections with Topology and Ergodic Theory: Proceedings of the OATE Conference held in Bus teni, Romania, Aug. 29 Sept. 9, 1983, pp. 556 588. Springer, 2006. Volˇciˇc, J. Matrix coefficient realization theory of noncommutative rational functions. Journal of Algebra, 499: 397 437, 2018. Wang, T., Zhong, X., and Fan, Z. Universality of approximate message passing algorithms and tensor networks. ar Xiv preprint ar Xiv:2206.13037, 2022. Wigner, E. P. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, pp. 548 564, 1955. Wu, D. and Xu, J. On the optimal weighted ℓ2 regularization in overparameterized linear regression. In Advances in Neural Information Processing Systems, 2020. Wu, J., Zhang, Q., and Xu, G. Tiny Image Net challenge. Technical report, 2017. Wu, J., Zou, D., Braverman, V., Gu, Q., and Kakade, S. M. The power and limitation of pretraining-finetuning for linear regression under covariate shift. In Advances in Neural Information Processing Systems, 2022. Yao, J., Bai, Z., and Zheng, S. Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge University Press, 2015. Demystifying Disagreement-on-the-Line in High Dimensions A. Technical Tools A.1. Operator-valued Free Probability Operator-valued free probability (e.g., Speicher (1998); Mingo & Speicher (2017); Helton et al. (2007)) has appeared in various studies of random features models including Adlam et al. (2022); Adlam & Pennington (2020a;b); Mel & Pennington (2021); Ba et al. (2022). Here, we briefly outline the most relevant concepts, which are used in our computation. Recall that a set A is an algebra (over the field C of complex numbers) if it is a vector space over C and is endowed with a bilinear multiplication operation denoted by . Thus, for all a, b, c A we have the distributivity relations a (b + c) = a b + a c and (b + c) a = b a + c a; and the relation indicating that multiplication in the algebra is compatible with the usual multiplication over C, namely that for and x, y C, (x y) (a b) = (x a) (y b). All algebras we consider will be associative, so that the multiplication operation over the algebra is associative. Further, an algebra is called unital if it contains a multiplicative identity element; this is denoted as 1 . Often, we drop the symbol to denote multiplication (both over the algebra and by scalars), and no confusion may arise. Definition A.1 (Non-commutative probability space). Let C be a unital algebra and φ : C C be a linear map such that φ(1) = 1. We call the pair (C, φ) a non-commutative probability space. Example A.2 (Deterministic matrices). For a matrix A Cm m, we denote its normalized trace by tr(A) = 1 m Pm i=1 Aii. The pair (Cm m, tr) is a non-commutative probability space. Example A.3 (Random matrices). Let (Ω, F, P) be a (classical) probability space and L (Ω) be the set of scalar random variables with all moments finite. The pair (L (Ω)m m, Etr) is a non-commutative probability space. Definition A.4 (Operator-valued probability space). Let A be a unital algebra and consider a unital sub-algebra B A. A linear map E : A B is a conditional expectation if E(b) = b for all b B and E(b1ab2) = b1E(a)b2 for all a A and b1, b2 B. The triple (A, E, B) is called an operator-valued probability space. The name conditional expectation can be understood from the following example. Example A.5 (Classical conditional expectation). Let (Ω, F, P) be a probability space and G be a sub-σ-algebra of F. Then, considering E = E[ |G], any unital algebra A L1(Ω, F, P) and its unital sub-algebra B L1(Ω, G, P), such that all required integrals in the definition of E(b1ab2) = b1E(a)b2 exist for all a A and b1, b2 B, form an operator-valued probability space (A, E, B). Example A.6 (block random matrices). Let (C, φ) = (L (Ω)m m, Etr) be the non-commutative probability space of random matrices defined in Example A.3. Define A = CM M C and B = CM M. In words, A is the space of M M block matrices with entries in C, and B is the space of M M scalar matrices. Note that B can be viewed as a unital sub-algebra of A by the canonical inclusion ι : A , B defined by ι(B) = B 1C, (10) where 1C is the unity of C (in this example 1C = Im). We also define the block-wise normalized expected trace E = id Etr : A B by E(A) = (Etr Aij)1 i,j M, A = (Aij)1 i,j M A. (11) Remark A.7. While we have only discussed squared blocks with identical sizes in Example A.6, it is possible to extend the definition to block matrices with rectangular blocks (Far et al., 2006; 2008; Benaych-Georges, 2009; Speicher & Vargas, 2012). The idea of Benaych-Georges (2009) is to embed each rectangular matrix into a block of a common larger square matrix. For example, if we have rectangular blocks whose dimensions are one of q1, . . . , q K N, we consider the space of (q1 + + q K) (q1 + + q K) square matrices with a block structure q1 q1 q1 q K ... ... ... q K q1 q K q K Then, we identify a rectangular matrix C Cqi qj with a square matrix e C C(q1+ +q K) (q1+ +q K), having the aforementioned block structure, whose (i, j)-block is C and all other blocks are zero. This identification preserves scalar Demystifying Disagreement-on-the-Line in High Dimensions multiplication, addition, multiplication, transpose, and trace, in the sense that, for rectangular matrices C, D and a scalar c C, c e C = f c C, e C + e D = C + D if C and D have same shape, ( e C) = g C , (g CD if C and D are conformable, 0 otherwise, tr( e C) = ( tr(C) if C is a square matrix, 0 otherwise. Through this identification, the space of rectangular matrices (with finitely many different dimension types) can be also understood as an algebra over C. Further, by replacing C in Example A.6 with the space of rectangular random matrices, we can define the space of block random matrices with rectangular blocks. The space of block random matrices with rectangular blocks, equipped with the block-wised expected trace, will be the operator-valued probability space we consider in our proof. Definition A.8 (Operator-valued Cauchy transform). Let (A, E, B) be an operator-valued probability space. For a A, define its operator-valued Cauchy transform Ga : B \ {a} B by Ga(b) = E[(b a) 1]. Definition A.9 (Operator-valued freeness). Let (A, E, B) be an operator-valued probability space and (Ai)i I be a family of sub-algebras of A which contain B. The sub-algebras Ai are freely independent over B, if E[a1 an] = 0 whenever E[a1] = = E[an] = 0 and ai Aj(i) for all i [n] with j(1) = = j(n). Variables a1, . . . , an A are freely independent over B if the sub-algebras generated by ai and B are freely independent over B. Another important transform, introduced in Voiculescu (1986; 2006), is the R-transform. It enables the characterization of the spectrum of a sum of asymptotically freely independent random matrices. It was generalized to operator-valued probability spaces in Shlyakhtenko (1996); Mingo & Speicher (2017). The definition of operator-valued R-transform can be found in Definition 10, Chapter 9 of Mingo & Speicher (2017). Our work does not directly require the definition of R-transforms, and instead uses the following property. Proposition A.10 (Subordination property, (9.21) of Mingo & Speicher (2017)). Let (A, E, B) be an operator-valued probability space. If x, y A are freely independent over B, then Gx+y(b) = Gx[b Ry(Gx+y(b))] (12) for all b B, where Ry is the operator-valued R-transform of y. A.2. Limiting R-transform of Gaussian Block Matrices Shlyakhtenko (1996; 1998) proposed using operator-valued free probability to study spectra of Gaussian block matrices. Their insight was that operator-valued free independence among Gaussian block matrices is guaranteed for general covariance structure, whereas scalar-valued freeness among them only holds in special cases. Later Far et al. (2006; 2008); Anderson & Zeitouni (2006) revisited this idea. We present a theorem of Far et al. (2008), which characterizes limiting R-transform of Gaussian block matrices with rectangular blocks. Theorem A.11 (Theorem 5 of Far et al. (2008)). For m = m1 + + m M, let A = (Aij)1 i,j M Rm m be an M M block random matrix whose block Aij is a mi mj random matrix with i.i.d. N(0, c2 ij/m) entries. Define the covariance function σ(i, j; k, l) to be cijckl if Aij/cij = A kl/ckl and 0 otherwise. We assume the proportional limit where m1, . . . , m M with mi/m αi (0, ), i = 1, . . . , M. Then, the limiting R-transform of A can be expressed as [RA(D)]ij = X 1 k,l M σ(i, k; l, j)αk Dkl, (13) for any D RM M. We remark the above statement should be understood in the space of block random matrices with rectangular blocks we discussed in Remark A.7. Also, the original statement used a different terminology covariance mapping , but it is identical to the R-transform of A (see discussion in Mingo & Speicher (2017) p.242 and Far et al. (2006) p.24) Demystifying Disagreement-on-the-Line in High Dimensions A.3. Centering Random Features We first argue that the random features F, f can be centered without changing the asymptotics of disagreement. This centering argument became a standard technique after it was introduced in Mei & Montanari (2022) (Section 10.4). More generally, centering arguments are standard in random matrix theory (see e.g., Bai & Silverstein (2010)). For a standard Gaussian random variable Z N(0, 1), define centered random features by F = F Eσ( ms Z), f = f Eσ( mj Z), where j {s, t} is the domain that input x comes from. Subtracting a scalar from a matrix/vector should be understood entry-wise. The following lemma states that model prediction obtained from these centered random features is close to the original prediction ˆy(x) with high probability. Lemma A.12. Define centered model prediction by ˆy(x) = Y 1 N F F + γIn There exist constants c1, c2, c3, c4 > 0 such that | ˆy(x) ˆy(x)| c1d c2 with probability at least 1 c3d c4. This lemma is a consequence of Lemma I.7 and Lemma I.8 of Tripuraneni et al. (2021). Since we consider the limit n, d, N , disagreement Disi(ϕ, ψ, γ), i {I, SS, SW} are invariant to the centering. We also remark that the nonlinearity constants defined in (3) are also unchanged after this centering. For these reasons, perhaps with a slight abuse of notation, we assume F and f are centered from now on. A.4. Gaussian Equivalence For domain j {s, t} that input x is drawn from, we consider the following noisy linear random features d WX + ρsωsΘ, f = rρj d Wx + ρjωjθ, (14) where Θ RN n and θ RN have i.i.d. standard Gaussian entries independent from all other Gaussian matrices. The coefficients above are chosen so that the first and second moment of F and f match those of F and f, respectively. We call F, f the Gaussian equivalent of F, f as we claim the following. Claim A.13 (Gaussian equivalence). The asymptotic limit (Condition 2.2) of the disagreement (Definition 2.1) of the random features model (2) is invariant to the substitution F, f F, f. This idea was introduced in the context of random kernel matrices (El Karoui, 2010; Cheng & Singer, 2013; Fan & Montanari, 2019) and has been repeatedly used in recent studies of random feature models. Mei & Montanari (2022) proved the Gaussian equivalence for random weights uniformly distributed on a sphere. Montanari et al. (2019) conjectured that the same holds for classification. Adlam & Pennington (2020a;b); Tripuraneni et al. (2021) derived several asymptotic properties of random features models building on the Gaussian equivalence conjecture. Goldt et al. (2022) provided theoretical and numerical evidence suggesting that the Gaussian equivalence holds for a wide class of models including random features models. Mel & Pennington (2021); d Ascoli et al. (2021); Loureiro et al. (2021) conjectured the Gaussian equivalence for anisotropic inputs. Hassani & Javanmard (2022) showed the Gaussian equivalence holds for the adversarial risk of adversarially trained random features models. Hu & Lu (2022) showed the conjecture for isotropic Gaussian inputs, under mild technical conditions. Montanari & Saeed (2022) generalized this by removing the isotropic condition and relaxing the Gaussian input assumption. More generally, the phenomenon that eigenvalue statistics in the bulk spectrum of a random matrix do not depend on the specific law of the matrix entries is referred to as bulk universality (Wigner, 1955; Gaudin, 1961; Mehta, 2004; Dyson, 1962) and has been a central subject in the random matrix theory literature (Erd os et al., 2010; 2012; El Karoui, 2010; Tao & Vu, 2011). Demystifying Disagreement-on-the-Line in High Dimensions It is known that local spectral laws of correlated random hermitian matrices can be fully determined by their first and second moments, through the matrix Dyson equation (Erd os, 2019). Also, Banna et al. (2015; 2020) showed that spectral distributions of correlated symmetric random matrices and sample covariance matrices can be characterized by Gaussian matrices with identical correlation structures. However, these results do not directly imply Claim A.13 since we do not study the spectral properties of F, f on their own. A.5. Linear Pencils After applying the Gaussian equivalence (14), each of the quantities that we study becomes an expected trace of a rational function of random matrices. To analyze this, we use the linear pencil method (Haagerup & Thorbjørnsen, 2005; Haagerup et al., 2006; Anderson, 2013; Helton et al., 2018), in which we build a large block matrix whose blocks are linear functions of variables and one of the blocks of its inverse is the desired rational function. Then, operator-valued free probability can be used to extract block-wise spectral properties of the inverse. For example, if we want to compute E tr[( X X d + γIn) 1] for X Rd n, we consider " In X inverse has as its (1, 1)-block γ( X X d + γIn) 1. Block matrices for more complicated rational functions can be constructed using the following proposition. Proposition A.14 (Algorithm 4.3 of Helton et al. (2018)). Let x1, . . . , xg be elements of an algebra A over a field K. For an m m matrix Q and vectors u, v Km, a triple (u, Q, v) is called a linear pencil of a rational function r K(x1, . . . , xg) if each entry of Q is a K-affine function of x1, . . . , xg and r = u Q 1v. The following holds. 1. (Addition) If (u1, Q1, v1) and (u2, Q2, v2) are linear pencils of r1 and r2, respectively, then u1 u2 , Q1 0m m 0m m Q2 is a linear pencil of r1 + r2. 2. (Multiplication) If (u1, Q1, v1) and (u2, Q2, v2) are linear pencils of r1 and r2, respectively, then 0m u1 , xgv1u 2 Q1 Q2 0m m is a linear pencil of r1xgr2. 3. (Inverse) If (u, Q, v) is a linear pencil of r, then 1 0m is a linear pencil of r 1. In this language, the example before the algorithm can be interpreted in the space we consider in Remark A.7 as r = γ( X X d + γIn) 1 being a rational function of X and X , and being a linear pencil of r. In principle, repeated application of the above rules to basic building blocks such as (15) can produce a linear pencil for any rational function of given random matrices. For example, consider X1, X2 Rd n, Σ Rd d and their transpose as Demystifying Disagreement-on-the-Line in High Dimensions elements of the algebra over R we discussed in Remark A.7. Then, In X 1 γd Σ γ2 X1 γd Id In X 2 γd X2 γd Id is a linear pencil of r = ( X 1 X1 d + γIn) 1Σ( X 2 X2 d + γIn) 1. Here, we denote zero blocks by dots. This can be seen by applying the multiplication rule to two copies of (15) and xg = Σ, and then switching the first and the second pairs of columns. However, constructing a suitably small linear pencil is a non-trivial problem of independent interest (see discussions on reductions of linear pencils in e.g., Volˇciˇc (2018); Helton et al. (2018) and references therein). This is one of the challenges we need to overcome in our proofs. B.1. Proof of Theorem 3.1 Starting from this section, we omit the high-dimensional limit signs limn,d,N (Condition 2.2) for a simpler presentation. However, every expectation appearing in the derivation should be understood as its high-dimensional limit. For j {s, t}, independent disagreement satisfies Disj I (ϕ, ψ, γ) = E[(ˆy W1,X1,Y1(x) ˆy W2,X2,Y2(x))2] = E[(ˆy W1,X1,Y1(x) EW1,X1,Y1[ˆy W1,X1,Y1(x)] + EW2,X2,Y2[ˆy W2,X2,Y2(x)] ˆy W2,X2,Y2(x))2] = Eβ,x Dj[(ˆy W1,X1,Y1(x) EW1,X1,Y1[ˆy W1,X1,Y1(x)])2] + Eβ,x Dj[(ˆy W2,X2,Y2(x) EW2,X2,Y2[ˆy W2,X2,Y2(x)])2] = Eβ,x Dj VW1,X1,Y1(ˆy W1,X1,Y1(x)) + Eβ,x Dj VW2,X2,Y2(ˆy W2,X2,Y2(x)) = 2Vj. Plugging in the variance Vj given in Theorem C.1, we obtain the formula for Disj I (ϕ, ψ, γ). B.1.1. DECOMPOSITION OF Disj SS(ϕ, ψ, γ) Writing Fi = σ(Wi X/ d), fi = σ(Wix/ d), Ki = 1 N F i Fi + γIn for i {1, 2}, we can write shared-sample disagreement as Disj SS(ϕ, ψ, γ) = 1 N 2 E[(Y K 1 1 F 1 f1 Y K 1 2 F 2 f2)2] = 2 N 2 E[f 1 F1K 1 1 Y Y K 1 1 F 1 f1] 2 N 2 E[f 2 F2K 1 2 Y Y K 1 1 F 1 f1] = D1 D2. (16) The term D1 was computed in (A268), (A279), (A462), (A546) of Tripuraneni et al. (2021) as D1 = 2Vj + 2ρjκ2 ρsϕ Ij 3,2. (17) Plugging in Y = X β/ d + ε, where ε = (ε1, . . . , εn) Rn, the term D2 becomes D2 = 2 d N 2 EWi,X tr[K 1 2 X Eβ[ββ ]XK 1 1 F 1 Ex Dj,θ[f1f 2 ]F2] d N 2 EWi,X[K 1 2 X Eβ,ε[βε ]K 1 1 F 1 Ex Dj,θ[f1f 2 ]F2] N 2 EWi,X tr[K 1 2 Eε[εε ]K 1 1 F 1 Ex Dj,θ[f1f 2 ]F2] = 2 d N 2 EWi,X tr[K 1 2 X XK 1 1 F 1 Ex Dj,θ[f1f 2 ]F2] + 2σ2 ε N 2 EWi,X tr[K 1 2 K 1 1 F 1 Ex Dj,θ[f1f 2 ]F2]. Demystifying Disagreement-on-the-Line in High Dimensions From the Gaussian equivalence (14), we have Ex Dj,θ[f1f 2 ] = ρj d W1Σj W 2 . D2 = 2ρj d2N 2 EWi,X tr[W1Σj W 2 F2K 1 2 X XK 1 1 F 1 ] + 2σ2 ερj d N 2 EWi,X tr[K 1 1 F 1 W1Σj W 2 F2K 1 2 ] = D21 + D22. (18) We can write X = Σ 1 2s Z for Z Rd n with i.i.d. standard Gaussian entries. Thus, D21 = 2ρj d2N 2 EWi,Z tr[W1Σj W 2 F2K 1 2 Z Σs ZK 1 1 F 1 ], D22 = 2σ2 ερj d N 2 EWi,Z tr[K 1 1 F 1 W1Σj W 2 F2K 1 2 ]. Now, we use the linear pencil method (Helton et al., 2018) to build a block matrix such that (1) each block is either deterministic or a constant multiple of Z, Wi, Θi and (2) D21 or D22 appears as a trace of a block of its inverse. Then, we compute the operator-valued Cauchy transform of the block matrix and extract D21 and D22 from the result. B.1.2. PRELIMINARY COMPUTATIONS We present some preliminary computations that will be used in later sections. We will also use the linear pencil Q0 as a building block when constructing other linear pencils. Most of the computations here are adopted from Section A.9.6.1 of Tripuraneni et al. (2021). For clarity and to be self-contained, we provide our own version of the same result updated in some minor ways. Using W, Z and other notations from Section 2 and Θ from (14), let Recall from Example A.2 that we denote the normalized trace of a matrix A by tr(A). Define the block-wise normalized expected trace of (Q0) 1 by G0 = (id Etr)((Q0) 1). From block matrix inversion, we see G0 1,1 = γ Etr(K 1), G0 3,6 = γ ρs Etr[Σs W ˆK 1W] N , G0 5,4 = ρs Etr[Σs ZK 1Z ] in which ˆK = 1 N FF + γIN. We augment the matrix Q0 to form the symmetric matrix Q0 as This matrix can be written as Q0 = Z0 Q0 W,Z,Θ Q0 Σ = In+4d+N In+4d+N Demystifying Disagreement-on-the-Line in High Dimensions Defining G0 as below, we have = (id Etr)((Q0) 1) (id Etr)(((Q0) ) 1) = (id Etr) (Q0) 1 = (id Etr)(( Q0) 1). Thus, G0 can be viewed as the operator-valued Cauchy transform of Q0 W,Z,Θ + Q0 Σ (in the space we consider in Remark A.7), G0 = (id Etr)( Z0 Q0 W,Z,Θ Q0 Σ) 1 = G Q0 W,Z,Θ+ Q0 Σ( Z0). Here, we implicitly used the canonical inclusion defined in (10) to write Since Q0 Σ is deterministic, the matrices Q0 W,Z,Θ and Q0 Σ are asymptotically freely independent according to Definition A.9. Hence by the subordination formula (12), G0 = G Q0 Σ( Z0 R Q0 W,Z,Θ( G0)) = (id Etr)( Z0 R Q0 W,Z,Θ( G0) Q0 Σ) 1. (20) Since Q0 W,Z,Θ consists of i.i.d. Gaussian blocks, we use (13) to find the R-transform R Q0 W,Z,Θ( G0) of the form R Q0 W,Z,Θ( G0) = (R0) For example, to find R0 1,1, we look for a block in the first row of Q0 W,Z,Θ and a block in the first column of Q0 W,Z,Θ such that they are transpose to each other up to a constant factor. There are two such pairs, ((1, 2)-block, (2, 1)-block) and ((1, 3)-block, (6, 1)-block). Therefore, the equation (13) gives R0 1,1 = ρsωs γ G0 2,2 ρs Repeating the same procedure, the non-zero blocks of R0 are R0 1,1 = ρsωs γ G0 2,2 ρs γ G0 3,6, R0 2,2 = ρsωsψ γϕ G0 1,1 + ρsψG0 5,4, R0 4,5 = ρs G0 2,2, R0 6,3 = ρs G0 1,1 γϕ . Plugging this into equation (20), we obtain self-consistent equations for G1. For example, G0 3,6 = Etr[(In+4d+N R0 Q0 Σ)]3,6 = Etr γ ρsϕG0 2,2Σs(γϕId + ρs G0 1,1G0 2,2Σs) 1 " λsγ ρsϕG0 2,2 γϕ + λsρs G0 1,1G0 2,2 Demystifying Disagreement-on-the-Line in High Dimensions G0 1,1 = γ γ + ρsωs G0 2,2 + ρs G0 3,6 , G0 2,2 = γϕ γϕ + ρsωsψG0 1,1 γ ρsψϕG0 5,4 G0 3,6 = Eµ " λsγ ρsϕG0 2,2 γϕ + λsρs G0 1,1G0 2,2 , G0 5,4 = Eµ " λs ρs G0 1,1 γϕ + λsρs G0 1,1G0 2,2 Now, by eliminating G0 3,6, G0 5,4 and expressing in terms of κ, τ, and τ defined in (4) and (6), we can show that Etr(K 1) = G0 1,1 γ = τ, and Etr( ˆK 1) = G0 2,2 γ = τ. Thus, using equation (5) we have G0 1,1 = γτ, G0 2,2 = γ τ, G0 3,6 = γ ρs τIs 1,1, G0 5,4 = ρsτIs 1,1 ϕ . (21) B.1.3. COMPUTATION OF D21 Define Q1 by N Id Σ 1 2 s ρs 1 2s Σj ρs Z Define the block-wise normalized expected trace of (Q1) 1 by G1 = (id Etr)((Q1) 1). Then, by block matrix inversion we have G1 2,14 = ψ d2N 2 E tr[W1Σj W 2 F2K 1 2 Z Σs ZK 1 1 F 1 ] = ψ We augment Q1 to the symmetric matrix Q1 as Q1 = Z1 Q1 W,Z,Θ Q1 Σ = I2n+9d+3N I2n+9d+3N Demystifying Disagreement-on-the-Line in High Dimensions Then defining G1 below, = (id Etr)((Q1) 1) (id Etr)(((Q1) ) 1) = (id Etr) (Q1) 1 = (id Etr)(( Q1) 1) can be viewed as the operator-valued Cauchy transform of Q1 W,Z,Θ + Q1 Σ (in the space we consider in Remark A.7), i.e., G1 = (id Etr)( Z1 Q1 W,Z,Θ Q1 Σ) 1 = G Q1 W,Z,Θ+ Q1 Σ( Z1). Further by the subordination formula (12), G1 = G Q1 Σ( Z1 R Q1 W,Z,Θ( G1)) = (id Etr)( Z1 R Q1 W,Z,Θ( G1) Q1 Σ) 1. (22) Since Q1 W,Z,Θ consists of i.i.d. Gaussian blocks, by (13), its limiting R-transform has a form R Q1 W,Z,Θ( G1) = (R1) Demystifying Disagreement-on-the-Line in High Dimensions where the non-zero blocks of R1 are R1 1,1 = ρsωs γ G1 2,2 ρs γ G1 3,6, R1 1,7 = ρs γ G1 3,12, R1 2,2 = ψρsωs γϕ G1 1,1 + ρsψG1 5,4, R1 2,14 = ρsψG1 5,13, R1 4,5 = ρs G1 2,2, R1 6,3 = ρs γϕ G1 1,1, R1 6,9 = ρs R1 7,1 = ρs γ G1 9,6 = 0, R1 7,7 = ρsωs γ G1 8,8 ρs γ G1 9,12, R1 8,8 = ψρsωs γϕ G1 7,7 + ρsψG1 11,10, R1 10,11 = ρs G1 8,8, R1 12,3 = ρs γϕ G1 7,1 = 0, R1 12,9 = ρs γϕ G1 7,7, R1 13,5 = ρs G1 14,2 = 0. We used the fact that G1 9,6 = G1 7,1 = G1 14,2 = 0, which we obtain from block matrix inversion of Q1. Computing the block-matrix inverse of Q1 and from equations (19), (21), we see G1 1,1 = G1 7,7 = γEtr(K 1) = G0 1,1 = γτ, G1 2,2 = G1 8,8 = γEtr( ˆK 1) = G0 2,2 = γ τ, G1 3,6 = G1 9,12 = γ ρs Etr[Σs W ˆK 1W] N = G0 3,6 = γ ρs τIs 1,1, G1 5,4 = G1 11,10 = ρs Etr[Σs ZK 1Z ] d = G0 5,4 = ρsτIs 1,1 ϕ . Plugging these into (22), we obtain self-consistent equations. For example, G1 2,14 = Etr[(I2n+9d+3N R1 Q1 Σ) 1]2,14 = γ ρsψϕG1 5,13 γϕ( 1 + ρsψG1 5,4) ψρsωs G1 1,1 = γ ρs τψG1 5,13. G1 5,13 = Eµ " λsλjγ ρsϕG1 1,7G1 2,2 + (λs)2λj ρs(G1 1,1)2G1 2,2 (γϕ + λsρs G1 1,1G1 2,2)2 = ρs τIj 2,2G1 1,7 + γ ρsτ 2 τ G1 1,7 = γ ρs G1 3,12 (γ + ρs G1 3,6 + ρsωs G1 2,2)2 = γ ρsτ 2G1 3,12, G1 3,12 = Eµ " λsγ2ϕ2 + (λs)2γρ2 sϕG1 1,7(G1 2,2)2 ρs(γϕ + λsρs G1 1,1G1 2,2)2 = ϕ ρs Is 1,2 γρ 3 2s τ 2Is 2,2G1 1,7. Eliminating G1 3,12 and using κ = γρsτ τ, G1 1,7 = γτ 2ϕIs 1,2 + κ2Is 2,2G1 1,7 G1 1,7 = γτ 2ϕIs 1,2 1 κ2Is 2,2 . G1 2,14 = γ ρs τψG1 5,13 = γρs τ 2ψIj 2,2G1 1,7 + γ2ρsτ 2 τ 2ψ = γ2ρsτ 2 τ 2ψϕIs 1,2Ij 2,2 1 κ2Is 2,2 + γ2ρsτ 2 τ 2ψ ϕ Ij 3,2 = κ2 ψϕIs 1,2Ij 2,2 ρs(1 κ2Is 2,2) + ψIj 3,2 ρsϕ ψ G1 2,14 = 2ρjκ2 ϕIs 1,2Ij 2,2 1 κ2Is 2,2 + Ij 3,2 ϕ Demystifying Disagreement-on-the-Line in High Dimensions B.1.4. COMPUTATION OF D22 and G2 = (id Etr)((Q2) 1). Then, G2 7,1 = γ ρsϕ d N 2 E tr[K 1 1 F 1 W1Σj W 2 F2K 1 2 ] = γ ρsϕ 2σ2ερj D22. We augment Q2 to the symmetric matrix Q2 as Q2 = Z2 Q2 W,Z,Θ Q2 Σ = I2n+8d+2N I2n+8d+2N Demystifying Disagreement-on-the-Line in High Dimensions Defining G2 below, = (id Etr)((Q2) 1) (id Etr)(((Q2) ) 1) = (id Etr) (Q2) 1 = (id Etr)(( Q2) 1). It can be viewed as the operator-valued Cauchy transform of Q2 W,Z,Θ + Q2 Σ (in the space we consider in Remark A.7), i.e., G2 = (id Etr)( Z2 Q2 W,Z,Θ Q2 Σ) 1 = G Q2 W,Z,Θ+ Q2 Σ( Z2). Further by the subordination formula (12), G2 = G Q2 Σ( Z2 R Q2 W,Z,Θ( G2)) = (id Etr)( Z2 R Q2 W,Z,Θ( G2) Q2 Σ) 1. (24) Since Q2 W,Z,Θ consists of i.i.d. Gaussian blocks, by (13), its limiting R-transform has a form R Q2 W,Z,Θ( G2) = (R2) where the non-zero blocks of R2 are R2 1,1 = ρsωs γ G2 2,2 ρs γ G2 3,6, R2 1,7 = ρs γ G2 3,12 = 0, R2 2,2 = ψρsωs γϕ G2 1,1 + ρsψG2 5,4, R2 4,5 = ρs G2 2,2, R2 6,3 = ρs γϕ G2 1,1, R2 6,9 = ρs γϕ G2 1,7 = 0, R2 7,1 = ρs R2 7,7 = ρsωs γ G2 8,8 ρs γ G2 9,12, R2 8,8 = ψρsωs γϕ G2 7,7 + ρsψG2 11,10, R2 10,11 = ρs G2 8,8, R2 12,3 = ρs γϕ G2 7,1, R2 12,9 = ρs We used the fact that G2 3,12 = G2 1,7 = 0, which we obtain from block matrix inversion of Q2. From block matrix inversion of Q2 and equations (19), (21), we have G2 1,1 = G2 7,7 = γEtr(K 1) = G0 1,1 = γτ, G2 2,2 = G2 8,8 = γEtr( ˆK 1) = G0 2,2 = γ τ, G2 3,6 = G2 9,12 = γ ρs Etr[Σs W ˆK 1W] N = G0 3,6 = γ ρs τIs 1,1, G2 5,4 = G2 11,10 = ρs Etr[Σs ZK 1Z ] d = G0 5,4 = ρsτIs 1,1 ϕ . Demystifying Disagreement-on-the-Line in High Dimensions Plugging these into (24), we have the following self-consistent equations G2 7,1 = γ ρs G2 9,6 (γ + ρs G2 3,6 + ρsωs G2 2,2)2 = γ ρsτ 2G2 9,6, G2 9,6 = Eµ 3 2s ϕ(G2 2,2)2G2 7,1 + λsλjγ2ρsϕ2(G2 2,2)2 (γϕ + λsρs G2 1,1G2 2,2)2 3 2s τ 2Is 2,2G2 7,1 γ2ρs τ 2ϕIj 2,2. Solving for G2 7,1, G2 7,1 = κ2γϕIj 2,2 ρs(1 κ2Is 2,2). D22 = 2σ2 ερj γ ρsϕG2 7,1 = 2ρjκ2σ2 εIj 2,2 ρs(1 κ2Is 2,2). (25) B.1.5. COMPUTATION OF Disj SS(ϕ, ψ, γ) Combining equations (16), (17), (18), (23), (25), we get Disj SS(ϕ, ψ, γ) = Disj I (ϕ, ψ, γ) 2ρjκ2(σ2 ε + ϕIs 1,2)Ij 2,2 ρs(1 κ2Is 2,2) . B.1.6. DECOMPOSITION OF Disj SW(ϕ, ψ, γ) Writing Fi = σ(WXi/ d), f = σ(Wx/ N F i Fi + γIn for i {1, 2}, we can write SW disagreement as Disj SW(ϕ, ψ, γ) = 1 N 2 E[(Y 1 K 1 1 F 1 f Y 2 K 1 2 F 2 f)2] = 2 N 2 E[f F1K 1 1 Y1Y 1 K 1 1 F 1 f] 2 N 2 E[f F2K 1 2 Y2Y 1 K 1 1 F 1 f] = D1 D3. (26) The term D1 is given in (17). Plugging in Yi = X i β/ d + εi, where εi = (εi1, . . . , εin) Rn, the term D3 becomes D3 = 2 d N 2 EW,Xi tr[F2K 1 2 X 2 Eβ[ββ ]X1K 1 1 F 1 Ex Dj,θ[ff ]] d N 2 EW,Xi[F2K 1 2 X 2 Eβ,ε1[βε 1 ]K 1 1 F 1 Ex Dj,θ[ff ]] N 2 EW,Xi tr[F2K 1 2 Eεi[ε2ε 1 ]K 1 1 F 1 Ex Dj,θ[ff ]] = 2 d N 2 EW,Xi tr[F2K 1 2 X 2 X1K 1 1 F 1 Ex Dj,θ[ff ]]. From the Gaussian equivalence (14), we have Ex Dj,θ[ff ] = ρj d WΣj W + ρjωj IN. D3 = 2ρj d2N 2 EW,Xi tr[WΣj W F2K 1 2 X 2 X1K 1 1 F 1 ] + 2ρjωj d N 2 EW,Xi tr F2K 1 2 X 2 X1K 1 1 F 1 = D31 + D32. (27) Demystifying Disagreement-on-the-Line in High Dimensions We can write Xi = Σ 1 2s Zi for Zi Rd n with i.i.d. standard Gaussian entries. Thus, D31 = 2ρj d2N 2 EW,Zi tr[WΣj W F2K 1 2 Z 2 Σs Z1K 1 1 F 1 ], D32 = 2ρjωj d N 2 EW,Zi tr F2K 1 2 Z 2 Σs Z1K 1 1 F 1 . Now, we use the linear pencil method to compute D31 and D32. B.1.7. COMPUTATION OF D31 N Id Σ 1 2 s ρs 1 2s Σj ρs Z1 and G3 = (id Etr)((Q3) 1). Then, G3 2,14 = ψ d2N 2 EW,Zi tr[WΣj W F2K 1 2 Z 2 Σs Z1K 1 1 F 1 ] = ψ We augment Q3 to the symmetric matrix Q3 as Q3 = 0 (Q3) Q3 = Z3 Q3 W,Z,Θ Q3 Σ = 0 I2n+9d+3N I2n+9d+3N 0 0 (Q3 W,Z,Θ) Demystifying Disagreement-on-the-Line in High Dimensions Defining G3 below, = 0 (id Etr)((Q3) 1) (id Etr)(((Q3) ) 1) 0 = (id Etr) 0 (Q3) 1 ((Q3) ) 1 0 = (id Etr)(( Q3) 1). It can be viewed as the operator-valued Cauchy transform of Q3 W,Z,Θ + Q3 Σ (in the space we consider in Remark A.7), i.e., G3 = (id Etr)( Z3 Q3 W,Z,Θ Q3 Σ) 1 = G Q3 W,Z,Θ+ Q3 Σ( Z3). Further by the subordination formula (12), G3 = G Q3 Σ( Z R Q3 W,Z,Θ( G3)) = (id Etr)( Z3 R Q3 W,Z,Θ( G3) Q3 Σ) 1. (28) Since Q3 W,Z,Θ consists of i.i.d. Gaussian blocks, by (13), its limiting R-transform has a form R Q3 W,Z,Θ( G3) = 0 (R3) Demystifying Disagreement-on-the-Line in High Dimensions where the non-zero blocks of R3 are R3 1,1 = ρsωs γ G3 2,2 ρs γ G3 3,6, R3 2,2 = ψρsωs γϕ G3 1,1 + ρsψG3 5,4, R3 2,8 = ρsψG3 5,10, R3 2,14 = ρsψG3 5,13, R3 4,5 = ρs G3 2,2, R3 4,11 = ρs G3 2,8, R3 6,3 = ρs γϕ G3 1,1, R3 7,7 = ρsωs γ G3 8,8 ρs R3 8,2 = ρsψG3 11,4 = 0, R3 8,8 = ψρsωs γϕ G3 7,7 + ρsψG3 11,10, R3 8,14 = ρsψG3 11,13, R3 10,5 = ρs G3 8,2 = 0, R3 10,11 = ρs G3 8,8, R3 12,9 = ρs γϕ G3 7,7, R3 13,5 = ρs G3 14,2 = 0, R3 13,11 = ρs G3 14,8 = 0. We used the fact that G3 11,4 = G3 8,2 = G3 14,2 = G3 14,8 = 0, which we obtain from block matrix inversion of Q3. Further from block matrix inversion of Q3 and equations (19), (21), we have G3 1,1 = G3 7,7 = γEtr(K 1) = G0 1,1 = γτ, G3 2,2 = G3 8,8 = γEtr( ˆK 1) = G0 2,2 = γ τ, G3 3,6 = G3 9,12 = γ ρs Etr[Σs W ˆK 1W] N = G0 3,6 = γ ρs τIs 1,1, G3 5,4 = G3 11,10 = ρs Etr[Σs ZK 1Z ] d = G0 5,4 = ρsτIs 1,1 ϕ . Plugging these into (28), we have the following self-consistent equations G3 2,14 = γ2ρs τ 2ψ2G3 5,10G3 11,13 + γ ρs τψG3 5,13, G3 5,10 = ρsτ 2 ϕ Is 2,2 + ρ ϕ Is 2,2G3 2,8, G3 2,8 = γ2 ρs τ 2ψG3 5,10, G3 5,13 = ρsτIj 2,2G3 2,8 + γ ρsτ 2 τ ϕ Ij 3,2, G3 11,13 = Ij 1,1 ρs . Solving for G3 5,10 gives G3 5,10 = ρsτ 2Is 2,2 ϕ ψκ2Is 2,2 . Plugging in G3 5,10, G3 11,13, G3 5,13 to find G3 2,14, we get ψ G3 2,14 = 2ρjψϕκ2Is 2,2Ij 1,2 ρs(ϕ ψκ2Is 2,2) + 2ρjκ2 ρsϕ Ij 3,2. (29) B.1.8. COMPUTATION OF D32 Demystifying Disagreement-on-the-Line in High Dimensions and G4 = (id Etr)((Q4) 1). Then, G4 2,8 = ρs d N 2 EW,Zi tr F2K 1 2 Z 2 Σs Z1K 1 1 F 1 = ρs 2ρjωj D32. We augment Q4 to the symmetric matrix Q4 as Q4 = 0 (Q4) Q4 = Z4 Q4 W,Z,Θ Q4 Σ = 0 I2n+8d+2N I2n+8d+2N 0 0 (Q4 W,Z,Θ) Defining G4 below, = 0 (id Etr)((Q4) 1) (id Etr)(((Q4) ) 1) 0 = (id Etr) 0 (Q4) 1 ((Q4) ) 1 0 = (id Etr)(( Q4) 1). Demystifying Disagreement-on-the-Line in High Dimensions It can be viewed as the operator-valued Cauchy transform of Q4 W,Z,Θ + Q4 Σ (in the space we consider in Remark A.7), i.e., G4 = (id Etr)( Z4 Q4 W,Z,Θ Q4 Σ) 1 = G Q4 W,Z,Θ+ Q4 Σ( Z4). Further by the subordination formula (12), G4 = G Q4 Σ( Z R Q4 W,Z,Θ( G4)) = (id Etr)( Z4 R Q4 W,Z,Θ( G4) Q4 Σ) 1. (30) Since Q4 W,Z,Θ consists of i.i.d. Gaussian blocks, by (13), its limiting R-transform has a form R Q4 W,Z,Θ( G4) = 0 (R4) where the non-zero blocks of R4 are R4 1,1 = ρsωs γ G4 2,2 ρs γ G4 3,6, R4 2,2 = ρsωsψ γϕ G4 1,1 + ρsψG4 5,4, R4 2,8 = ρsψG4 5,10, R4 4,5 = ρs G4 2,2 R4 4,11 = ρs G4 2,8, R4 6,3 = ρs γϕ G4 1,1, R4 7,7 = ρsωs γ G4 8,8 ρs γ G4 9,12, R4 8,2 = ρsψG4 11,4 = 0, R4 8,8 = ρsωsψ γϕ G4 7,7 + ρsψG4 11,10, R4 10,5 = ρs G4 8,2 = 0, R4 10,11 = ρs G4 8,8, R4 12,9 = ρs We used the fact that G4 11,4 = G4 8,2 = 0, which we obtain from block matrix inversion of Q4. Further from block matrix inversion of Q4 and equations (19), (21), we have G4 1,1 = G4 7,7 = γEtr(K 1) = G0 1,1 = γτ, G4 2,2 = G4 8,8 = γEtr( ˆK 1) = G0 2,2 = γ τ, G4 3,6 = G4 9,12 = γ ρs Etr[Σs W ˆK 1W] N = G0 3,6 = γ ρs τIs 1,1, G4 5,4 = G4 11,10 = ρs Etr[Σs ZK 1Z ] d = G0 5,4 = ρsτIs 1,1 ϕ . Plugging these into (30), we have the following self-consistent equations G4 2,8 = ρsψϕ2G4 5,10 (ϕ + ρsτψ(ωs + Is 1,1))2 , G4 5,10 = ρsτ 2 ϕ Is 2,2 + ρ ϕ Is 2,2G4 2,8 Solving for G4 2,8 and plugging in to D32, we get D32 = 2ρjωj ρs G4 2,8 = 2ρjωjψκ2Is 2,2 ρs(ϕ ψκ2Is 2,2). (31) B.1.9. COMPUTATION OF Disj SW(ϕ, ψ, γ) Combining equations (26), (17), (27), (29), (31), we get Disj SW(ϕ, ψ, γ) = Disj I (ϕ, ψ, γ) 2ρjψκ2(ωj + ϕIj 1,2)Is 2,2 ρs(ϕ ψκ2Is 2,2) . B.2. Proof of Corollary 3.2 Since κ 1/ωs for any γ > 0 by (4), we know limγ 0 γκ = 0. Thus from (6), we have lim γ 0 γτ = |ψ ϕ| + ψ ϕ 2ψ , lim γ 0 γ τ = 1 ψ ϕ lim γ 0 γτ = |ψ ϕ| + ϕ ψ Demystifying Disagreement-on-the-Line in High Dimensions By Condition 2.3 and the dominated convergence theorem, the functionals Is a,b, It a,b and their derivatives with respect to κ are continuous in κ. Applying the implicit function theorem to the self-consistent equation (4), viewing it as a function of κ and γ, we find that κ is differentiable with respect to γ and thus continuous. Therefore, the limit of κ, Is a,b, It a,b when γ 0 is well defined. Plugging these limits into Theorem 3.1, we reach lim γ 0 Disj I (ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) + 2ρjκ(σ2 ε+ϕIs 1,2)Ij 2,2 ρs(ωs+ϕIs 1,2) ϕ > ψ, 2ρjκ(ωj+ϕIj 1,2)Is 2,2 ρs(ωs+ϕIs 1,2) ϕ < ψ, lim γ 0 Disj SS(ϕ, ψ, γ) = lim γ 0 Disj I (ϕ, ψ, γ) 2ρjκ2(σ2 ε + ϕIs 1,2)Ij 2,2 ρs(1 κ2Is 2,2) , (32) lim γ 0 Disj SW(ϕ, ψ, γ) = lim γ 0 Disj I (ϕ, ψ, γ) 2ρjψκ2(ωj + ϕIj 1,2)Is 2,2 ρs(ϕ ψκ2Is 2,2) . (33) From the equation (5), we have Is 1,1 = ϕIs 1,2 + κIs 2,2. Also by (4) and (6), ωs = 1 γτ κ Is 1,1. Therefore, ωs + ϕIs 1,2 = 1 γτ κ Is 1,1 + ϕIs 1,2 = 1 γτ κ κIs 2,2. (34) In the ridgeless limit γ 0, the equation (34) gives lim γ 0 1 ωs + ϕIs 1,2 = limγ 0 κ 1 κ2Is 2,2 ϕ > ψ, limγ 0 ψκ ϕ ψκ2Is 2,2 ϕ < ψ. (35) Putting (32), (33), (35) together, we conclude lim γ 0 Disj SS(ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) + (ωj+ϕIj 1,2)Is 2,2 ωs+ϕIs 1,2 κ(σ2 ε+ϕIs 1,2)Ij 2,2 1 κ2Is 2,2 lim γ 0 Disj SW(ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) + (σ2 ε+ϕIs 1,2)Ij 2,2 ωs+ϕIs 1,2 ψκ(ωj+ϕIj 1,2)Is 2,2 ϕ ψκ2Is 2,2 B.3. Proof of Theorem 4.1 By Corollary 3.2, disagreement in the ridgeless and overparametrized regime is given by lim γ 0 Disj I (ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) + 2ρjκ(σ2 ε + ϕIs 1,2)Ij 2,2 ρs(ωs + ϕIs 1,2) , lim γ 0 Disj SS(ϕ, ψ, γ) = 2ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1). The self-consistent equation (7) in the overpametrized regime ϕ > ψ is κ = 1 ωs + Is 1,1(κ), which is independent of ψ. Consequently, the unique positive solution κ is also independent of ψ. This proves that the slope a and the intercept b I defined in Theorem 4.1 are independent of ψ as well. Checking the equation (9) can be done by using (35) and a simple algebra. Demystifying Disagreement-on-the-Line in High Dimensions B.4. Proof of Theorem 4.3 Let a(γ), b I(γ), b SS(γ) be defined by (8), but with κ in the self-consistent equation (4) with general γ, instead of the self-consistent equation (7) in the ridgeless limit. With this notation, we have a = a(0), b I = b I(0), b SS = b SS(0). By Theorem 3.1 and the triangle inequality, deviation from the line is bounded by |Dist i(ϕ, ψ,γ) a Diss i(ϕ, ψ, γ) bi| |Dist i(ϕ, ψ, γ) a(γ)Diss i(ϕ, ψ, γ) bi(γ)| + |a(γ) a(0)||Diss i(ϕ, ψ, γ)| + |bi(γ) bi(0)| A1 + A2 + Diss i(ϕ, ψ, γ)|a(γ) a(0)| + |bi(γ) bi(0)|, i {I, SS}, (36) A1 = 2ψγτκIs 2,2|ρt(ωt + ϕIt 1,2) aρs(ωs + ϕIs 1,2)| ϕγ + ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2) , A2 = 2(σ2 ε + ϕIs 1,2)|ρt It 2,2 aρs Is 2,2| κϕγ τ ϕγ + ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2) κ2 ρs(1 κ2Is 2,2) In what follows, we bound each of these terms. We will use O( ) notation to hide constants depending on ϕ, µ, σ2 ε, σ. For example, we can write Ij a,b = O(1) for j {s, t} since we assume in Condition 2.3 that µ is compactly supported. B.4.1. BOUNDING A1 We know a ρt(ωt + It 1,1)/ρsωs by (8). Thus, Is 2,2|ρt(ωt + ϕIt 1,2) aρs(ωs + ϕIs 1,2)| = O(1). (37) By (6) and since p x2 + y2 |x| + |y| for any x, y R, (ψ ϕ)2 + 4κψϕγ/ρs + ψ ϕ Again by (6), ψγτ + ϕγ τ = p (ψ ϕ)2 + 4κψϕγ/ρs. Therefore, κ ϕγ + ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2) κ ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2) = O 1 1 ψ/ϕ + ψγ Here, we used κ 1 ωs = O(1) by (4). Combining (37), (38), (39), we reach A1 = O ψγ 1 ψ/ϕ + ψγ B.4.2. BOUNDING A2 Similar to (37), we have 2(σ2 ε + ϕIs 1,2)|ρt It 2,2 aρs Is 2,2| = O(1). (41) ρs(1 κ2Is 2,2) = κ2 ρs[γτ + κ(ωs + ϕIs 1,2)]. (42) From (42) and κ = γρsτ τ, κϕγ τ ϕγ + ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2) κ2 ρs(1 κ2Is 2,2) = κ2(ωs + ϕIs 1,2)ψγτ [ϕγ + ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2)][γτ + κ(ωs + ϕIs 1,2)]. Demystifying Disagreement-on-the-Line in High Dimensions From (38), (39), and κ(ωs + ϕIs 1,2)/[γτ + κ(ωs + ϕIs 1,2)] 1, we get κϕγ τ ϕγ + ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2) κ2 ρs(1 κ2Is 2,2) = O ψγ 1 ψ/ϕ + ψγ Putting (41) and (43) together, A2 = O ψγ 1 ψ/ϕ + ψγ B.4.3. BOUNDING Diss I(ϕ, ψ, γ) AND Diss SS(ϕ, ψ, γ) By Theorem 3.1 and the equations (6), (38), (39), we have Diss I(ϕ, ψ, γ) = O 1 + ψγ 1 ψ/ϕ + ψγ By Theorem 3.1 and the equations (38), (39), (43), we have Diss SS(ϕ, ψ, γ) = O ψ + ψγ 1 ψ/ϕ + ψγ B.4.4. BOUNDING |a(γ) a(0)| From the argument in Section B.2, we know a(γ) is differentiable with respect to γ. By the chain rule and (5), γ It 2,2(ωs + Is 1,1) + Is 2,2(ωt + It 1,1) (ωs + Is 1,1)2 . (47) By implicit differentiation of (4), we have κ γ = κ ϕγ + ρs(ψγτ + ϕγ τ)(ωs + ϕIs 1,2). (48) We have |( It 2,2(ωs + Is 1,1) + Is 2,2(ωt + It 1,1))/(ωs + Is 1,1)2| = O(1) and (ψ ϕ)2 + ψϕγ since ψγτ + ϕγ τ = p (ψ ϕ)2 + 4κψϕγ/ρs. Therefore, |a(γ) a(0)| = a γ (u)du Z γ (ψ ϕ)2 + ψϕu du = O γ 1 ψ/ϕ + ψγ B.4.5. BOUNDING |b I(γ) b I(0)| From the argument in Section B.2, we know b I(γ) is differentiable with respect to γ. In (8), the terms κ2 1 κ2Is 2,2 , σ2 ε + ϕIs 1,2, ρt aρs Is 2,2 and their derivatives with respect to κ are O(1). Thus, (ψ ϕ)2 + ψϕγ Demystifying Disagreement-on-the-Line in High Dimensions |b I(γ) b I(0)| = γ (u)du Z γ (ψ ϕ)2 + ψϕu du = O γ 1 ψ/ϕ + ψγ Theorem 4.3 is proved by combining the equations (36), (40), (44), (45), (46), (49), (50). B.5. Proof of Corollary 4.4 By Ej = Bj + Vj = Bj + 1 2Disj I (ϕ, ψ, γ) and (51), we have |Et a Es brisk| 1 2|Dist I(ϕ, ψ, γ) a Diss I(ϕ, ψ, γ) b I| + Bt a Bs lim γ 0(Bt a Bs) . Since the derivatives of Ij 1,1, Ij 1,2 with respect to γ is O(1). We have Bt a Bs lim γ 0(Bt a Bs) = O(γ) by the mean value theorem. The conclusion follows from Theorem 4.3. C. Recap of Tripuraneni et al. (2021) In this section, we restate some relevant results of Tripuraneni et al. (2021), in the special cases Σ = Σs or Σ = Σt. See Tripuraneni et al. (2021) for the original theorems. For a test distribution x N(0, Σ ), define the risk by EΣ = Ex,β,X,Y,W [(β x ˆy W,X,Y (x))2]. We have the following bias-variance decomposition EΣ = Ex,β[(β x EW,X,Y [ˆy W,X,Y (x)])2] + Ex,β[VW,X,Y (ˆy W,X,Y (x))] = BΣ + VΣ . We consider the high-dimensional limit n, d, N with d/n ϕ and d/N ψ of the above quantities when Σ = Σs or Σ = Σt, Ej = lim n,d,N EΣj, Bj = lim n,d,N BΣj, Vj = lim n,d,N VΣj, j {s, t}. Theorem C.1 (Theorem 5.1 of Tripuraneni et al. (2021)). For j {s, t}, the asymptotic bias and variance are given by 2 mj + 2 1 rρj ρs Ij 1,1 + ρjϕ Is 1,1(ωs + ϕIs 1,2)(ωj + Ij 1,1) + ϕ2 ψ γ τIs 1,2Ij 2,2 +γτIs 2,2(ωj + ϕIj 1,2) + σ2 ε (ωs + ϕIs 1,2)(ωj + Ij 1,1) + ϕ ψ γ τIj 2,2 where κ, τ, τ are defined in (4) and (6). In the ridgeless limit γ 0, the variance Vj is further simplified as follows. Demystifying Disagreement-on-the-Line in High Dimensions Corollary C.2 (Corollary 5.1 of Tripuraneni et al. (2021)). For j {s, t}, the asymptotic variance in the ridgeless limit is lim γ 0 Vj = ρjψκ ρs|ϕ ψ|(σ2 ε + Is 1,1)(ωj + Ij 1,1) + 1 κ(ωs σ2 ε) 1 κ2Is 2,2 Ij 2,2 ϕ ψ, ρjκ2ψIs 2,2 ρs(ϕ κ2ψIs 2,2)(ωj + ϕIj 1,2) ϕ < ψ, where κ is defined in (7). Another important observation is that there is a linear relation between the asymptotic error under the source and target domain. Proposition C.3 (Proposition 5.6 of Tripuraneni et al. (2021)). We assume ϕ is fixed. In the ridgeless limit γ 0 and the overparametrized regime ϕ ψ, the error Et is linear in Es, as a function of ψ. That is, lim γ 0 Et = brisk + ρt(ωt + It 1,1) ρs(ωs + Is 1,1) lim γ 0 Es, where the intercept 2b I + lim γ 0(Bt a Bs) (51) and the slope ρt(ωt + It 1,1)/ρs(ωs + Is 1,1) are independent of ψ. D. Additional Experiments D.1. Estimation of the Slope Let ˆΣs, ˆΣt be sample covariance of test inputs from the source and target domain, respectively. Denote the eigenvalues and corresponding eigenvectors of ˆΣs by ˆλs 1, . . . , ˆλs d and ˆv1, . . . , ˆvd. Define ˆλt i = ˆv i ˆΣtˆvi for i [d]. For j {s, t}, we estimate Ij a,b(κ) by ˆIj a,b(κ) = ϕ (ˆλs i)a 1ˆλj i (ϕ + κˆλs i)b . We estimate the constants defined in (3) by replacing mj with ˆmj = tr(ˆΣj), j {s, t}. Now, the self-consistent equation (7) is estimated by ˆκ = min(1, ϕ/ψ) ˆωs + ˆIs 1,1(ˆκ) , and its unique non-negative solution is denoted by ˆκ. The existence and uniqueness of ˆκ follows from Lemma A1.2 of Tripuraneni et al. (2021). We use ˆa = ˆρt(ˆωt + ˆIt 1,1(ˆκ)) ˆρs(ˆωs + ˆIs 1,1(ˆκ)) as an estimate of the slope a = ρt(ωt + It 1,1)/ρs(ωs + Is 1,1). D.2. Deviation from the Line Figure 5 displays deviation from the line for I disagreement and risk, when non-zero ridge regularization γ is used. Similar to Figure 3 (b), the deviation is smaller for γ closer to zero. However, unlike SS disagreement, the deviation is non-zero even in the infinite overparameterization limit ψ 0. This is consistent with the upper bound we present in Theorem 4.3 and Corollary 4.4. D.3. Varying Corruption Severity CIFAR-10-C and Tiny Image Net-C have different severity of corruption ranging from 1 to 5. We only included a few selected results in the main text due to space limitations. We present the plots for all severity levels in Figure 7. Demystifying Disagreement-on-the-Line in High Dimensions 0.00 0.25 0.50 0.75 1.00 ψ/φ 0.00 0.25 0.50 0.75 1.00 ψ/φ Figure 5. (a) Deviation from the line, Dist I (ϕ, ψ, γ) a Diss I (ϕ, ψ, γ) b I, as a function of ψ for non-zero γ. (b) Deviation from the line, Et a Es brisk, as a function of ψ for non-zero γ. We use ϕ = 0.5, σ2 ε = 10 4, Re LU activation σ, and µ = 0.4δ(0.1,1) + 0.6δ(1,0.1) D.4. I and SW disagreement In Figure 8, Figure 9, Figure 6 (a), (b), we repeat the experiment in Section D.3 for I and SW disagreement. Since our theory suggests that the disagreement-on-the-line phenomenon does not occur for SW disagreement, we do not plot theoretical predictions for SW disagreement. D.5. Accuracy and Agreement In the main text, we consider disagreement and risk defined in terms of mean squared error, but here we present classification accuracy and 0-1 agreement as studied in Hacohen et al. (2020); Chen et al. (2021); Jiang et al. (2021); Nakkiran & Bansal (2020); Baek et al. (2022); Atanov et al. (2022); Pliushch et al. (2022); Kirsch & Gal (2022). See Figures 10 and Figure 6 (c). 0.0 0.2 0.4 0.6 0.8 Source Risk I disagr. 0 20 40 60 Source 0.7 0.8 0.9 0.50 Accuracy Agreement Figure 6. (a) Target vs. source independent disagreement of random features model trained on Camelyon17. (b) Target vs. source shared-weight disagreement of random features model trained on Camelyon17. (c) Target vs. source accuracy and agreement of random features model trained on Camelyon17; Experimental setting is identical to Section 5. Demystifying Disagreement-on-the-Line in High Dimensions CIFAR10-C-Snow CIFAR10-C-Fog Tiny Image Net-C-Snow Risk SS disagreement Tiny Image Net-C-Fog 0.0 0.5 0.00 0.25 0.50 Figure 7. Target vs. source shared-sample disagreeement on CIFAR-10 and Tiny Image Net with varying corruption severity. Experimental setting is identical to Section 5. Demystifying Disagreement-on-the-Line in High Dimensions CIFAR10-C-Snow Risk I disagreement CIFAR10-C-Fog Tiny Image Net-C-Snow Tiny Image Net-C-Fog 0.1 0.2 0.0 0.1 0.2 0.2 0.4 0.2 0.4 Figure 8. Target vs. source independent disagreeement on CIFAR-10 and Tiny Image Net with varying corruption severity. Experimental setting is identical to Section 5. Demystifying Disagreement-on-the-Line in High Dimensions CIFAR10-C-Snow SW disagreement CIFAR10-C-Fog Tiny Image Net-C-Snow Tiny Image Net-C-Fog 0.15 0.20 0.25 0.3 0.4 0.5 0.3 0.4 Figure 9. Target vs. source shared-weight disagreeement on CIFAR-10 and Tiny Image Net with varying corruption severity. Experimental setting is identical to Section 5. Demystifying Disagreement-on-the-Line in High Dimensions CIFAR10-C-Snow CIFAR10-C-Fog Tiny Image Net-C-Snow Accuracy Agreement Tiny Image Net-C-Fog 0.6 0.8 0.6 0.8 Figure 10. Target vs. source classification accuracy and agreement on CIFAR-10 and Tiny Image Net with varying corruption severity. Experimental setting is identical to Section 5.