# on_biasvariance_alignment_in_deep_models__c74bce2f.pdf Published as a conference paper at ICLR 2024 ON BIAS-VARIANCE ALIGNMENT IN DEEP MODELS Lin Chen, Michal Lukasik, Wittawat Jitkrittum, Chong You, Sanjiv Kumar Google Research, {linche,mlukasik,wittawat,cyou,sanjivk}@google.com Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a trade-off. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are aligned at a sample level, where squared bias is approximately equal to variance for correctly classified sample points. We present empirical evidence confirming this phenomenon in a variety of deep learning models and datasets. Moreover, we study this phenomenon from two theoretical perspectives: calibration and neural collapse. We first show theoretically that under the assumption that the models are well calibrated, we can observe the bias-variance alignment. Second, starting from the picture provided by the neural collapse theory, we show an approximate correlation between bias and variance. 1 INTRODUCTION The concepts of bias and variance, obtained from decomposing the generalization error, are of fundamental importance in machine learning. Classical wisdom suggests that there is a trade-off between bias and variance: models of low capacity have high bias and low variance, while models of high capacity have low bias and high variance. This understanding served as an important guiding principle for developing generalizable machine learning models, suggesting that they should be neither too large nor too small (Bishop, 2006). Recently, a line of research found that deep models defy this classical wisdom (Belkin et al., 2019): their variance curves exhibit a unimodal shape that first increases with model size, then decreases beyond the point that the models can perfectly fit the training data (Neal et al., 2018; Yang et al., 2020). While the unimodal variance curve explains why over-parameterized deep models generalize well, there is a lack of understanding on why it occurs. Bias(x, y)2 Variance(x, y) High B-V cluster Medium B-V cluster Low B-V cluster (log scale) (log scale) (a) Illustration of B-V alignment in B-V space (left) and output space (right). (b) B-V for a vision task 10 5 10 3 10 1 Bias squared Correctly classified? (c) B-V for an NLP task Figure 1: The bias-variance alignment phenomenon. (a) Given an input x and its associated label y, bias-variance alignment refers to the phenomenon that the bias and variance of a deep model satisfy log Bias2 hθ,(x,y) log Varihθ,(x,y) for correctly classified points, as illustrated as the dashed line (see the left subfigure). In the right subfigure, each cross represents a prediction of the model ensemble {hθ} on a sample x, and the center is the one-hot encoding ey of the corresponding label y. Green, yellow, and red colored clusters of crosses correspond to the three groups also shown in the left subplot, with small, medium, and large bias, respectively. Bias-variance alignment implies that the three groups have small, medium, and large variance, respectively. (b) Per-example bias and variance for Res Net-50 trained on Image Net, where each dot corresponds to a test sample and colored according to whether the sample is correctly classified by the model or not. Bias and variance are estimated from 20 independently trained networks with different initial weights and over different bootstrap samples from the train set, following methodology from Neal et al. (2018). (c) Per-example bias-variance for BERT trained on TREC data (see Section E.7 in Appendix for details). Published as a conference paper at ICLR 2024 This paper revisits the study of bias and variance to understand their behavior in deep models. We perform a per-sample measurement of bias and variance in popular deep classification models. Our study reveals a curious phenomenon, which is radically different from the classical tradeoff perspective on bias-variance, while is concordant with more recent works (Belkin et al., 2019; Hastie et al., 2022; Mei & Montanari, 2022). Given a sample x and its corresponding label y from a dataset of test examples {(xi, yi)}i [n], let Biashθ,(x,y) and Varihθ,(x,y) be the bias and variance, respectively, of an ensemble of deep models {hθ}. Here, the randomness in calculating the bias and variance comes from θ, which depends on the randomness in parameter initialization, batching and sampling of the training data (Neal et al., 2018; Yang et al., 2020). Our key empirical observations can be summarized with the following two statements which we call the Bias-Variance Alignment and the Upper Bounded Variance. As we explain in the remainder of the paper, and as summarized in Table 1, these observations encapsulate our empirical observations, and also capture the special cases we prove from the calibration and the neural collapse assumptions. Bias-Variance Alignment. First, we find that for correctly classified points (x, y) {(xi, yi)}i [n]: log Varihθ,(xi,yi) log Bias2 hθ,(xi,yi) +Ehθ , (Ehθ is a constant independent of i) or log Varihθ,(xi,yi) = log Bias2 hθ,(xi,yi) +Ehθ + εi , (εi is noise s.t. Ei Unif([n])[εi] = 0) (1) where εi is random noise with mean vanishing across the dataset (i.e., Ei Unif([n])[εi] = 0). Specifically, our quantitative results show that (1) a simple linear regression of log Varihθ,(xi,yi) on log Bias2 hθ,(xi,yi) yields a remarkably high coefficient of determination R2; (2) the residuals of the simple linear regression exhibit an approximate normal distribution (we provide evidence for (1) and (2) in Section 3.1). Type Assumptions Finding (logarithmic scale) Finding (linear scale) Ref. Note Empirical Large model size log Varihθ,(xi,yi) log Bias2 hθ,(xi,yi) +Ehθ Varihθ,(xi,yi) = Chθ Bias2 hθ,(xi,yi) + ξi Sec. 3 + Correctly classified data Theoretical Perfect calibration Bias2 hθ,(xi,yi) Varihθ,(xi,yi) Sec. 4 a Theoretical Neural collapse log Bias2 hθ,(xi,yi)(k) log Varihθ,(xi,yi)(k) (1.114, 4) Bias2 hθ,(xi,yi) Varihθ,(xi,yi) (2s 1)2 exp(2s) , 3 Sec. 5 b + Binary classification a The result Bias2 hθ,(x,y) Varihθ,(x,y) corresponds to C = 1 in our main empirical observation on the linear scale presented in Eq. (2) and we bound ξi = Varihθ,(x,y) Bias2 hθ,(x,y) by the calibration error in Section 4. b Here, k {1, 2} is class index and s is (roughly speaking) the ℓ2 norm of the prelogit of hθ(x) (i.e., before softmax). Table 1: A summary of our findings on the bias-variance alignment. In linear scale, we can represent Equation (1) as Varihθ,(xi,yi) = Chθ Bias2 hθ,(xi,yi) +ξi , (Chθ = e Ehθ Ei Unif([n])[eεi] > 0 is a constant) ξi = O(Bias2 hθ,(xi,yi))ηi , (ηi is noise s.t. Ei Unif([n])[ηi] = 0) (2) A formal statement of the above formulations can be found in Proposition E.1 in Appendix E.4. Note that because the noise term ξi scales with squared bias, Equation (2) predicts that the sample-wise bias-variance in linear scale has a cone-shaped distribution (i.e., as bias increases, an increasingly wider range of variance is covered by examples). We discuss this in more detail in Appendix E.4. Upper Bounded Variance. Second, we find that the following relation approximately holds for all examples (i.e., for both correctly and incorrectly classified examples): Bias2 hθ,(x,y) Chθ Varihθ,(x,y), (x, y) {(xi, yi)}i [n] , (3) where Chθ is the same constant as in Equation (2). In Figure 1, we illustrate these findings on an illustrative example. Observe that for correctly classified sample points, the bias and variance align closely along the line of Bias2 hθ,(x,y) = Varihθ,(x,y), i.e., the equality in (1) holds. We refer to this phenomenon as the bias-variance alignment. For incorrectly classified samples, we observe Bias2 hθ,(x,y) > Varihθ,(x,y), hence the inequality in (3) approximately holds for all examples. It is worth noting that Eq. (3) provides an explanation for why deep models have limited variance, and in Published as a conference paper at ICLR 2024 effect, good generalization, i.e., the variance of a model is always bounded from above by the squared bias at every sample. The paper provides both empirical and theoretical analyses of the bias-variance alignment phenomenon. We organize the paper as follows. We begin with empirical investigations, in which we observe the bias-variance alignment across architectures and datasets (Section 3). We then move on to theoretical explanations of these phenomena. We start from a statistical perspective, where we connect calibration and the bias-variance alignment (Section 4). In the process, we generalize the theory from previous works on calibration implying the generalization-disagreement equality (Jiang et al., 2022; Kirsch & Gal, 2022) (Section 4.1). Next, we show how starting from a separate perspective of neural collapse (Papyan et al., 2020) can lead to the bias-variance approximate equality result (Section 5). We conclude with the discussion of wider implications of our findings (Section 6). Our main contributions are: (1) We conduct experiments to show that the bias-variance alignment holds for a variety of model architectures and on different datasets. (2) We provide evidence that the phenomenon does not occur if the model is small. This suggests that the bias-variance alignment is specific to large neural networks and provides more evidence that there could be a sharp difference between small and large models. (3) Theoretically, we prove the bias-variance alignment under the assumption that the model is well-calibrated (i.e., the output of the softmax layer aligns with the true conditional probability of each class given the data). As a side product, we provide a unified definition for a variety of definitions of calibration introduced in previous works. (4) We show that the neural collapse theory predicts the approximate bias-variance alignment. 2 BACKGROUND AND RELATED WORK 2.1 BACKGROUND ON BIAS-VARIANCE DECOMPOSITION Consider the task of learning a multi-class classification model hθ : X M([K]) RK, where X is the input domain, M([K]) is the set of distributions on [K], and K is the number of classes. Let {hθ : X M([K])} be an ensemble of trained models, where θ is a random variable taking values from Θ1. For any input x X, we use hθ( | x) to represent the corresponding distribution. That is, hθ( | x) (hθ(1 | x), . . . , hθ(K | x)) is the vector of predictive probabilities from model hθ. Given any sample (X, Y ) X [K], the bias and variance of {hθ} with respect to the mean squared error (MSE) loss are defined as follows. Definition 2.1 (Bias and Variance). Let h( | x) Eθhθ( | x) be the mean function of {hθ}. The bias, variance, and bias-variance gap of the i-th entry on (X, Y ), for each i [K], are defined as Biashθ,(X,Y )(i) = βhθ,(X,Y )(i) |h(i | X) 1{Y = i}| , (4) Varihθ,(X,Y )(i) = ς2 hθ,(X,Y )(i) Eθ (hθ(i | X) h(i | X))2 , (5) BVGhθ,(X,Y )(i) Bias2 hθ,(X,Y )(i) Varihθ,(X,Y )(i) . (6) Throughout this paper, we use Biashθ,(X,Y )(i) and βhθ,(X,Y )(i) interchangeably as synonyms, and we call ςhθ,(X,Y )(i) the standard deviation of the i-th entry. Moreover, the total bias, variance, and bias-variance gap are defined as Biashθ,(X,Y ) s X i [K] Bias2 hθ,(X,Y )(i) = h( | X) e Y 2 , (7) Varihθ,(X,Y ) X i [K] Varihθ,(X,Y )(i) = Eθ hθ( | X) h( | X) 2 2 , (8) BVGhθ,(X,Y ) Bias2 hθ,(X,Y ) Varihθ,(X,Y ) . (9) where ei RK is a vector whose i-th entry is 1 and all other entries are 0. It is well-known that bias and variance provide a decomposition of the expected risk with respect to the MSE loss (Hastie et al., 2009, equation (2.25)). That is, Riskhθ,(X,Y ) Eθ hθ( | X) e Y 2 2 = Bias2 hθ,(X,Y ) + Varihθ,(X,Y ) . (10) 1In all of our empirical results, the randomness comes from weight initialization and bootstrapping of training set, following Neal et al. (2018). The effect of different sources of randomness is studied in Appendix E.6. Published as a conference paper at ICLR 2024 Premise Finding Pointwise vs. Empirical vs. References In aggregate? Theoretical? Calibration Generalization error = Disagreement In aggregate Both (Jiang et al., 2022; Kirsch & Gal, 2022) Generalization Calibration In aggregate Empirical (Carrell et al., 2022) Multi-domain calibration Out-of-domain generalization In aggregate Empirical (Wald et al., 2021) Calibration Bias2 Variance Pointwise Both This work Neural collapse Bias2 Variance Pointwise Both This work Table 2: Summary of findings about calibration, generalization and disagreements. We focus on models trained using the cross-entropy (CE) loss throughout the paper. However, we analyze the bias and variance of these models using a decomposition of the mean squared error (MSE) loss. We present results for bias and variance from decomposing CE loss in Section E.2. We now introduce several notations that will be used throughout the paper. Definition 2.2. The prediction, confidence, and accuracy of h on x are defined by predh(x) = arg max j [K] h(j | x) , confh(x) = h(predh(x) | x) , acch(x) = PY |X(predh(x) | x) . The uncertainty of the ensemble {hθ} on x is Uncehθ(x) = 1 Eθ hθ( | x) 2 2. 2.2 RELATED WORK Bias-variance decomposition in deep learning. In the classical statistical learning theory of biasvariance tradeoff, increasing the model capacity beyond a certain point leads to overfitting (Geman et al., 1992). However, deep neural networks in practice usually contain a large number of parameters but still generalize well. Towards bridging the gap between theory and practice, one of the most famous work is Belkin et al. (2019) which reveals a double-descent curve to subsume the U-shaped tradeoff curve. This surprising observation motivates the work of Neal et al. (2018) to measure the bias and variance in popular deep models, leading to a discovery of a unimodal variance phenomenon (Yang et al., 2020). Subsequent work includes Adlam & Pennington (2020); Lin & Dobriban (2021) that study variance under fine-grained decompositions, and Rocks & Mehta (2022a;b) that analyze variance under simplified regression or random feature models. Calibration. Calibration is a fundamental quantity in machine learning which informally speaking measures the degree to which the output distribution from a model agrees with the Bayes probability of the labels over the data (Guo et al., 2017). Previous works proposed a theory on calibration implying the generalization-disagreement equality (Jiang et al., 2022; Kirsch & Gal, 2022). In Table 2 we summarize several related results connecting calibration with other fundamental concepts from previous works. Our work can be viewed as extending these works as we connect calibration and the bias-variance alignment. Neural collapse. Towards understanding last layer features learned in deep network based classification models, the work of Papyan et al. (2020) reveals the neural collapse phenomenon that offers a clear mathematical characterization: Within-class features collapse to their corresponding class means, and between-class separation of the class means is maximized. This observation motivates a sequence of theoretical work on justifying its occurrence (Fang et al., 2021; Zhu et al., 2021; Tirer & Bruna, 2022; Poggio & Liao, 2020; Thrampoulidis et al., 2022), and practical work on leveraging the insights to improve model performance (Liang & Davis, 2023; Yang et al., 2023; Li et al., 2022). 3 EMPIRICAL ANALYSIS OF BIAS-VARIANCE ALIGNMENT We begin by providing a quantitative measure in Section 3.1 on the alignment of bias and variance illustrated in Figure 1b. Then, we provide empirical evidence in Section 3.2 that the bias-variance alignment phenomenon occurs more prevalently for networks beyond Res Nets, and for datasets other than Image Net. Finally, in Section 3.3 we study the effect of network size, showing that bias-variance alignment is a phenomenon for over-parameterized models. Published as a conference paper at ICLR 2024 3.1 QUANTITATIVE REGRESSION ANALYSIS OF BIAS-VARIANCE ALIGNMENT Model name R2 Slope Res Net-8 (CIFAR-10) 0.979 0.882 Res Net-56 (CIFAR-10) 0.996 0.964 Res Net-110 (CIFAR-100) 0.986 0.901 Res Net-50 (Image Net) 0.977 0.897 Table 3: Coefficient of determination (R2) and slope for linear regression of log Varihθ,(xi,yi) on log Biashθ,(xi,yi). 2.5 0.0 2.5 Theoretical quantiles Ordered Values Res Net-56 (CIFAR-10) 2.5 0.0 2.5 Theoretical quantiles Res Net-8 (CIFAR-10) 2.5 0.0 2.5 Theoretical quantiles Res Net-50 (Image Net) 2.5 0.0 2.5 Theoretical quantiles Res Net-110 (CIFAR-100) Figure 2: The Q-Q plots of the residuals of linear regression of log Varihθ,(xi,yi) on log Biashθ,(xi,yi) of the four models. We start by conducting a quantitative regression analysis of bias and variance on the logarithmic scale for verifying the bias-variance alignment phenomenon in Eq. (1). First, in Table 3, we present the coefficient of determination R2 and the slope of the linear regression of log Varihθ,(xi,yi) on log Biashθ,(xi,yi) for the following models and datasets: Res Net-56 (on CIFAR-10), Res Net-8 (on CIFAR-10), Res Net-50 (on Image Net), and Res Net-110 (on CIFAR-100). Notice how the coefficient of determination for all four settings is extremely close to 1 (at least 0.977) and the slope is also very close to 1, demonstrating a very strong linear alignment of the two quantities. We next analyze the normality of the residuals of the logarithmic linear regression. The Q-Q plots are shown in Figure 2. We observe that the residuals of the linear regression on the four models are all approximately normal, especially for the data points whose sample quantile is between 1.5 and 1.5. 3.2 PREVALENCE OF BIAS-VARIANCE ALIGNMENT ACROSS ARCHITECTURES AND DATASETS In Figure 1b, we showed that the bias-variance alignment occurs for Res Net-50 trained on Image Net. Here, we provide additional evidence in Figure 3 that the bias-variance alignment occurs for other model architectures and datasets. In particular, Figure 3a and Figure 3b show the sample-wise bias-variance of Efficient Net-B0 (Tan & Le, 2019) and Mobile Net-V2 (Sandler et al., 2018), respectively, trained on the Image Net dataset. Figure 3c and Figure 3d show the sample-wise bias-variance of Res Net-110 trained on CIFAR-10 and CIFAR-100, respectively. In all cases, we observe a similar pattern, demonstrating the prevalence of the bias-variance alignment with respect to network architectures and choice of dataset. Finally, in Figure 12 we confirm the observation on the NLP benchmark, where we finetune BERT models on TREC the dataset. (a) Image Net (b) Image Net (c) CIFAR-10 (d) CIFAR-100 Figure 3: Sample-wise bias and variance for (a, b): Varying model architectures trained on the Image Net dataset, and (c, d): Two additional datasets, namely CIFAR-10 and CIFAR-100. For Image Net, CIFAR-10 and CIFAR-100, bias and variance are estimated from 10, 100 and 100 independently trained models, respectively. 3.3 ROLE OF OVER-PARAMETERIZATION We investigate the impact of model size on the bias-variance alignment phenomenon, demonstrating that it only occurs for large and over-parameterized models. To vary the size of the model, we consider Res Nets with depths of 8, 20, 56, and 110, which we denote as Res Net-{8, 20, 56, 110}. To obtain even smaller models, we reduce the width of Res Net-8 from 16 to 8, 4, 2, and 1. Here, the width refers to the number of filters in the convolutional layers of Published as a conference paper at ICLR 2024 Res Net-8. Specifically, the convolutional layers of Res Net-8 can be divided into three stages with 16, 32, and 64 filters, respectively. Therefore, Res Net-8 with width k, denoted as Res Net-8-W[k], refers to a Res Net-8 with k, 2k, and 4k filters in the three stages, respectively. Due to space limit, we choose three models, namely Res Net-8-W1, Res Net-8-W16, and Res Net-110W16 from the eight model sizes discussed above, and present their sample-wise bias and variance evaluated on CIFAR-10 in Figures 4a to 4c. These three models represent a small, medium, and large size model, respectively. Results for all the other model sizes are provided in Figure 5 (see Appendix). We can see that bias-variance alignment becomes increasingly pronounced with larger models. (a) Res Net-8-W1 (b) Res Net-8-W16 (c) Res Net-110-W16 (d) Varying model size Figure 4: Figures 4a to 4c: Sample-wise bias and variance for networks of varying scale trained on CIFAR-10. Here, Res Net-[p]-W[q] refers to Res Net with p layers and width q. The model size monotonically increases from the leftmost figure to the rightmost figure. For all cases, the bias and variance are estimated from 50 independently trained models. Figure 4d: Averaged bias and variance over all test samples under varying model sizes. The results in Figure 4 suggest that the emergence of bias-variance alignment is associated with over-parameterization, which refers to large capacity deep models that can perfectly fit any training data. Classical theory suggests that bias and variance exhibit a trade-off relation, where larger model size reduces bias and increases variance. However, Yang et al. (2020) shows that this trade-off holds only in the regime where the model size is relatively small. For over-parameterized models, the variance does not continue to grow; rather, it exhibits a unimodal shape. To examine bias-variance alignment in association with the unimodal variance phenomenon of Yang et al. (2020), we compute the averaged bias and variance over all test samples for results reported in Figures 4a to 4c (and also results of five additional models reported in Figure 5), and plot them as a function of model size in Figure 4d. The result in Figure 4d aligns with the observation in Yang et al. (2020), namely, bias is monotonically decreasing and variance curves is unimodal. Moreover, the model that exhibits strong bias-variance alignment in Figure 4, namely Res Net-56-W16, clearly is outside of the classical regime of bias-variance tradeoff shown in Figure 4d. Meanwhile, model in the classical trade-off regime, e.g. Res Net-8-W1, does not have bias-variance alignment. 4 CALIBRATION IMPLIES BIAS-VARIANCE ALIGNMENT In this section we show how model calibration can imply bias-variance alignment. We start from unifying the different calibration views from previous work, and then show how each of these assumptions implies a different version of the bias-variance correlation. 4.1 A UNIFIED VIEW OF CALIBRATION In previous work, various definitions of calibration have been introduced. In the following, we present a general definition that encompasses a wide range of these definitions. Suppose that there is a collection of sub-σ-algebras {Σi}i [K] of σ(X), where σ(X) is the σ-algebra generated by the random variable X. 2 In this section, we refer to the σ-algebra generated by a random variable or event using the notation σ( ). 2In Definition 4.1, the σ-algebras Σi must be sub-σ-algebras of σ(X). This is because we want the random variables and events we are conditioning on to be functions of X. Thus, the conditional expectation in the definition of calibration averages over all data examples that have the same value of a function of X. Published as a conference paper at ICLR 2024 ECE {Σi}i [K] P CWCE {Σi}i [K] P Σsamp i (Kirsch & Gal, 2022, Expectation of Eq. (6)) This work (Definition 4.2) Σpre i (Guo et al., 2017, Eq. (2)) (see our Appendix F.2) (Kirsch & Gal, 2022, Eq. (36)) Σbin i (Guo et al., 2017, Eq. (3)) (Naeini et al., 2015) (Nixon et al., 2019) (see our Appendix F.3) This work (Definition 4.2) Table 4: Summary of the various definitions of ECE {Σi}i [K] P and CWCE {Σi}i [K] P that have been proposed in previous work. Our unified definition subsumes these previous definitions. Definition 4.1 (Perfect (confidence) calibration). A function h : X M([K]) has perfect calibration with respect to {Σi}i [K] if the following equation holds for all i [K]: E[ (i | X) | Σi] = 0 , (i | x) h(i | x) PY |X(i | x) . (11) The function has perfect confidence calibration if the following equation holds: E[ (predh(X) | X) | Σpredh(X)] = E[(confh(X) acch(X)) | Σpredh(X)] = 0 . (12) The sub-σ-algebras {Σi}i [K] control the granularity of perfect calibration. In other words, (11) in Definition 4.1 says that (i | X) vanishes on average, and {Σi}i [K] specifies the set of samples that we average over. Here are examples of ways of choosing {Σi}i [K]: Sample-wise perfect calibration Σsamp i = σ(X). In this case, Definition 4.1 is equivalent to h(i | X) = PY |X(i | X) for every i [K]. In other words, perfect calibration holds for every sample X. Pre-image perfect calibration Σpre i = σ(h(i | X)). This may be the most widely used definition of calibration (Guo et al., 2017; Kirsch & Gal, 2022) and characterizes the calibration averaged over all samples that share a common prediction h( | X). Bin-wise perfect confidence calibration Σbin i = σ (1 { Mh(i | X) }) where M is a positive integer (Guo et al., 2017). The map x 7 Mh(i | x) assigns x to M bins (if h(i | x) > 0 for all x): (0, 1 M ], . . . , ( M 1 M , 1] according to the value of h(i | x). Then the samples that fall into the same bin are averaged over. Definition 4.2 (Calibration Errors). If we define (i | x) as in (11), the expected calibration error (ECE {Σi}i [K] P ) and the class-wise calibration error (CWCE {Σi}i [K] P ) of a function h : X M([K]) with respect to {Σi}i [K] on the data distribution P M(X Y) are given by ECE {Σi}i [K] P (h) E E[ (predh(X) | X) | Σpredh(X)] , CWCE {Σi}i [K] P (h) X i [K] E |E[ (i | X) | Σi]| . We elucidate how our unified definition subsumes the various definitions in previous work, and we summarize it in Table 4. In Appendix F.2 and Appendix F.3, we show concretely what ECE {Σi}i [K] P looks like with respect to Σpre i and Σbin i , respectively. 4.2 CALIBRATION MEETS THE BIAS-VARIANCE DECOMPOSITION In this subsection we then show that, under the model calibration assumption, there is a correlation between squared bias and variance. Moreover, in the case of imperfect model calibration, the discrepancy between the squared bias and variance can be bounded by the calibration error. Recall the definitions of bias and variance in Section 2.1. Theorem 4.3 characterizes the discrepancy between the squared bias and variance, and provides an upper bound for it by the class calibration error. Corollary 4.4, which follows the theorem, shows that in the case of perfect calibration, the variance is upper bounded by the squared bias. Moreover, if the model outputs a completely certain prediction (outputs a one-hot vector), it is theoretically guaranteed that the squared bias equals variance, i.e., the bias-variance correlation appears. If the model does not output a completely certain prediction but a highly certain prediction, an approximate bias-variance correlation follows. Theorem 4.3. If {hθ} is an ensemble whose mean function h(i | X) is Σi-measurable, then we have EY |X BVGhθ,(X,Y )(i) = 2h(i | X) (i | X) + PY |X(i | X) Eθhθ(i | X)2 , E E EY |X[BVGhθ,(X,Y )(i)] PY |X(i | X) + Eθhθ(i | X)2 | Σi 2 CCE {Σi}i [K] P (i) , Published as a conference paper at ICLR 2024 Width factor LHS of (13) RHS of (13) CWCE{Σbin i } P 1/2 0.64 0.90 0.45 1/4 0.63 0.74 0.37 1/8 0.66 0.70 0.35 1/16 0.76 0.90 0.45 Table 5: Empirical summary of values for CWCE{Σbin i } P and the bias and variance, both calculated w.r.t. Σbin i (where we use 20 equally spaced bins) from Res Net-8 models on CIFAR-10 across varying width. where CCE {Σi}i [K] P (i) E |E[ (i | X) | Σi]| is the class calibration error for class i. If Σ = T i [K] σ(Σi), for total squared bias and variance, the following equations holds E EY |X[BVGhθ,(X,Y )] | Σ = E [Uncehθ(X) | Σ] + 2 X i [K] E [E[ (i | X) | Σi]h(i | X) | Σ] , E EY |X[BVGhθ,(X,Y )] Uncehθ(X) | Σ 2 CWCE {Σi}i [K] P (13) We observe that, without any additional assumptions, h(i|X) is automatically Σsamp i - and Σpre i - measurable. If one chooses Σsamp i = σ(X), then E[EY |X[ ] | Σ] = EY |X[ ]. In this case, Theorem 4.3 holds for every example x X. Corollary 4.4. If h has perfect calibration with respect to {Σi}i [K], then E[ (i | X) | Σi] = 0 and therefore we have E EY |X[BVGhθ,(X,Y )(i)] | Σi = E PY |X(i | X) Eθhθ(i | X)2 | Σi = E h(i | X) Eθhθ(i | X)2 | Σi = E [Eθ [hθ(i | X)(1 hθ(i | X))] | Σi] , (14) E EY |X[BVGhθ,(X,Y )] | Σ = E [Uncehθ(X) | Σ] 0 . (15) Moreover, if hθ outputs a one-hot vector, we have hθ( | X) 2 2 = 1 and therefore E EY |X[BVGhθ,(X,Y )] | Σ = 0. In other words, in expectation E[EY |X[ ] | Σ], the squared bias equals variance. Corollary 4.4 shows that if h has perfect calibration, then in expectation E[EY |X[ ] | Σ], the variance is upper bounded by the bias and the gap between them is Uncehθ(X). If hθ( | X) is highly confident (i.e., maxi [K] hθ( | X) 1), then Bias2 hθ,(X,Y ) Varihθ,(X,Y ). The extreme case is that hθ outputs a one-hot vector. Corollary 4.5. If h has perfect calibration with respect to {Σi}i [K] and hθ(i | X) a for every θ (a is either 0 or 1), then E EY |X βhθ,(X,Y )(i) ςhθ,(X,Y )(i) | Σi 0 . Corollary 4.5 shows that when hθ(i | X) is highly confident (hθ(i | X) a {0, 1}), then βhθ,(X,Y )(i) ςhθ,(X,Y )(i) 0 in the mean E EY |X[ ] | Σi , which is entrywise bias-variance correlation. Moreover, since E EY |X βhθ,(X,Y )(i) ςhθ,(X,Y )(i) | Σi = E[EY |X[1{Y = i}(1 h(i | X) ςhθ,(X,Y )(i)) + 1{Y = i}(h(i | X) ςhθ,(X,Y )(i))] | Σi] , if PY |X(Y = i | Σi) 1, we have h(i | X) 1 ςhθ,(X,Y )(i); and if PY |X(Y = i | Σi) 0, we have h(i | X) ςhθ,(X,Y )(i). Experiments confirm our theory. We now present empirical results that support our theory. We empirically verify the inequality (13) across models of varying widths when using Σbin i (where we use 20 equally spaced bins) for the definitions of calibration, uncertainty, bias and variance. Note that calibration requires estimating the true probability distribution, and so direct computation for Σpred i or Σsamp i is infeasible. In Table 5 we empirically verify the relationship between bias, variance, and uncertainty when using Σbin i . The left-hand side of (13), which is E EY |X[BVGhθ,(X,Y )] Uncehθ(X) | Σ , is computed from the bias and variance values. The right-hand side of (13) is 2 CWCE {Σbin i }i [K] P . We see that across architectures, Equation (13) of Theorem 4.3 holds. 5 NEURAL COLLAPSE IMPLIES APPROXIMATE BIAS-VARIANCE ALIGNMENT Neural collapse (Papyan et al., 2020) is a phenomenon pertaining the last layer features and classifier weights of a trained deep classification model. This section considers a statistical modeling of the Published as a conference paper at ICLR 2024 prediction of the network ensemble {hθ} on an arbitrary test data (X, Y ) that is motivated from neural collapse, upon which we derive a bound on the ratio between Bias2 hθ,(X,Y ) and Varihθ,(X,Y ). Modeling assumption motivated by neural collapse. We assume that each model hθ in the ensemble can be written as hθ( | x) = softmax(Wψτ(x)), where θ = (τ, W) denotes trainable parameters. In the above, we refer to ψτ(x) Rd as the feature vector, and W as the classifier weight. The neural collapse phenomenon states that during training, W converges to a rotated and scaled version of the simplex equiangular tight frame (ETF) matrix W ETF. That is, W W ETFR , where W ETF = w ETF 1 , . . . , w ETF K and R Rd K is an orthogonal matrix (i.e., R R = IK). In above, IK RK K is an identity matrix, and 1K RK is a column vector with all entries being 1. We summarize the properties of W ETF in Appendix G.1. Moreover, for any training data (xtrain, ytrain), neural collapse predicts that the feature ψτ(xtrain) is aligned with its classifier weight, i.e., ψτ(xtrain) = s Rw ETF ytrain for some s > 0 independent of (xtrain, ytrain). For a test sample (X, Y ), neural collapse does not predict the distribution of its feature ψτ(X). However, it is reasonable to assume that it slightly deviates from the training feature of class Y . This motivates us to assume that ψτ(X) = R sw ETF Y + v , where v is the noise vector, which leads to hθ( | X) = softmax(W ETF(sw ETF Y + v)). Hence, the prediction of the network ensemble {hθ} may be modeled as follows. Assumption 5.1. We assume {hθ( | X)} = {softmax W ETF(sw ETF Y + v) } for any test sample (X, Y ), where v has i.i.d. entries drawn according to β q K K 1vi Gumbel(µ, β). Appendix G.2 shows that the assumption on v aligns with the observation in practical networks. Bias-variance analysis. Theorem 5.2 computes the entrywise bias and standard deviation under Assumption 5.1, for entries corresponding to the true class. Theorem 5.2. Consider the model ensemble in Assumption 5.1. Let K = K 1, c = es K/K φK (c) c K 1(c 1 K ) K (c K K log(c K ) 1) c K 1 + (c 1 K PK 1 j=1 (K j)(c 1/K )jc j+K 1 Then we have βhθ,(X,Y )(Y ) = |cφK (c) 1| and ςhθ,(X,Y )(Y ) = c r dc + φK (c)2 , where βhθ,(X,Y )( ) and ςhθ,(X,Y )( ) are defined in (4) of Definition 2.1. Due to technical difficulties, Theorem 5.2 does not provide bias and variance for entries other than those corresponding to the true class. However, under the special case of binary classification, we are able to provide bias and variance for all entries as follows. Corollary 5.3. If K = 2, then for i {1, 2} we have βhθ,(X,Y )(i) = |log(c)c c+1| (c 1)2 , ςhθ,(X,Y )(i) = c((c 1)2 c log2(c)) (c 1)2 , where c = e2s. Furthermore, we have: (1) on the linear scale, βhθ,(X,Y )(i) ςhθ,(X,Y )(i) is a decreasing function of s and 1.74 > βhθ,(X,Y )(i) ςhθ,(X,Y )(i) 2s 1 es , or equivalently,1.742 > Bias2 hθ,(X,Y ) Varihθ,(X,Y ) e2s ; (2) on the logarithmic scale, log Biashθ,(X,Y )(i) log Varihθ,(X,Y )(i) (0.557, 2). Corollary 5.3 illustrates the entrywise bias and standard deviation for binary classification. We prove the approximate correlation between the entrywise bias and standard deviation by providing an upper and lower bound for the ratio of the entrywise bias to the standard deviation. 6 CONCLUSION We show that bias and variance align at a sample level for ensembles of deep learning models, suggesting a more nuanced bias-variance relation in deep learning. We study this phenomenon from two theoretical perspectives, calibration and neural collapse, and provide new insights into the bias-variance alignment. Published as a conference paper at ICLR 2024 7 ACKNOWLEDGEMENTS We would like to acknowledge helpful comments from Aditya Krishna Menon (Google Research) and Christina Baek (Carnegie Mellon University). Taiga Abe, E Kelly Buchanan, Geoff Pleiss, and John P Cunningham. Pathologies of predictive diversity in deep ensembles. ar Xiv preprint ar Xiv:2302.00704, 2023. Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained biasvariance decomposition. Advances in neural information processing systems, 33:11022 11032, 2020. Alexander Atanasov, Blake Bordelon, Sabarish Sainathan, and Cengiz Pehlevan. The onset of variance-limited behavior for networks in the lazy and rich regimes. ar Xiv preprint ar Xiv:2212.12147, 2022. Christina Baek, Yiding Jiang, Aditi Raghunathan, and Zico Kolter. Agreement-on-the-line: Predicting the performance of neural networks under distribution shift, 2023. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738. Gavin Brown, Jeremy L Wyatt, Peter Tino, and Yoshua Bengio. Managing diversity in regression ensembles. Journal of machine learning research, 6(9), 2005. A. Michael Carrell, Neil Mallinar, James Lucas, and Preetum Nakkiran. The calibration generalization gap, 2022. URL https://arxiv.org/abs/2210.01964. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171 4186, 2019. Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021. Stuart Geman, Elie Bienenstock, and Ren e Doursat. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1 58, 1992. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pp. 1321 1330. PMLR, 2017. Neha Gupta, Jamie Smith, Ben Adlam, and Zelda E Mariet. Ensembles of classifiers: a bias-variance perspective. Transactions on Machine Learning Research, 2022. Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009. Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022. Tom Heskes. Bias/variance decompositions for likelihood-based estimators. Neural Computation, 10 (6):1425 1433, 1998. Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Chin-Yew Lin, and Deepak Ravichandran. Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, 2001. URL https://www.aclweb.org/ anthology/H01-1069. Published as a conference paper at ICLR 2024 Alan Jeffares, Tennison Liu, Jonathan Crabb e, and Mihaela van der Schaar. Joint training of deep ensembles fails due to learner collusion. ar Xiv preprint ar Xiv:2301.11323, 2023. Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, and J Zico Kolter. Assessing generalization of SGD via disagreement. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Wv OGCEAQhxl. Andreas Kirsch and Yarin Gal. A note on assessing generalization of sgd via disagreement . Transactions on Machine Learning Research, 2022. Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, and Qing Qu. Principled and efficient transfer learning of deep models via neural collapse. ar Xiv preprint ar Xiv:2212.12206, 2022. Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. URL https://www.aclweb.org/anthology/ C02-1150. Tong Liang and Jim Davis. Inducing neural collapse to a fixed hierarchy-aware frame for reducing mistake severity. ar Xiv preprint ar Xiv:2303.05689, 2023. Licong Lin and Edgar Dobriban. What causes the test error? going beyond bias-variance via anova. The Journal of Machine Learning Research, 22(1):6925 7006, 2021. Neil Mallinar, James Simon, Amirhesam Abedsoltan, Parthe Pandit, Misha Belkin, and Preetum Nakkiran. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. Advances in Neural Information Processing Systems, 35:1182 1195, 2022. Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75 (4):667 766, 2022. John Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization, 2021. Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. ar Xiv preprint ar Xiv:1810.08591, 2018. Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In CVPR workshops, volume 2, 2019. Luis A Ortega, Rafael Caba nas, and Andres Masegosa. Diversity and generalization in neural network ensembles. In International Conference on Artificial Intelligence and Statistics, pp. 11720 11743. PMLR, 2022. Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40): 24652 24663, 2020. David Pfau. A generalized bias-variance decomposition for bregman divergences. Unpublished Manuscript, 2013. Tomaso Poggio and Qianli Liao. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. ar Xiv preprint ar Xiv:2101.00072, 2020. Jason W Rocks and Pankaj Mehta. Bias-variance decomposition of overparameterized regression with random linear features. Physical Review E, 106(2):025304, 2022a. Published as a conference paper at ICLR 2024 Jason W Rocks and Pankaj Mehta. Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models. Physical Review Research, 4(1):013201, 2022b. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510 4520, 2018. Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105 6114. PMLR, 2019. Christos Thrampoulidis, Ganesh Ramachandra Kini, Vala Vakilian, and Tina Behnia. Imbalance trouble: Revisiting neural-collapse geometry. Advances in Neural Information Processing Systems, 35:27225 27238, 2022. Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, pp. 21478 21505. PMLR, 2022. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=XWYJ25-y TRS. Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, and Dacheng Tao. Neural collapse inspired feature-classifier alignment for few-shot class incremental learning. ar Xiv preprint ar Xiv:2302.03004, 2023. Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning, pp. 10767 10777. PMLR, 2020. Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. ar Xiv preprint ar Xiv:2102.00650, 2021. Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820 29834, 2021. Published as a conference paper at ICLR 2024 Table of Contents A Societal Impact Statement 13 B Limitations 13 C Further related work. 14 D Practical applications. 14 E More Empirical Analysis of Bias-variance Alignment 15 E.1 Additional results on role of over-parameterization . . . . . . . . . . . . . . . . 15 E.2 Bias-variance decomposition of cross-entropy loss . . . . . . . . . . . . . . . . 15 E.3 Permutation test for the residual and the log-bias . . . . . . . . . . . . . . . . . 15 E.4 Linear vs logarithmic scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 E.5 Correlation to prediction uncertainty and logit norm . . . . . . . . . . . . . . . 17 E.6 Effect on the sources of randomness . . . . . . . . . . . . . . . . . . . . . . . . 17 E.7 TREC experiments with BERT . . . . . . . . . . . . . . . . . . . . . . . . . . 18 F Further Results on Calibration and the Bias-Variance Correlation 18 F.1 Perfect calibration does not necessarily imply perfect confidence calibration . . . 18 F.2 Pre-image expected calibration error . . . . . . . . . . . . . . . . . . . . . . . 19 F.3 Bin-wise expected calibration error . . . . . . . . . . . . . . . . . . . . . . . . 19 F.4 Proof of Corollary 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 F.5 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 F.6 No bias-variance correlation in Kullback-Leibler convergence. . . . . . . . . . . 21 F.7 Theorem 4.3 implies generalization disagreement equality (GDE) . . . . . . . . 22 G Further Results on Neural Collapse and the Bias-Variance Correlation 23 G.1 Properties of simplex equiangular tight frame (ETF) . . . . . . . . . . . . . . . 23 G.2 Verifying Assumption 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 G.3 Verifying Corollary 5.3: Binary classification . . . . . . . . . . . . . . . . . . . 25 G.4 Relationship between Gumbel and exponential distribution . . . . . . . . . . . . 25 G.5 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 G.6 Proof of Corollary 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 H Future Work 28 A SOCIETAL IMPACT STATEMENT This paper aims to expose a peculiar bias-variance alignment phenomenon and characterize it both theoretically and empirically. We do not foresee any negative societal consequences from this work. B LIMITATIONS We identify the following limitations of our work: We only considered the squared loss and cross entropy loss functions. It would be interesting to extend our results to other loss functions, such as the 0/1 loss. Published as a conference paper at ICLR 2024 Our theory is based on the binary classification assumption. We plan to extend it to multiclass classification in future work. It would be interesting to study the bias-variance alignment theoretically in an end-to-end manner, using tools such as the neural tangent kernel theory or a mean-field analysis. We conducted our experiments in the image classification domain. It would be interesting to verify our findings in other domains, such as natural language processing (NLP). We believe that these limitations do not detract from the overall significance of our work. Our findings provide new insights into the bias-variance alignment phenomenon, and we hope that this paper will stimulate further research on this important topic. C FURTHER RELATED WORK. It was shown that statistics such as accuracy (Miller et al., 2021) and disagreement (Baek et al., 2023) are highly correlated when contrasted across in-distribution and out-distribution data. This points at a potential extension to our work to consider how our findings translate to the out of domain data. Gupta et al. (2022) investigate the theoretical underpinnings of ensemble methods for classification tasks, extending the bias-variance decomposition to derive generalized laws of total expectation and variance for nonsymmetric losses. Their work sheds light on the mechanisms by which ensembles reduce variance and potentially bias, providing valuable insights for improving the performance of ensemble classifiers. Ortega et al. (2022) provide a comprehensive theoretical framework for understanding the relationship between diversity and generalization error in neural network ensembles. They analyze the impact of diversity on ensemble performance for various loss functions and model combination strategies, offering valuable insights for designing effective ensemble learning algorithms. Brown et al. (2005) focus on managing diversity in regression ensembles. They introduce a control mechanism through the error function, demonstrating its effectiveness in improving ensemble performance over traditional methods. This work provides insights into systematic control of the bias-variance-covariance trade-off in regression ensembles. Abe et al. (2023) challenge conventional wisdom on predictive diversity in deep neural network ensembles. While diversity benefits small models, the authors find that encouraging diversity harms high-capacity deep ensembles used for classification. Their experiments show that diversity-encouraging regularizers hinder performance, suggesting that the best strategy for deep ensembles may involve using more accurate but less diverse component models. In contrast to traditional ensemble methods, deep ensembles of neural networks offer the potential for direct optimization of overall performance. However, Jeffares et al. (2023) reveal that jointly minimizing ensemble loss induces base learners to collude, inflating artificial diversity. This pseudo-diversity fails to generalize, highlighting limitations in direct joint optimization and its impact on generalization gaps. D PRACTICAL APPLICATIONS. We believe that our findings on the bias-variance alignment can be used to develop new methods for validating deep learning models and selecting generalizable models in practice. One practical application is estimating the test error of a deep learning model using variance. This is possible because our finding is that bias and variance are aligned, and so we can estimate bias from variance. This means that even when the true labels of the test data are unavailable, we can still get a good estimate of the test error by measuring variance over multiple models on the test data. Compared to Jiang et al. (2022) which analyzed the alignment between disagreement and test error across the entire dataset, our method is a per-example approach and thus enables example-level validation of deep learning models. This is a novel way of validating deep learning models and selecting generalizable models in practice even when the true labels of the test data are unavailable. Additionally, our method is simple to implement, so it could be easily adopted by practitioners. Moreover, inspired by our result, one could consider practical algorithms leveraging the observation of bias and variance alignment. As one example, one can consider routing between ensembles of models. Given two ensembles of models, one could dynamically route between such two ensembles, depending which one yields lower variance. Given the above possible applications, we believe that our work has the potential to make a significant contribution to the field of deep learning. Published as a conference paper at ICLR 2024 E MORE EMPIRICAL ANALYSIS OF BIAS-VARIANCE ALIGNMENT E.1 ADDITIONAL RESULTS ON ROLE OF OVER-PARAMETERIZATION In Section 3.3 we showed that the bias-variance alignment phenomenon becomes more pronounced for over-parameterized models, by plotting sample-wise bias-variance for three models of varying sizes in Figure 4. Here we present results on five additional models of varying size Figure 5 that complement the results in Figure 4. Figure 5: Sample-wise bias and variance for networks of varying scale trained on CIFAR-10. E.2 BIAS-VARIANCE DECOMPOSITION OF CROSS-ENTROPY LOSS Deep neural networks for classification tasks are typically trained with the cross-entropy (CE) loss. Here, we investigate whether the bias and variance from decomposing the CE loss also exhibit the alignment phenomenon. The risk with respect to the CE loss can be decomposed as follows (Pfau, 2013): Eθ e Y , log(hθ( | X)) | {z } Risk = DKL(e Y h( | X)) | {z } Bias2 + EθDKL( h( | X)) hθ( | X))) | {z } Variance where DKL denotes the KL divergence. In the above equation, h( |X) is obtained by taking the expectation of the log-probabilities and then applying a softmax function. In other words, h(i | X)) = exp Eθ log(hθ(i | X))) P i exp Eθ log(hθ(i | X))) (18) Intuitively, h( |X) represents the average prediction under the KL divergence, assigning a probability proportional to exp Eθ log(hθ(i | X)) to each class i. The bias term measures the KL divergence between the true distribution e Y and the average prediction h( |X), quantifying the deviation of the ensemble s mean prediction from the actual class distribution. The variance term, on the other hand, captures the average KL divergence between the individual predictions in the ensemble and the mean prediction h( |X), reflecting the overall variability of the ensemble s predictions. In Figure 7 we present the sample-wise bias and variance from decomposing the CE loss under the same setup as that in Figure 1b. In other words, the only difference between Figure 7 and Figure 1b is that the bias and variance are computed from decomposing the CE and MSE loss, respectively. It can be seen that the bias no longer aligns well with the variance. In Appendix F.6, we theoretically explain this phenomenon. E.3 PERMUTATION TEST FOR THE RESIDUAL AND THE LOG-BIAS We perform linear regression of log Varihθ,(xi,yi) on log Biashθ,(xi,yi) for the following models and datasets: Res Net-56 (on CIFAR-10), Res Net-8 (on CIFAR-10), Res Net-50 (on Image Net), and Res Net-110 (on CIFAR-100). We would like to test whether the residual is linearly correlated of the exogenous variable log Biashθ,(xi,yi). To this end, we perform permutation tests with the Pearson s Published as a conference paper at ICLR 2024 0.0250.000 0.025 0 Res Net-56 (CIFAR-10) p-value: 0.9949 Null Distribution Statistic 0.025 0.000 0.025 0 Res Net-8 (CIFAR-10) p-value: 0.9957 Null Distribution Statistic 0.02 0.00 0 Res Net-50 (Image Net) p-value: 0.9999 Null Distribution Statistic 0.0250.000 0.025 0 Res Net-110 (CIFAR-100) p-value: 0.9947 Null Distribution Statistic Figure 6: Null distribution, the statistic and the p-value of the permutation test results for the residual against log Bias2 hθ,(xi,yi) in linear regression of log Varihθ,(xi,yi) on log Biashθ,(xi,yi). Figure 7: Sample-wise bias and variance of the CE loss. Figure 8: Sample-wise bias and variance plotted in linear scale (with correctly classified samples only). correlation coefficient on the residuals and their corresponding log Biashθ,(xi,yi) values. The null hypothesis is that the residual and log Biashθ,(xi,yi) are not correlated. We plot the null distribution, the statistic, and the p-value of the test results on the four models in Figure 6. The results show that the null hypothesis is not rejected, suggesting that it may be true. E.4 LINEAR VS LOGARITHMIC SCALE In Section 1, the bias-variance alignment is presented first in the logarithmic scale (see Eq. (1)) and subsequently in the linear scale (see Eq. 2). Here, we provide a rigorous analysis on their connections. In addition, we explain the implication of the bias-variance alignment when plotted in the linear scale, which is complemented by empirical results. Connection between bias-variance alignment in linear vs logarithmic scale. First, we provide a formal statement on the connection between the linear and log scale of the bias-variance alignment. Proposition E.1. If log Varihθ,(xi,yi) = log Bias2 hθ,(xi,yi) +Ehθ + εi where εi is independent of Bias2 hθ,(xi,yi) and Ei Unif([n])[εi] = 0, we have Varihθ,(xi,yi) = Chθ Bias2 hθ,(xi,yi) +Dhθ Bias2 hθ,(xi,yi) ηi , where Chθ = e Ehθ Ei Unif([n])[eεi] > 0, Dhθ = e Ehθ > 0, ηi = eεi E[eεi] and Ei Unif([n])[ηi] = 0. Proof. We exponentiate both sides of log Varihθ,(xi,yi) = log Bias2 hθ,(xi,yi) +Ehθ + εi where Ei Unif([n])[εi] = 0 and obtain Varihθ,(xi,yi) = e Ehθ Bias2 hθ,(xi,yi) eεi = e Ehθ Bias2 hθ,(xi,yi)(ηi + E[eεi]) = Chθ Bias2 hθ,(xi,yi) +Dhθ Bias2 hθ,(xi,yi) ηi , where ηi has mean 0 by definition. Published as a conference paper at ICLR 2024 (a) Res Net-8-W1 (b) Res Net-8-W16 (c) Res Net-110-W16 Figure 9: Same as Figure 4(a-c) but plotted in linear scale and with correctly classified samples only. Sample-wise bias and variance plotted in linear scale. Unlike in the log scale where the noise term ϵi (see Eq. (1)) is independent of the bias and variance, in linear scale the noise term ξi is multiplied by a factor that scales with the squared bias (see Eq. 2). This implies that instead of aligning along a straight line, sample-wise bias and vairance in linear scale has a cone-shaped distribution. That is, as bias increases, an increasingly wider range of variance is covered by the samples and such a range forms a cone. To illustrate this, we regenerate the plot of Figure 1b but with linear (instead of log) scale in both the x and y axis, and the result is shown in Figure 8 (we also removed the incorrectly classified data points from the plot). Furthermore, to observe the effect of model size on bias-variance alignment in linear scale, we regenerate the Figure 4(a-c) with x and y axis switched to linear scale and present the result in Figure 9. E.5 CORRELATION TO PREDICTION UNCERTAINTY AND LOGIT NORM Figure 1b demonstrates that the bias and variance of varying sample points exhibit the alignment phenomenon for points that are correctly classified. Here, in addition to the correctness of prediction, we also examine the relation between the alignment phenomenon with the prediction uncertainty and the logit norm. (a) Correctness (b) Uncertainty (c) Norm of Logit Vector Figure 10: Same as Figure 1b, but with each sample colored according to (a): Correctness of model prediction, (b): Uncertainty of model prediction, and (c): ℓ2 norm of the logit vector. Figure 10a is the same as the one in Figure 1b for the reader s reference. In Figure 10b, we show how the uncertainty in model predictive distribution, i.e., Unceh(x) (see Definition 2.2), correlates with bias and variance. It can be seen that samples with large variance are those with large uncertainty scores. We give a formal relation between bias, variance, and uncertainty in Theorem 4.3. Finally, Figure 10c shows the lack of correlation between bias/variance and the ℓ2 norm of the logit vector. E.6 EFFECT ON THE SOURCES OF RANDOMNESS The decomposition of the generalization into the summation of bias and variance requires one to specify a source of randomness in obtaining a collection of models. In classical bias-variance tradeoff, this source of randomness is usually taken to be the sampling of the training dataset. Correspondingly, the numerical estimation of bias and variance can be achieved by sampling a given dataset via bootstrap (see e.g. Neal et al. (2018)). This is the approach that we adopt in all numerical experiments in this paper, other than those in this section. On the other hand, modern deep networks often have other sources of randomness as well, such as the initialization of the model parameters, random sampling of the batches in the training process. Published as a conference paper at ICLR 2024 In this section, we study whether the emergence of bias-variance alignment is due purely to the randomness in sampling the training dataset, or other sources of randomness may also give rise to a similar phenomenon. Towards that, we conduct experiments to train multiple deep neural networks without data bootstrapping. In such cases, the randomness in the collection of networks comes only from random initialization and data batching. The result is shown in Figure 11. Comparing it with Figure 4, where the only difference lies in the bootstrapping of training dataset, it can be seen that the source of randomness have a very small impact on the bias-variance alignment phenomenon. (a) Res Net-8-W1 (b) Res Net-8-W16 (c) Res Net-110-W16 (d) Varying model size Figure 11: Same as Figure 4 but without bootstrapping of training dataset. Hence, the randomness in computing the bias and variance comes from random initialization and data batching, and there is no randomness in sampling of training dataset. E.7 TREC EXPERIMENTS WITH BERT We next show that on NLP datasets with a Transformer-based model (Vaswani et al., 2017), more specifically BERT (Devlin et al., 2019), the bias-variance alignment observation holds. This is shown in Figure 12 where we consider the TREC dataset (Hovy et al., 2001; Li & Roth, 2002) with its fine-grained labels (i.e., 47 classes) and vary the number of layers in the BERT model. In this experiment, each of the two ensembles consists of 20 BERT models. In each case, each of these models was initialized from the same pre-trained checkpoint, and trained for 20 epochs with learning rate of 2 10 5 using Adam. We use a polynomial decay learning rate schedule with the number of warm-up steps set to be 10% of the number of total update steps. Training batch size was set to 8. 10 4 10 2 100 Bias squared (a) BERT with 2 layers 10 5 10 3 10 1 Bias squared (b) BERT with 8 layers Figure 12: Sample-wise bias and variance of BERT fine-tuned on TREC. We confirm the bias-variance alignment phenomenon. F FURTHER RESULTS ON CALIBRATION AND THE BIAS-VARIANCE CORRELATION F.1 PERFECT CALIBRATION DOES NOT NECESSARILY IMPLY PERFECT CONFIDENCE CALIBRATION Perfect calibration does not necessarily imply perfect confidence calibration. To illustrate this, consider the following example: let X = Y = {1, 2}, and let X be a uniformly random variable on {1, 2}. Let the probability of Y = i given X be P(i | X) = 1{X = i}, and let the classifier h be Published as a conference paper at ICLR 2024 defined as h(i | X) = 1{X = i}. In addition, let Σi represent the trivial σ-algebra for all i in the set 1, 2. In this case, we have E[ (i | X) | Σi] = E[ (i | X)] = E[2 1{X = i} 1] = 0, which means that h has perfect calibration with respect to {Σi}i=1,2. However, since predh(x) = x, we have E[ (predh(X) | X) | Σpredh(X)] = E[ (predh(X) | X)] = E[ (X | X)] = 1. Therefore, h does not have perfect confidence calibration. This is not true even for preimage perfect calibration Σpre i . Consider the following counterexample h(i | x) = x i 1 2 3 4 1 0.3 0.25 0.2 0.25 2 0.3 0.5 0.2 0 PY |X(i | x) = x i 1 2 3 4 1 0.4 0.25 0.1 0.25 2 0.2 0.5 0.3 0 We set P(X = 1) = P(X = 2) = 1/2. It is clear that E h(i | X) PY |X(i | X) | h(i | X) = 0 (19) holds for for i = 2, 4. If i = 1, we have h(i | X) = 0.3. Therefore, E h(1 | X) PY |X(1 | X) | h(1 | X) = 0.3 = 0.3 E PY |X(1 | X) | h(1 | X) = 0.3 = 0.3 2 = 0. Similarly, we can show that (19) holds for i = 4. For Σ(2) i , Σ(2) predh(X) = σ(h(predh(X) | X)) = confh(X). Note that in this example, confh(X) can take only two values 0.3 and 0.5. Since E[(confh(X) acch(X)) | confh(X) = 0.3] (20) = E[(confh(X) acch(X)) | X = 1] = 0.1 = 0 , (21) perfect confidence calibration is not satisfied. F.2 PRE-IMAGE EXPECTED CALIBRATION ERROR The expected calibration error (ECE) with respect to {Σpre i }i [K] recovers the ECE from Equation (2) in (Guo et al., 2017). Recall the definition of conf and acc in Section 2.1. The ECE with respect to {Σpre i }i [K] is ECE {Σi}i [K] P (22) = E E[ (predh(X) | X) | Σpre predh(X)] (23) = E |E[ (predh(X) | X) | h(predh(X) | X)]| (24) = E |E[(confh(X) acch(X)) | confh(X)]| . (25) Equation (25) follows from (predh(X) | X) = confh(X) acch(X) and h(predh(X) | X) = confh(X). This recovers the definition of the ECE in (Guo et al., 2017, Equation (2)). F.3 BIN-WISE EXPECTED CALIBRATION ERROR The expected calibration error (ECE) with respect to {Σbin i }i [K] recovers the ECE from Equation (3) in (Guo et al., 2017). Let Ej represent the event j 1 M < confh(X) j M = { M confh(X) = j}. Published as a conference paper at ICLR 2024 The ECE with respect to {Σbin i }i [K] is ECE {Σi}i [K] P (26) j [M] P(Ej)E h E[ (predh(X) | X) | Σbin predh(X)] | Ej i (27) j [M] P(Ej)E [|E[(confh(X) acch(X)) | M confh(X) ]| | Ej] (28) j [M] P(Ej) |E[(confh(X) acch(X)) | Ej]| . (29) Equation (28) follows from (predh(X) | X) = confh(X) acch(X) and Mh(predh(X) | X) = M confh(X) . Equation (29) is the ECE with respect to {Σbin i }i [K] on the population P(X, Y ). If one wants to estimate the ECE from an empirical distribution formed by sampling n i.i.d. samples {(xi, yi)}i [n] from P(X, Y ), then P(Ej) is |Bj| n , where Bj denotes the set of the indices of the samples whose confidence falls into the bin j 1 M . Under the empirical distribution, we have E[acch(X) | Ej] = 1 |Bj| i Bj 1{predh(xi) = yi} . (30) E[confh(X) | Ej] = 1 |Bj| i Bj confh(xi) . (31) We recover the definition of the ECE in (Naeini et al., 2015) and (Guo et al., 2017, Equation (3)). F.4 PROOF OF COROLLARY 4.5 Proof of Corollary 4.5. By (14) and the bounded convergence theorem, we have E EY |X[β(i)2 σ(i)2] | Σi (32) = E [Eθ [hθ(i | X)(1 hθ(i | X))] | Σi] 0 . (33) Let us write δ(i) = β(i) σ(i). It follows that E EY |X[β(i)2 σ(i)2] | Σi (34) = E EY |X[(σ(i) + δ(i))2 σ(i)2] | Σi (35) = E EY |X[δ(i)(δ(i) + 2σ(i))] | Σi (36) E EY |X[2δ(i)2] | Σi 0 . (37) As a result, we obtain E EY |X[δ(i)2] | Σi 0, which implies E EY |X[δ(i)] | Σi 0 since L2 convergence of random variables implies L1 convergence. F.5 PROOF OF THEOREM 4.3 Proof of Theorem 4.3. We have Bias2 hθ,(X,Y )(i) = h(i | X)2 + 1{Y = i} 2 1{Y = i}h(i | X) , (38) Varihθ,(X,Y )(i) = Eθhθ(i | X)2 h(i | X)2 . (39) Therefore we get Bias2 hθ,(X,Y )(i) Varihθ,(X,Y )(i) = 2 h(i | X)2 1{Y = i}h(i | X) (40) + 1{Y = i} Eθhθ(i | X)2 . (41) Taking the expectation over the conditional distribution of Y | X yields EY |X h Bias2 hθ,(X,Y )(i) Varihθ,(X,Y )(i) i = 2h(i | X) (i | X) + PY |X(i | X) Eθhθ(i | X)2 . (42) Published as a conference paper at ICLR 2024 Then we further take the conditional expectation E[ | Σi] and re-arrange the terms, and obtain E h EY |X[Bias2 hθ,(X,Y )(i) Varihθ,(X,Y )(i)] PY |X(i | X) + Eθhθ(i | X)2 | Σi i = 2h(i | X)E [ (i | X) | Σi] . (43) Taking the absolute value and then the outer expectation gives E E h EY |X[Bias2 hθ,(X,Y )(i) Varihθ,(X,Y )(i)] PY |X(i | X) + Eθhθ(i | X)2 | Σi i = 2E [h(i | X) |E [ (i | X) | Σi]|] 2E [|E [ (i | X) | Σi]|] Summing (43) over i [K] and taking the outer conditional expectation E[ | Σ] gives E h EY |X[Bias2 hθ,(X,Y ) Varihθ,(X,Y )] | Σ i (45) = 1 E EY |XEθ hθ( | X) 2 2 | Σ + 2 X i [K] E [E[ (i | X) | Σi]h(i | X) | Σ] . (46) Re-arranging the terms and taking the absolute value yields E E h EY |X[Bias2 hθ,(X,Y ) Varihθ,(X,Y )] 1 + EY |XEθ hθ( | X) 2 2 | Σ i (47) i [K] E [|E[ (i | X) | Σi]| h(i | X)] 2 CWCE {Σi}i [K] P . (48) F.6 NO BIAS-VARIANCE CORRELATION IN KULLBACK-LEIBLER CONVERGENCE. The expected Kullback-Leibler (KL) divergence EθDKL(e Y hθ( | X)) can also be decomposed (Heskes, 1998; Zhou et al., 2021; Yang et al., 2020) into the bias Bias2 hθ,(X,Y ) and the variance Varihθ,(X,Y ) Bias2 hθ,(X,Y ) = DKL(e Y h( | X)) , Varihθ,(X,Y ) = EθDKL(e Y hθ( | X)) Bias2 hθ,(X,Y ) , where the mean function h and the partition function Z thereof are defined by h(i | X) = 1 Z exp(Eθ log hθ(i | X)) , (49) i [K] exp(Eθ log hθ(i | X)). (50) We can see Varihθ,(X,Y ) = log Z. The following Proposition F.1 demonstrates that there is no correlation between bias and variance in KL divergence, unlike in mean squared error. Specifically, we prove that the ratio of expected bias to expected variance in the decomposition of KL divergence can take any value in the range of (0, ). Proposition F.1. There exists a data distribution P(X, Y ) such that for any value r (0, ), there is an ensemble {hθ}θ such that its mean function Eθhθ has samplewise perfect calibration, and the ratio of expected bias to expected variance under the KL divergence EY |X Bias2 hθ,(X,Y ) EY |X Varihθ,(X,Y ) = r . Proof. Suppose that there are K = 2 classes and for every x, P(i | x) = 1/2 (i = 1, 2). Moreover, define h1(1 | x) = h2(2 | x) = ε and h1(2 | x) = h2(1 | x) = 1 ε, and set θ to a uniformly random variable on {1, 2}. Then the mean function h satisfies h(1 | x) = h(2 | x) = 1/2, which does not depend on ε. The expected bias EY |X Bias2 hθ,(X,Y ) is log 2. The partition function Z equals 2 exp((log ε + log(1 ε))/2), from which we obtain the variance EY |X Varihθ,(X,Y ) = Varihθ,(X,Y ) = log 2 1 2 log ε(1 ε). As ε 0+, the variance Varihθ,(X,Y ) tends to . As ε 1/2, the variance Varihθ,(X,Y ) vanishes. Therefore the ratio EY |X Bias2 hθ,(X,Y ) EY |X Varihθ,(X,Y ) can be any value in the range of (0, ). Published as a conference paper at ICLR 2024 In Proposition F.1, we can let r approach 0. In this limit, there exists a collection of ensembles {hθ}θ (which depends on r, hence the term collection ) such that EY |X Bias2 hθ,(X,Y ) EY |X Varihθ,(X,Y ) 0. Conversely, we can let r approach infinity, in which case EY |X Bias2 hθ,(X,Y ) EY |X Varihθ,(X,Y ) . Therefore, either bias or variance can be arbitrarily large relative to the other, implying that there is no alignment of bias and variance under the KL divergence. F.7 THEOREM 4.3 IMPLIES GENERALIZATION DISAGREEMENT EQUALITY (GDE) In this section, we show that Theorem 4.3 implies the generalization disagreement equality (GDE), which is the main result of (Jiang et al., 2022; Kirsch & Gal, 2022). We first recap the GDE using the notation of this paper. We begin with defining the test error, disagreement, and class aggregated calibration error (CACE) originally defined in (Jiang et al., 2022; Kirsch & Gal, 2022). Definition F.2 (Test error, disagreement, and class aggregated calibration error (Jiang et al., 2022; Kirsch & Gal, 2022)). Let {hθ : X M([K])} be an ensemble of trained models, each of which has a deterministic prediction, i.e., hθ( | x) is a one-hot vector for x X. Let h( | x) Eθhθ( | x) be the mean function of {hθ}. Then, the test error, disagreement, and class aggregated calibration error (CACE) of h are defined as follows: Test Err P(hθ) = E(X,Y ) P[1{hθ( | X) = e Y }] , Dis P(hθ, hθ ) = E(X,Y ) P[1{hθ(X) = hθ (X)}] , CACEP,h = Z 1 i [K] P(Y = i, h(i | X) = q) q X i [K] P(h(i | X) = q) Note that while the test error Test Err P(hθ) and disagreement Dis P(hθ, hθ ) are expected values over P, they still have randomness due to θ. Moreover, note that Jiang et al. (2022) use an integer i [K] to denote the prediction of hθ. However, we use a one-hot vector ei RK. We will see the mathematical convenience of representing the prediction with a one-hot vector in our proof of Theorem F.3. In particular, our proof of Theorem F.3 shows that in expectation, the disagreement is equal to the variance (defined in Equation 8) and the test error equals half the risk (defined in Equation 10). Theorem F.3 (Theorem 4.2 of (Jiang et al., 2022)). If hθ outputs an one-hot vector (as assumed in (Jiang et al., 2022)) and θ, θ are i.i.d., The following inequality holds: |Eθ,θ [Dis P(hθ, hθ )] Eθ[Test Err P(hθ)]| CACEP,h . If the ensemble {hθ} satisfies the pre-image perfect calibration (E[ (i | X) | Σpre i ] = 0, for i [K]), the following generalization disagreement equality (GDE) holds: Eθ,θ [Dis P(hθ, hθ )] = Eθ[Test Err P(hθ)] . Proof. We first show that the disagreement is equal to variance in expectation: Eθ,θ [Dis P(hθ, hθ )] = Eθ,θ E(X,Y ) P " hθ(X) h θ(X) 2 2 2 = E(X,Y ) P h Eθ[ hθ(X) 2 2] Eθ[hθ(X)] 2 2 i = E(X,Y ) P Varihθ,(X,Y ) . Next, we show that the test error is equal to half the risk in expectation: Eθ [Test Err P(hθ)] = EθE(X,Y ) P " hθ( | X) e Y 2 2 2 = E(X,Y ) P Riskhθ,(X,Y ) = E(X,Y ) P " Bias2 hθ,(X,Y ) 2 + Varihθ,(X,Y ) Published as a conference paper at ICLR 2024 We can obtain the following equation by subtracting the above two equations: Eθ,θ [Dis P(hθ, hθ )] Eθ[Test Err P(hθ)] = E(X,Y ) P " Bias2 hθ,(X,Y ) 2 + Varihθ,(X,Y ) 2 Varihθ,(X,Y ) = E(X,Y ) P BVGhθ,(X,Y ) Apply Theorem 4.3 with Σi = Σpre i = σ(h(i | X)) and using Uncehθ(X) = 0 (because hθ( | x) is a one-hot vector for x X), we obtain BVGhθ,(X,Y ) i [K] E [E[ (i | X) | Σpre i ]h(i | X)] . (51) We see immediately that if E[ (i | X) | Σpre i ] = 0 for i [K], the GDE is satisfied. The right-hand side of Equation (51) equals X i [K] E [E[ (i | X) | Σpre i ]h(i | X)] i [K] (q P(Y = i | h(i | X) = q)) q P(h(i | X) = q)dq i [K] P(h(i | X) = q) X i [K] P(Y = i, h(i | X) = q) In the first equality, we expand the expectations. To compute the outer expectation, we condition on the prediction h(i|X) returned by the model for class index i and integrate with respect to the conditional probability distribution h(i|X). The inner expectation is taken over all X such that the model for class index i returns q, the value that we condition on. In the last equality, we apply the conditional probability rule P(Y = i, h(i | X) = q) = P(Y = i | h(i | X) = q)P(h(i | X) = q). Taking the absolute value gives |Eθ,θ [Dis P(hθ, hθ )] Eθ[Test Err P(hθ)]| i [K] P(h(i | X) = q) X i [K] P(Y = i, h(i | X) = q) i [K] P(h(i | X) = q) X i [K] P(Y = i, h(i | X) = q) = CACEP,h , where the last inequality uses q [0, 1]. G FURTHER RESULTS ON NEURAL COLLAPSE AND THE BIAS-VARIANCE CORRELATION G.1 PROPERTIES OF SIMPLEX EQUIANGULAR TIGHT FRAME (ETF) W ETF has the following properties: First, it is symmetric. Second, the inner product between any two distinct columns is equal to 1 K 1. Third, this pairwise distance is maximized, i.e., there does not exist any matrix where the inner product between any two distinct pairs of columns are smaller than 1 K 1. Published as a conference paper at ICLR 2024 G.2 VERIFYING ASSUMPTION 5.1 As stated in Assumption 5.1, we assume that the logits of a deep network for any test sample (X, Y ) are drawn from the following distribution: softmax W ETF(sw ETF Y + v) use the fact IK 1 K K 1e Y + v s K K 1e Y + K K 1e Y + v s K K 1e Y + where the final equality follows from the fact that the term q K 1K1 K s q K K 1e Y + v is parallel to 1K and the addition of this term does not change the value of the softmax function. In particular, v above is a random vector with i.i.d. entries drawn according to β q K K 1vi Gumbel(µ, β). In this section, we verify this assumption from two perspectives. First, we will plot the distributions of logits in practical neural networks, and show that they align with (52). Second, we will show through simulation that, if the logits are generated according to (52), then we observe bias-variance alignment. Distribution of logits. From (52), the logits corresponding to the correct class (i.e., Y ) and any incorrect class Y = Y are given by K K 1v Y , and K K 1v Y , (53) respectively. In particular, since Gumbel distribution has a unimodal shaped probability density function, the distribution of both the positive and all the negative logits have unimodal shape according to (53). To verify that this aligns with the practical observations, we calculate the distributions of logit values on various datasets and model architectures, for both positive classes and negative classes. The results are presented in Figure 13. We observe unimodal logit distributions for both positive and negative classes in all cases. On the other hand, one may notice that while (53) predicts the positive and negative logits to have different biases but the same variance, in many cases from Figure 13, the positive and negative logits have notable different variances. Hence, Assumption 5.1 is used as a simplified model that makes our theoretical analysis tractable, but is not meant to perfectly model the distribution of logits in practice. We will show next that such a simplified model is sufficient for producing the bias-variance alignment phenomenon that we observe in practice. Synthesizing bias-variance alignment. To justify Assumption 5.1, we synthetically generate a collection of logit vectors according to (53), and plot the sample-wise bias and variance obtained from the logit vectors. Specifically, given any number n as the number of samples, and K as the number of classes, we first generate a collection of n random labels where each label is drawn uniformly at random from [K]. For each label, we sample T logit vectors independently according to (53) (for the Gumbel distribution, we take µ = 0 and β = 1). Here, T is interpreted as the number of independently trained models for estimating bias and variance. The results with n = 200, K = 2, and T = 10, under varying choices of s {5, 10, 20, 100} are reported in Figure 14. In all cases, we observe a clear bias-variance alignment. In particular, we did not draw correctly and incorrectly classified samples in different colors (as in Figure 1b) because all the samples in these cases are correctly classified. Published as a conference paper at ICLR 2024 20 0 20 Logit Positive classes Negative classes (a) CIFAR 10 - Res Net 8 20 0 20 40 Logit Positive classes Negative classes (b) CIFAR 10 - Res Net 56 0 20 40 Logit Positive classes Negative classes (c) CIFAR 10 - Res Net 110 20 0 20 Logit Positive classes Negative classes (d) CIFAR 100 - Res Net 8 20 0 20 40 60 Logit Positive classes Negative classes (e) CIFAR 100 - Res Net 56 20 0 20 40 Logit Positive classes Negative classes (f) CIFAR 100 - Res Net 110 0 10 20 30 Logit Positive classes Negative classes (g) Image Net - Mobile Net V2 0 20 40 Logit Positive classes Negative classes (h) Image Net - Efficient Net B0 0 20 40 60 Logit Positive classes Negative classes (i) Image Net - Res Net 50 Figure 13: Distribution of logits for positive and negative classes. Figure 14: Sample-wise bias and variance for synthetic data generated according to (53). From left to right, s is varied in the set {5, 10, 20, 100}. G.3 VERIFYING COROLLARY 5.3: BINARY CLASSIFICATION We note that the Neural collapse theory relies on the binary classification assumption. To ensure that the bias-variance alignment results hold for such a setup empirically, we construct a binary classification problem based on the CIFAR-10 dataset: each example in the first five classes is assigned label 0, and each eample in the last five classes is assigned label 1. We call the resulting dataset CIFAR-2. The results are shown in Figure 15. G.4 RELATIONSHIP BETWEEN GUMBEL AND EXPONENTIAL DISTRIBUTION Lemma G.1. Let X Gumbel(µ, β). Then, e X/β Exp(eµ/β). Proof. Recall that the cumulative distribution function (CDF) of the Gumbel distribution is given by P(X x) = e e (x µ)/β. (54) Published as a conference paper at ICLR 2024 10 10 10 6 10 2 Bias squared (a) Res Net 8 10 14 10 9 10 4 101 Bias squared (b) Res Net 14 10 13 10 9 10 5 10 1 Bias squared (c) Res Net 56 10 12 10 8 10 4 100 Bias squared (d) Res Net 110 Figure 15: Squared bias and variance computed based on various model sizes on the CIFAR-2 problem. See Appendix G.3 for details. Thus we get P(e X/β e x/β) = 1 e e (x µ)/β . (55) Substituting t = e x/β, we get P(e X/β t) = 1 e teµ/β . (56) This is the CDF of Exp(eµ/β). G.5 PROOF OF THEOREM 5.2 Proof of Theorem 5.2. As the first step, we compute the output of function hθ hθ(X) = softmax (Wψτ(X)) = softmax W ETFR R sw ETF Y + v s K K 1e Y + Let us denote w hθ( | X). Without loss of generality and for the ease of presentation, we label the Y -th entry as the first entry (Y = 1). Moreover, we introduce the shorthand notation u q Published as a conference paper at ICLR 2024 Then we have s = softmax s K aeu1 + eu2 + + eu K , eu2 aeu1 + eu2 + + eu K , . . . , eu K aeu1 + eu2 + + eu K where a = es K/K . In light of Lemma G.1, since βui Gumbel(µ, β) are i.i.d., then vi eui Exp(λ) where λ eµ/β. Let us look at the first entry w1 of s = softmax(re1 + u). It equals aeu1 + eu2 + + eu K = a a + eu2+ +eu K eu1 = a a + (K 1)F = c c + F , (59) where c = a K 1 = es K/K K and F = (eu2+ +eu K )/(2(K 1)) eu1/2 = (eu2+ +eu K )/(K 1) eu1 F(2(K 1), 2) follows the F distribution (this is because eui Exp(λ) = 1 2λχ2 2). The expectation of w1/c is given by i = EF F(2K ,2) = 1 K Beta(K , 1) (c + x)(x + 1/K )K +1 dx (61) = c K 1 c 1 K K (c K K log(c K ) 1) c K 1 (62) (K j)(c 1 K )jc j+K 1 = φK (c) . (64) As a result, the squared bias of the first entry w1 is βhθ,(X,Y )(1) = |Ew1 1| = |cφK (c) 1| . (65) To get the variance of the first entry w1, we compute its second moment as the first step: E[w2 1] = c2E 1 (c + F)2 dc E 1 c + F = c2 dφK (c) Therefore, it follows that Varihθ,(X,Y )(1) = E[w2 1] E[w1]2 = c2 dφK (c) dc c2φK (c)2 , (67) which yields ςhθ,(X,Y )(1) = q Varihθ,(X,Y )(1) = c dc + φK (c)2 . (68) G.6 PROOF OF COROLLARY 5.3 Proof of Corollary 5.3. If K = 2, we get EF F(2,2)[ 1 c+F ] = c log(c) 1 (c 1)2 . As in the Proof of Theorem 5.2, without loss of generality and for the ease of presentation, we label the Y -th entry as the Published as a conference paper at ICLR 2024 first entry (Y = 1). We then have the following expectations: Eu[w1] = c(c log(c) 1) (c 1)2 , Eu[w2] = c + c log(c) + 1 c EF F(2,2)[ 1 c + F ] = c2 2c log(c) 1 To obtain the variance of w1, w2, we first calculate the second moment: Eu[w2 1] = c2(c2 2c log(c) 1) (c 1)3c (69) Eu[w2 2] = Eu[(1 w1)2] = c2 2c log(c) 1 (c 1)3 (70) As a result, we have Varu[w1] = Varu[w2] = c (c 1)2 c log2(c) (c 1)4 (71) Therefore, we obtain βhθ,(X,Y )(1) = |Eu[w1] 1| = |log(c)c c + 1| βhθ,(X,Y )(2) = |Eu[w2] 0| = βhθ,(X,Y )(1) ςhθ,(X,Y )(1) = ςhθ,(X,Y )(2) = p c (c 1)2 c log2(c) The ratio βhθ,(X,Y )(i) ςhθ,(X,Y )(i) = | c+c log(c)+1| c((c 1)2 c log2(c)) is a decreasing function of c (1, ) and limc 1+ βhθ,(X,Y )(i) ςhθ,(X,Y )(i) = 3. Therefore, the ratio βhθ,(X,Y )(i) ςhθ,(X,Y )(i) 3 < 1.74 for c > 1. To show βhθ,(X,Y )(i) ςhθ,(X,Y )(i) = | c+c log(c)+1| c((c 1)2 c log2(c)) log c 1 c 2s 1 es , it suffices to prove ( c + c log(c) + 1)2 (c 1)2 c log2(c) (log(c) 1)2 , which is equivalent to log(c) 2c + c log3(c) 2c log2(c) + (3c 1) log(c) + 2 log(c)f(c) 0 . Since f (c) = 1 c + log3(c) + log2(c) log(c) + 1 0 and f(1) = 0, we complete the proof for βhθ,(X,Y )(i) ςhθ,(X,Y )(i) log c 1 c . On the log scale, log Biashθ,(X,Y )(i) log Varihθ,(X,Y )(i) = log ( c+c log(c)+1)2 log c((c 1)2 c log2(c)) is a monotone function for c > 1. As c approaches 1 from the right, the limit of the function is log(4) log(12) > 0.557. As c approaches infinity, the limit of the function is 2. H FUTURE WORK In the realm of future research, an intriguing avenue lies in the exploration of the bias-variance alignment phenomenon within diverse machine learning model ensembles. While our present work sheds light on this phenomenon in the context of deep neural network ensembles, there remains a substantial gap in our understanding of its presence or absence in other model types such as decision trees and support vector machines. An essential direction for future investigation involves unraveling the root causes of bias-variance alignment, particularly as it pertains to distinct types of overfitting in machine learning models be it benign, tempered, or catastrophic (Mallinar et al., 2022). Our conjecture extends towards discerning whether bias-variance alignment is a ubiquitous trait across Published as a conference paper at ICLR 2024 various overfitting scenarios or if its manifestation is contingent on the specific type of overfitting. Notably, we hypothesize that catastrophic overfitting may exhibit a bias-variance tradeoff, while benign and tempered overfitting might demonstrate a unique bias-variance alignment. Unraveling these intricacies is pivotal for advancing our comprehension of ensemble behaviors across different machine learning paradigms. Moreover, Atanasov et al. (2022) demonstrates an inverse relationship between the feature learning strength and the variance across initializations. Exploring the interplay between bias, variance, and feature learning strength in neural networks is also an intriguing area of future research.