# how_many_classifiers_do_we_need__fbf825ff.pdf How many classifiers do we need? Hyunsuk Kim Department of Statistics University of California, Berkeley hyskim7@berkeley.edu Liam Hodgkinson School of Mathematics and Statistics University of Melbourne, Australia lhodgkinson@unimelb.edu.au Ryan Theisen Harmonic Discovery ryan@harmonicdiscovery.com Michael W. Mahoney ICSI, LBNL, and Dept. of Statistics University of California, Berkeley mmahoney@stat.berkeley.edu As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined to improve accuracy. In this paper, we provide a detailed analysis of how the disagreement and the polarization (a notion we introduce and define in this paper) among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks. We address these questions in the following ways. (1) An upper bound for polarization is derived, and we propose what we call a neural polarization law: most interpolating neural network models are 4/3-polarized. Our empirical results not only support this conjecture but also show that polarization is nearly constant for a dataset, regardless of hyperparameters or architectures of classifiers. (2) The error rate of the majority vote classifier is considered under restricted entropy conditions, and we present a tight upper bound that indicates that the disagreement is linearly correlated with the error rate, and that the slope is linear in the polarization. (3) We prove results for the asymptotic behavior of the disagreement in terms of the number of classifiers, which we show can help in predicting the performance for a larger number of classifiers from that of a smaller number. Our theoretical findings are supported by empirical results on several image classification tasks with various types of neural networks. 1 Introduction As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined, both to improve accuracy and to form more robust conclusions than any individual model alone can provide. In some cases, ensembling can produce substantial benefits, particularly when increasing model size becomes prohibitive. In particular, for large neural network models, deep ensembles [LPB17] are especially popular. These ensembles consist of independently trained models on the same dataset, often using the same hyperparameters, but starting from different initializations. The cost of producing new classifiers can be steep, and it is often unclear whether the additional performance gains are worth the cost. Assuming that constructing two or three classifiers is relatively cheap, procedures capable of deciding whether to continue producing more classifiers are needed. To do so requires a precise understanding of how to predict ensemble performance. Of particular interest are majority vote strategies in classification tasks, noting that regression tasks can also be formulated in this way by clustering outputs. In this case, one of the most effective avenues for 38th Conference on Neural Information Processing Systems (Neur IPS 2024). predicting performance is the disagreement [JNBK22, BJRK22]: measuring the degree to which classifiers provide different conclusions over a given dataset. Disagreement is concrete, easy to compute, and strongly linearly correlated with majority vote prediction accuracy, leading to its use in many applications. However, a priori, the precise linear relationship between disagreement and accuracy is unclear, preventing the use of disagreement for predicting ensemble performance. Our goal in this paper is to go beyond disagreement-based analysis to provide a more quantitative understanding of the number of classifiers one should use to achieve a desired level of performance in modern practical applications, in particular for neural network models. In more detail, our contributions are as follows. (i) We introduce and define the concept of polarization, a notion that measures the higherorder dispersity of the error rates at each data point, and which indicates how polarized the ensemble is relative to the ground truth. We state and prove an upper bound for polarization (Theorem 1). Inspired by the theorem, we propose what we call a neural polarization law (Conjecture 1): most interpolating (Definition 2) neural network models are 4/3-polarized. We provide empirical results supporting the conjecture (Figures 1 and 2). (ii) Using the notion of polarization, we develop a refined set of bounds on the majority vote test error rate. For one, we provide a sharpened bound for any ensembles with a finite number of classifiers (Corollary 1). For the other, we offer a new, tighter bound under an additional condition on the entropy of the ensemble (Theorem 4). We provide empirical results that demonstrate our new bounds perform significantly better than the existing bounds on the majority vote test error (Figure 3). (iii) The asymptotic behavior of the majority vote error rate is determined as the number of classifiers increases (Theorem 5). Consequently, we show that we can predict the performance for a larger number of classifiers from that of a smaller number. We provide empirical results that show such predictions are accurate across various pairs of model architecture and dataset (Figure 4). In Section 2, we define the notations that will be used throughout the paper, and we introduce upper bounds for the error rate of the majority vote from previous work. The next three sections are the main part of the paper. In Section 3, we introduce the notion of polarization, ηρ, which plays a fundamental role in relating the majority vote error rate to average error rate and disagreement. We explore the properties of the polarization and present empirical results that corroborate our claims. In Section 4, we present tight upper bounds for the error rate of the majority vote for ensembles that satisfy certain conditions; and in Section 5, we prove how disagreement behaves in terms of the number of classifiers. All of these ingredients are put together to estimate the error rate of the majority vote for a large number of classifiers using information from only three sampled classifiers. In Section 6, we provide a brief discussion and conclusion. Additional material is presented in the appendices. 2 Preliminaries In this section, we introduce notation that we use throughout the paper, and we summarise previous work on the performance of the majority vote error rate. 2.1 Notations We focus on K-class classification problems, with features X X, labels Y [K] = {1, 2, ..., K} and feature-label pairs (X, Y ) D. A classifier h : X [K] is a function that maps a feature to a label. We define the error rate of a single classifier h, and the disagreement and the tandem loss [MLIS20] between two classifiers, h and h , as the following: Error rate : L(h) = ED[1(h(X) = Y )] Disagreement : D(h, h ) = ED[1(h(X) = h (X))] Tandem loss : L(h, h ) = ED[1(h(X) = Y )1(h (X) = Y )], where the expectation ED is used to denote E(X,Y ) D. Next, we consider a distribution of classifiers, ρ, which may be viewed as an ensemble of classifiers. This distribution can represent a variety of different cases. Examples include: (1) a discrete distribution over finite number of hi, e.g., a weighted sum of hi; and (2) a distribution over a parametric family hθ, e.g., a distribution of classifiers resulting from one or multiple trained neural networks. Given the ensemble ρ, the (weighted) majority vote h MV ρ : X [K] is defined as h MV ρ (x) = arg max y [K] Eρ[1(h(x) = y)]. Again, Eρ denotes Eh ρ, and we use Eρ, Eρ2, Pρ for Eh ρ, E(h,h ) ρ2, Ph ρ, respectively, throughout the paper. In this sense, Eρ[L(h)] represents the average error rate under a distribution of classifiers ρ and Eρ2[D(h, h )] represents the average disagreement between classifiers under ρ. Hereafter, we refer to Eρ[L(h)], Eρ2[D(h, h )], and L(h MV ρ ) as the average error rate, the disagreement, and the majority vote error rate, respectively, with L(h MV ρ ) = ED[1(h MV ρ (X) = Y )]. Lastly, we define the point-wise error rate, Wρ(X, Y ), which will serve a very important role in this paper (for clarity, we will denote Wρ(X, Y ) by Wρ unless otherwise necessary): Wρ(X, Y ) = Eρ[1(h(X) = Y )]. (1) 2.2 Bounds on the majority vote error rate The simplest relationship between the majority vote error L(h MV ρ ) and the average error rate Eρ[L(h)] was introduced in [Mc A98]. It states that the error in the majority vote classifier cannot exceed twice the average error rate: L(h MV ρ ) 2Eρ[L(h)] (2) A simple proof for this relationship can be found in [MLIS20] using Markov s inequality. Although (2) does not provide useful information in practice, it is worth noting that this bound is, in fact, tight. There exist pathological examples where h MV ρ exhibits twice the average error rate (see Appendix C in [TKY+24]). This suggests that we can hardly obtain a useful or tighter bound by relying on only the first-order term, Eρ[L(h)]. Accordingly, more recent work constructed bounds in terms of second-order quantities, Eρ2[L(h, h )] and Eρ2[D(h, h )]. In particular, [LMRR17] and [MLIS20] designed a so-called C-bound using the Chebyshev-Cantelli inequality, establishing that, if Eρ[L(h)] < 1/2, then L(h MV ρ ) Eρ2[L(h, h )] Eρ[L(h)]2 Eρ2[L(h, h )] Eρ[L(h)] + 1 As an alternative approach, [MLIS20] incorporated the disagreement Eρ2[D(h, h )] into the bound as well, albeit restricted to the binary classification problem, to obtain: L(h MV ρ ) 4Eρ[L(h)] 2Eρ2[D(h, h )]. (4) While (3) and (4) may be tighter in some cases, once again, there do exist pathological examples where this bound is as uninformative as the first-order bound (2). Motivated by these weak results, [TKY+24] take a new approach by restricting ρ to be a good ensemble, and introducing the competence condition (see Definition 3 in our Appendix A). Informally, competent ensembles are those where it is more likely in average across the data that more classifiers are correct than not. Based on this notion, [TKY+24] prove that competent ensembles are guaranteed to have weighted majority vote error smaller than the weighted average error of individual classifiers: L(h MV ρ ) Eρ[L(h)]. (5) That is, the majority vote classifier is always beneficial. Moreover, [TKY+24] proves that any competent ensemble ρ of K-class classifiers satisfy the following inequality. L(h MV ρ ) 4(K 1) 2Eρ2[D(h, h )] . (6) We defer further discussion of competence to Appendix A, where we introduce simple cases for which competence does not hold. In these cases, we show how one can overcome this issue so that the bounds (5) and (6) still hold. In particular, in Appendix A.3, we provide an example to show the bound (6) is tight. 3 The Polarization of an Ensemble In this section, we introduce a new quantity, ηρ, which we refer to as the polarization of an ensemble ρ. First, we provide examples as to what this quantity represents and draw a connection to previous studies. Then, we present theoretical and empirical results that show this quantity plays a fundamental role in relating the majority vote error rate to average error rate and disagreement. In Theorem 1, we prove an upper bound for the polarization ηρ, which highlights a fundamental relationship between the polarization and the constant 4 3. Inspired from the theorem, we propose Conjecture 1 which we call a neural polarization law. Figures 1 and 2 present empirical results on an image recognition task that corroborates the conjecture. We start by defining the polarization of an ensemble. In essence, the polarization is an improved (smaller) coefficient on the Markov s inequality on PD(Wρ > 0.5), where Wρ is the point-wise error rate defined as equation (1). It measures how much the ensemble is polarized from the truth, with consideration of the distribution of Wρ. Definition 1 (POLARIZATION). An ensemble ρ is η-polarized if η ED[W 2 ρ ] PD(Wρ > 1/2). (7) The polarization of an ensemble ρ is ηρ := PD(Wρ > 1/2) ED[W 2ρ ] , (8) which is the smallest value of η satisfies inequality (7). Note that the polarization always takes a value in [0, 4], due to the positivity constraint and Markov s inequality. Also note that ensemble ρ with polarization ηρ is η-polarized for any η ηρ. To understand better what this quantity represents, consider the following examples. The first example demonstrates that polarization increases as the majority vote becomes more polarized from the truth, while the second example demonstrates how polarization increases when the constituent classifiers are more evenly split. Example 1. Consider an ensemble ρ where 75% of classifiers output Label 1 with probability one, and the other 25% classifiers output Label 2 with probability one. - Case 1. The true label is Label 1 for the whole data. In this case, the majority vote in ρ results in zero error rate. The point-wise error rate Wρ is 0.25 on the entire dataset, and thus PD(Wρ > 0.5) = 0. The polarization ηρ is 0. - Case 2. The true label is Label 1 for half of the data and is Label 2 for the other half. In this case, the majority vote is only correct for half of the data. The point-wise error rate Wρ is 0.25 for this half, and is 0.75 for the other half. The polarization ηρ is 0.5/0.3125 = 1.6. - Case 3. The true label is Label 2 for the whole data. In this case, the majority vote in ρ is wrong on every data point. The point-wise error rate Wρ is 0.75 on the entire dataset and thus PD(Wρ > 0.5) = 1. The polarization ηρ is 1/0.3125 = 3.2. Example 2. Now consider an ensemble ρ of which 51% of classifiers always output Label 1, and the other 49% classifiers always output Label 2. - Case 1. The polarization ηρ is now 0, the same as in Example 1. - Case 2. The polarization ηρ is 0.5/0.2501 2, which is larger than 1.6 in Example 1. - Case 3. The polarization ηρ is now 1/0.2501 4 , which is larger than 3.2 in Example 1. In addition, the following proposition draws a connection between polarization and the competence condition mentioned in Section 2.2. It states that the polarization of competent ensembles cannot be very large. The proof is deferred to Appendix A.2. Proposition 1. Competent ensembles are 2-polarized. Now we delve more into this new quantity. We introduce Theorem 1, which establishes (by means of concentration inequalities) an upper bound on the polarization ηρ. The proof of Theorem 1 is deferred to Appendix B.1. Figure 1: Polarizations ηρ obtained from Res Net18 trained on CIFAR-10 with various sets of hyper-parameters tested on (a) an out-of-sample CIFAR-10 and (b) an out-of-distribution dataset, CIFAR-10.1. Red dashed line indicates y = 4/3, a suggested value of polarization appears in Theorem 1 and Conjecture 1. Theorem 1. Let {(Xi, Yi)}m i=1 be independent and identically distributed samples from D that are independent of an ensemble ρ. Then the polarization of the ensemble, ηρ, satisfies with probability at least 1 δ, where S = 1 m Pm i=1 W 2 ρ (Xi, Yi) and P = 1 m Pm i=1 1(Wρ(Xi, Yi) > 1/2).f Surprisingly, in practice, ηρ = 4 3 appears to be a good choice for a wide variety of cases. See Figure 1 and Figure 2, which show the polarization ηρ obtained from VGG11 [SZ14], Dense Net40 [HLVDMW17], Res Net18, Res Net50 and Res Net101 [HZRS16] trained on CIFAR-10 [Kri09] with various hyperparameters choices. The trend does not deviate even when evaluated on an out-ofdistribution dataset, CIFAR-10.1 [RRSS18, TFF08]. For more details on these empirical results, see Appendix C. Remark. We emphasize that values for ηρ that are larger than 4 3 does not contradict Theorem 1. This happens when the non-constant second term in (9) is larger than 4 3, which is often the case for classifiers which are not interpolating (or, indeed, that underfit or perform poorly). Definition 2 (INTERPOLATING, [BHMM19]). A classifier is interpolating if it achieves an accuracy of 100% on the training data. Putting Theorem 1 and the consistent empirical trend shown in Figure 2(b) together, we propose the following conjecture. Conjecture 1 (NEURAL POLARIZATION LAW). The polarization of ensembles comprised of independently trained interpolating neural networks is smaller than 4 4 Entropy-Restricted Ensembles In this section, we first present an upper bound on the majority vote error rate, L(h MV ρ ), in Theorem 2, using our notion of polarization ηρ which we introduced and defined in the previous section. Then, we present Theorems 3 and 4 which are the main elements in obtaining tighter upper bounds on L(h MV ρ ). Figure 3 shows our proposed bound offers a significant improvement over state-of-the-art results. The new upper bounds are inspired from the fact that classifier prediction probabilities tend to concentrate on a small number of labels, rather than be uniformly spread over all the possible labels. Figure 2: Polarization ηρ obtained (a) from various architectures trained on CIFAR-10 and (b) only from interpolating classifiers trained on various datasets. Red dashed line indicates y = 4/3. In subplot (b), we observe that the polarization of all interpolating models expect one are smaller than 4/3, which aligns with Conjecture 1. This is analogous to the phenomenon of neural collapse [Kot22]. As an example, in the context of a computer vision model, when presented with a photo of a dog, one might expect that a large portion of reasonable models might classify the photo as an animal other than a dog, but not as a car or an airplane. We start by stating an upper bound on the majority vote error, L(h MV ρ ) as a function of polarization ηρ. This upper bound is tighter (smaller) than the previous bound in inequality (6) when the polarization is lower than 2, which is the case for competent ensembles. The proof is deferred to Appendix B.2. Theorem 2. For an ensemble ρ of K-class classifiers, L(h MV ρ ) 2ηρ(K 1) 2Eρ2[D(h, h )] , where ηρ is the polarization of the ensemble ρ. Based on the upper bound stated in Theorem 2, we add a restriction on the entropy of constituent classifiers to obtain Theorem 3. The theorem provides a tighter scalable bound that does not have explicit dependency on the total number of labels, with a small cost in terms of the entropy of constituent classifiers. The proof of Theorem 3 is deferred to Appendix B.3. Theorem 3. Let ρ be any η-polarized ensemble of K-class classifiers that satisfies Pρ(h(x) / A(x)) , where y A(x) [K] and |A(x)| M, for all data points (x, y) D. Then, we have L(h MV ρ ) 2η(M 1) 2Eρ2[D(h, h )] . While Theorem 3 might provide a tighter bound than prior work, coming up with pairs (M, ) that satisfy the constraint is not an easy task. This is not an issue for a discrete ensemble, however. If ρ is a discrete distribution of N classifiers, then we observe that the assumption of Theorem 3 must always hold with (M, ) = (N +1, 0). We state this as the following corollary. Corollary 1 (FINITE ENSEMBLE). For an ensemble ρ that is a weighted sum of N classifiers, we have L(h MV ρ ) 2ηρN 2Eρ2[D(h, h )] , (10) where ηρ is the polarization of the ensemble ρ. Figure 3: Comparing our new bound from Corollary 1 (colored black), which is the right hand side of inequality (10), with bounds from previous studies. Green corresponds to the C-bound in inequality (3), and blue corresponds to the right hand side of inequality (6). Res Net18, Res Net50, Res Net101 models with various sets of hyperparameters are trained on CIFAR-10 then tested on (a) the out-of-sample CIFAR-10, (b) an out-of-distribution dataset, CIFAR-10.1. See Figure 3, which provides empirical results that compare the bound in Corollary 1 with the C-bound in inequality (3), and with inequality (6) proposed in [TKY+24]. We can observe that the new bound in Corollary 1 is strictly tighter than the others. For more details on these empirical results, see Appendix C. Although the bound in Corollary 1 is tighter than the bounds from previous studies, it s still not tight enough to use it as an estimator for L(h MV ρ ). In the following theorem, we use a stronger condition on the entropy of an ensemble to obtain a tighter bound. The proof is deferred to Appendix B.4. Theorem 4. For any η-polarized ensemble ρ that satisfies 1 2ED Pρ2 (h(X) = Y, h (X) = Y, h(X) = h (X)) ε ED [Pρ (h(X) = Y )] , (11) L(h MV ρ ) η (1 + ε) Eρ[L(h)] 1 2Eρ2[D(h, h )] . The condition (11) can be rephrased as follows: compared to the error Pρ(h(x) = y), the entropy of the distribution of wrong predictions is small, and it is concentrated on a small number of labels. A potential problem is that one must know or estimate the smallest possible value of ε in advance. At least, we can prove that ε = K 2 2(K 1) always satisfies the condition (11) for an ensemble of K-class classifiers. The proof is deferred to Appendix B.4. Corollary 2. For any η-polarized ensemble ρ of K-class classifiers, we have L(h MV ρ ) η 1 + K 2 2(K 1) 2Eρ2[D(h, h )] . Naturally, this ε is not good enough for our goal. We discuss more on how to estimate the smallest possible value of ε in the following section. 5 A Universal Law for Ensembling In this section, our goal is to predict the majority vote error rate of an ensemble with large number of classifiers by just using information we can obtain from an ensemble with a small number, e.g., three, of classifiers. Among the elements in the bound in Theorem 4, η (1 + ε) Eρ[L(h)] 1 2Eρ2[D(h, h )] , we plug in η = 4 3 as a result of Theorem 1; and since Eρ[L(h)] is invariant to the number of classifiers, it remains to predict the behavior of Eρ2[D(h, h )] and the smallest possible value of ε, ερ = ED[Pρ2(h(X) =Y,h (X) =Y,h(X) =h (X))] 2ED[Pρ(h(X) =Y )] . Since the denominator ED [Pρ (h(X) = Y )] = Eρ[L(h)] is invariant to the number of classifiers, and the numerator resembles the disagreement between classifiers, ερ is expected to follow a similar pattern as Eρ2[D(h, h )]. Note that the numerator of ερ has the same form as the disagreement, differing by only one less label. Both are V -statistics that can be expressed as a multiple of a U-statistic, as shown in equation (12). In the next theorem, we show that the disagreement for a finite number of classifiers can be expressed as the sum of a hyperbolic curve and an unbiased random walk. Here, [x] denotes the greatest integer less than or equal to x and D[0, 1] is the Skorokhod space on [0, 1] (see Appendix B.5). Theorem 5. Let ρN denote an empirical distribution of N independent classifiers {hi}N i=1 sampled from a distribution ρ and σ2 1 = Varh ρ(Eh ρPD(h(X) = h (X))). Then, there exists D > 0 such that E(h,h ) ρ2 N [D(h, h )] = 1 1 where EZN = 0, Var ZN σ2 1 and { t σ1 Z[Nt]}t [0,1] converges weakly to a standard Wiener process in D[0, 1] as N . Proof. Let Φ(hi, hj) = PD(hi(X) = hj(X)). We observe that N(N 1)E(h,h ) ρ2 N [D(h, h )] = 1 N(N 1) i,j=1 PD(hi(X) = hj(X)) i,j=1 Φ(hi, hj) = Φ: symmetric Φ(hi,hi)=0 1 i 1/2), that is, the probability that at least half of the classifiers agree on the correct answer. This quantity is especially well-behaved, and it frequently appears in our proofs. (Indeed, every bound presented in this work serves as an upper bound for Pρ(Wρ > 1/2).) We conjecture that this quantity is useful much more generally. Acknowledgements. We would like to thank the DOE, IARPA, NSF, and ONR for providing partial support of this work. [BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. [Bil13] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2nd edition, 2013. [BJRK22] Christina Baek, Yiding Jiang, Aditi Raghunathan, and Zico Kolter. Agreement-onthe-line: Predicting the performance of neural networks under distribution shift. Advances in Neural Information Processing Systems, 35:19274 19289, 2022. [CBIK+18] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018. [Den12] Li Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. [HLVDMW17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700 4708, 2017. [How17] Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017. [HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [JNBK22] Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, and J Zico Kolter. Assessing generalization of SGD via disagreement. In International Conference on Learning Representations, 2022. [Kal21] Olav Kallenberg. Foundations of modern probability. Springer, 3rd edition, 2021. [KB13] Vladimir S. Korolyuk and Yu V. Borovskich. Theory of U-statistics. Springer Science & Business Media, 2013. [Kot22] Vignesh Kothapalli. Neural collapse: A review on modelling principles and generalization. ar Xiv preprint ar Xiv:2206.04041, 2022. [Kri09] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [LMRR17] François Laviolette, Emilie Morvant, Liva Ralaivola, and Jean-Francis Roy. Risk upper bounds for general ensemble methods with an application to multiclass classification. Neurocomputing, 219:15 25, 2017. [LPB17] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017. [Mc A98] David A. Mc Allester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference on Computational Learning Theory, pages 230 234, 1998. [MLIS20] Andrés Masegosa, Stephan Lorenzen, Christian Igel, and Yevgeny Seldin. Second order PAC-Bayesian bounds for the weighted majority vote. Advances in Neural Information Processing Systems, 33:5263 5273, 2020. [RRSS18] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do CIFAR-10 classifiers generalize to CIFAR-10? ar Xiv preprint ar Xiv:1806.00451, 2018. [SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [TFF08] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on Pattern Analysis and Machine Intelligence, 30(11):1958 1970, 2008. [TKY+24] Ryan Theisen, Hyunsuk Kim, Yaoqing Yang, Liam Hodgkinson, and Michael W. Mahoney. When are ensembles really effective? Advances in Neural Information Processing Systems, 36, 2024. [XRV17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. A More discussion on competence In this section, we delve more into the competence condition that was introduced in [TKY+24]. We explore in which cases the competence condition might not work and how to overcome these issues. We discuss a few milder versions of competence that are enough for bounds (5) and (6) to hold. Then we discuss how to check whether these weaker competence conditions hold in practice, with or without a separate validation set. We start by formally stating the original competence condition. Definition 3 (Competence, [TKY+24]). The ensemble ρ is competent if for every 0 t 1/2, PD(Wρ [t, 1/2)) PD(Wρ [1/2, 1 t]). (16) A.1 Cases when competence fails One tricky part in the definition of competence is that it requires inequality (16) to hold for every 0 t 1/2. In case t = 1/2, the inequality becomes 0 PD(Wρ = 1/2). This is not a significant issue in the case that ρ is a continuous distribution over classifiers, e.g., a Bayes posterior or a distribution over a parametric family hθ, as {Wρ = 1/2} would be a measure-zero set. In the case that ρ is a discrete distribution over finite number of classifiers, however, PD(Wρ = 1/2) is likely to be a positive quantity, in which case it can violate the competence condition. That being said, {(x, y) | Wρ(x, y) = 1/2} represent tricky data points that deserves separate attention. This event can be divided into two cases: 1) all the classifiers that incorrectly made a prediction output the same label; or 2) incorrect predictions consist of multiple labels so that the majority vote outputs the true label. Among these two possibilities, the first case is troublesome. We denote such data points by TIE(ρ, D): TIE(ρ, D) := {(x, y) | Pρ(1(h(x) = j)) = Pρ(1(h(x) = y)) = 1/2 for true label y and an incorrect label j}. In this case, the true label and an incorrect label are chosen by exactly the same ρ weights of classifiers. An easy way to resolve this issue is to slightly tweak the weights. For instance, if ρ is an equally weighted sum of two classifiers, we can change each of their weights to be (1/2 + ϵ, 1/2 ϵ), instead of (1/2, 1/2). This change may seem manipulative, but it corresponds to a deterministic tie-breaking rule which prioritizes one classifier over the other, which is a commonly used tiebreaking rule. Definition 4 (Tie-free ensemble). An ensemble is tie-free if PD(TIE(ρ, D)) = 0. Proposition 2. An ensemble with a deterministic tie-breaking rule is tie-free. With such tweak to make the set TIE(ρ, D) to be an empty set or a measure-zero set, we present a slightly milder condition that is enough for the bounds (5) and (6) to still hold. Definition 5 (Semi-competence). The ensemble ρ is semi-competent if for every 0 t < 1/2, P(Wρ [t, 1/2]) P(Wρ (1/2, 1 t]). (17) Note that inequality (17) is a strictly weaker condition than inequality (16), and hence competence implies semi-competence. The converse is not true. An ensemble is semi-competent even if the pointwise error Wρ(X, Y ) = 1/2 on every data points, but such an ensemble is not competent. Theorem 6. For a tie-free ensemble and semi-competent ensemble ρ, L(h MV ρ ) Eρ[L(h)] and L(h MV ρ ) 4(K 1) 2Eρ2[D(h, h )] holds in K-class classification setting. We provide the proof as a separate subsection below. A.2 Proof of Theorem 6 and Proposition 1 We start with the following lemma, which is a semi-competence version of Lemma 2 from [TKY+24]. Lemma 1. For a semi-competent ensemble ρ and any increasing function g satisfying g(0) = 0, ED[g(Wρ)1Wρ 1/2] ED[g( f Wρ)1g Wρ<1/2], where f Wρ = 1 Wρ. Proof. For every x [0, 1], it holds that PD(Wρ1Wρ 1/2 x) = PD(Wρ [x, 1/2]) 1x 1/2, PD( f Wρ1g Wρ<1/2 x) = PD( f Wρ [x, 1/2)) 1x 1/2 = PD(Wρ (1/2, 1 x]) 1x 1/2. From the definition of semi-competence, this implies that PD(Wρ1Wρ 1/2 x) PD( f Wρ1g Wρ<1/2 x) for every x [0, 1]. Using the fact that g(x 1x c) = g(x)1x c for any increasing function g with g(0) = 0, we obtain PD(h(Wρ)1Wρ 1/2 x) PD(h( f Wρ)1g Wρ<1/2 x). Putting these together with a well-known equality EX = R 0 P(X x)dx for a non-negative random variable X proves the lemma. Now we use Lemma 1 and Theorem 2 to prove Theorem 6. Proof of Theorem 6. Applying Lemma 1 with g(x) = 2x2 gives, ED[2W 2 ρ 1Wρ 1/2] ED[2 f Wρ 21g Wρ<1/2] = ED[(2 4Wρ + 2W 2 ρ )1Wρ>1/2]. (18) Putting this together with the following decomposition of ED[2W 2 ρ ] shows that the ensemble ρ is 2-polarized: ED[2W 2 ρ ] ED[2W 2 ρ 1Wρ>1/2] + ED[2W 2 ρ 1Wρ 1/2] (18) ED[2W 2 ρ 1Wρ>1/2] + ED[(2 4Wρ + 2W 2 ρ )1Wρ>1/2] ED[(1 2Wρ)21Wρ>1/2] + PD(Wρ > 1/2) PD(Wρ > 1/2). Therefore, applying Theorem 2 with constant η = 2 concludes the proof. We also state the following proof of Proposition 1 for completeness. Proof of Proposition 1. Inequality (19) with Lemma 3 proves the proposition. A.3 Example that the bound (6) is tight Here, we provide a combination of (ρ, D) of which L(h MV ρ ) is arbitrarily close to the bound. Consider, for each feature x, that exactly (1 ϵ) fraction of classifiers predict the correct label, and that the remaining ϵ fraction of classifiers predict a wrong label. In this case, L(h MV ρ ) = 0, Eρ[L(h)] = ϵ, and Eρ2[D(h, h )] = 2ϵ(1 ϵ). Hence, the upper bound (6) is 4(K 1) K ϵ2, which can be arbitrarily close to 0. B Proofs of our main results In this section, we provide proofs for our main results. B.1 Proof of Theorem 1 We start with the following lemma which shows the concentration of a linear combination of W 2 ρ and 1(Wρ > 1/2). Lemma 2. For sampled data points {(Xi, Yi)}m i=1 D, define Z2 := Pm i=1 W 2 ρ (Xi, Yi) and Z0 := Pm i=1 1(Wρ(Xi, Yi) > 1/2). The ensemble ρ is η-polarized with probability at least 1 δ if 1 m(ηZ2 Z0) > 4 , 1} 2m log 1 Proof. Let Z2i = W 2 ρ (Xi, Yi) and Z0i = 1(Wρ(Xi, Yi) > 1/2). Observe that η Z2i Z0i always takes a value between [ η 4 1, max{ η 4, η 1}] since Wρ(Xi, Yi) [0, 1]. This implies that ηZ2i Z0is are i.i.d. sub-Gaussian random variable with parameter σ = max{ 3η By letting A2 = E[ηW 2 ρ 1(Wρ 1/2)] and using the Hoeffding s inequality, we obtain 1 m(ηZ2 Z0) A2 4 , 1} 2m log 1 with probability at least 1 δ. Therefore, ρ is η-polarized with probability at least 1 δ if 1 m(ηZ2 Z0) > 4 , 1} 2m log 1 Now we use Lemma 2 to prove Theorem 1. Proof of Theorem 1. Observe that S = 1 m Z2, P = 1 m Z0, and thus 1 m(ηZ2 Z0) = ηS P. For 3, the lower bound in Lemma 2 is simply q 3η 8m log 1 δ , and the inequality (20) can be viewed as a quadratic inequality in terms of η. From quadratic formula, we know that 3η 8m log 1 3η 8m log 1 2S , then ηS P 3η 8m log 1 Putting this together with Lemma 2 proves the theorem: 3η 8m log 1 δ Lemma2 ρ is η-polarized w.p. 1 δ, and thus the polarization ηρ, the smallest η such that ρ is η-polarized, is upper bounded by the right hand side of inequality (21). B.2 Proof of Theorem 2 We start by proving the following lemma which relates the error rate of the majority vote, L(h MV ρ ), with the point-wise error rate, Wρ, using Markov s inequality. In general, L(h MV ρ ) PD(Wρ 1/2) is true for any ensemble ρ. We prove a tighter version of this. The difference between the two can be non-negligible when dealing with an ensemble with finite number of classifiers. Refer to Appendix A.1 and Definition 4 for more details regarding this difference and tie-free ensembles. Lemma 3. For a tie-free ensemble ρ, we have the inequality L(h MV ρ ) PD(Wρ > 1/2). Proof. For given feature x, Wρ 1/2 implies that more than or exactly ρ weighted half of the classifiers outputs the true label. Since the ensemble ρ is tie-free, h MV ρ outputs the true label if Wρ 1/2. Therefore, {(x, y) | Wρ(x, y) 1/2} {(x, y) | h MV ρ (x) = y}. Applying PD on the both sides proves the lemma. The following lemma appears as Lemma 2 in [MLIS20]. This lemma draws the connection between the point-wise error rate, Wρ and the tandem loss, Eρ2[L(h, h )]. Lemma 4. The equality ED[Wρ 2] = Eρ2[L(h, h )] holds. The next lemma appears as Lemma 4 in [TKY+24]. This lemma provides an upper bound on the tandem loss, Eρ2[L(h, h )], in terms of the average error rate, Eρ[L(h)], and the average disagreement, Eρ2[D(h, h )]. Lemma 5. For the K-class problem, Eρ2[L(h, h )] 2(K 1) 2Eρ2[D(h, h )] . Now we use these results to prove Theorem 2. Proof of Theorem 2. Putting Lemmas 3, 4, and 5 and the definition of the polarization together proves the theorem: L(h MV ρ ) Lemma 3PD(Wρ > 1/2) polarization ηρ ED[W 2 ρ ] = Lemma 4 ηρ Eρ2[L(h, h )] = Lemma 5 2Eh,h [D(h, h )] . B.3 Proof of Theorem 3 We start with a lemma which is a corollary of Newton s inequality. Lemma 6. For any collection of probabilities p1, . . . , pn, the following inequality holds. 1 i 1/2) η-polarized η ED[Wρ 2] = Lemma 4 η Eρ2[L(h, h )]. (22) From this, it suffices to prove that hα Eρ2[L(h, h )] is smaller than the upper bound in the theorem. First, observe the following decomposition of Eρ2[L(h, h )]: Eρ2[L(h, h )] = ED Pρ(h(X) = Y )2 = ED Pρ(h(X) = Y ) Pρ2(h(X) = Y, h (X) = Y ) . (23) For any predictor mapping into K classes, let y denote the true label for an input x. Now we derive a lower bound of Pρ2(h(X) = Y, h (X) = Y ) using the following decomposition of Pρ2(h(x) = h (x)): 1 2 Pρ2(h(x) = h (x)) = Pρ2(h(x) = y, h (x) = y) + X i/ A(x) j A(x)\{y} i,j A(x)\{y} i