# rethinking_fanos_inequality_in_ensemble_learning__66b6d11e.pdf

Rethinking Fano s Inequality in Ensemble Learning

Terufumi Morishita 1 Gaku Morio * 1 Shota Horiguchi * 1 Hiroaki Ozaki 1 Nobuo Nukaga 1

We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano s inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano s inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.

1. Introduction

Ensemble learning has had great success in various fields of machine learning. Bagging (Breiman, 1996) trains diverse models from artificial datasets built by random sub-sampling on the original one. It is common to train models with different weight initializations (Lakshminarayanan et al., 2017) or models with different network architectures (Qummar et al., 2019; Morishita et al., 2020b). While models are usually combined by voting on predictions, other methods focus on how to combine them cleverly (Omari & Figueiras-Vidal, 2015; Morio et al., 2020a). Stacking (Wolpert, 1992) trains meta-estimators that make final predictions from model predictions as their inputs. Mixture of Experts (Jacobs et al., 1991; Shazeer et al., 2017) focuses more on the models that are best specialized for a given dataset instance.

*Equal contribution 1Hitachi, Ltd. Research and Development Group, Kokubunji, Tokyo, Japan. Correspondence to: Terufumi Morishita <terufumi.morishita.wp@hitachi.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

It has been widely believed that accurate and diverse models lead to better performance for ensemble systems. Guided by this intuition, many heuristical metrics have been proposed to measure accuracy and diversity (Kohavi et al., 1996; Skalak et al., 1996; Cunningham & Carney, 2000; Shipp & Kuncheva, 2002). However, these metrics lack theoretical grounding, and indeed, Kuncheva & Whitaker (2003) empirically showed that there are no connections between the metrics and system performance through a broad range of experiments. Turning to theoretical viewpoints, Geman et al. (1992) decomposed the squared error loss used in regression tasks into the bias and covariance of models. Bias here corresponds to accuracy and covariance diversity. For classification tasks, Tumer & Ghosh (1995) showed that the error rate reductions obtained by unweighted voting is a decreasing function of models correlations, indicating that diverse models lead to better performance.

While the theory of Tumer & Ghosh (1995) deals with classification tasks under a limited setting, Brown (2009); Zhou & Li (2010) first derived accuracy and diversity in a general setting. Using Fano s inequality of information theory, they derived a lower bound to the error rate of a given system. Then, they decomposed the lower bound into relevance Irelev and redundancy Iredun (Lemma 2.3, illustrated in Figure 1). Irelev is the information theoretical version of accuracy and Iredun diversity. Their framework is promising as a fundamental theory of ensemble learning since it derives wellbelieved metrics in a general setting. However, the validity of the framework has not been examined much from both theoretical and empirical perspectives. Theoretically, we find that the framework rests on implicit assumptions used by a variant of Fano s inequality, which generally do not hold in ensemble learning. As a result, the framework fails in capturing important aspects of ensemble learning. Empirically, the experiments of the studies were not extensive enough to justify the framework. In particular, they did not check whether the framework can predict representative phenomena in ensemble learning.

In this paper, we rethink the theoretical framework from both perspectives. We first revisit the theory (Sections 2 and 3). We argue that the framework does not take into account the information lost when multiple model predictions are combined into a single final prediction. We call this information loss combination loss. To address the issue,

Rethinking Fano s Inequality in Ensemble Learning

Lemma 2.3 Lemma 3.1 (ours)

Prediction ℱ

Combination (e.g., voting and Stacking)

: relevance

( accuracy)

: redundancy

( diversity)

error rate lower bound

error rate lower bound

: information lost

when predictions are combined

: error rate

Figure 1: Previous framework (Brown, 2009; Zhou & Li, 2010) (left) and ours (right).

we propose a generalized framework that incorporates the third metric of combination loss Icombloss based on original Fano s inequality (Lemma 3.1, illustrated in Figure 1). We also solve the issue of the previous framework producing a loose lower bound when the number of classes is small.

Next, we turn to empirical viewpoints. We first validate the proposed framework in Sections 4 and 5. In contrast to the previous studies, (i) we directly check whether the framework can predict phenomena in ensemble learning, (ii) we use various ensemble systems (Table 2), and (iii) we use various tasks (Table E.10). Additionally, to be modern and realistic, we use state-of-the-art DNNs such as BERT (Devlin et al., 2019), and the tasks are chosen from widely-used benchmarks such as GLUE (Wang et al., 2018). Extensive experiments reveal that the previous framework can not predict phenomena such as the performance ranking of ensemble systems (Figure 2) and performance scaling behavior (Figure 3), ignoring combination loss. These results refute the previous framework. In contrast, the proposed framework justifies itself by predicting all these phenomena. Finally, we demonstrate the proposed framework (Section 6). We analyze DNN ensemble systems and answer why a system performs well or badly through its strengths and weaknesses in terms of the three metrics (Table 4). Such analysis pushes the theoretical understanding of ensemble learning and gives us insights into designing systems. In summary,

We propose a fundamental theoretical framework that measures a given ensemble system from a wellgrounded set of metrics: relevance Irelev, redundancy Iredun, and combination loss Icombloss. The metrics are tied to the bound on the performance of a system. The framework applies to any ensemble system.

We validate the framework through extensive experiments on DNN ensemble systems.

We demonstrate the framework. We analyze the DNN ensemble systems and answer why a system performs well or badly as follows:

1. Systems with models that simply differ in the

training seeds perform well because the models are accurate (large Irelev) and combinable (small Icombloss). 2. Heterogeneous systems, which use various types of DNNs, also perform well. While some DNNs are inaccurate (small Irelev), DNNs are diverse (small Iredun). Further, such systems should perform the best among all the systems when DNNs are combined by meta-estimators. 3. Bagging-based systems do not perform that well. Their models are diverse (small Iredun) but inaccurate (small Irelev) and uncombinable (large Icombloss). 4. Systems with models with randomly chosen hyperparameters do not perform that well. The models are diverse (small Iredun) but inaccurate (small Irelev). 5. Meta-estimators generally push the performance of the systems by combining models smartly to reduce Icombloss. Further, meta-estimators benefit systems such as 2 and 4 the most since the amount of information of the true label is unevenly distributed on models of varied accuracies and such information is recovered well by meta-estimators. Finally, a simple estimator such as logistic regression should be enough on strong DNNs.

We release our code as open source.1

2. Conventional Framework Based on Variant of Fano s Inequality

2.1. Fano s Inequality

Let Y {1, 2, . . . , Ymax} be a discrete stochastic variable representing the input and O Rm be m stochastic variables representing an observation after a noisy channel. We want to recover Y from O by using the reconstruction function F : O 7 ˆY {1, 2, . . . , Ymax}. Note that Y O ˆY forms a Markov chain. Fano s inequality

1Available at: https://github.com/hitachi-nlp/ ensemble-metrics

Rethinking Fano s Inequality in Ensemble Learning

relates the information lost in a noisy channel to the error rate when recovering the input as follows.

Lemma 2.1 (Fano s inequality (Fano, 1961)). For any function F, the following holds:

H2(perr) + perr log2(Ymax 1) H(Y | ˆY ), (1)

where perr = Pr[ˆY = Y ] [0, 1] is the reconstruction error rate, H2(p) = p log2(p) (1 p) log2(1 p) binary cross entropy, and H(Y | ˆY ) conditional entropy (C.2).

From the Markovness, the amount of information carried by ˆY is never more than that carried by O; thus, the right-hand side of (1) is lower bounded as

H(Y | ˆY ) H(Y | O). (2)

Since the binary cross entropy never exceeds one, the left side of Lemma 2.1 is upper bounded as

H2(perr) + perr log2(Ymax 1) 1 + perr log2(Ymax 1),

< 1 + perr log2(Ymax) (3)

From (1) (3), we obtain the following well-known variant of Fano s inequality:

Lemma 2.2 (An error rate lower bound (Fano, 1961)).

perr > H (Y | O) 1

log2 Ymax .

2.2. Error Rate Lower Bound of Ensemble Systems

In ensemble learning, Y denotes a label on a given instance, and O = {O1, O2, . . . , ON} is the output from N models. Note that the output from i-th model Oi RYmax can be a predicted label (Ymax = 1) or class probabilities (Ymax 2). F denotes a model combination method such as voting or Stacking. Lemma 2.2 gives a lower bound of the classification error rate perr of an ensemble system.

Brown (2009) decomposed the lower bound into relevance and redundancy, the formulation of which was later simplified by Zhou & Li (2010) as follows:

Lemma 2.3 (Zhou & Li, 2010).

perr > B(I(O, Y )) := H(Y ) I(O, Y ) 1

log2 Ymax . (4)

I(O, Y ) is defined as follows:

I(O, Y ) := Irelev(O, Y ) Iredun(O, Y ), (5)

Irelev(O, Y ) :=

i=1 I(Oi; Y ),

Iredun(O, Y ) := Imulti(O) Imulti(O|Y ),

where H denotes entropy (C.1), I denotes mutual information (C.3), and Imulti denotes multi-information (C.4) to (C.5), a multivariate generalization of mutualinformation.

Since H(Y ) and Ymax are constants given a machine learning task, the important term in (4) is I(O, Y ) defined in (5), which denotes the amount of unique information on Y carried by O. The first term Irelev(O, Y ) is the relevance, whose component I(Oi; Y ) denotes the amount of information on Y given by Oi. It can be seen as the accuracy of the model i from the information theoretical point of view. The second term Iredun(O, Y ) is the redundancy, which indicates how strongly the model outputs O = {O1, O2, . . . ON} are correlated with each other. In other words, it describes the amount of redundant (duplicated) information. Overall, Lemma 2.3 reveals that an ensemble system should include accurate (large Irelev) and diverse (small Iredun) models to get a small lower bound for the error rate B(I).

3. Proposed Framework Based on Original Fano s Inequality

3.1. Error Rate Lower Bound with Better Properties

To derive Lemma 2.2, which is the basis of Lemma 2.3, two bounds, (2) and (3), are used. However, in a ensemble learning context, both are not tight, so Lemma 2.3 would not give a good approximation of the lower bound.

The problem with relying on (2) is that the existence of a perfect reconstruction function F is implicitly assumed. In the information theoretical context, using the noisy-channel coding theorem (Shannon, 1948), we can construct a smart reconstruction function F so that the information lost by F is zero as H(Y | ˆY ) H(Y |O) 0. Thus, the equality in (2) holds. On the other hand, in the ensemble learning context, we usually use a simple function such as voting or a meta-estimator trained on a limited amount of data as F. Therefore, the information loss H(Y | ˆY ) H(Y |O) caused by combining the outputs from multiple models O into a single prediction ˆY should also be taken into account. We refer to this loss as combination loss.

The problem with relying (3) is that an exponentially large number of classes is assumed, i.e., Ymax 1. In information theory, Y is assumed to be a sequence of symbols (e.g., bits). Suppose that the sequence length L 1 and that there are C types of symbols; Ymax becomes exponentially large as Ymax = CL. Then, the second term of the left-hand side of (3) is approximated as perr log2(Ymax 1) perr L log2 C 1. Since the first term (H2(perr) 1) becomes negligible, it can be safely replaced with its upper bound (i.e. 1) without loosening the inequality much. On the other hand, in the ensemble learning context, the number of classes Ymax can

Rethinking Fano s Inequality in Ensemble Learning

Table 1: Extreme toy ensemble systems on imaginary binary classification task for discussing combination loss (Section 3.2). Each row shows predicted labels on instance from dataset. O = {O1, . . . , O5}: model predictions, ˆY = F(O): ensemble prediction, and Y : ground-truth label. Red 0/1 shows wrong ensemble predictions. Orange 0/1 shows correct but neglected model predictions. Blue 0 shows correct prediction recovered by weighted voting. Tables 1b and 1c use the same O.

(a) ˆYvote: voting on O.

11111 1 1 11111 1 1 11111 1 1 . . . . . . ... 00000 0 0 11111 1 0 00000 0 0 . . . . . . ...

(b) ˆYvote: voting on O.

11100 1 1 11111 1 1 10011 1 1 ... ... ... 01101 1 0 00000 0 0 00011 0 0 ... ... ...

(c) ˆYw.vote: just using O1.

O ˆYw.vote Y

11100 1 1 11111 1 1 10011 1 1 . . . . . . . . . 01101 0 0 00000 0 0 00011 0 0 . . . . . . . . .

(d) ˆYvote: voting on O, ˆYw.vote: on O3 5

O ˆYvote ˆYw.vote Y

11100 1 0 1 11111 1 1 1 00011 0 1 1 . . . . . . . . . . . . 11100 1 0 0 00000 0 0 0 00011 0 1 0 . . . . . . . . . . . .

be small; thus, simply neglecting H2(perr) produces a loose bound. For example for binary classification problems, the bound by Lemma 2.2 is always negative as H(Y |O) 1

log2 Ymax 0 because H(Y | O) 1 when Ymax = 2.

To address these two problems, we lower bounded the error rate using the original Fano s inequality (Lemma 2.1) directly: Lemma 3.1 (Decomposition of error rate lower bound into three metrics). Let U(p) = H2(p) + p log2(Ymax 1) and U (p) = d U

dp (p), and let p0 [0, 1] be the approximate error rate. Then, for any p0, the error rate perr is bounded as

perr Btight p0 (E(O, Y, ˆY ))

:= p0 + U (p0)

1 8H(Y ) E(O, Y, ˆY ) U(p0)

where the ensemble strength E(O, Y, ˆY ) is given by

E(O, Y, ˆY ) :=Irelev(O, Y ) Iredun(O, Y )

Icombloss(O, Y, ˆY ), (7)

Icombloss(O, Y, ˆY ) :=H(Y | ˆY ) H(Y |O).

Proof. In Lemma 2.1, we expand H2(perr) by using strong convexity and solve for perr. Appendix D.1 shows the proof.

Lemma 3.1 differs from Lemma 2.3 in that (i) the ensemble strength E (7) includes the third metric of combination loss, and (ii) the bound function is tighter2: Btight p0 (E) B(E), which is the result of removing the large Ymax assumption.

Since E = I Icombloss holds, E denotes the amount of unique information on Y carried by O that can be extracted when a combination F is applied to O. Btight p0 is still a decreasing function of E when p0 [0, Ymax 1

Ymax ], where

2If p0 is not far from the lower bound values (Appendix D.3)

Ymax denotes the error rate of a random-guessing system on a balanced label dataset. Thus, Lemma 3.1 reveals that an ensemble system should include accurate (large Irelev) and diverse (small Iredun) models and keep Icombloss small in order to have a small lower bound.

3.2. What Kind of Systems Produce Combination Loss?

To clarify in what kind of ensemble systems combination loss becomes apparent, four toy ensemble systems on an imaginary binary classification task are shown in Table 1. The systems differ in terms of models O = {O1, O2, O3, O4, O5} or combination function F. Although the systems examined here are extremely simplified and the claims here are hypothetical, they can illustrate certain aspects of empirical behaviors of ensemble systems as discussed in Section 6.

Table 1a shows the case where the outputs from each model in O are perfectly correlated, i.e., there is no diversity between models. Information theoretically, the system has large redundancy Iredun. In this case, simple voting ˆYvote does not lose any information carried by O, so the combination loss is trivially zero.

Tables 1b and 1c show the cases where the models differ in accuracy, among which O1 performs best. Information theoretically, the amount of information on Y given by the models is unevenly distributed on the models, and especially concentrated on O1. Note that the same model set is shown in both tables. If naive voting is used for model combination (Table 1b), it produces a prediction error 1 even though some of the models (O1 and O4 in this case) give correct predictions 0. These correct but neglected minorities are the source of combination loss. On the other hand, if weighted voting that focuses more on the best model (i.e., O1) is used (Table 1c), it will succeed in recovering the correct prediction, 0.

Table 1d shows the case where the models outputs are di-

Rethinking Fano s Inequality in Ensemble Learning

Table 2: Ensemble methods used in this study. We built 16 ensemble systems using all combinations of model generation and combination methods. Note that Stacking has three variations (i.e., Log R, SVM and RForest). All generation methods train N ( 30) models using different seed for each model. Seed affects random aspects of training, e.g., weight initialization or hidden units dropped when using dropout. See Section 4.2 for details.

Type Method Description

Model Generation

Random-Hy P Train models with different hyperparameters randomly sampled around the best value.

Bagging Train models using different dataset instance sets. Each set contains instances randomly sampled from the original dataset.

Random-Seed Train models that differ only in the seed of fine-tuning.

Hetero-DNNs Train models from L(=5) types of DNNs. M models from each type so that L M = N.

Model Combination

Voting Take a majority vote on labels predicted by models.

Stacking (Log R|SVM|RForest) Use meta-estimators that make prediction from outputs of models as inputs. We used twolayered stacking with a single meta-estimator, which takes predicted labels as inputs. We trained logistic regression (Log R), Support Vector Machine (Platt, 1999) with RBF kernel (SVM) and Random Forest (Breiman, 2001) (RForest) as meta-estimators.

verse but have the same accuracy. Information theoretically, information on Y given by O is uniformly distributed on all the models. In this case, weighted voting will not help much in recovering the correct predictions compared with simple voting, since there are no better models to be focused on.

From the discussion above, it is expected that (i) models redundancy decreases combination loss, and (ii) smart combination functions help reduce combination loss, especially when the accuracies of models are varied.

4. Experiments

We empirically validate and demonstrate Lemma 3.1. To this end, we built various ensemble systems and measured their error rates, error rate lower bounds, and the three metric values. To be modern and realistic, we built ensemble systems on top of state-of-the-art DNNs, specifically pretrained language models such as BERT (Devlin et al., 2019). We used various tasks from the GLUE and Super GLUE benchmarks (Wang et al., 2018; 2019). These benchmarks include challenging tasks from different domains of NLP and are commonly used to compare state-of-the-art models.

Below, we briefly describe these setups. For reproducibility, we show the details in Appendix E and release the code.

4.1. Models

We fine-tuned the following five types of language models on downstream tasks: BERT (Devlin et al., 2019), Ro BERTa (Liu et al., 2019), ELECTRA (Clark et al., 2020), ALBERT (Lan et al., 2020), and BART (Lewis et al., 2020).

4.2. Ensemble Systems

To build an ensemble system, we must specify a model generation method (i.e., how to train models that produce O) and a combination method (i.e., F). We used wellestablished methods that can be used with DNNs (Table 2). These methods are commonly used with DNNs in a wide range of domains (Kumar et al., 2016; Liu et al., 2017; Qummar et al., 2019; Ma & Chu, 2019), especially in competitions where the highest performance is required (Szegedy et al., 2015; Yan et al., 2015; Atwood et al., 2020; Morishita et al., 2020a; Morio et al., 2020b). We built 16 systems using all the combinations of generation and combination methods.

For later convenience, we define the baseline system s0 in each task, which is a single DNN (i.e., no-ensemble) that performs the best among DNNs: ELECTRA for the MRPC/Boolq/SST and Ro BERTa for the other tasks.

Random-Seed, Random-Hy P and Bagging used a single DNN type the same as s0. Hetero-DNNs used L=5 DNN types.

4.3. Estimation of Metric Values and Lower Bound

We estimated the three metric values (Irelev, Iredun, and Icombloss) and the other quantities appearing in Lemmas 2.3 and 3.1 on the basis of the observed frequency distribution of the labels (O, ˆY , Y ). Then, we computed the lower bounds by Lemmas 2.3 and 3.1. All such operations were done on test sets3.

To tackle the count sparsity of high-dimensional variables

3In order to eliminate from our discussion the statistical fluctuation caused by dataset splitting. Such counfounding factor is undesirable for verifying the theory.

Rethinking Fano s Inequality in Ensemble Learning

6 8 10 12 Error rate reduction [%]

Lower bound reduction [%]

Pearson Coef. = -0.238

(a) Lemma 2.3 B(I).

6 8 10 12 Error rate reduction [%]

Lower bound reduction [%]

Pearson Coef. = -0.165

(b) Btight(I).

6 8 10 12 Error rate reduction [%]

Lower bound reduction [%]

Pearson Coef. = 0.984

(c) Lemma 3.1 Btight(E).

Figure 2: Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in figures shows quantity of specific ensemble system s, and quantity is average over eight tasks. See Table 4a for real value of each point. We used 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. Baseline values in (8) and (9) are: ER(s0): 15.5 %, LB(s0) by Btight(E): 2.8 %, LB(s0) by Btight(I): 2.8 %, and LB(s0) by B(I): 2.0 %.

O = {Oi | 1 i N, Oi {1, 2, . . . , Ymax}} , we used the trick of MTIk=3 introduced by Zhou & Li (2010).

We set the approximate error rate p0 in (6) as the error rate of the baseline s0. Below, we simply denote Btight p0 as Btight.

We used eight classification tasks with moderately-sized datasets for computational reasons: Boolq (Clark et al., 2019), Co LA (Dolan & Brockett, 2005), Cosmos QA (Khot et al., 2018), MNLI (Williams et al., 2018), MRPC (Dolan & Brockett, 2005), Sci Tail (Khot et al., 2018), SST (Socher et al., 2013), and QQP.

4.5. Computational Resources / Experimental Runs

A single run of experiments required about 200 GPUs (V100) 1 day. We ran the experiments three times.

5. Validation of Framework Through its Predictive Power to Ensemble Phenomena

We show that we can predict various phenomena observed on actual ensemble systems using Lemma 3.1. We show the results aggregated over the eight tasks here and those for each task in Appendix K. The discussions here are valid for all tasks, showing their significance.

Lemma 3.1 Btight(E) differs from Lemma 2.3 B(I) in two ways, i.e., it has a tightened bound function Btight and ensemble strength with combination loss E. To separate contribution of each, we analyze three types of lower bounds hereafter: B(I), Btight(I), and Btight(E).

5.1. Effect of Bound Function Btight

First, as theoretically expected, the lower bound Btight(I) was tighter than Lemma 2.3 B(I), for example for the baseline system s0, Btight(Is0) = 2.8% and B(Is0) = 2.0% (average of eight tasks). The captions of Tables K.15 to K.22 show the error rates and the error rate lower bounds for eight tasks.

5.2. Correlation between Error Rate and Lower Bound

The error rate lower bound denotes the best-case error rate. Thus, a system with a smaller lower bound has higher chance of having a smaller error rate (Brown, 2009; Zhou & Li, 2010). Guided by this intuition, we measured the correlation between the error rates and lower bounds of the ensemble systems.

Figure 2 plots the following normalized versions of the error rate and lower bound for each ensemble system s:

Error Rate Reduction(s) = ER(s0) ER(s)

ER(s0) 100 [%],

Lower Bound Reduction(s) = LB(s0) LB(s)

|LB(s0)| 100 [%].

(9) s0 denotes the single DNN baseline defined in Section 4.2. ER(s) denotes the error rate (i.e., 100% accuracy) and LB(s) the lower bound. Note that Pearson correlation coefficient is invariant under this transformation.

Neither the lower bound reduction by Lemma 2.3 B(I) nor that by Btight(I) correlated with the error rate reduction, as shown in Figures 2a and 2b. In addition, Lemma 2.3 B(I) predicted the same lower bound reduction value for different

4The correlation coefficient between the averaged error rate reductions and lower bound reductions. The average is taken over the eight tasks.

Rethinking Fano s Inequality in Ensemble Learning

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction [%].

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction [%]

by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction [%]

by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction [%]

by Lemma 3.1 Btight(E).

Figure 3: Change in error rate reduction and lower bound reduction when number of models N was changed. Each value is an average of eight tasks. Ensemble systems used SVM model combination.

Table 3: Pearson correlation coefficients between error rate reduction and lower bound reduction. In each task, we used the 16 ensemble systems described in Section 4.2, and each system used N = 15 models.

Lower bound type

Task Lemma 2.3 B(I) Btight(I) Lemma 3.1 Btight(E)

Boolq 0.341 0.330 0.910 Co LA -0.211 -0.210 0.991 Cosmos QA -0.324 -0.320 1.000 MNLI 0.226 0.216 0.961 MRPC 0.332 0.252 0.989 QQP -0.131 -0.076 0.998 Sci Tail -0.237 -0.191 0.966 SST -0.242 -0.252 0.998

average4 -0.238 -0.165 0.984

systems that share the same model generation method. This behavior of the lower bounds can be seen from the points on the same horizontal lines in Figure 2a. This behavior is theoretically expected: since Lemma 2.3 B(I) does not include Icombloss, it does not consider model combination methods. This behavior was also observed on Btight(I) for the same reason.

By contrast, the lower bound reduction by Lemma 3.1 Btight(E) was very strongly correlated with the error rate reduction, as shown in Figure 2c. Strong correlations were observed for all eight tasks (Table 3) and also for different Ns (Tables G.11 to G.14). These results justify Lemma 3.1 and show that Btight(E) can be used for comparing systems. These results also show the importance of combination loss given that the only difference between Btight(E) and Btight(I) is combination loss.

5.3. Predicting Error Rate Scaling Curve

Figure 3 shows the change in error rate reduction and lower bound reductions when the number of models N was changed.

Both Lemma 2.3 B(I) (Figure 3b) and Btight(I) (Figure 3c)

could not predict the shape the of error rate reduction curve (Figure 3a), especially the saturation over N 15. By contrast, Lemma 3.1 Btight(E) (Figure 3d) could predict such phenomena. The results again justify Lemma 3.1 and show the importance of combination loss.

Refer to Appendix H for more detailed discussions, where we examine the scaling property of each metric values.

6. Analysis of Ensemble Systems by Framework

We demonstrate how we can reveal the strengths and weaknesses of the systems on the basis of the metrics in Lemma 3.1. The results here are summarized in Section 1. We show the results aggregated over the eight tasks here and those for each task in Appendix K. The discussions here are valid for all tasks, showing their significance.

6.1. Justification of Three Metrics for Ensemble System Analysis

Table 4 shows the statistics of the ensemble systems. First, the ranking of the lower bound reduction by Btight(E) in Table 4a matches the ranking of E in Table 4b. This is theoretically expected because Btight is a decreasing function. Thus, E can be used for comparing systems, instead of Btight(E). Furthermore, since E is decomposed into the three metrics (Irelev, Iredun, Icombloss) as in (7), the three metrics can be used to analyze ensemble systems.

Below, we use per-model metrics i{relev, redun, combloss} = I{relev, redun, combloss}/N for intuitive understanding.

6.2. Analysis of Model Generation Methods

Random-Seed and Hetero-DNNs systems performed the best or second best in each column of Table 4a (i.e. among the systems with the same combination method). Looking into the per-model relevance irelev in Table 4b, Random Seed had the largest irelev in each column. irelev denotes the average accuracy of the models. Indeed, the ranking of irelev

Rethinking Fano s Inequality in Ensemble Learning

Table 4: Statistics of ensemble systems described in Section 4.2. Rows and columns list model generation and combination methods of Table 2, respectively. Each cell shows quantity of specific system s. Each quantity is average over eight tasks. Each system contains N = 15 models. Color shows rank within each column (brighter is better).

(a) Error rate and lower bound reductions. Baseline values used in (8) and (9) were ER(s0): 15.5 %, LB(s0) by Btight(E): 2.8 %, LB(s0) by Btight(I): 2.8 %, and LB(s0) by B(I): 2.0 %.

Error rate reductions [%] Lower bound reductions by Btight(E) [%]

Voting Log R SVM RForest Voting Log R SVM RForest

Random-Hy P 6.8 1.4 8.5 0.9 8.4 1.2 7.6 0.7 5.8 1.4 7.2 1.0 7.5 1.1 6.6 0.7 Bagging 7.3 2.0 8.2 1.9 9.0 1.9 5.8 2.0 6.8 2.1 6.9 2.0 8.2 2.1 4.8 2.0 Random-Seed 9.6 1.2 10.1 0.7 9.5 0.7 8.7 0.2 8.8 1.2 8.5 0.7 8.7 0.8 7.7 0.1 Hetero-DNNs 6.5 1.4 11.9 0.8 10.4 1.5 9.1 1.9 5.5 1.4 10.3 0.8 9.2 1.5 7.8 1.9

(b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all values are normalized by ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is raw value.

Ensemble strength E Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.478) 100 0 0 0 0 0 100

Random-Hy P 105.0 1.4 107.4 1.0 107.5 1.3 105.2 0.7 89.4 0.9 74.5 0.9 7.96 0.36 7.80 0.34 8.00 0.37 7.94 0.29 15.0 1.3 Bagging 105.3 1.9 105.7 1.8 108.0 1.1 103.1 1.4 90.1 0.3 73.5 0.3 9.56 0.08 9.54 0.03 9.40 0.05 9.71 0.05 16.6 0.4 Random-Seed 109.2 1.1 108.5 0.9 108.8 1.3 107.7 1.0 100.0 0.0 84.9 0.3 7.79 0.21 7.84 0.26 7.82 0.19 7.89 0.23 15.1 0.3 Hetero-DNNs 104.5 1.1 110.9 0.8 110.6 1.0 107.8 1.7 86.0 0.4 69.9 0.2 9.16 0.24 8.73 0.26 8.75 0.29 8.94 0.19 16.1 0.4

coincided with the ranking of the average error rate shown as avg in Table 5. Random-Seed had the most accurate models because it used only the best DNN type (cf. Hetero DNNs), all the dataset instances (cf. Bagging), and only the best hyperparameter (cf. Random-Hy P).

On the per-model redundancy iredun, Hetero-DNNs had a value smaller than that of Random-Seed (i.e., it had more diverse models), benefitting from the diverse DNN types.

For per-model combination loss icombloss 5, Random-Seed had the smallest value in the voting column. We attribute this to it having lowest diversity (i.e., the largest iredun), similarly to Table 1a. However, the meta-estimators (Log R, SVM, and RForest) reduced icombloss more on Hetero-DNNs than on Random-Seed. This pushed the performance of Hetero-DNNs to the highest among all the systems. Regarding this phenomenon, Hetero-DNNs can be analogous to Tables 1b to 1c and Random-Seed to Table 1d: since Hetero DNNs uses various DNN types of varied accuracies, the amount of information on Y is concentrated more on better models compared with Random-Seed. Thus, Hetero-DNNs benefitted more from the meta-estimators, which focused on these models and recovered the information to reduce

5The magnitude of icombloss is smaller than those of irelev and iredun. However, irelev and iredun are strongly correlated, and thus, icombloss is not negligible compared with irelev iredun, as shown. Thus, combination loss is significant.

icombloss, similarly to the transition from Tables 1b to 1c. This phenomenon did not occur in Random-Seeed since it uses models of similar accuracies, similarly to Table 1d.

Indeed, we can see the information concentration and how the meta-estimator handled such information more directly. To this end, we propose an auxiliary metric of n-model concentration Conc N n (Appendix J) which measures the degree to which the amount of information given by N models O = {O1, . . . , ON} is concentrated on the top-n models ΩN,max n :

Conc N n (O, Y ) = I(ΩN,max n ; Y ) I(ΩN,min n ; Y ) I(O; Y ) [0, 1],

I(ΩN,max/min n ; Y ) = max/min {i1,i2,...,in} ΩN n I({Oi1, Oi2, . . . , Oin}; Y ).

Table 5 shows Conc N=15 n=3 for each model generation method. Intuitively, Conc N n and the standard deviation of model error rates, which denotes the variety in accuracies, were strongly correlated. Hetero-DNNs had a larger Conc N n and Random-Seed a smaller one, as expected. Table 6 shows that the meta-estimator for Hetero-DNNs distributed weight Wt to each DNN type t in accordance with its error rate. Overall, we can see a clear analogy of Hetero-DNNs to Tables 1b and 1c, and of Random-Seed to Table 1d.

Bagging and Random-Hy P performed the third or fourth best in each column of Table 4a. Similarly to Hetero-DNNs,

Rethinking Fano s Inequality in Ensemble Learning

Table 5: The information concentration metric Conc N=15 n=3 . See (10). Color shows rank (brighter is better) in each column. Values are averages over eight tasks.

Conc N=15 n=3 Error rates of models [%]

Model generation avg. std.

Baseline (s0) 0 16.1 0.9 -

Random-Hy P 0.28 0.02 17.3 0.1 3.4 0.2 Bagging 0.08 0.00 17.1 0.0 0.8 0.0 Random-Seed 0.08 0.00 15.5 0.1 0.7 0.1 Hetero-DNNs 0.20 0.00 18.1 0.0 2.3 0.0

Table 6: Logistic regression meta-estimator weight Wt distributed to each DNN type t. N=15 models are generated by Hetero-DNNs (i.e., 3 models per DNN type). Values are averages over eight tasks. See Appendix F.2 for details.

DNN t Average error rate of models [%] Wt

Ro BERTa 15.1 0.3 0.49 ELECTRA 17.0 0.1 0.40 BART 17.9 0.1 0.25 BERT 18.7 0.1 0.24 ALBERT 20.4 0.1 0.21

they had a smaller irelev and iredun compared with Random Seed (Table 4b). The smaller irelev is attributed to Bagging using smaller subsets of training instances and Random Hy P using randomly sampled non-optimal hyperparameters, which degraded model accuracies. The smaller iredun is due to the diverse instance sets of Bagging and the diverse hyperparameters of Random-Hy P.

Bagging had the largest icombloss in each column, and more importantly, the meta-estimators (Log R, SVM and RForest) could not reduce icombloss as much as they could on Hetero DNNs and Random-Hy P. This phenomenon should be due to the Bagging s smaller Conc N n (Table 5), which is the result of models of similar accuracies, similarly to Table 1d. Such models were generated because Bagging used dataset sub-sets of the same size.

6.3. Analysis of Model Combination Methods

Stacking (i.e., Log R, SVM, and RForest) generally outperformed voting in each row of Table 4a. This is due to the smaller icombloss since irelev and iredun are the same in each row. Simply, the meta-estimators combined the models better to reduce icombloss.

Interestingly, the simple meta-estimator of Log R performed on par with or better than the complex ones of SVM and RForest. We estimate that the DNN s predictions were so good that simple combinations were enough, and complex ones were superfluous.

7. Other Considerations

The limitations of the study are listed in Appendix A. Ethical matters and social impacts are discussed in Appendix B.

8. Conclusion

We proposed a novel and fundamental theoretical framework that measures a given ensemble system on the basis of a well-grounded set of metrics. We also validated and demonstrated the framework through experiments on DNN ensemble systems. In the future, we will analyze a broader range of systems, including rec/ent DNN ensemble systems optimized in an end-to-end manner. We will also incorporate combination loss into ensemble systems as an optimization target (i.e., as a loss-term) for better performance.

Acknowledgement

We thank the three anonymous reviewers and the metareviewer, who gave us insightful comments and suggestions. Computational resources of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) were used. We thank Dr. Masaaki Shimizu at Hitachi for the convenience of additional computational resources. We thank Dr. Naoaki Okazaki, professor at Tokyo Institute of Technology, for the keen comments.

Atwood, J., Halpern, Y., Baljekar, P., Breck, E., Sculley, D., Ostyakov, P., Nikolenko, S. I., Ivanov, I., Solovyev, R., Wang, W., et al. The inclusive images competition. In The Neur IPS 18 Competition, pp. 155 186. Springer, 2020.

Breiman, L. Bagging predictors. Machine Learning, 24(2): 123 140, 1996.

Breiman, L. Random forests. Machine Learning, 45 (1):5 32, 2001. ISSN 0885-6125. doi: 10.1023/A: 1010933404324. URL http://dx.doi.org/10. 1023/A%3A1010933404324.

Brown, G. An information theoretic perspective on multiple classifier systems. In International Workshop on Multiple Classifier Systems, pp. 344 353, 2009.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Bool Q: Exploring the surprising difficulty of natural yes/no questions. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2924 2936, 2019.

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. ELECTRA: Pre-training text encoders as discriminators

Rethinking Fano s Inequality in Ensemble Learning

rather than generators. In International Conference on Learning Representations, 2020.

Cunningham, P. and Carney, J. Diversity versus quality in classification ensembles based on feature selection. In European Conference on Machine Learning, pp. 109 116. Springer, 2000.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171 4186, 2019.

Dolan, W. B. and Brockett, C. Automatically constructing a corpus of sentential paraphrases. In International Workshop on Paraphrasing, 2005.

Fano, R. M. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793 794, 1961.

Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1 58, 1992.

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural Computation, 3(1):79 87, 1991.

Khot, T., Sabharwal, A., and Clark, P. Sci Tail: A textual entailment dataset from science question answering. In AAAI Conference on Artificial Intelligence, volume 32, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Kohavi, R., Wolpert, D. H., et al. Bias plus variance decomposition for zero-one loss functions. In ICML, volume 96, pp. 275 83, 1996.

Kumar, A., Kim, J., Lyndon, D., Fulham, M., and Feng, D. An ensemble of fine-tuned convolutional neural networks for medical image classification. IEEE Journal of Biomedical and Health Informatics, 21(1):31 40, 2016.

Kuncheva, L. I. and Whitaker, C. J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181 207, 2003.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Annual Meeting of the Association for Computational Linguistics, pp. 7871 7880, 2020.

Liu, W., Zhang, M., Luo, Z., and Cai, Y. An ensemble deep learning method for vehicle type classification on visual traffic surveillance sensors. IEEE Access, 5:24417 24425, 2017.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Ro BERTa: A robustly optimized BERT pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

Ma, S. and Chu, F. Ensemble deep learning-based fault diagnosis of rotor bearing systems. Computers in Industry, 105:143 152, 2019.

Morio, G., Morishita, T., Ozaki, H., and Miyoshi, T. Hitachi at Sem Eval-2020 task 10: Emphasis distribution fusion on fine-tuned language models. In Workshop on Semantic Evaluation, pp. 1658 1664, 2020a.

Morio, G., Morishita, T., Ozaki, H., and Miyoshi, T. Hitachi at Sem Eval-2020 task 11: An empirical study of pre-trained transformer family for propaganda detection. In Workshop on Semantic Evaluation, pp. 1739 1748, 2020b.

Morishita, T., Morio, G., Horiguchi, S., Ozaki, H., and Miyoshi, T. Hitachi at Sem Eval-2020 task 8: Simple but effective modality ensemble for meme emotion recognition. In Workshop on Semantic Evaluation, pp. 1126 1134, 2020a.

Morishita, T., Morio, G., Ozaki, H., and Miyoshi, T. Hitachi at semeval-2020 task 7: Stacking at scale with heterogeneous language models for humor recognition. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 791 803, 2020b.

Omari, A. and Figueiras-Vidal, A. R. Post-aggregation of classifier ensembles. Information Fusion, 26:96 102, 2015.

Phang, J., Yeres, P., Swanson, J., Liu, H., Tenney, I. F., Htut, P. M., Vania, C., Wang, A., and Bowman, S. R. jiant 2.0: A software toolkit for research on general-purpose text understanding models. http://jiant.info/, 2020.

Rethinking Fano s Inequality in Ensemble Learning

Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large-Margin Classifiers, pp. 61 74. 1999.

Qummar, S., Khan, F. G., Shah, S., Khan, A., Shamshirband, S., Rehman, Z. U., Khan, I. A., and Jadoon, W. A deep learning ensemble approach for diabetic retinopathy detection. IEEE Access, 7:150530 150539, 2019.

Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379 423, 1948.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017.

Shipp, C. A. and Kuncheva, L. I. Relationships between combination methods and measures of diversity in combining classifiers. Information fusion, 3(2):135 148, 2002.

Skalak, D. B. et al. The sources of increased accuracy for two proposed boosting algorithms. In Integrating Multiple Learned Models Workshop, volume 1129, pp. 1133, 1996.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631 1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https: //aclanthology.org/D13-1170.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, 2015.

Tumer, K. and Ghosh, J. Theoretical foundations of linear and order statistics combiners for neural pattern classifiers. IEEE Trans. Neural Networks, 1995.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353 355, 2018.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Super GLUE: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32, 2019.

Williams, A., Nangia, N., and Bowman, S. A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112 1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. Transformers: State-of-the-art natural language processing. In Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38 45, 2020.

Wolpert, D. H. Stacked generalization. Neural Networks, 5: 241 259, 1992.

Yan, J., Yu, Y., Zhu, X., Lei, Z., and Li, S. Z. Object detection by labeling superpixels. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5107 5116, 2015. doi: 10.1109/CVPR.2015.7299146.

Zhou, Z.-H. and Li, N. Multi-information ensemble diversity. In Multiple Classifier Systems, pp. 134 144, 2010.

Rethinking Fano s Inequality in Ensemble Learning

A. Limitations

The study has the following limitations:

As stated in Section 1, the framework Lemma 3.1 deals with classification tasks.

As stated in Section 3.2, the claims made on the toy ensemble systems of Table 1 are hypothetical rather than theoretically driven, although they have explained certain aspects of the experimental results as discussed in Section 6.

B. Ethics and Social Impacts

Ensemble learning is a generic technology to boost the performance of machine learning models. This study provides a theoretical framework on ensemble learning for evaluating a given ensemble system by a set of specific metrics. The framework enables us to reveal the strengths and weaknesses of ensemble systems on each metric, which will give us insights into the designing of ensemble systems. Thus, this study should ultimately lead to the better performance of machine learning models.

While it is possible that inappropriate use of improved machine learning models poses negative effects on society, we believe that this study does not directly pose negative effects on society.

C. Definitions

We show the definitions of information theoretical quantities used in this study. In the below, we assume that S and T denote sets of discrete stochastic variables:

S = {S1, S2, . . . , SL}, L N,

T = {T1, T2, . . . , TM}, M N,

where Si and Ti are discrete stochastic variables. We denote si, ti as the values of Si, Ti, and p as the probability distribution function. Definition C.1 (Entropy of S).

s1,...,s L p(s1, . . . , s L) log2 p(s1, . . . , s L).

Definition C.2 (Conditional entropy of T given S).

t1,...,t M p(s1, . . . , s L, t1, . . . , t M)

log2 p(t1, . . . t M|s1, . . . , s L). (C.2)

Definition C.3 (Mutual-information between S and T).

I(S; T) = H(T) H(T|S). (C.3)

Definition C.4 (Multi-information of S).

Imulti(S) = X

s1,...,s L p(s1, . . . , s L) log2 p(s1, . . . , s L) p(s1) . . . .p(s L).

Definition C.5 (Conditional multi-information of T given S).

Imulti(T|S) = X

t1,...,t M p(s1, . . . , s L, t1, . . . , t M)

log2 p(t1, . . . , t M|s1, . . . , s L) p(t1|s1, . . . , s L) . . . p(t M|s1, . . . , s L).

For the interpretation of (C.4) to (C.5), see (Brown, 2009; Zhou & Li, 2010).

D. About Lemma 3.1

D.1. Full Proof

H(Y |O) + H(Y | ˆY ) H(Y |O) | {z } combination loss

= H(Y | ˆY ),

H2(perr) | {z } +perr log2(Ymax 1) =: U(perr),

H2(p0) + H 2(p0)(perr p0) m

2 (perr p0)2

| {z } =: ˆ H2(perr)

+ perr log2(Ymax 1),

=: ˆUtight m,p0(perr). (D.1)

The first inequality follows from Fano s inequality Lemma 2.1. In the second inequality, we used strong concavity of binary cross entropy function H2(perr) | {z } to upper

bound it by another quadratic function ˆH2(perr) tangent to H2(perr) at perr = p0. (D.1) holds for any p0 [0, 1] and m 4.

m represents the curvature of ˆH2(perr). Setting m = 4 produces the most curved quadratic function ˆH2(perr), and hence the tightest upper bound of H2(perr). Then, decomposing H(Y |O) of the left-hand side as Lemma 2.3 and solving (D.1) for perr derives Lemma 3.1.

The choice of p0 of Lemma 3.1 is discussed in Appendix D.2.

D.2. Which Choice of p0 is Preferable for Ensemble System Comparison

Lemma 3.1 discloses lower bounds that depend on p0. For fair comparisons of ensemble systems, we must first choose

Rethinking Fano s Inequality in Ensemble Learning

and fix a specific value of p0 from [0, 1]. Any choice of p0 is ok since it does not change the ranking of lower bounds. In our experiments, we chose the baseline error rate as our p0 due to the following reason.

As stated in Appendix D.1, we approximated the binary cross entropy function H2(perr) as a quadratic function ˆH2(perr) tangent to H2(perr) at perr = p0. Thus, the approximation error ˆUtight m=4,p0(perr) U(perr) is the smallest when p0 perr, where perr = Btight p0 (E) is the actual lower bound obtained by ensemble strength E of each of the ensemble systems. This means that we should choose a value of p0 that is similar to the error rate lower bounds of the target ensemble systems due to the following reason.

Since we do not know the error lower bounds of the systems before we choose p0 and solve perr = Btight p0 (E), it is a bit complicated to tune the value of p0, although it is possible. Thus, in the experiments of this study, we chose the baseline error rate as our p0 rather than tuning p0. The baseline error rate is expected to be similar to the error rates of the ensemble systems, and hence it should not be much different from the lower bounds of the systems.

D.3. Comparison between Tightness of Lemma 3.1 and Lemma 2.3

Lemma 3.1 differs from Lemma 2.3 in the lower bound functions. That is, Lemma 3.1 uses Btight p0 (E) while Lemma 2.3 uses B(E). In this section, we show that the bound function Btight p0 (E) is tighter (i.e. larger) than B(E) if E is in a specific range in which Btight p0 (E) is not much different from p0. Hereafter we assume p0 Ymax 1

Ymax , Btight p0 (E) Ymax 1

Ymax , and B(E) Ymax 1

Ymax , where Ymax 1

Ymax means an error rate of a random guessing system on a balanced label dataset.

Firstly, we show how the two lemmas are derived.

Lemma 3.1 is derived using (D.1) as:.

1. Set m = 4. That is, we use ˆUtight m=4,p0(perr) for the upper bound function.

2. Solving for perr derives Lemma 3.1 perr Btight p0 (E).

Lemma 2.3 is derived in a similar way as:

H(Y |O)+ H(Y | ˆY ) H(Y |O) | {z } combination loss

ˆUtight m,p0(perr),

ˆUtight m,p0(perr) + perr log2 Ymax Ymax 1 =: ˆUm,p0(perr).

1. Set m = 0, p0 = 1

2. That is, we use ˆUm=0,p0= 1

2 (perr) for the upper bound function.

2. Loosen the left-hand side as H(Y |O) + H(Y | ˆY ) H(Y |O) | {z } combination loss

H(Y |O), that is, ignore the

combination loss.

3. Then, solving for perr derives Lemma 2.3 perr B(I).

Viewing these, we can immediately show that if we use p0 = 1 2 for Lemma 3.1, it s bound function is tighter as E, Btight p0= 1

2 (E) B(E). This follows from ˆUtight m=4,p0= 1

2 (perr) ˆUtight m=0,p0= 1

2 (perr) + perr log2 Ymax Ymax 1 = ˆUm=0,p0= 1

2 (perr). We also point out that Lemma 2.3 poses the following assumptions which may lead to loose bound; (i) m = 0. This means that the upper bound function ˆUm=0,p0= 1

2 (perr) is a line. (ii) The existence of positive term perr log2 Ymax Ymax 1.

When we use p0 = 1

2, that is more general, the tightness Btight p0 (E) B(E) holds in limited ranges of E. As stated in Appendix D.2, the approximation error produced by ˆUtight m,p0(E) is the smallest if Btight p0 (E) p0. Thus, roughly speaking, Btight p0 (E) B(E) holds if Btight p0 (E) p0 and Btight p0 (E) B(E) holds if Btight p0 (E) and p0 differ much in their values.

We can discuss the details as follows. The tightness condition on ensemble strength E is given by:

Btight p0 (E) B(E). (D.3)

Let lower bound function pp0(E) = Btight p0 (E). Solving (D.3) for E can derive the range of pp0(E) where the tightness holds:

pp0(E) min(p0 + + p0, p0 + p0), (D.4)

max(p0 + + p0, p0 + p0) pp0(E), (D.5)

p0 := τ(p0)

1 H2(p0) log Ymax 1

Ymax τ(p0)2

dp (p0) log Ymax Ymax 1

We assumed 1 1

2 1 H2(p0) log Ymax 1

Ymax τ(p0)2 0. Otherwise, the tightness (D.3) always holds.

We proceed by specifying p0. Firstly, suppose p0 is mildly small: p0 Ymax 1 2Ymax 1. Then, we can show τ(p0) 0. Thus, p0 + p0 holds, and (D.4) becomes:

pp0(E) p0 + p0. (D.6)

Rethinking Fano s Inequality in Ensemble Learning

Additionally, we can show that p0 0. Thus, (D.6) discloses that if the lower bound pp0(E) is not much larger than p0, the tightness (D.3) holds. Especially, if pp0(E) p0 the tightness holds. This condition applies to the experiments of this study. We have also directly shown that Btight p0 (I) > B(I) in Table 4a.

If p0 is large p0 Ymax 1 2Ymax 1, we can show τ(p0) 0. Thus, p0 + p0 holds, and (D.5) becomes:

pp0(E) p0 + p0. (D.7)

Additionally, we can show that p0 0. Thus, (D.7) discloses that if the lower bound pp0(E) is not much smaller than p0, the tightness (D.3) holds.

Rethinking Fano s Inequality in Ensemble Learning

E. Details of Experimental Setup

E.1. Models

DNN types: Table E.7 shows the five types of pre-trained language models used in this study. Pre-trained language models are essentially large neural networks with selfattention layers that are trained on huge text corpora in an unsupervised manner. These models are shown to obtain state-of-the-art performance when fine-tuned on downstream tasks (Liu et al., 2019; Lan et al., 2020; Clark et al., 2020; Devlin et al., 2019; Lewis et al., 2020). In addition, since they differ in terms of model architecture and pretraining method, they should produce strong diversity, and hence, are suitable for ensembles.

Fine-tuning procedures: We trained each DNN on each downstream task following the standard practice of language model fine-tuning (see Devlin et al. (2019) for example) as follows.

We added a new softmax layer on top of the embedding layers of the DNNs. We preprocessed the input text by the following steps: (i) we tokenized the input text with a DNN-type-specific tokenizer, (ii) if the text included more than two sentences, we added DNN-type-specific separator tokens between sentences, (iii) we tensorized each token into a one-hot vector using DNN-type-specific vocabulary.

We trained these models on the training sets of the tasks. TValidation sets were used only during the preliminary experiments to adjust some hyperparameters (shown below). Please refer to Appendix F.4 for the details of the datset splitting strategy.

We used the hyperparameters shown in Table E.8 to finetune all of the DNN types. The values were chosen on the basis of the original papers (Liu et al., 2019; Lan et al., 2020; Clark et al., 2020; Devlin et al., 2019; Lewis et al., 2020) and our preliminary experiments. Note that language models require only a few epochs for convergence.

Some of the ensemble methods in Table 2 use different seeds for fine-tuning to produce diverse DNN models. In our study, seeds affect (i) the initial weights of the softmax layer, (ii) the hidden units dropped by dropout, and (iii) the shuffling order of the training instances.

Implementations: We implemented the fine-tuning of DNNs described here using the jiant library (Phang et al., 2020) (v2.2.06), which in turn utilizes Hugging Face s Transformers library (Wolf et al., 2020). Jiant enables us to fine-tune various types of pre-trained language models on various NLP tasks. See our code for details.

6github hash: 961bd577f736449956ddb2c15dcfce68bbb75e59

E.2. Ensemble Systems

For the random hyperparameter sampling of Random-Hy P, we sampled the fine-tuning learning rate since it affect the resulting model the most. We sampled the learning rate around the best value of 3e-5, i.e., from [1e-5, 1e-4], as shown in Table E.8.

The baseline system s0 was single DNN (i.e. no-ensemble) that performed the best among DNNs. These baselines are shown as bold in Table E.7

We implemented the model generation methods in Table 2 by ourselves.

We implemented the model combination methods in Table 2 using scikit-learn 7. For the training of Stacking metaestimators, we used the hyperparameters shown in Table E.9. We tuned some of the hyperparameters using scikit-learn s Grid Search CV with 5-fold cross validation. Appendix F gives other details on Stacking ensemble used in this study.

E.3. Estimation of metric values and lower bound

Trick of MTI

In our experiments, we estimated the three metric values on the basis of the frequency distribution observed for the datasets. We used the trick of MTI introduced by (Zhou & Li, 2010), which approximates quantities appearing in the three metrics which depend on high-dimensional stochastic variables O. Please refer to (Zhou & Li, 2010) for more details.

We repeat the three terms of Lemma 3.1 below:

Irelev(O, Y ) =

i=1 I(Oi, Y ),

Iredun(O, Y ) = Imulti(O) Imulti(O|Y ),

Icombloss(O, Y, ˆY ) = H(Y | ˆY ) H(Y |O).

Looking at above, it can be seen that some terms (i.e. Imulti(O), Imulti(O|Y ) and H(Y |O)) depend on highdimensional variable O = {O1, . . . , ON}, where N is the number of models. Since N can be as large as 30 in our experiments, these terms might not be estimated reliably due to the count sparsity for the limited amount of dataset instances.

Thus, we use the trick of MTI introduced by Zhou & Li (2010), which approximates the quantities by replacing O

7https://scikit-learn.org/stable/

Rethinking Fano s Inequality in Ensemble Learning

Table E.7: DNNs used in study and their error rates for each task. Convention of variant follows Huggingface s transformer library (Wolf et al., 2020). Bold shows best model in each task, which is used as baseline s0 stated in Section 4.2.

DNN type variant avg. Boolq Co LA Cosmos QA MNLI MRPC QQP Sci Tail SST

Ro BERTa (Liu et al., 2019) base 15.5 0.3 24.1 0.6 15.6 0.2 28.2 0.5 18.7 1.2 13.6 0.5 14.1 0.8 4.2 0.2 5.8 0.7 ELECTRA (Clark et al., 2020) base-discriminator 17.3 0.3 23.1 1.3 17.0 0.5 29.8 0.7 22.6 1.0 13.3 0.7 18.5 0.9 7.1 0.2 5.7 0.5 BART (Lewis et al., 2020) base 17.9 0.2 25.5 1.3 20.9 0.5 30.1 0.5 22.3 0.8 15.8 0.7 15.9 1.2 4.7 0.3 8.3 0.5 BERT (Devlin et al., 2019) base-uncased 18.7 0.1 26.0 0.8 17.2 0.4 34.1 0.6 26.3 0.2 17.1 0.7 16.5 0.7 4.5 0.2 8.0 0.7 ALBERT (Lan et al., 2020) base-v1 20.4 0.1 25.3 2.2 19.5 0.2 43.2 0.1 27.1 0.3 14.5 0.4 18.8 0.8 4.9 0.1 9.6 0.3

Table E.8: Hyperparameters used for fine-tuning of DNNs.

hyperparameter value

learning rate 3e-5 ([1e-5, 1e-4] for the random sampling of Random-Hy P) optimizer Adam (Kingma & Ba, 2015) (ϵ = 1e 8) with linear warmup (data size proportion=0.1), described in (Devlin et al., 2019). gradient clipping 1.0 gradient accumulation steps 1 epochs 5 dropout DNN specific values (follows jiant (Phang et al., 2020)) training batch size 16 inference batch size 32 number of softmax layer 1

Table E.9: Meta-estimator hyperparameters. Hyperparameter names follow scikit-learn. Most of the hyperparameters are set as default values of scikit-learn (version 0.22.2).

meta-estimator hyperparameter value / search range

logistic regression C [1e-2, 3e-2, 1e-1, 3e-1, 1e0] penalty L2 solver liblinear max_iter 1000 multi_class auto random_state 0 SVM C [1e-2, 3e-2, 1e-1, 3e-1, 1e0] max_iter -1 decision_function_shape ovr random_state 0 Random Forest ccp_alpha [0.0, 0.03, 0.1, 0.3] random_state 0 criterion gini max_depth None

with its smaller subset Ωas follows:

Imulti(O) =

i=1 I(Oi; O1:i 1)

i=1 max Ωi 1 k I(Oi|Ωi 1 k ),

Imulti(O|Y ) =

i=1 I(Oi; O1:i 1|Y )

i=1 max Ωi 1 k I(Oi; Ωi 1 k |Y ),

H(Y |O) min ΩN k H(Y |ΩN k ), (E.3)

where Ωi = {X1, . . . , Xi}, and Ωi 1 k is a subset of size k.

The first equalities of (E.1) and (E.2) were proved by (Zhou & Li, 2010). The last inequality in each equation is understood as follows. By replacing O with its subset Ωi k, we lose some amount of information carried by O. Thus, this transformation might make mutual information in (E.1) and (E.2) smaller than the original value and the entropy in (E.3) larger. However, if we find Ωi k, which contains the largest amount of information (corresponding to max and min operations in each equation), the difference from the original value (i.e,. approximation error) is the smallest.

Zhou & Li (2010) empirically showed that the method works well to produce almost an exact value. In our experiments, we used k = 3 (MTIk=3).

Rethinking Fano s Inequality in Ensemble Learning

On the choice of p0

We set the approximate error rate p0 in (6) as the error rate of the baseline s0 defined in Section 4.2. We state the reason for this in Appendix D.2.

Table E.10 details the eight tasks used in this study.

Rethinking Fano s Inequality in Ensemble Learning

Table E.10: Tasks used in this study.Mmajority of tasks are from GLUE benchmark (Wang et al., 2018) (shown as ) and Super GLUE benchmark (Wang et al., 2019) (shown as ). All datasets are publicly available.

task dataset size # classes (Ymax)

description

Boolq (Boolean Question) (Clark et al., 2019) 9.5k 2 We are required to choose yes or no about a given question on a given passage. The questions are the ones naturally occurring in Google search engine, rather than the ones artificially built. Answering the questions often requires query for complex, nonfactoid information, and difficult entailment-like inference.

Co LA (Corpus of Linguistic Acceptability) (Dolan & Brockett, 2005) 8.5k 2 We are required to judge linguistic acceptability (i.e., grammatical or non grammatical) of given text such as What did Bill buy potatoes? . The text are drawn from books and journal articles on linguistic theory. Answering the questions requires the rich grammatical knowledge from the local word dependencies such as subject-verb-object order to the non-local dependencies.

Cosmos QA (Khot et al., 2018) 25k 4 After reading a short narrative passage, we are required to answer a question about the passage (such as What s a possible reason the writer needed someone to dress him every morning? ) by choosing one answer from four possible candidates. The passages are taken from blogs on the web and personal narratives. Understanding the narrative requires common sense such as inference on causes and effects of events, even when they are not mentioned explicitly in the texts.

MNLI (Multi-Genre Natural Language Inference) (Williams et al., 2018) 400k (10k used)

3 Given two pieces of text, we answer the relationship of the one piece to the other piece from three choices: entails , neutral , contradicts . The dataset is composed of texts from various distinct genres of written English. The pairs are such as At 8:34, the Boston Center controller received a third transmission from American 11 and The Boston Center controller got a third transmission from American 11. Answering the question requires total ability of natural language understanding, e.g., handling lexical entailment, quantification, coreference, tense, belief, modality, and lexical and syntactic ambiguity.

MRPC (Microsoft Research Paraphrase Corpus) (Dolan & Brockett, 2005) 3k 2 We are required to judge whether given two sentences are semantically equivalent. The sentences are automatically extracted from online news sources and twitter. Pairs are such as: Charles O. Prince, 53, was named as Mr. Weill s successor. Mr. Weill s longtime confidant, Charles O. Prince, 53, was named as his successor. . Recognizing such paraphrase is a fundamental skill needed for various tasks in NLP.

QQP (Quora Question Pairs) 8 300k (10k used)

2 We are required to determine whether a pair of questions are semantically equivalent. The questions are taken from the social Q&A website Quora. The skill is used by question-answering system to recognize the semantically same questions of different linguistic expressions.

Sci Tail (Khot et al., 2018) 23k 4 We are required to answer a given scientific question such as Which of the following best explains how stems transport water to other parts of the plant? by choosing one answer from four candidates. We have access to the additional relevant text. The questions are the ones naturally arising in the web rather than ones artificially created.

SST (Stanford Sentiment Treebank) (Socher et al., 2013) 50k (10k used) 2 We predict a sentiment label (i.e., positive or negative) of a given sentence. The sentences are taken from movie reviews. The task requires the understanding of compositoinality of langeuage.

Rethinking Fano s Inequality in Ensemble Learning

F. Stacking Ensemble

Meta-estimator

Model Model Model

Figure F.4: Stacking ensemble used in this study.

Here, we give destails of the stacking ensemble used in this study.

F.1. Architecture:

Figure F.4 illustrates the stacking ensemble used in this study. We used the two-layered stacking ensemble where the first-layer models are fine-tuned DNNs, and the secondlayer model (i.e. the meta-estimator) is another classification model. For the meta-estimator, we used logistic regression, Support Vector Machine (Platt, 1999) with RBF kernel, and Random Forest (Breiman, 2001). For the inputs of the metaestimator, we used class labels predicted by the models.

In the below, we show the details of the logistic regression meta-estimator case. The meta-estimator estimates the probability for a given instance i belonging to class c pi,c [0, 1] from class labels predicted by N models ˆyi = {ˆy1 i , ˆy2 i , . . . , ˆy N i }, ˆyn i {0, 1} as:

pi,c = 1 1 + exp( li,c),

li,c = w0 c +

m=1 wm c ˆym i .

The class with the largest pi,c is chosen as the final answer. The meta-estimator is trained using meta-feature dataset Dmeta = {(ˆy1, y1), (ˆy2, y2), . . . , (ˆy|D|, y|D|)}, where yi denotes the groundtruth label. Details of the meta-estimator training are shown in E.2 and F.4.

F.2. Weight Distribution of Table 6

The DNN-type-wise weight sum mentioned in Table 6 is calculated as follows:

m Mt |wm c=1|,

where m denote the index of model, t a specific DNN type and Mt the set of indexes of models from DNN type t. Note that since our study used binary classification tasks, it suffices to look c = 1.

F.3. Meta-estimator training

Hyperparameters The hyperparameters of the metaestimators (i.e., logistic regression and the SVM used by Stacking ensemble) are shown in Table E.9.

Implementation: We implemented the model combination methods in Table 2 using scikit-learn 9.

F.4. Dataset splitting

In order to train meta-estimator of Stacking, we must take cross-validation based dataset splitting strategy (Wolpert, 1992). In the below, we describe the data splitting strategy, which is illustrated in Figure F.5. Note that the same data splitting strategy was used for voting-based systems for fair comparisons.

Training of stacking meta-estimators requires meta-feature dataset Dmeta = {(ˆy1, y1), (ˆy2, y2), . . . , ˆy|D|, y|D|)}., as stated in Appendix F. Here, ˆyi = {ˆy1 i , ˆy2 i , . . . , ˆy N i } where ˆym i {0, 1} denotes the label predicted by model m on instance i. yi denotes the groundtruth label of the same instance i. To prevent overfitting of meta-estimators, the model predictions {ˆy1, ˆy2, . . . } must be label-leak-free. Thus, the model predictions are usually obtained using nfold cross-validation as follows.

Meta-feature dataset construction For each model m, we use n-fold cross-validation to obtain its label-leak-free predictions. Specifically:

1. Choose model m.

2. Divide the dataset D = {(x1, y1), . . . , (x|D|, y|D|)} into n sets.

3. One of them (i.e. base-test-i) is set aside for testing later.

4. Train the model m on the rest sets (i.e. base-train-i).

5. Apply the trained model m to the test set (i.e. basetest-i) to get label-leak-free predictions.

6. Repeat 3-5 for i to collect label-leak-free predictions on whole the dataset {ˆym 1 , ˆym 2 , . . . , ˆym |D|} where ˆym i denotes a label prediction by model m on the instance i, as stated in F.

7. Repeat 1-6 for m to collect label-leak-free predictions by all the models: {ˆy1, ˆy2, . . . , ˆy|D|}. Then, we concatenate the predictions Dmeta. Then, we merge the predicted labels {ˆy1, ˆy2, . . . , ˆy|D|} and the groundtruth labels {y1, . . . , y|D|} into the meta-feature dataset Dmeta = {(ˆy1, y1), (ˆy2, y2), . . . , (ˆy|D|, y|D|)}.

9https://scikit-learn.org/stable/

Rethinking Fano s Inequality in Ensemble Learning

Dataset D = {. . , (𝒙𝒙𝒊𝒊, 𝑦𝑦𝑖𝑖), . . }

Base-train-1 Base-test-1

Base model train predict

Meta-feature dataset Dmeta = {. . 𝒚𝒚𝒊𝒊, 𝑦𝑦𝑖𝑖. . }

Meta-train-1 Meta-test-1

Metaestimator train

calculate error rate

Base-train-2

Base-test-2 Base-train-2

Base-train-3 Base-test-3

train predict

Meta-train-2 Meta-test-2

Metaestimator

calculate error rate

Meta-train-3 Meta-test-3

Metaestimator

train calculate error rate

Meta-train-2

Averaged error rate

Meta-feature dataset construction

Meta-estimator training and scoring

Figure F.5: Our dataset splitting strategy (Appendix F.4) with 3-split case.

Note that, since the test set (i.e. base-test-i) is never used by model training, the predictions on the test set are label-leakfree.

In this study, we used n = 5.

Meta-estimator training and scoring :

Some of the datasets used in this study are small, as shown in Table E.10. The official test-sets of the datasets are also small. For example, the test sets of RTE dataset includes only 277 instances. We supposed that the performance measurements conducted on such small test-sets might not be so reliable. Thus, we conducted the following l-fold cross-validation to train and score the meta-estimators:

1. Divide Dmeta into l sets.

2. One of them (i.e. meta-test-i) is set aside for testing.

3. Train a meta-estimator on the rest sets (i.e. meta-traini).

4. Apply the meta-estimator to the test sets (i.e. meta-testi) and calculate its error rate.

5. Repeat 2-4 for i to get error rates on the test-sets, then calculate the average of them.

In this study, we used l = 4.

Rethinking Fano s Inequality in Ensemble Learning

G. Pearson Correlation Coefficients between Error Rate Reductions and Lower Bound Reductions for Various Number of Models N

Tables G.11 to G.14 show the Pearson correlation coefficients between the error reductions and lower bound reductions of the ensemble systems in each task. Each table shows the results of different N, which is the number of models used by the ensemble systems.

See Section 5.2 for the discussion of such correlations.

Table G.11: N = 10. Pearson correlation coefficients between error rate reduction and lower bound reduction. In each task we used the 16 ensemble systems described in Section 4.2.

Lower bound type

Task Lemma 2.3 B(I) Btight(I) Lemma 3.1 Btight(E)

Boolq 0.413 0.377 0.869 Co LA -0.259 -0.245 0.993 Cosmos QA -0.188 -0.174 1.000 MNLI -0.275 -0.385 0.955 MRPC 0.218 0.218 0.983 QQP -0.359 -0.330 0.999 Sci Tail -0.076 -0.092 0.944 SST 0.286 0.357 0.998

average10 -0.482 -0.431 0.975

Table G.12: N = 15. Pearson correlation coefficients between error rate reduction and lower bound reduction. In each task we used the 16 ensemble systems described in Section 4.2.

Lower bound type

Task Lemma 2.3 B(I) Btight(I) Lemma 3.1 Btight(E)

Boolq 0.341 0.330 0.910 Co LA -0.211 -0.210 0.991 Cosmos QA -0.324 -0.320 1.000 MNLI 0.226 0.216 0.961 MRPC 0.332 0.252 0.989 QQP -0.131 -0.076 0.998 Sci Tail -0.237 -0.191 0.966 SST -0.242 -0.252 0.998

average10 -0.238 -0.165 0.984

10The correlation coefficient between the averaged error rate reductions and lower bound reductions. The average is taken over the eight tasks.

Table G.13: N = 20. Pearson correlation coefficients between error rate reduction and lower bound reduction. In each task we used the 16 ensemble systems described in Section 4.2.

Lower bound type

Task Lemma 2.3 B(I) Btight(I) Lemma 3.1 Btight(E)

Boolq 0.323 0.311 0.915 Co LA -0.324 -0.320 0.995 Cosmos QA -0.510 -0.512 1.000 MNLI -0.190 -0.192 0.976 MRPC -0.235 -0.199 0.964 QQP 0.411 0.390 0.999 Sci Tail -0.286 -0.307 0.958 SST 0.032 0.024 0.997

average10 -0.452 -0.425 0.985

Table G.14: N = 30. Pearson correlation coefficients between error rate reduction and lower bound reduction. In each task we used the 16 ensemble systems described in Section 4.2.

Lower bound type

Task Lemma 2.3 B(I) Btight(I) Lemma 3.1 Btight(E)

Boolq 0.158 0.146 0.940 Co LA -0.215 -0.213 0.994 Cosmos QA -0.592 -0.588 1.000 MNLI -0.048 -0.050 0.976 MRPC -0.471 -0.498 0.974 QQP 0.187 0.231 0.999 Sci Tail -0.379 -0.377 0.954 SST 0.213 0.208 0.996

average10 -0.330 -0.288 0.990

Rethinking Fano s Inequality in Ensemble Learning

H. Behavior of Ensemble Quantities When Number of Models N is Changed

In this section, we examine the behavior of the ensemble quantities when the number of models is changed (Figure G.6). Most importantly: (i) both Lemma 2.3 B(I) (Figure G.6b) and Btight(E) (Figure G.6c) could not predict the shape of error rate reduction curve (Figure G.6a), especially the saturation over N 15. (ii) by contrast Lemma 3.1 (Figure G.6d) could predict the phenomena. This success is attributed to the ensemble strength which consider combination loss (Figure G.6j).

Figure G.6e shows the per-model relevance irelev = Irelev/N, that denotes the average amount of information on Y conveyed by a single model or average accuracy of the models. All the systems kept it nearly constant, since their model training procedures do not change with respect to N.

Figure G.6f shows the per-model redundancy iredun = Iredun/N, which denotes the average amount of information on Y conveyed by a single model that is redundant to the other models. In all of the systems, it increased to about the same as irelev. It increased because as more models come into an ensemble system, it becomes more difficult for a new model to output a novel prediction distribution compared with those of the existing models. As a result, new models eventually become totally redundant as iredun irelev.

irelev iredun (Figure G.6g), the average amount of unique information conveyed by a single model, converged to nearly zero. Because of this diversity saturation, the increase in the I = N (irelev iredun) slowed at large scale (Figure G.6h). However, their saturation speed was smaller than the observed one (Figure G.6a). As a result, both lower bound reductions of Lemma 2.3 B(I) (Figure G.6b) and Btight(E) (Figure G.6c) could not predict the saturation behavior.

Figure G.6i shows the combination loss Icombloss. Icombloss increased in proportion to the increase of I, since Icombloss represents the amount of information lost from I (Appendix I gives the intuition behind this increase). Overall, E = I Icombloss saturated at the large scale (Figure G.6j). Thus, the lower bound reduction by Lemma 3.1 (Figure G.6d) produced by E succeeded in detecting the observed saturation behavior (Figure G.6a).

Rethinking Fano s Inequality in Ensemble Learning

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure G.6: The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

I. On Increase in Combination Loss with respect to N

We describe the reason for the increase in combination loss Icombloss with respect to number of models N observed in Figure G.6i.

In particular, we discuss that the fact that Icombloss increases with N does not contradict the fact that a larger N leads to better performance.

I.1. Information theoretical view

From information theoretical viewpoints:

1. Since incoming models bring us more information, I, which denotes the total amount of information carried by the models, increases with N, as shown in Figure G.6h.

2. Since Icombloss represents the amount of information lost from I when a combination function F is applied, Icombloss generally increases as I increases. This is shown in Figure G.6i. This fact is not counter-intuitive, since, for example, if the information loss rate is constant as c, Icombloss = c I increases at the same speed as I.

3. Since the growth of I is faster than that of Icombloss, E = I Icombloss, which denotes the total amount of information remaining after the combination, also increases, as shown in Figure G.6j.

4. Since E represents the performance of an ensemble system, increasing E leads to better performance.

As seen, the fact 2 that Icombloss increases with N does not contradict the fact 3-4 that a larger N leads to better performance.

I.2. Viewing through neglected minority model predictions

In Section 3.2, we discussed that the source of Icombloss is neglected but correct model predictions. We can also discuss Icombloss from this view as follows:

1. The number of neglected minority predictions on a misclassified dataset instance increases as the number of total predictions on the instance increases. Since the latter is rougly proportional to N, the former is also roughly proportional to N.

2. The total number of misclassified dataset instances, which denotes error rate, decreases more slowly than linearly with N. This is empirically known, for example as shown in Figure G.6a.

3. The total number of neglected minority predictions in a dataset, which is the source of Icombloss, is roughly estimated as [the number of neglected minority predictions on a misclassified dataset instance] [the total number of misclassified dataset instances]. From 1 and 2, this quantity increases roughly linearly with N.

As seen, the fact 3 that Icombloss increase with N does not contradict with the fact 2 that error rate decrease with N.

J. Measurements of Information Concentration

To observe this directly, we defined n-model concentration (Conc N n ) which measures the degree of concentration on top-n models as a value in [0, 1]:

Conc N n (O, Y ) = I(ΩN,max n ; Y ) I(ΩN,min n ; Y ) I(O; Y ) [0, 1],

I(ΩN,max/min n ; Y ) = max/min {i1,i2,...,in} ΩN n I({Oi1, Oi2, . . . , Oin}; Y ),

where I is mutual information defined by (C.3) and ΩN n is all possible combinations of n integers from [1, N]. Since the amount of information on Y carried by a subset {Oi1, . . . , Oin} can never be more than that of a full set O, I(ΩN,max/min n ; Y ) I(O; Y ). This leads to Conc N n (O, Y ) [0, 1]. The Conc N n takes 1 when all the information carried by O can be reconstructed by top-n Oi and bottom-n Ois having no information (i.e. I(ΩN,max n ; Y ) = I(O; Y ) and I(ΩN,min n ; Y ) = 0). The Conc N n are small when the amount of information on top-n Oi is similar to that of bottom-n Oi (i.e. I(ΩN,max n ; Y ) I(ΩN,min n ; Y )).

Rethinking Fano s Inequality in Ensemble Learning

K. Results of each task

Below, we show the experimental results of the eight tasks. The discussion in Sections 5 and 6 holds in each task, that is:

Btight generate lower bound tighter than B. This is discussed in Section 5.1.

The lower bound reduction by Lemma 3.1 Btight(E) is strongly correlated to the error rate reductions, while those of Lemma 2.3 B(I) and Btight(I) are not. This is discussed in Section 5.2.

The lower bound reduction by Lemma 3.1 Btight(E) successfully predicts the shape of error rate reduction curve when the number of models N is changed, while those of Lemma 2.3 B(I) and Btight(I) do not. This is discussed in Section 5.3.

The strengths and weaknesses of ensemble systems in terms of the three metrics. This is discussed in Section 6.

Rethinking Fano s Inequality in Ensemble Learning

6 8 10 12 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.341

(a) Lemma 2.3 B(I).

6 8 10 12 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.33

(b) Btight(I).

6 8 10 12 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.91

(c) Lemma 3.1 Btight(E).

Figure K.7: Boolq task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.15 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 24.1 %. LB(s0) by Btight(E): 3.1 %. LB(s0) by Btight(I): 3.1 %. LB(s0) by B(I): 2.0 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.8: Boolq task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.15: Boolq task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 24.1 %. LB(s0) by Btight(E): 3.1 %. LB(s0) by Btight(I): 3.1 %. LB(s0) by B(I): 2.0 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P 7.9 1.2 9.6 2.0 9.1 2.0 10.3 1.4 3.3 1.2 6.3 1.8 5.7 1.0 5.8 1.7 33 2 73 5 Bagging 5.3 2.0 6.4 3.0 6.2 2.4 5.4 1.4 3.1 1.7 2.8 2.7 2.7 1.8 1.0 1.3 43 2 94 3 Random-Seed 9.2 1.9 9.5 1.8 8.5 3.1 8.8 2.9 6.9 1.7 6.0 1.7 4.5 3.1 4.7 2.5 38 1 83 3 Hetero-DNNs 9.8 0.3 11.8 1.6 12.6 0.4 12.2 1.5 5.5 0.4 7.2 1.3 7.6 0.4 6.2 1.4 52 2 114 7 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.182) 100 0 0 0 0 0 100

Random-Hy P 110.1 3.8 119.2 6.0 117.2 3.5 117.7 5.6 74.2 3.2 60.9 2.8 5.97 0.32 5.36 0.47 5.50 0.48 5.46 0.55 13.3 4.3 Bagging 109.5 5.3 108.8 8.5 108.4 5.7 103.2 3.9 80.0 2.4 64.7 2.5 7.97 0.12 8.02 0.41 8.04 0.28 8.39 0.07 15.3 3.4 Random-Seed 120.8 5.6 118.1 5.8 113.9 9.8 114.3 8.0 100.0 0.0 85.8 0.3 6.18 0.27 6.36 0.17 6.64 0.41 6.61 0.27 14.2 0.3 Hetero-DNNs 116.4 1.6 121.4 3.1 122.8 1.8 118.7 4.9 85.4 2.0 68.4 1.5 9.33 0.65 8.99 0.92 8.90 0.63 9.18 0.41 17.1 2.5

Rethinking Fano s Inequality in Ensemble Learning

5 0 5 10 Error rate reduction

Lower bound reduction

Pearson Coef = -0.211

(a) Lemma 2.3 B(I).

5 0 5 10 Error rate reduction

Lower bound reduction

Pearson Coe = -0.21

(b) Btight(I).

5 0 5 10 Error rate reduction

Lower bound reduction

Pearson Coe = 0.991

(c) Lemma 3.1 Btight(E).

Figure K.9: Co LA task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.16 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 15.6 %. LB(s0) by Btight(E): 2.6 %. LB(s0) by Btight(I): 2.6 %. LB(s0) by B(I): 2.1 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.10: Co LA task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.16: Co LA task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 15.6 %. LB(s0) by Btight(E): 2.6 %. LB(s0) by Btight(I): 2.6 %. LB(s0) by B(I): 2.1 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P -1.2 1.3 5.6 2.2 7.5 1.6 5.2 2.5 -0.9 1.2 5.1 2.1 8.1 1.6 4.7 2.3 42 3 48 3 Bagging 0.5 1.5 2.5 1.5 5.9 1.2 -2.0 2.3 0.4 1.3 2.1 1.4 6.9 1.6 -1.8 2.1 59 3 66 3 Random-Seed 3.9 0.5 6.6 1.7 10.4 0.5 5.8 1.0 3.4 0.4 5.9 1.6 10.8 1.1 5.2 0.9 49 3 55 3 Hetero-DNNs -5.4 3.0 6.8 0.9 8.7 1.0 5.6 2.0 -4.8 2.5 6.2 0.9 8.6 0.7 5.0 1.9 42 1 47 1 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.252) 100 0 0 0 0 0 100

Random-Hy P 98.6 1.8 108.1 3.2 112.8 2.6 107.4 3.6 77.1 2.6 66.0 2.3 4.53 0.38 3.89 0.52 3.58 0.25 3.94 0.55 11.1 3.5 Bagging 100.6 2.0 103.4 2.2 110.9 2.5 97.1 3.2 84.6 0.5 71.7 0.2 6.14 0.16 5.96 0.29 5.45 0.44 6.37 0.12 12.9 0.5 Random-Seed 105.3 0.6 109.3 2.5 117.1 1.6 108.3 1.5 100.0 0.0 88.2 0.3 4.80 0.22 4.53 0.32 4.01 0.17 4.61 0.33 11.8 0.3 Hetero-DNNs 92.4 4.0 109.8 1.4 113.6 1.1 107.9 3.0 80.7 0.4 69.7 0.4 4.88 0.23 3.72 0.08 3.47 0.05 3.84 0.18 11.0 0.6

Rethinking Fano s Inequality in Ensemble Learning

5 10 15 20 Error rate reduction

Lower bound reduction

Pearson Coef. = -0.324

(a) Lemma 2.3 B(I).

5 10 15 20 Error rate reduction

Lower bound reduction

Pearson Coef. = -0.32

(b) Btight(I).

5 10 15 20 Error rate reduction

Lower bound reduction

Pearson Coef. = 1.0

(c) Lemma 3.1 Btight(E).

Figure K.11: Cosmos QA task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.17 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 28.2 %. LB(s0) by Btight(E): 6.2 %. LB(s0) by Btight(I): 6.2 %. LB(s0) by B(I): 2.0 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.12: Cosmos QA task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.17: Cosmos QA task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 28.2 %. LB(s0) by Btight(E): 6.2 %. LB(s0) by Btight(I): 6.2 %. LB(s0) by B(I): 2.0 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P 8.7 1.0 10.4 1.0 10.7 1.4 10.3 0.6 7.5 0.9 8.9 0.8 9.2 1.1 9.0 0.5 248 7 837 34 Bagging 3.2 0.8 4.8 0.2 4.7 0.3 4.2 0.7 2.7 0.7 4.0 0.2 3.9 0.2 3.6 0.6 302 3 1022 20 Random-Seed 10.6 1.1 11.7 0.3 11.9 0.6 9.1 2.8 9.1 1.0 10.0 0.3 10.2 0.6 7.9 2.2 246 1 827 7 Hetero-DNNs 15.6 1.3 18.4 0.7 17.3 0.7 19.9 1.2 13.8 1.1 16.2 0.6 15.2 0.6 17.6 1.1 277 5 935 9 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.683) 100 0 0 0 0 0 100

Random-Hy P 110.8 1.2 112.9 1.3 113.3 1.8 112.9 0.9 91.5 1.9 60.6 1.6 23.57 0.47 23.43 0.54 23.40 0.56 23.43 0.51 31.0 2.5 Bagging 103.9 1.0 105.8 0.3 105.6 0.4 105.1 0.9 82.7 0.4 46.4 0.3 29.41 0.27 29.28 0.33 29.29 0.30 29.33 0.36 36.3 0.5 Random-Seed 113.1 1.3 114.5 0.3 114.7 0.7 111.4 3.3 100.0 0.0 69.3 0.4 23.16 0.46 23.07 0.39 23.06 0.42 23.28 0.25 30.7 0.4 Hetero-DNNs 119.9 1.8 123.3 1.1 122.0 1.0 125.5 1.8 82.6 0.7 48.8 0.0 25.84 0.66 25.61 0.76 25.70 0.77 25.47 0.70 33.8 0.8

Rethinking Fano s Inequality in Ensemble Learning

6 8 10 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.226

(a) Lemma 2.3 B(I).

6 8 10 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.216

(b) Btight(I).

6 8 10 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.961

(c) Lemma 3.1 Btight(E).

Figure K.13: MNLI task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.18 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 18.6 %. LB(s0) by Btight(E): 3.7 %. LB(s0) by Btight(I): 3.7 %. LB(s0) by B(I): 1.1 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.14: MNLI task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.18: MNLI task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 18.6 %. LB(s0) by Btight(E): 3.7 %. LB(s0) by Btight(I): 3.7 %. LB(s0) by B(I): 1.1 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P 6.2 3.0 5.8 3.4 5.1 1.8 5.8 1.8 4.9 2.9 4.7 3.4 5.0 2.3 5.8 2.0 247 20 1172 52 Bagging 11.2 3.8 10.5 3.9 10.1 3.8 7.2 3.0 9.9 3.9 9.2 3.7 9.3 3.4 6.4 2.6 278 21 1324 103 Random-Seed 8.3 1.4 10.8 0.6 6.2 2.2 9.7 4.2 7.1 1.3 9.3 0.3 6.2 1.0 9.2 3.9 229 22 1092 137 Hetero-DNNs 4.4 2.0 8.7 2.8 6.2 1.4 6.5 3.3 3.1 2.1 7.5 2.6 5.8 1.1 7.1 2.4 257 13 1226 116 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.681) 100 0 0 0 0 0 100

Random-Hy P 104.3 2.6 104.2 3.0 104.4 2.1 105.1 1.8 95.6 0.5 74.2 1.2 14.47 1.26 14.48 1.28 14.46 1.21 14.41 1.13 21.4 1.3 Bagging 108.8 3.4 108.2 3.3 108.2 3.1 105.7 2.3 95.4 0.6 72.1 1.7 16.05 0.97 16.09 1.05 16.09 1.01 16.26 1.24 23.3 1.8 Random-Seed 106.3 1.1 108.3 0.2 105.5 0.9 108.1 3.5 100.0 0.0 79.6 1.4 13.27 1.38 13.14 1.39 13.32 1.44 13.15 1.61 20.4 1.4 Hetero-DNNs 102.7 1.9 106.7 2.3 105.2 1.0 106.3 2.1 81.0 0.8 58.9 1.0 15.19 0.83 14.93 0.83 15.03 0.89 14.96 0.93 22.0 1.3

Rethinking Fano s Inequality in Ensemble Learning

5 10 15 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.332

(a) Lemma 2.3 B(I).

5 10 15 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.252

(b) Btight(I).

5 10 15 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.989

(c) Lemma 3.1 Btight(E).

Figure K.15: MRPC task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.19 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 13.6 %. LB(s0) of Btight(E): 2.6 %. LB(s0) of Btight(I): 2.6 %. LB(s0) of B(I): 4.0 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.16: MRPC task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.19: MRPC task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 13.6 %. LB(s0) of Btight(E): 2.6 %. LB(s0) of Btight(I): 2.6 %. LB(s0) of B(I): 4.0 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P 7.7 3.7 6.4 2.3 6.7 3.8 6.3 2.5 7.3 3.7 5.4 2.3 7.1 3.7 5.5 2.4 63 4 58 4 Bagging 5.0 4.5 5.6 3.0 6.3 3.2 3.0 2.4 4.4 4.2 4.9 3.0 6.5 2.6 2.2 2.2 72 4 66 5 Random-Seed 12.0 3.2 4.3 2.5 7.0 2.0 6.0 2.1 11.2 3.0 3.5 2.2 8.6 2.5 5.2 1.9 67 6 62 6 Hetero-DNNs 15.7 2.1 18.0 2.2 16.0 4.3 11.3 5.6 14.9 2.1 17.2 2.3 16.0 4.7 10.5 6.1 71 2 66 4 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.336) 100 0 0 0 0 0 100

Random-Hy P 108.5 4.1 106.3 2.6 108.3 4.1 106.5 3.0 99.5 1.7 87.9 1.5 4.37 0.50 4.51 0.52 4.38 0.49 4.50 0.41 11.6 2.3 Bagging 105.3 5.1 105.8 3.6 107.8 3.2 102.6 2.6 90.1 1.9 77.8 1.5 5.30 0.20 5.26 0.23 5.13 0.27 5.48 0.55 12.3 2.4 Random-Seed 113.1 3.2 104.1 2.6 110.1 2.6 106.1 2.2 100.0 0.0 88.0 0.5 4.42 0.67 5.02 0.72 4.62 0.70 4.89 0.69 12.0 0.5 Hetero-DNNs 117.5 2.7 120.3 3.1 118.9 5.9 112.5 7.4 90.1 1.4 77.8 1.2 4.45 0.24 4.26 0.14 4.35 0.12 4.78 0.19 12.3 1.9

Rethinking Fano s Inequality in Ensemble Learning

5 0 5 Error rate reduction

Lower bound reduction

Pearson Coef. = -0.131

(a) Lemma 2.3 B(I).

5 0 5 Error rate reduction

Lower bound reduction

Pearson Coef. = -0.076

(b) Btight(I).

5 0 5 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.998

(c) Lemma 3.1 Btight(E).

Figure K.17: QQP task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.20 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 14.0 %. LB(s0) of Btight(E): 2.1 %. LB(s0) of Btight(I): 2.1 %. LB(s0) of B(I): 2.9 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.18: QQP task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.20: QQP task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 14.0 %. LB(s0) of Btight(E): 2.1 %. LB(s0) of Btight(I): 2.1 %. LB(s0) of B(I): 2.9 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P 2.5 2.7 1.1 4.3 0.6 3.7 2.5 1.5 2.1 2.9 0.6 4.6 -0.2 4.0 1.9 1.7 93 17 80 16 Bagging 5.8 6.0 5.8 7.1 5.3 5.4 4.9 6.0 5.9 6.7 5.6 7.7 5.3 5.9 4.3 6.5 101 3 86 2 Random-Seed 8.2 0.3 7.3 1.4 1.6 0.4 7.8 1.7 8.2 0.3 6.8 1.4 0.8 0.4 7.3 1.8 102 15 87 14 Hetero-DNNs -5.1 1.8 -2.2 1.1 -3.6 4.0 -0.3 2.5 -5.6 1.9 -2.7 0.6 -4.4 4.0 -1.0 2.7 104 11 89 11 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.343) 100 0 0 0 0 0 100

Random-Hy P 102.0 2.8 100.6 4.4 99.9 3.8 101.8 1.6 91.0 1.6 78.5 1.8 5.66 1.12 5.76 0.97 5.81 0.99 5.68 1.14 12.5 2.4 Bagging 105.6 6.4 105.4 7.3 105.1 5.7 104.1 6.2 93.1 2.5 80.2 2.4 5.91 0.42 5.92 0.49 5.94 0.38 6.01 0.32 12.9 3.5 Random-Seed 107.6 0.5 106.4 1.4 100.7 0.3 106.8 1.7 100.0 0.0 87.0 1.0 5.81 0.95 5.89 1.02 6.27 0.98 5.86 1.05 13.0 1.0 Hetero-DNNs 94.8 1.7 97.5 0.6 95.9 3.6 99.1 2.4 85.7 1.3 72.6 0.6 6.78 0.74 6.61 0.81 6.71 0.71 6.50 0.87 13.1 1.5

Rethinking Fano s Inequality in Ensemble Learning

10 15 20 25 Error rate reduction

Lower bound reduction

Pearson Coef. = -0.237

(a) Lemma 2.3 B(I).

10 15 20 25 Error rate reduction

Lower bound reduction

Pearson Coef. = -0.191

(b) Btight(I).

10 15 20 25 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.966

(c) Lemma 3.1 Btight(E).

Figure K.19: Sci Tail task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.21 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 5.7 %. LB(s0) of Btight(E): 1.2 %. LB(s0) of Btight(I): 1.2 %. LB(s0) of B(I): 5.2 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.20: Sci Tail task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.21: Sci Tail task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 5.7 %. LB(s0) of Btight(E): 1.2 %. LB(s0) of Btight(I): 1.2 %. LB(s0) of B(I): 5.2 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P 18.2 1.1 18.7 1.1 19.5 2.3 15.4 1.5 17.2 1.3 16.4 1.8 16.7 2.6 15.5 1.5 95 2 23 0 Bagging 12.5 1.5 17.0 2.5 17.3 2.7 10.2 2.6 12.4 1.4 14.2 2.9 14.5 3.1 9.2 2.6 125 2 30 0 Random-Seed 14.8 1.4 22.1 0.5 20.9 0.9 14.2 0.7 14.8 1.6 19.1 0.5 18.3 0.7 13.6 0.4 104 4 25 1 Hetero-DNNs 20.9 2.0 26.7 1.0 25.3 2.2 17.8 4.5 20.9 2.0 24.2 1.1 23.1 2.2 17.1 4.3 114 2 27 1 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.641) 100 0 0 0 0 0 100

Random-Hy P 104.5 0.4 104.3 0.5 104.3 0.7 104.0 0.4 88.8 1.8 80.5 1.8 1.35 0.03 1.36 0.01 1.36 0.02 1.38 0.03 8.3 2.5 Bagging 103.2 0.4 103.7 0.7 103.8 0.8 102.4 0.7 96.2 0.3 87.4 0.3 1.95 0.01 1.92 0.03 1.91 0.04 2.01 0.04 8.8 0.5 Random-Seed 103.8 0.4 105.0 0.2 104.8 0.2 103.6 0.1 100.0 0.0 91.5 0.1 1.55 0.07 1.47 0.08 1.49 0.08 1.57 0.08 8.5 0.1 Hetero-DNNs 105.4 0.5 106.3 0.3 106.0 0.6 104.5 1.1 94.8 0.5 86.1 0.5 1.62 0.08 1.56 0.07 1.58 0.09 1.68 0.12 8.6 0.7

Rethinking Fano s Inequality in Ensemble Learning

0 5 10 15 Error rate reduction

Lower bound reduction

Pearson Coef = -0.242

(a) Lemma 2.3 B(I).

0 5 10 15 Error rate reduction

Lower bound reduction

Pearson Coef = -0.252

(b) Btight(I).

0 5 10 15 Error rate reduction

Lower bound reduction

Pearson Coef. = 0.998

(c) Lemma 3.1 Btight(E).

Figure K.21: SST task. Correlations between error rate reductions and lower bound reductions. Each figure uses different type of lower bound. Each point in the figures shows a quantity of a specific ensemble system s and the quantity is the average over the eight tasks. See Table K.22 for the real value of each point. We used the 16 ensemble systems described in Section 4.2. Each system s used N = 15 models. The baseline values in (8) and (9) were the followings: ER(s0): 15.7 %. LB(s0) of Btight(E): 2.3 %. LB(s0) of Btight(I): 2.3 %. LB(s0) of B(I): 3.0 %.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(a) Error rate reduction.

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(b) Lower bound reduction by Lemma 2.3 B(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(c) Lower bound reduction by Btight(I).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(d) Lower bound reduction by Lemma 3.1 Btight(E).

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(g) irelev iredun

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(h) I = N(irelev iredun)

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(i) Icombloss

0 5 10 15 20 25 30 N

Random-Hy P Bagging Random-Seed Hetero DNNs

(j) E = N(irelev iredun) Icombloss

Figure K.22: SST task. The change in ensemble quantities when the number of models N is changed. Each figure shows a specific quantity. The ensemble systems used the SVM model combination. Each value is an averages of the eight tasks. i denotes per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N.

Rethinking Fano s Inequality in Ensemble Learning

Table K.22: SST task. Statistics of ensemble systems described in Section 4.2. The rows and columns list the model generation and combination methods of Table 2, respectively. Each cell shows a quantity of a specific system s. Each quantity is the average over the eight tasks. Each system contains N = 15 models. Color shows the rank within each column (brighter is better).

(a) Error rate reductions and lower bound reductions. The baseline values used in (8) and (9) were the followings. ER(s0): 15.7 %. LB(s0) of Btight(E): 2.3 %. LB(s0) of Btight(I): 2.3 %. LB(s0) of B(I): 3.0 %.

Error rate reductions (8) Lower bound reductions (9)

Voting Log R SVM RForest Lemma 3.1 Btight(E) Btight(I) Lemma 2.3

B(I) Voting Log R SVM RForest

Random-Hy P 4.8 7.4 10.5 3.4 8.2 2.9 4.7 7.6 4.7 7.3 10.3 3.5 8.0 2.8 4.6 7.5 166 28 58 11 Bagging 15.1 7.7 12.8 6.1 16.2 10.8 13.8 9.9 15.2 8.1 12.5 6.3 16.6 11.0 13.7 10.2 177 28 62 11 Random-Seed 9.4 5.8 8.1 2.8 9.4 5.8 8.2 2.8 9.5 6.1 7.7 2.8 9.8 5.6 8.2 2.9 146 16 51 6 Hetero-DNNs -3.5 4.5 7.0 1.8 1.1 5.8 0 -3.7 4.5 6.6 1.8 1.8 6.1 -0.3 1.9 186 14 65 7 (b) Breakdown of ensemble strength defined in (7). We show per-model metric values defined as: i{relev, redun, combloss} = I{relev, redun, combloss}/N. Thus, E = (irelev iredun icombloss) N holds. For intuitive understanding, all the values are normalized by the ensemble strength of baseline Es0, for example, Irelev = ˆIrelev/Es0 100 where ˆIrelev is the raw value.

E(O, Y, ˆY ) Per-model metric values

irelev iredun icombloss irelev iredun Voting Log R SVM RForest Voting Log R SVM RForest

Baseline (s0) 100 (the raw value is 0.705) 100 0 0 0 0 0 100

Random-Hy P 101.6 2.5 103.5 1.2 102.8 0.9 101.6 2.6 97.7 0.5 87.2 0.2 3.76 0.72 3.63 0.63 3.68 0.67 3.76 0.60 10.5 0.6 Bagging 105.3 2.8 104.3 2.2 105.8 3.9 104.8 3.6 98.4 1.2 87.6 0.6 3.78 0.53 3.84 0.58 3.74 0.45 3.81 0.50 10.8 1.4 Random-Seed 103.2 2.1 102.7 1.0 103.4 1.9 102.8 0.9 100.0 0.0 90.0 0.3 3.16 0.23 3.19 0.39 3.15 0.24 3.19 0.29 10.0 0.3 Hetero-DNNs 98.7 1.5 102.3 0.7 100.7 2.1 99.9 0.7 87.7 0.6 76.7 0.3 4.41 0.27 4.17 0.34 4.28 0.31 4.33 0.33 11.0 0.6