# ensemble_pruning_for_outofdistribution_generalization__121c17f1.pdf Ensemble Pruning for Out-of-distribution Generalization Fengchun Qiao 1 Xi Peng 1 Ensemble of deep neural networks has achieved great success in hedging against single-model failure under distribution shift. However, existing techniques suffer from producing redundant models, limiting predictive diversity and yielding compromised generalization performance. Existing ensemble pruning methods can only guarantee predictive diversity for in-distribution data, which may not transfer well to out-of-distribution (Oo D) data. To address this gap, we propose a principled optimization framework for ensemble pruning under distribution shifts. Since the annotations of test data are not available, we explore relationships between prediction distributions of the models, encapsulated in a topology graph. By incorporating this topology into a combinatorial optimization framework, complementary models with high predictive diversity are selected with theoretical guarantees. Our approach is model-agnostic and can be applied on top of a broad spectrum of off-the-shelf ensembling methods for improved generalization performance. Experiments on common benchmarks demonstrate the superiority of our approach in both multiand single-source Oo D generalization. The source codes are publicly available at: https://github.com/joffery/TEP. 1. Introduction Despite the documented successes, the complex prediction rules learned by modern machine learning (ML) models, such as deep neural networks, are vulnerable to out-ofdistribution (Oo D) data. This means that an ML model, while highly accurate on average, may fail dramatically when faced with rare or unseen data distributions (Sagawa 1Deep REAL Lab, Department of Computer and Information Sciences, University of Delaware, DE, USA. Correspondence to: Xi Peng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). et al., 2019; Qiao et al., 2020). Although numerous algorithms have been proposed to improve Oo D generalization, recent studies (Wiles et al., 2021; Gulrajani & Lopez-Paz, 2020; Koh et al., 2021) indicate that no single model consistently outperforms others across all theoretical or empirical contexts, and many perform worse than the standard baseline of empirical risk minimization (Vapnik, 1999). Ensemble learning (Dong et al., 2020) emerges as a promising approach by leveraging the diversity of multiple models to mitigate the risk of single-model failure under distribution shifts. This diversity originates from variations in initialization (Lakshminarayanan et al., 2017), architectures (Li et al., 2022), and training processes (Wortsman et al., 2022; Ram e et al., 2023; Lin et al., 2024). However, due to the lack of access to data from target distributions during training, current methods tend to generate an excessive number of models in the hope of ensuring diversity at test time. These methods often lead to the creation of redundant models, and merely combining all models can reduce their complementary strengths against the target distribution, resulting in suboptimal generalization performance (see Fig. 1). This underscores the importance of ensemble pruning (Tsoumakas et al., 2009), a process that involves selectively choosing an optimal subset from a pool of pretrained models. Ensemble pruning aims to retain models that are most complementary to each other to enhance overall performance. Both theoretical analyses and empirical studies (Martinez-Munoz et al., 2008; Caruana et al., 2004) support the effectiveness of ensemble pruning in boosting the generalization ability of ensembles, a concept sometimes referred to as the many-could-be-better-than-all theorem (Zhou et al., 2002). Existing pruning methods typically utilize supervised diversity metrics based on training or validation data to inform the selection process. However, this approach encounters difficulties with data exhibiting distribution shifts, as the diversity metrics applicable to indistribution data may not be relevant for out-of-distribution data. Consequently, the problem of efficiently pruning redundant models to improve performance when faced with distribution shifts remains an unexplored problem. In this paper, we introduce a novel post-hoc optimization approach designed to prune redundant models at test time, thereby enhancing generalization to out-of-distribution Ensemble Pruning for Out-of-distribution Generalization (d) Office Home (c) Terra Incognita (b) VLCS Figure 1. We evaluate the Out-of-Distribution (Oo D) accuracy of a uniform ensemble and its variations using the Domain Bed benchmark (Gulrajani & Lopez-Paz, 2020). To construct our ensemble, we employ the state-of-the-art Di WA (Rame et al., 2022) method, training ten models with varied hyperparameters. Our empirical analysis reveals selectively removing a particular model from the ensemble pool can enhance Oo D performance. This finding suggests that model redundancy is a widespread issue in ensemble learning for Oo D generalization, indicating the feasibility of creating more compact ensembles nonetheless exhibit superior generalization capabilities. (Oo D) data. The primary challenge is the selection of a complementary subset of models without the accessibility of data annotations. To overcome this challenge, we propose to learn an ensemble topology graph that captures the relationships between model predictions. This graph is integrated into a combinatorial optimization framework, enabling the adaptive selection of a diverse ensemble of models tailored to the test distribution. A distinctive feature of our method is its reliance solely on test data predictions for model selection, without necessitating any updates to model parameters. This approach sets it apart from test-time adaptation (Sun et al., 2020) strategies that typically involve re-training models. Our method is designed to be modelagnostic, making it compatible with a wide array of existing ensembling techniques to boost generalization performance. Through rigorous testing on well-established benchmarks, our method has demonstrated its superiority in improving generalization across both multi-source and single-source Oo D scenarios. Our contributions are threefold: We pioneer the exploration of ensemble pruning in the context of distribution shifts, presenting a topology-informed post-hoc optimization framework that selectively curates a highly diverse ensemble for the test distribution. Our approach is model-agnostic and can be applied on top of a broad spectrum of off-the-shelf ensembling methods for improved generalization performance. Experiments on common benchmarks demonstrate the effectiveness of our method in enhancing Oo D generalization, setting new standards in both multi-source and single-source contexts. 2. Preliminaries The goal of ensemble pruning is to search for a good subset of members that performs as well as, or better than, the original ensemble. Error analysis of continuous problems (Breiman, 2001) shows that the ensemble error can be represented by a linear combination of the individual accuracy terms and pairwise diversity terms. Motivated by it, (Zhang et al., 2006) formulated ensemble pruning as a quadratic integer programming problem and utilized this linear combination as the optimization objective. To achieve it, firstly a matrix P is used to record the error of all the member classifiers on the validation set: ( 0, if classifier j correctly predicts i 1, otherwise (1) Thus, G = P P is a matrix with the interesting properties that the diagonal entries Gii represent the misclassification errors made by each classifier i on the validation data, while the off-diagonal entries Gij correspond to the number of common errors made by classifier i and j. Normalization is then applied on G to make its elements on the same scale: M , Gij,i =j = 1 where M is the number of data points. Intuitively, smaller Gii corresponds to a more accurate member classifier, while smaller Gij implies a more different classifier pair. Therefore, it is straightforward that a small value of P ij Gij implies a good ensemble. Specifically, given N pre-trained models, selecting a sub-ensemble of size K (K N) with both accuracy and diversity can now be formulated as: min z z T Gz i zi = K, zi {0, 1}. (3) The binary variable zi serves as a 0/1 weight or an indicator: when zi = 1, the i th classifier will be selected. The parameter K controls the size of the pruned sub-ensemble and needs to be specified beforehand. Compared to existing heuristics that use simple greedy search as the optimization method, solving Eq. 3 can rigorously reduce the generalization error with theoretical guarantee when there are no distribution shifts. However, this Ensemble Pruning for Out-of-distribution Generalization Pretrained Models Final Merge M 2 (a) Uniform Ensemble Prune & Merge (b) Greedy Selection Pretrained Models (c) Topology-informed Ensemble Pruning Prune & Merge Ensemble Topology Pretrained Models Figure 2. Overview of different strategies for ensemble pruning. Different from existing works that are tailored for in-distribution data, our proposed topology-informed ensemble pruning is capable of promoting predictive diversity and lowering generalization errors on shifted distributions with theoretical guarantees. method as well as other alternative ensemble pruning methods suffer from critical limitations under distribution shifts: they typically utilize supervised diversity metrics based on training/validation data to inform the selection process; the diversity on in-distribution data may not be transferred to out-of-distribution data. Consequently, the problem of efficiently pruning redundant models to improve performance under distribution shifts remains an unexplored problem. 3. Topology-informed Ensemble Pruning It is non-trivial to select a subset of complementary models for out-of-distribution (Oo D) data as the annotations are not available. To bridge this gap, we propose a topologyinformed optimization method for ensemble pruning under distribution shifts. Our key idea is to explore an ensemble topology that captures the relationships between model predictions. By incorporating this topological information into the optimization in Eq. 3, we can theoretically guarantee a selection of models with improved diversity on shifted test distributions, thereby improving Oo D robustness. Our method consists of two steps: Topology Learning (Sec. 3.1) and Learning on Topology (Sec. 3.2). 3.1. Topology Learning We represent the ensemble topology as a weighted graph G = (V, E, W) that captures predictive relationships between models. The nodes V symbolize individual models, the edges E depict connectivities between models, and the adjacency matrix W contains edge weights. The edge weight between models i and j is defined as: Wij = exp( D2 ij/2) where Dij represents the distance between models i and j. We employ the discrepancy between Fisher Information Matrices (FIM) (Fisher, 1922) as the distance metric due to its sensitivity in capturing parameter-output relationships. This allows us to assess model similarities in their prediction mechanisms, laying the foundation for further analysis of model redundancy. For a network with parameters θ, the Fisher matrix Fθ is a positive semi-definite matrix that expresses how changes to θ impact the output distribution: Fθ = Ex[Ey pθ(y|x)[ θ log pθ(y|x)( θ log pθ(y|x))T ]]. (4) The full Fisher matrix is intractable to compute and store for large networks. A common approximation uses the diagonal Fisher (Matena & Raffel, 2022), which has a similar cost to backpropagating gradients over N examples. Specifically, we estimate the diagonal Fisher Fθ via: i=1 E y pθ(y|xi) ( θ log pθ (y | xi))2 . (5) In Eq. 5, the gradients are computed over M inputs sampled from the test set. Alternative distance metrics. In our preliminary experiments, we have explored several common distance metrics including ℓ2 distance, Earth Mover s Distance (EMD), and Maximum Mean Discrepancy (MMD). We empirically found these metrics are not as effective as FIM (see Fig. 6). This indicates the parameter-output sensitivity plays a key role in topology learning. 3.2. Learning on Topology After acquiring the ensemble topology G, we integrate G into the integer programming in Eq. 3 and formulate the optimization problem as: min z z T Gz | {z } In-Distribution +λ z T Wz | {z } Out-of-Distribution i zi = K, zi {0, 1}. (6) Here, λ is a hyperparameter to balance in-distribution accuracy and out-of-distribution diversity. Direct solution of the problem in Eq. 6 is challenging due to its NP-hard nature and binary constraints. To make it tractable, we propose to relax the binary constraints by replacing zi {0, 1} with 0 zi 1. This changes the problem from a combinatorial optimization problem to a continuous one, which is Ensemble Pruning for Out-of-distribution Generalization easier to solve. To further simplify, we employ semi-definite relaxation by introducing a matrix Z = zz T : min Z tr( GZ) + λ tr(WZ) s.t. tr(Z) = K, Z zz T , Z 0. (7) Eq. 7 can be solved by standard semi-definite programming (SDP) solvers. After solving the relaxed continuous optimization problem, we obtain a relaxed solution matrix Z. To convert Z into a discrete 0-1 solution z that selects the models, we apply an efficient rounding procedure. Specifically, we pick the K indices corresponding to the K largest values in the diagonal of Z and set them to 1 in z, while setting all other entries to 0. We empirically found the SDPrelaxation yields almost the same Oo D accuracy as brute force (see Tab. 5), indicating the relaxation provides a tight approximation to the original combinatorial problem. The proposed topology-informed optimization framework seamlessly incorporates ensemble topology and accuracy information to select a subset of models suited for Oo D data. By relaxing the integer constraints, we transform the intractable problem into an efficiently solvable SDP. The obtained continuous solution is rounded to retrieve a 0-1 model selection vector z. Our approach provides the following key advantages: (i) Jointly optimizes for Oo D diversity and indistribution accuracy without separate steps. (ii) Agnostic to the model family and applicable to any pre-trained models. (iii) Efficient test-time computation that only requires forwarding samples through models without re-training. In summary, the ensemble topology graph captures predictive relationships, while the Gram matrix provides accuracy validation. Fusing both sources of information into a combinatorial optimization formulation allows intelligently pruning the ensemble to improve robustness under distribution shift. In Sec. 4, we show that the pruned ensemble obtained by solving the quadratic programming problem in Eq. 7 leads to promoted diversity (Lem. 4.1) and lower generalization error (Thm. 4.4) on target distributions. 4. Theoretical Analysis In this section, we provide theoretical guarantees for the proposed topology-aware ensemble pruning. In Lem. 4.1 (Diversity Promotion), we show the pruned ensemble promotes diversity compared to the original set of models. In Thm. 4.4 (Generalization Error of the Pruned Ensemble), we present a bias-variance-diversity decomposition of the expected ensemble risk, which highlights the role of diversity in reducing the generalization error on target distributions. We provide the step-by-step proof in the Appendix. Lemma 4.1 (Diversity Promotion). Let S V be the pruned ensemble with the size of K, obtained by solving the optimization problem in Eq. 6. Define the average pairwise similarity among all models in V as: WV = 2 N(N 1) i