# packed_ensembles_for_efficient_uncertainty_estimation__d9b3b495.pdf Published as a conference paper at ICLR 2023 PACKED-ENSEMBLES FOR EFFICIENT UNCERTAINTY ESTIMATION Olivier Laurent,1,2,* Adrien Lafage,2,* Enzo Tartaglione,3 Geoffrey Daniel,1 Jean-Marc Martinez,1 Andrei Bursuc4 & Gianni Franchi2, Universit e Paris-Saclay, CEA, SGLS,1 U2IS, ENSTA Paris, Institut Polytechnique de Paris,2 LTCI, T el ecom Paris, Institut Polytechnique de Paris,3 valeo.ai4 Deep Ensembles (DE) are a prominent approach for achieving excellent performance on key metrics such as accuracy, calibration, uncertainty estimation, and out-of-distribution detection. However, hardware limitations of real-world systems constrain to smaller ensembles and lower-capacity networks, significantly deteriorating their performance and properties. We introduce Packed-Ensembles (PE), a strategy to design and train lightweight structured ensembles by carefully modulating the dimension of their encoding space. We leverage grouped convolutions to parallelize the ensemble into a single shared backbone and forward pass to improve training and inference speeds. PE is designed to operate within the memory limits of a standard neural network. Our extensive research indicates that PE accurately preserves the properties of DE, such as diversity, and performs equally well in terms of accuracy, calibration, out-of-distribution detection, and robustness to distribution shift. We make our code available at github.com/ENSTAU2IS/torch-uncertainty. 2 3 4 5 6 Images/sec ( 103) Accuracy (%) 10M 20M 90M Packed-Ensembles Packed-Ensembles Deep Ensembles ( 4) Figure 1: Evaluation of computation cost vs. performance trade-offs for multiple uncertainty quantification techniques on CIFAR-100. The y-axis and x-axis respectively show the accuracy and inference time in images per second. The circle area is proportional to the number of parameters. Optimal approaches are closer to the top-right corner. Packed-Ensembles strikes a good balance between predictive performance and speed. 1 INTRODUCTION Real-world safety-critical machine learning decision systems such as autonomous driving (Levinson et al., 2011; Mc Allister et al., 2017) impose exceptionally high reliability and performance requirements across a broad range of metrics: accuracy, calibration, robustness to distribution shifts, uncertainty estimation, and computational efficiency under limited hardware resources. Despite significant improvements in performance in recent years, vanilla Deep Neural Networks (DNNs) still *equal contribution corresponding author - gianni.franchi@ensta-paris.fr Published as a conference paper at ICLR 2023 exhibit several shortcomings, notably overconfidence in both correct and wrong predictions (Nguyen et al., 2015; Guo et al., 2017; Hein et al., 2019). Deep Ensembles (Lakshminarayanan et al., 2017) have emerged as a prominent approach to address these challenges by leveraging predictions from multiple high-capacity neural networks. By averaging predictions or by voting, DE achieves high accuracy and robustness since potentially unreliable predictions are exposed via the disagreement between individuals. Thanks to the simplicity and effectiveness of the ensembling strategy (Dietterich, 2000), DE have become widely used and dominate performance across various benchmarks (Ovadia et al., 2019; Gustafsson et al., 2020). DE meet most of the real-world application requirements except computational efficiency. Specifically, DE are computationally demanding in terms of memory storage, number of operations, and inference time during both training and testing, as their costs grow linearly with the number of individuals. Their computational costs are, therefore, prohibitive under tight hardware constraints. This limitation of DE has inspired numerous approaches proposing computationally efficient alternatives: multi-head networks (Lee et al., 2015; Chen & Shrivastava, 2020), ensemble-imitating layers (Wen et al., 2019; Havasi et al., 2020; Ram e et al., 2021), multiple forwards on different weight subsets of the same network (Gal & Ghahramani, 2016; Durasov et al., 2021), ensembles of smaller networks (Kondratyuk et al., 2020; Lobacheva et al., 2020), computing ensembles from a single training run (Huang et al., 2017; Garipov et al., 2018), and efficient Bayesian Neural Networks (Maddox et al., 2019; Franchi et al., 2020). These approaches typically improve storage usage, train cost, or inference time at the cost of lower accuracy and diversity in the predictions. An essential property of ensembles to improve predictive uncertainty estimation is related to the diversity of its predictions. Perrone & Cooper (1992) show that the independence of individuals is critical to the success of ensembling. Fort et al. (2019) argue that the diversity of DE, due to randomness from weight initialization, data augmentation and batching, and stochastic gradient updates, is superior to other efficient ensembling alternatives, despite their predictive performance boosts. Few approaches manage to mirror this property of DE in a computationally efficient manner close to a single DNN (in terms of memory usage, number of forward passes, and image throughput). In this work, we aim to design a DNN architecture that closely mimics properties of ensembles, in particular, having a set of independent networks, in a computationally efficient manner. Previous works propose ensembles composed of small models (Kondratyuk et al., 2020; Lobacheva et al., 2020) and achieve performances comparable to a single large model. We build upon this idea and devise a strategy based on small networks trying to match the performance of an ensemble of large networks. To this end, we leverage grouped convolutions (Krizhevsky et al., 2012) to delineate multiple subnetworks within the same network. The parameters of each subnetwork are not shared across subnetworks, leading to independent smaller models. This method enables fast training and inference times while predictive uncertainty quantification is close to DE (Figure 1). In summary, our contributions are the following: We propose Packed-Ensembles (PE), an efficient ensembling architecture relying on grouped convolutions, as a formalization of structured sparsity for Deep Ensembles; We extensively evaluate PE regarding accuracy, calibration, OOD detection, and distribution shift on classification and regression tasks. We show that PE achieves state-of-the-art predictive uncertainty quantification. We thoroughly study and discuss the properties of PE (diversity, sparsity, stability, behavior of subnetworks) and release our Py Torch implementation. 2 BACKGROUND In this section, we present the formalism for this work and offer a brief background on grouped convolutions and ensembles of DNNs. Appendix A summarizes the main notations in Table 3. 2.1 BACKGROUND ON CONVOLUTIONS The convolutional layer (Le Cun et al., 1989) consists of a series of cross-correlations between feature maps hj RCj Hj Wj regrouped in batches of size B and a weight tensor ωj RCj+1 Cj s2 j Published as a conference paper at ICLR 2023 M = 3 M = 3 width α width width Figure 2: Overview of the considered architectures: (left) baseline vanilla network; (center) Deep Ensembles; (right) Packed-Ensembles-(α, M = 3, γ = 2). with Cj, Hj, Wj three integers representing the number of channels, the height and the width of hj respectively. Cj+1 and sj are also two integers corresponding to the number of channels of hj+1 (the output of the layer) and the kernel size. Finally, j is the layer s index and will be fixed in the following formulae. The bias of convolution layers will be omitted in the following for simplicity. Hence the output value of the convolution layer, denoted , is: zj+1(c, :, :) = (hj ωj)(c, :, :) = k=0 ωj(c, k, :, :) hj(k, :, :), (1) where c J0, Cj+1 1K is the index of the considered channel of the output feature map, is the classical 2D cross-correlation operator, and zj is the pre-activation feature map such that hj = ϕ(zj) with ϕ an activation function. To embed an ensemble of subnetworks, we leverage grouped convolutions, already used in Res Ne Xt (Xie et al., 2017) to train several DNN branches in parallel. The grouped convolution operation with γ groups and weights ωi γ RCj+1 Cj γ s2 j is given in (2), γ dividing Cj for all layers. Any output channel c is produced by a specific group (set of filters), identified by the integer j γc Cj+1 k , which only uses 1 γ of the input channels: zj+1(c, :, :) = (hj ωj γ)(c, :, :) k=0 ωj γ (c, k, :, :) hj k + γc γ , :, : . (2) The grouped convolution layer is mathematically equivalent to a classical convolution where the weights are multiplied element-wise by the binary tensor maskm {0, 1}Cj+1 Cj s2 j such that maskj m(k, l, :, :) = 1 if j γl Cj k = j γk Cj+1 k = m for each group m J0, γ 1K. The complete layer mask is finally defined as maskj = γ 1 P m=0 maskj m and the grouped convolution can therefore be rewritten as zj+1 = hj ωj maskj , where is the Hadamard product. 2.2 BACKGROUND ON DEEP ENSEMBLES For an image classification problem, let us define a dataset D = {xi, yi}|D| i=1 containing |D| pairs of samples xi = h0 i RC0 H0 W0 and one-hot-encoded labels yi RNC modeled as the realization of a joint distribution P(X,Y ) where NC is the number of classes in the dataset. The input data xi is processed via a neural network fθ which is a parametric probabilistic model such that ˆyi = fθ(xi) = P(Y = yi|X = xi; θ). This approach consists in considering the prediction ˆyi as parameters of a Multinoulli distribution. Published as a conference paper at ICLR 2023 equivalent architectures x1 x2 x3 x1 x2 x3 3,3x3,36 groups = 3 36,3x3,72 groups = 3 144,1x1,30 groups = 3 72,1x1,144 groups = 3 36,3x3,72 groups = 3 144,1x1,30 groups = 3 72,1x1,144 groups = 3 Figure 3: Equivalent architectures for Packed-Ensembles. (a) corresponds to the first sequential version, (b) to the version with the rearrange operation and grouped convolutions and (c) to the final version beginning with a full convolution. To improve the quality of both predictions and estimated uncertainties, as well as the detection of OOD samples, Lakshminarayanan et al. (2017) propose to ensemble M randomly initialized DNNs as a large predictor called Deep Ensembles. These ensembles can be seen as a discrete approximation of the intractable Bayesian marginalization on the weights, according to Wilson & Izmailov (2020). If we note {θm}M 1 m=0 the set of trained weights for the M DNNs, Deep Ensembles consists in averaging the predictions of these M DNNs as in equation (3). P(yi|xi, D) = 1 m=0 P(yi|xi, θm) (3) 3 PACKED-ENSEMBLES This section describes how to train multiple subnetworks using grouped convolution efficiently. Then, we explain how our new architectures are equivalent to training several networks in parallel. 3.1 REVISITING DEEP ENSEMBLES Although Deep Ensembles provide undisputed benefits, they also come with the significant drawback that the training time and the memory usage in inference increase linearly with the number of networks. To alleviate these problems, we propose assembling small subnetworks, which are essentially DNNs with fewer parameters. Moreover, while ensembles to this day have mostly been trained sequentially, we suggest leveraging grouped convolutions to massively accelerate their training and inference computations thanks to their smaller size. The propagation of grouped convolutions with M groups, M being the number of subnetworks in the ensemble, ensures that the subnetworks are trained independently while dividing their encoding dimension by a factor M. More details on the usefulness of grouped convolutions to train ensembles can be found in subsection 3.3. To create Packed-Ensembles (illustrated in Figure 2), we build on small subnetworks but compensate for the dramatic decrease of the model capacity by multiplying the width by the hyperparameter α, which can be seen as an expansion factor. Hence, we propose Packed-Ensembles-(α, M, 1) as a flexible formalization of ensembles of small subnetworks. For an ensemble of M subnetworks, Packed-Ensembles-(α, M, 1) therefore modifies the encoding dimension by a factor α M and the inference of our ensemble is computed with the following formula, omitting the index i of the sample: m=0 P(y|θα,m, x) with θα,m = {ωj α maskj m}j, (4) where ωj,α is the weight of the layer j of dimension (αCj+1) (αCj) s2 j. Published as a conference paper at ICLR 2023 (a) M = 2, γ = 1 (b) M = 2, γ = 2 Figure 4: Diagram representation of a subnetwork mask: maskj, with M = 2, j an integer corresponding to a fully connected layer In the following, we introduce another hyperparameter γ corresponding to the number of groups of each subnetwork of the Packed-Ensembles, creating another level of sparsity. These groups are also called subgroups and are applied to the different subnetworks. Formally, we denote our technique Packed-Ensembles-(α, M, γ), with the hyperparameters in the parentheses. In this work, we consider a constant number of subgroups across the layers; therefore, γ divides αCj for all j. 3.2 COMPUTATIONAL COST For a convolutional layer, the number of parameters involving Cj input channels, Cj+1 output chan- nels, kernels of size sj and γ subgroups is equal to M h αCj M s2 jγ 1i . The same formula applies to dense layers as 1 1 convolutions. Two cases emerge when the architectures of the subnetworks are fully convolutional or dense. If α = M, the number of parameters in the ensemble equals the number of parameters in a single model. With α = M, each subnetwork corresponds to a single model (and their ensemble is therefore equivalent in size to DE). 3.3 IMPLEMENTATION DETAILS We propose a simple way of designing efficient ensemble convolutional layers using grouped convolutions. To take advantage of the parallelization capabilities of GPUs in training and inference, we replace the sequential training architecture, (a) in Figure 3, with the parallel implementations (b) and (c). Figure 3 summarizes different equivalent architectures for a simple ensemble of M = 3 DNNs with three convolutional layers and a final dense layer (equivalent to a 1 1 convolution) with α = γ = 1. In (b), we propose to stack the feature maps on the channel dimension (denoted as the rearrange operation).1 This yields a feature map hj, of size M Cj Hj Wj regrouped by batches of size only B M , with B the batch size of the ensemble. One solution to keep the same batch size is to repeat the batch M times so that its size equals B after the rearrangement. Using convolutions with M groups and γ subgroups per subnetwork, each feature map is convoluted separately by each subnetwork and yields its own independent output. Grouped convolutions are propagated until the end to ensure that gradients stay independent between subnetworks. Other operations, such as Batch Normalization (Ioffe & Szegedy, 2015), can be applied directly as long as they can be grouped or have independent actions on each channel. Figure 4a illustrates the mask used to code Packed Ensembles in the case where M = 2. Similarly, Figure 4b shows the mask with M = 2 and γ = 2. Finally, (b) and (c) are also equivalent. It is indeed possible to replace the rearrange operation and the first grouped convolution with a standard convolution if the same images are to be provided simultaneously to all the subnetworks. We confirm in Appendix F that this procedure is not detrimental to the ensemble s performance, and we take advantage of this property to provide this final optimization and simplification. 1See https://einops.rocks/api/rearrange/ Published as a conference paper at ICLR 2023 4 EXPERIMENTS To validate the performance of our method, we conduct experiments on classification tasks and measure the influence of the parameters α and γ. Regression tasks are detailed in Appendix N. 4.1 DATASETS AND ARCHITECTURES First, we demonstrate the efficiency of Packed-Ensembles on CIFAR-10 and CIFAR100 (Krizhevsky, 2009), showing how the method adapts to tasks of different complexities. As we propose to replace a single model architecture with several subnetworks, we study the behavior of PE on various sizes architectures: Res Net-18, Res Net-50 (He et al., 2016), and Wide Res Net2810 (Zagoruyko & Komodakis, 2016). We compare it against Deep Ensembles (Lakshminarayanan et al., 2017) and three other approximated ensembles from the literature: Batch Ensemble (Wen et al., 2019), MIMO (Havasi et al., 2020), and Masksembles (Durasov et al., 2021). Second, we report our results for Packed-Ensembles on Image Net (Deng et al., 2009), which we compare against all baselines. We run experiments with Res Net-50 and Res Net-50x4. All training runs are started from scratch. 4.1.1 METRICS, OOD DATASETS, AND IMPLEMENTATION We evaluate the overall performance of the models in classification tasks using the accuracy (Acc) in % and the Negative Log-Likelihood (NLL). We choose the classical Expected Calibration Error (ECE) (Naeini et al., 2015) for the calibration of uncertainties2 and measure the quality of the OOD detection using the Areas Under the Precision/Recall curve (AUPR) and Under the operating Curve (AUC), as well as the False Positive Rate at 95% recall (FPR95), all expressed in %, similarly to Hendrycks & Gimpel (2017). We use accuracy as the validation criterion (i.e., the final trained model is the one with the highest accuracy). During inference, we average the softmax probabilities of all subnetworks and consider the index of the maximum of the output vector to be the predicted class of the ensemble. We define the prediction confidence as this maximum value (also called maximum softmax probability). For OOD detection tasks on CIFAR-10 and CIFAR-100, we use the SVHN dataset (Netzer et al., 2011) as an out-of-distribution dataset and transform the initial classification problem into a binary classification between in-distribution and OOD data using the maximum softmax probability as criterion. We discuss the different OOD criteria in appendix E. For Image Net, we use two out-ofdistribution datasets: Image Net-O (Hendrycks et al., 2021b) and Texture (Wang et al., 2022), and use the Mutual Information (MI) as a criterion for the ensembles techniques (see Appendix E for details on MI) and the maximum softmax probability for the single model and MIMO. To measure the robustness under distribution shift, we use Image Net-R (Hendrycks et al., 2021a) and evaluate the Accuracy, ECE, and NLL, denoted r Acc, r ECE, and r NLL on this dataset, respectively. We implement our models using the Py Torch-Lightning framework built on top of Py Torch. Both are open-source Python frameworks. Appendix B and Table 4 detail the hyper-parameters used in our experiments across architectures and datasets. Most training instances are completed on a single Nvidia RTX 3090 except for Image Net, for which we use 2 to 8 Nvidia A100-80GB. 4.1.2 RESULTS Table 1 presents the average performance for the classification task over five runs using the hyperparameters in Table 4. We demonstrate that Packed-Ensembles, in the setting of α = 2 and γ = 2, yields similar results to Deep Ensembles while having a lower memory cost than a single model. For CIFAR-10, the relative performance of PE compared to DE appears to increase as the original architecture becomes larger. When using Res Net-18, Packed-Ensembles matches Deep Ensembles on OOD detection metrics but shows slightly worse performance on the others. However, using Res Net-50, both models seem to perform similarly, and PE slightly outperforms DE in classification performance with Wide Res Net28-10. 2Note that the benchmark uncertainty-baselines only uses ECE to measure calibration Published as a conference paper at ICLR 2023 Table 1: Performance comparison (averaged over five runs) on CIFAR-10/100 using Res Net-18 (R18), Res Net-50 (R50), and Wide Res Net28-10 (WR) architectures. All ensembles have M = 4 subnetworks, we highlight the best performances in bold. For our method, we consider α = γ = 2, except for WR on C100, where γ = 1. Mult-Adds corresponds to the inference cost, i.e., the number of Giga multiply-add operations for a forward pass which is estimated with Torchinfo (2022). Method Data Net Acc NLL ECE AUPR AUC FPR95 Params (M) Mult-Adds Single Model C10 R18 94.0 0.238 0.035 94.0 89.7 33.8 11.17 0.56 Batch Ensemble C10 R18 92.9 0.257 0.031 92.4 87.8 32.1 11.21 2.22 MIMO (ρ = 1) C10 R18 94.0 0.228 0.033 94.4 90.2 28.6 11.19 0.56 Masksembles C10 R18 94.0 0.188 0.009 93.6 89.5 27.8 11.24 2.22 Packed-Ensembles C10 R18 94.3 0.178 0.007 94.7 91.3 23.2 8.18 0.48 Deep Ensembles C10 R18 95.1 0.156 0.008 94.7 91.3 18.0 44.70 2.22 Single Model C10 R50 95.1 0.211 0.031 95.2 91.9 23.6 23.52 1.30 Batch Ensemble C10 R50 93.9 0.255 0.033 94.7 91.3 20.1 23.63 5.19 MIMO (ρ = 1) C10 R50 95.4 0.197 0.030 95.1 90.8 26.0 23.59 1.30 Masksembles C10 R50 95.3 0.175 0.019 95.7 92.2 22.1 23.81 5.19 Packed-Ensembles C10 R50 95.9 0.137 0.008 97.3 95.2 14.4 14.55 1.00 Deep Ensembles C10 R50 96.0 0.136 0.008 97.0 94.7 15.5 94.08 5.19 Single Model C10 WR 95.4 0.200 0.029 96.1 93.2 20.4 36.49 5.95 Batch Ensemble C10 WR 95.6 0.206 0.027 95.5 92.5 22.1 36.59 23.81 MIMO (ρ = 1) C10 WR 94.7 0.234 0.034 94.9 90.6 30.9 36.51 5.96 Masksembles C10 WR 94.0 0.186 0.016 97.2 95.0 14.5 36.53 23.82 Packed-Ensembles C10 WR 96.2 0.133 0.009 98.1 96.5 11.1 19.35 4.06 Deep Ensembles C10 WR 95.8 0.143 0.013 97.8 96.0 12.5 145.96 23.82 Single Model C100 R18 75.1 1.016 0.093 88.6 79.5 55.0 11.22 0.56 Batch Ensemble C100 R18 71.2 1.236 0.116 86.0 75.4 60.2 11.25 2.22 MIMO (ρ = 1) C100 R18 75.3 0.962 0.069 89.2 80.7 52.9 11.36 0.56 Masksembles C100 R18 74.2 1.054 0.061 86.7 76.3 59.8 11.24 2.22 Packed-Ensembles C100 R18 76.4 0.858 0.041 88.7 79.8 57.1 8.27 0.48 Deep Ensembles C100 R18 78.2 0.800 0.018 90.2 82.4 50.5 44.88 2.22 Single Model C100 R50 78.3 0.905 0.089 87.4 77.9 57.6 23.70 1.30 Batch Ensemble C100 R50 66.6 1.788 0.182 85.2 74.6 60.6 23.81 5.19 MIMO (ρ = 1) C100 R50 79.0 0.876 0.079 87.5 76.9 64.7 24.33 1.30 Masksembles C100 R50 78.5 0.832 0.046 90.3 81.9 52.3 23.81 5.19 Packed-Ensembles C100 R50 81.2 0.703 0.020 90.0 81.7 56.5 15.55 1.00 Deep Ensembles C100 R50 80.9 0.713 0.026 89.2 80.8 52.5 94.82 5.19 Single Model C100 WR 80.3 0.963 0.156 81.0 64.2 80.1 36.55 5.95 Batch Ensemble C100 WR 82.3 0.835 0.130 88.1 78.2 69.8 36.65 23.81 MIMO (ρ = 1) C100 WR 80.2 0.822 0.028 84.9 72.0 72.8 36.74 5.96 Masksembles C100 WR 74.4 0.937 0.063 76.1 60.0 75.1 36.59 23.82 Packed-Ensembles C100 WR 83.9 0.678 0.089 86.2 73.2 80.7 36.62 5.95 Deep Ensembles C100 WR 82.5 0.903 0.229 81.6 67.9 71.3 146.19 23.82 On CIFAR-100, Deep Ensembles outperforms Packed-Ensembles on Res Net-18. However, we argue that Res Net-18 architecture need more representation capacity to be divided into subnetworks for CIFAR-100. Indeed, when we look at the results of Res Net-50, we can see that Packed Ensembles has better results than Deep Ensembles. This analysis demonstrates that, given a sufficiently large network, Packed-Ensembles is able to match Deep Ensembles with only 16% of its parameters. In Appendix D, we discuss the influence of the representation capacity. Based on the results in Table 2, we can conclude that Packed-Ensembles improves uncertainty quantification for OOD and distribution shift on Image Net compared to Deep Ensembles and Single model and that it improves the accuracy with a moderate training and inference cost. 4.1.3 STUDY ON THE PARAMETERS α AND γ Table 1 reports results for α = 2 and γ = 2. However, the optimal values of these hyperparameters depend on the balance between computational cost and performance. To help users strike the best compromise, we propose Figures 6 and 7 in Appendix D, which illustrate the impact of changing α on the performance of Packed-Ensembles. 5 DISCUSSIONS We have shown that Packed-Ensembles has attractive properties, mainly by providing a similar quality of uncertainty quantification as Deep Ensembles while using a reduced architecture and Published as a conference paper at ICLR 2023 Table 2: Performance comparison on Image Net using Res Net-50 (R50) and Res Net-50x4 (R50x4). All ensembles have M = 4 subnetworks and γ = 1. We highlight the best performances in bold. For OOD tasks, we use Image Net-O (IO) and Texture (T), and for distribution shift we use Image Net-R. The number of parameters and operations are available in Appendix M. Method Net Acc ECE AUPR - T AUC - T FPR95 - T AUPR - IO AUC - IO FPR95 - IO r Acc r NLL r ECE Single Model R50 77.8 0.121 18.0 80.9 68.6 3.6 50.8 90.8 23.5 5.187 0.082 Batch Ensemble R50 75.9 0.035 20.2 81.6 66.5 4.0 55.2 82.3 21.0 6.148 0.165 MIMO (ρ = 1) R50 77.6 0.147 18.4 81.6 66.8 3.7 52.2 90.6 23.4 5.115 0.059 Masksembles R50 73.6 0.209 13.6 79.7 68.3 3.3 47.7 87.7 21.2 5.139 0.011 Packed-Ensembles α = 3 R50 77.9 0.180 35.1 88.2 43.7 9.9 68.4 80.9 23.8 4.978 0.022 Deep Ensembles R50 79.2 0.233 19.6 83.4 62.1 3.7 52.5 85.5 24.9 4.879 0.018 Single Model R50 4 80.2 0.022 20.5 82.6 63.9 4.9 60.2 87.4 26.0 5.190 0.1721 Batch Ensemble R50 4 77.7 0.024 23.8 82.8 63.8 4.4 58.4 80.5 23.4 6.079 0.203 MIMO (ρ = 1) R50 4 80.3 0.015 19.3 82.5 66.1 4.9 60.7 86.4 25.8 5.278 0.189 Masksembles R50 4 79.8 0.137 21.5 83.3 63.5 4.4 58.4 80.5 23.4 6.079 0.207 Packed-Ensembles α = 2 R50 4 81.3 0.103 34.6 88.1 50.3 9.6 69.9 79.2 26.6 4.848 0.075 Deep Ensembles R50 4 82.1 0.053 23.0 85.6 58.1 5.0 62.7 81.9 28.2 4.789 0.105 computing cost. Several questions can be raised, and we conducted some studies - detailed in the Appendix sections - to provide possible answers. Discussion on the sparsity As described in section 3, one could interpret PE as leveraging group convolutions to approximate Deep Ensembles with a mask operation applied to some components. In Appendix C, by using a simplified model, we propose a bound of the approximation error based on the Kullback-Leibler divergence between the DE and its pruned version. This bound depends on the density of ones in the mask p, and, more specifically, depends on p(1 p) and (1 p)2/p. By manipulating these terms, corresponding to modifying the number of subnetworks M, the number of groups γ, and the dilation factor α, we could theoretically control the approximation error. On the sources of stochasticity Diversity is essential in ensembles and is usually obtained by exploiting two primary sources of stochasticity: the random initialization of the model s parameters and the shuffling of the batches. A last source of stochasticity is introduced during training by the non-deterministic behavior of the backpropagation algorithms. In Appendix F, we study the function space diversities which arise from every possible combination of these sources. It follows that only one of these sources is often sufficient to generate diversity, and no peculiar pattern seems to emerge to predict the best combination. Specifically, we highlight that even the only use of non-deterministic algorithms introduces enough diversity between each subnetwork of the ensemble. Ablation study We perform ablation studies to assess the impact of the parameters M, α, and γ on the performance of Packed-Ensembles. Appendix D provides in-depth details of this study. No explicit behavior appears from the results we obtained. A trend shows that a higher number of subnetworks helps get better OOD detection, but the improvement in AUPR is not significant. Training speed Depending on the chosen hyperparameters α, M, and γ, PE may have fewer parameters than the single model, as shown in Table 1. This translates into an expected lower number of operations. A study of the training and inference speeds, developed in Appendix H, shows that using PE-(2,4,1) does not significantly increase the training and testing times compared to the single model while improving accuracy and uncertainty quantification performances. However, this also hints that the group-convolution speedup is not optimal despite the significant acceleration offered by 16-bits floating points. OOD criteria The maximum softmax probability is often used as criterion for discriminating OOD elements. However, this criterion is not unique, and others can be used, such as the Mutual Information, the maximum logit, or the Shannon entropy of the mean prediction. Although no relationship is expected between this criterion and PE, we obtained different performances in OOD detection according to the selected criterion. The results on CIFAR-100 are detailed in Appendix E and show that an approach based on the maximum logit seems to give the best results in detecting OOD. It should be noted that the notion of OOD depends on the training distribution. Such a discussion does not necessarily generalize to all datasets. Indeed, preliminary results have shown that Mutual information outperforms the other criteria for our method applied to the Image Net dataset. Published as a conference paper at ICLR 2023 6 RELATED WORK Ensembles and uncertainty quantification. Bayesian Neural Networks (BNNs) (Mac Kay, 1992; Neal, 1995) are the cornerstone and primary source of inspiration for uncertainty quantification in deep learning. Despite the progress enabled by variational inference (Jordan et al., 1999; Blundell et al., 2015), BNNs remain challenging to scale and train for large DNN architectures (Dusenberry et al., 2020). DE (Lakshminarayanan et al., 2017) arise as a practical and efficient instance of BNNs, coarsely but effectively approximating the posterior distribution of weights (Wilson & Izmailov, 2020). DE are currently the best-performing approach for both predictive performance and uncertainty estimation (Ovadia et al., 2019; Gustafsson et al., 2020). Efficient ensembles. The appealing properties in performance and diversity of DE (Fort et al., 2019), but also their major downside related to computational cost, have inspired a large cohort of approaches aiming to mitigate it. Batch Ensemble (Wen et al., 2019) spawns an ensemble at each layer thanks to an efficient parameterization of subnetwork-specific parameters trained in parallel. MIMO (Havasi et al., 2020) shows that a large network can encapsulate multiple subnetworks using a multi-input multi-output configuration. A single network can be used in ensemble mode by disabling different sub-sets of weights at each forward pass (Gal & Ghahramani, 2016; Durasov et al., 2021). Liu et al. (2022) leverage the sparse networks training algorithm of Mocanu et al. (2018) to produce ensembles of sparse networks. Ensembles can be computed from a single training run by collecting intermediate model checkpoints (Huang et al., 2017; Garipov et al., 2018), by computing the posterior distribution of the weights by tracking their trajectory during training (Maddox et al., 2019; Franchi et al., 2020), and by ensembling predictions over multiple augmentations of the input sample (Ashukha et al., 2020). However, most of these approaches require multiple forward passes. Neural network compression. The most intuitive approach for reducing the size of a model is to employ DNNs that are memory-efficient by design, relying on, e.g., channel shuffling (Zhang & Yang, 2021), point-wise convolutional filters (Liang et al., 2021), weight sharing (Bender et al., 2020), or a combination of them. Some of the most popular architectures that leverage such models are Squeeze Net (Iandola et al., 2016), Shuffle Net (Zhang et al., 2018b), and Mobile Net-v3 (Howard et al., 2019). Some approaches conduct automatic model size reduction, e.g., network sparsification (Molchanov et al., 2017; Louizos et al., 2018; Frankle & Carbin, 2018; Tartaglione et al., 2022). These approaches aim at removing as many parameters as possible from the model to improve memory and computation efficiency, also at train time Bragagnolo et al. (2022). Similarly, quantization approaches (Han et al., 2016; Lin et al., 2017) avoid or minimize the computation cost of floating point operation and optimize the use of the much more efficient integer computation. Grouped convolutions. To the best of our knowledge, grouped convolutions (group of convolutions) were introduced by Krizhevsky et al. (2012). Enabling the computation of several independent convolutions in parallel, they developed the idea of running a single model on multiple GPU devices. Xie et al. (2017) demonstrate that using grouped convolutions leads to accuracy improvements and model complexity reduction. So far, grouped convolutions have been used primarily for computational efficiency but also to compute multiple output branches in parallel (Chen & Shrivastava, 2020). PE re-purpose them to delineate multiple subnetworks within a network and efficiently train an ensemble of such subnetworks. 7 CONCLUSIONS We propose a new ensemble framework, Packed-Ensembles, that can approximate Deep Ensembles in terms of uncertainty quantification and accuracy. Our research provides several new findings. First, we show that small independent neural networks can be as effective as large, deep neural networks when used in ensembles. Secondly, we demonstrate that not all sources of diversity are essential for improving ensemble diversity. Thirdly, we show that Packed-Ensembles are more stable than single DNNs. Fourthly, we highlight that there is a trade-off between accuracy and the number of parameters, and Packed-Ensembles enables us to create flexible and efficient ensembles. In the future, we intend to explore Packed-Ensembles for more complex downstream tasks. Published as a conference paper at ICLR 2023 8 REPRODUCIBILITY Alongside this paper, we provide the source code of Packed-Ensembles layers. Additionally, we have created two notebooks demonstrating how to train Res Net-50-based Packed-Ensembles using public datasets such as CIFAR-10 and CIFAR-100. To ensure reproducibility, we report the performance given a specific random seed with a deterministic training process. Furthermore, it should be noted that the source code contains two Py Torch Module classes to produce Packed-Ensembles efficiently. A readme file at the root of the project details how to install and run experiments. In addition, we showcase how to get Packed-Ensembles from Le Net (Le Cun et al., 1998). To further promote accessibility, we have created an open-source pip-installable Py Torch package, torch-uncertainty, that includes Packed-Ensembles layers. With these resources, we hope to encourage the broader research community to engage with and build upon our work. The purpose of this paper is to propose a new method for better estimations of uncertainty for deep-learning-based models. Nevertheless, we acknowledge their limitations, which could become particularly concerning when applied to safety-critical systems. While this work aims to improve the reliability of Deep Neural Networks, this approach is not ready for deployment in safety-critical systems. We show the limitations of our approach in several experiments. Many more validation and verification steps would be crucial before considering its real-world implementation to ensure robustness to various unknown situations, including corner cases, adversarial attacks, and potential biases. ACKNOWLEDGMENTS This work was supported by AID Project ACo Ca Therm and Hi!Paris. This work was performed using HPC resources from GENCI-IDRIS (Grant 2021-AD011011970R1) and (Grant 2022AD011011970R2). Published as a conference paper at ICLR 2023 Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In ICLR, 2020. 9 William H Beluch, Tim Genewein, Andreas N urnberger, and Jan M K ohler. The power of ensembles for active learning in image classification. In CVPR, 2018. 21 Gabriel Bender, Hanxiao Liu, Bo Chen, Grace Chu, Shuyang Cheng, Pieter-Jan Kindermans, and Quoc V. Le. Can weight sharing outperform random architecture search? An investigation with Tu NAS. In CVPR, 2020. 9 Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In ICML, 2015. 9 Andrea Bragagnolo, Enzo Tartaglione, and Marco Grangetto. To update or not to update? neurons at equilibrium in deep models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=LGDfv0U7MJR. 9 Hao Chen and Abhinav Shrivastava. Group ensemble: Learning an ensemble of convnets in a single convnet. ar Xiv preprint ar Xiv:2007.00649, 2020. 2, 9 Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR, 2020. 17 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 6, 25 Thomas G Dietterich. Ensemble methods in machine learning. In IWMCS, 2000. 2 Nikita Durasov, Timur Bagautdinov, Pierre Baque, and Pascal Fua. Masksembles for uncertainty estimation. In CVPR, 2021. 2, 6, 9, 17 Michael Dusenberry, Ghassen Jerfel, Yeming Wen, Yian Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, and Dustin Tran. Efficient and scalable bayesian neural nets with rank-1 factors. In ICML, 2020. 9 Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019. 2, 9 Gianni Franchi, Andrei Bursuc, Emanuel Aldea, S everine Dubuisson, and Isabelle Bloch. Tradi: Tracking deep neural network weight distributions. In ECCV, 2020. 2, 9 Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2018. 9, 22 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016. 2, 9, 25 Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Neur IPS, 2018. 2, 9 Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, 2017. 2 Fredrik K Gustafsson, Martin Danelljan, and Thomas B Schon. Evaluating scalable bayesian deep learning methods for robust computer vision. In CVPR Workshops, 2020. 2, 9 Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016. 9 Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew Mingbo Dai, and Dustin Tran. Training independent subnetworks for robust prediction. In ICLR, 2020. 2, 6, 9 Published as a conference paper at ICLR 2023 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 6, 17 Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yield highconfidence predictions far away from the training data and how to mitigate the problem. In CVPR, 2019. 2 Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019. 23, 24 Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017. 6 Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In CVPR, 2021a. 6 Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021b. 6 Jos e Miguel Hern andez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In ICML, 2015. 25 Andrew G. Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for Mobile Net V3. In ICCV, 2019. 9 Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. Snapshot ensembles: Train 1, get M for free. In ICLR, 2017. 2, 9 Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeeze Net: Alex Net-level accuracy with 50x fewer parameters and < 0.5 mb model size. ar Xiv preprint ar Xiv:1602.07360, 2016. 9 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 5 Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 1999. 9 Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Neur IPS, 2017. 25 Dan Kondratyuk, Mingxing Tan, Matthew Brown, and Boqing Gong. When ensembling smaller models is more efficient than single large models. ar Xiv preprint ar Xiv:2005.00570, 2020. 2 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, MIT, 2009. 6, 23 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, 2012. 2, 9 Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Neur IPS, 2017. 2, 4, 6, 9, 20, 25 Yann Le Cun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 2 Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 10 Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why M heads are better than one: Training a diverse ensemble of deep networks. ar Xiv preprint ar Xiv:1511.06314, 2015. 2 Published as a conference paper at ICLR 2023 Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J. Zico Kolter, Dirk Langer, Oliver Pink, Vaughan Pratt, Michael Sokolsky, Ganymed Stanek, David Stavens, Alex Teichman, Moritz Werling, and Sebastian Thrun. Towards fully autonomous driving: Systems and algorithms. In IV, 2011. 1 Feng Liang, Zhichao Tian, M. Dong, Shuting Cheng, Li Sun, Hai Helen Li, Yiran Chen, and Guohe Zhang. Efficient neural network using pointwise convolution kernels with linear phase constraint. Neurocomputing, 2021. 9 Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Neur IPS, 2017. 9 Shiwei Liu, Tianlong Chen, Zahra Atashgahi, Xiaohan Chen, Ghada Sokar, Elena Mocanu, Mykola Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Deep ensembling with no overhead for either training or testing: The all-round blessings of dynamic sparsity. In ICLR, 2022. 9 Ekaterina Lobacheva, Nadezhda Chirkova, Maxim Kodryan, and Dmitry Vetrov. On power laws in deep ensembles. In Neur IPS, 2020. 2 Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l 0 regularization. In ICLR, 2018. 9 David JC Mac Kay. A practical bayesian framework for backpropagation networks. Neural computation, 1992. 9 Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Neur IPS, 2019. 2, 9 Rowan Mc Allister, Yarin Gal, Alex Kendall, Mark Van Der Wilk, Amar Shah, Roberto Cipolla, and Adrian Weller. Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning. In IJCAI, 2017. 1 Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 2018. 9 Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In ICML, 2017. 9 Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, 2015. 6 Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. In ICML Workshops, 2019. 24 Radford M Neal. Bayesian learning for neural networks. Ph D thesis, University of Toronto, 1995. 9 Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In Neur IPS Workshops, 2011. 6 A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015. 2 Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? Uncovering how neural network representations vary with width and depth. In ICLR, 2020. 22 D.A. Nix and A.S. Weigend. Estimating the mean and variance of the target probability distribution. In ICNN, 1994. 25 Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In Neur IPS, 2019. 2, 9, 24 Published as a conference paper at ICLR 2023 Michael P Perrone and Leon N Cooper. When networks disagree: Ensemble methods for hybrid neural networks. Technical report, Brown University, 1992. 2 Alexandre Ram e, R emy Sun, and Matthieu Cord. Mixmo: Mixing multiple inputs for multiple outputs via deep subnetworks. In ICCV, 2021. 2 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 17 Enzo Tartaglione, Andrea Bragagnolo, Attilio Fiandrotti, and Marco Grangetto. Loss-based sensitivity regularization: towards deep sparse neural networks. Neural Networks, 2022. 9 Torchinfo. Torchinfo. https://github.com/Tyler Yep/torchinfo, 2022. Version: 1.7.1. 7, 25 Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vi M: Out-of-distribution with virtuallogit matching. In CVPR, 2022. 6 Yeming Wen, Dustin Tran, and Jimmy Ba. Batch Ensemble: an alternative approach to efficient ensemble and lifelong learning. In ICLR, 2019. 2, 6, 9 Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019. 17 Ross Wightman, Hugo Touvron, and Herve Jegou. Resnet strikes back: An improved training procedure in timm. In Neur IPS 2021 - Workshop Image Net PPF, 2021. 17 Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. In Neur IPS, 2020. 4, 9 Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. 3, 9, 24 Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In CVPR, 2019. 17 Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016. 6, 17 Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018a. 17 Qing-Long Zhang and Yubin Yang. SA-Net: Shuffle attention for deep convolutional neural networks. In ICASSP, 2021. 9 Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018b. 9 Published as a conference paper at ICLR 2023 TABLE OF CONTENTS - SUPPLEMENTARY MATERIAL A Notations 16 B Implementation details 17 C Discussion on the sparsity 17 D Ablation study 20 E Discussion about OOD criteria 20 F Discussion about the sources of stochasticity 21 G Discussion about the subnetworks 22 H Discussion about the training velocity 23 I Distribution shift 23 J Stabilization of the performance 24 K On the equivalence between sequential training and Packed-Ensembles 24 L Using groups is not sufficient to equal Packed-Ensembles 24 M Efficiency of the networks trained on Image Net 25 N Regression 25 Published as a conference paper at ICLR 2023 A NOTATIONS We summarize the main notations used in the paper in Table 3. Table 3: Summary of the main notations of the paper. Notations Meaning D = {(xi, yi)}|D| i=1 The set of |D| data samples and the corresponding labels j, m, L The index of the current layer, the current subnetwork, and the number of layers zj The preactivation feature map and output of the layer (j 1)/input of layer j ϕ The activation function (considered constant throughout the network) hj The feature map and output of layer j, hj = ϕ(zj) Hj, Wj The height and width of the feature maps and output of layer j 1 Cj The number of channels of the feature maps and output of layer j 1 nj The number of parameters of layer j B The batch size of the training procedure maskj m The mask corresponding to the layer j of the subnetwork m The floor function , , The 2D cross-correlation, the convolution, and the Hadamard product sj The size of the kernel of the layer j M The number of subnetworks in an ensemble ˆym i The prediction of the subnetwork number m concerning the input xi ˆyi The prediction of the ensemble concerning the input xi α The width-augmentation factor of Packed-Ensembles γ The number of subgroups of Packed-Ensembles θα,m The set of weights of the subnetwork m with a width factor α ωj α,γ The weights of layer j with γ groups and a width factor α Published as a conference paper at ICLR 2023 Table 4: Hyperparameters for image classification experiments. HFlip denotes the classical horizontal flip. Dataset Networks Epochs Batch size start lr Momentum Weight decay γ-lr Milestones Data augmentations C10 R18 75 128 0.05 0.9 5e-4 0.1 25, 50 HFlip C10 R50 200 128 0.1 0.9 5e-4 0.2 60, 120, 160 HFlip C10 WR28-10 200 128 0.1 0.9 5e-4 0.2 60, 120, 160 HFlip C100 R18 75 128 0.05 0.9 1e-4 0.2 25, 50 HFlip C100 R50 200 128 0.1 0.9 5e-4 0.2 60, 120, 160 HFlip C100 WR28-10 200 128 0.1 0.9 5e-4 0.2 60, 120, 160 Medium B IMPLEMENTATION DETAILS General Considerations. Table 4 summarizes all the hyperparameters used in the paper for CIFAR-10 and CIFAR-100. In all cases, we use SGD combined with a multistep-learning-rate scheduler multiplying the rate by γ-lr at each milestone. Note that Batch Ensemble based on Res Net50 uses a lower learning rate of 0.08 instead of 0.1 for stability. The Medium data augmentation corresponds to a combination of mixup (Zhang et al., 2018a) and cutmix (Yun et al., 2019) with 0.5 switch probability and using timm s augmentation classes (Wightman, 2019), with coefficients respectively 0.5 and 0.2. In this case, we also use Rand Augment (Cubuk et al., 2020) with m = 9, n = 2, and mstd = 1 and a label-smoothing (Szegedy et al., 2016) of intensity 0.1. To ensure that the layers convey sufficient information and are not weakened by groups, we have set a constant minimum number of channels per group to 64 for all experiments presented in the paper. If the number of channels per group is lower than this threshold, γ is reduced. Moreover, we do not apply subgroups (parameterized by γ) on the first layer of the network, nor on the first layer of Res Net s blocks. Experiments in which this minimum number of channels could play a significant role and bring confusion are not presented (see, for instance, PE-(1, 4, 4) in Table 5). For Image Net, we use the A3 procedure from Wightman et al. (2021) for all models. Training with the exact A3 procedure was not always possible. Refer to the specific subsection for more details. Please note that the hyperparameters of the training procedures have not been optimized for our method and have been taken directly from the literature (He et al., 2016; Wightman et al., 2021). We strengthened the data augmentations for Wide Res Net on CIFAR-100 as we were not able to replicate the results from Zagoruyko & Komodakis (2016). Masksembles. We use the code proposed by (Durasov et al., 2021) 3. We modified the mask generation function using binary search, as proposed by the authors since it was unable to build masks for Res Net50x4. We note that the code implies performing batch repeats at the start of the forward passes. All the results regarding this technique are therefore computed with this specification. The Res Net implementations are built using Masksemble2D layers with M = 4 and a scale factor of 2 after each convolution. Batch Ensemble. For Batch Ensemble, we use two different values for weight decay: table 4 provides the weight decay corresponding to the shared weights but we don t apply weight decay to the vectors S and R (which generate the rank-1 matrices). Image Net. The batch size of Masksembles Res Net-50x4 is reduced to 1120 because of memory constraints. Concerning the Batch Ensembles based on Res Net-50 and Res Net-50x4, we clip the norm of the gradients to 0.0005 to avoid divergence. C DISCUSSION ON THE SPARSITY In this section, we estimate the expected distance between a dense, fully-connected layer and a sparse one. For simplicity, we are here assuming to operate with a fully-connected layer. First, let us write our first proposition: Proposition C.1. Given a fully connected layer j + 1 defined by: k=0 ωj(c, k)hj(k) (5) 3available at github.com/nikitadurasov/masksembles Published as a conference paper at ICLR 2023 Figure 5: KL divergence for different values of p and σj+1 z , with µj(k) = 0.1 j, k and wj(c, k) = 0.1 j, c, k. and its approximation defined by: k=0 (ωj(c, k)maskj(k, c))hj(k) (6) Under the assumption that the j follows a Gaussian distribution hj N(µj, Σj), where Σj is the covariance matrix, and µj the mean vector, the Kullback Leibler divergence between the layer and its approximation is bounded by: DKL(z, z)(c) 1 p 2 + p (1 p) PCj 1 k=0 ωj(c, k)2µj(k)2 (σj+1 z )2(c) + (1 p) µj+1 z (c) 2 p(σj+1 z )2(c) where p [0; 1] is the fraction of the parameters of zj+1(c) included in the approximation zj+1(c). A plot for (7) is provided in Figure 5. Proof. To prove Prop. C.1, we state first that, since hj(k) follows a Gaussian distribution, and considering that ωj at inference time is constant and linearly-combined with a gaussian random variable, zj+1 will be as well gaussian-distributed. From the property of linearity of expectations, we know that the mean for zj+1(c) is: µj+1 z (c) = k=0 ωj(c, k)µj(k) (8) and the variance is: (σj+1 z )2(c) = k=0 ωj(c, k) ωj(c, k)Σ(k, k) + 2 X k