# neural_bootstrapper__d2cbd788.pdf

Neural Bootstrapper

Minsuk Shin1 , Hyungjoo Cho2 , Hyun-seok Min3, Sungbin Lim4

Department of Statistics, University of South Carolina1

Department of Transdisciplinary Studies, Seoul National University2

Tomocube Inc.3

Artiﬁcial Intelligence Graduate School, UNIST4

sungbin@unist.ac.kr

Bootstrapping has been a primary tool for ensemble and uncertainty quantiﬁcation in machine learning and statistics. However, due to its nature of multiple training and resampling, bootstrapping deep neural networks is computationally burdensome; hence it has difﬁculties in practical application to the uncertainty estimation and related tasks. To overcome this computational bottleneck, we propose a novel approach called Neural Bootstrapper (Neu Boots), which learns to generate bootstrapped neural networks through single model training. Neu Boots injects the bootstrap weights into the high-level feature layers of the backbone network and outputs the bootstrapped predictions of the target, without additional parameters and the repetitive computations from scratch. We apply Neu Boots to various machine learning tasks related to uncertainty quantiﬁcation, including prediction calibrations in image classiﬁcation and semantic segmentation, active learning, and detection of out-of-distribution samples. Our empirical results show that Neu Boots outperforms other bagging based methods under a much lower computational cost without losing the validity of bootstrapping.

1 Introduction

Bootstrapping [7] or bagging [3] procedures have been commonly used as a primary tool in quantifying uncertainty lying on statistical inference, e.g. evaluations of standard errors, conﬁdence intervals, and hypothetical null distribution. Despite its success in statistics and machine learning ﬁeld, the naive use of bootstrap procedures in deep neural network applications has been less practical due to its computational intensity. Bootstrap procedures require evaluating a number of models; however, training multiple deep neural networks are infeasible in practice in terms of computational cost.

To utilize bootstrap for deep neural networks, we propose a novel bootstrapping procedure called Neural Bootstrapper (Neu Boots). The proposed method is mainly motivated by Generative Bootstrap Sampler (GBS) [38], which trains a bootstrap generator by model parameterization based on Random Weight Bootstrapping (RWB, [37]) framework. For many statistical models, the idea of GBS is more theoretically valid than amortized bootstrap [31], which trains an implicit model to approximate the bootstrap distribution over model parameters. However, GBS is hardly scalable to modern deep neural networks containing millions of parameters.

Contrary to the previous method, the proposed method is effortlessly scalable and universally applicable to the various architectures. The key idea of Neu Boots is simple; multiplying bootstrap weights to the ﬁnal layer of the backbone network and instead of model parameterization. Hence it outputs the bootstrapped predictions of the target without additional parameters and the repetitive

Equal Contribution Corresponding Author

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Standard Bootstrap [7] MCDrop [13] Deep Ensemble [24] Neu Boots

Memory Efﬁciency

Fast Training

Fast Prediction

Table 1.1. Computational comparison between bagging based uncertainty estimation methods in the view of memory efﬁciency and computational speed during the training and prediction step.

computations from scratch. Neu Boots outperforms the previous sampling-based methods [13, 24, 31] on the various uncertainty quantiﬁcation related tasks with deep convolutional networks [17, 20, 22]. Throughout this paper, we show that Neu Boots has multiple advantages over the existing uncertainty quantiﬁcation procedures in terms of memory efﬁciency and computational speed (see Table 1.1).

To verify the empirical power of the proposed method, we apply Neu Boots to a wide range of experiments related to uncertainty quantiﬁcation and bagging. We apply the Neu Boots to prediction calibration, active learning, out-of-distribution (OOD) detection, semantic segmentation, and learning on imbalanced datasets. Notably, we test the proposed method on biomedical data of high-resolution, NIH3T3 data [5]. In Section 4, our results show that Neu Boots achieves at least comparable or better performance than the state-of-the-art methods in the considered applications.

2 Preliminaries

As preliminaries, we brieﬂy review the standard bootstrapping [7] and introduce an idea of Generative Bootstrap Sampler (GBS, [38]), which is the primary motivation of the proposed method. Let [m] := {1, . . . , m} and denote a given training data by D = {(Xi, yi) : i 2 [n]}, where each feature Xi 2 X Rp and its response yi 2 Rd. We denote the class of models f : Rp ! Rd by M. For the standard bootstrapping, we sample B sets of bootstrap data D(b) = {(X(b)

i ) : i 2 [n]} with replacement for b 2 [B]. For each bootstrap data D(b), we deﬁne a loss functional L on f 2 M:

L(f, D(b)) := 1

where : Rd Rd ! R is an arbitrary loss function. Then we minimize (2.1) with respect to f 2 M to obtain bootstrapped models: for b 2 [B],

bf (b) = arg min

L(f, D(b)). (2.2)

Random Weight Bootstrapping It is well-known that the standard bootstrap uses only (approximately) 63% of observations for each bootstrap evaluation [24]. To resolve this problem, we use Random Weight Bootstrapping (RWB, [37]), which reformulates (2.2) as a sampling of bootstrapping weights for a weighted loss functional. Let W = {w 2 Rn

i=1 wi = n} be a dilated standard (n 1)-simplex. For w = (w1, . . . , wn) 2 W and the original training data D, we deﬁne the

Weighted Bootstrapping Loss (WBL) functional on f 2 M as follows:

L(f, w, D) := 1

wi (f(Xi), yi). (2.3)

Then for any resampled dataset D(b), there exists a unique w 2 W such that (2.1) matches to (2.3). This reformulation provides a relaxation method to consider full data set without any omission in bootstrapping. Precisely, as a continuous relaxation of the standard bootstrap, we use Dirichlet distribution [32]; PW = n Dirichlet(1, . . . , 1), where PW is a probability distribution on the simplex W. Hence RWB fully utilizes the observed data points, since sampled bootstrap weights w PW are strictly positive. Also, [34] showed that RWB achieves the same theoretical properties with these of the standard bootstrap i.e. PW = Multinomial(n; 1/n, . . . , 1/n) in (2.3).

Bootstrap Distribution Generator Although RWB resolves the data discard problem, training multiple networks bf (1), . . . , bf (B) remains a computational problem, and one has to store the parameters of every network for prediction. To reduce the computational bottlenecks, GBS [38] proposes a

procedure to train a generator function of bootstrapped estimators for parametric statistical models. The main idea of GBS is to parameterize the model parameter with bootstrap weight w 2 W. When the GBS is applied to bootstrapping neural networks, it considers a bootstrap generator g : Rp W ! Rd with parameter (w), where d is the total number of neural net parameters in g, so that g(X, w) = g (w)(X). Based on (2.3), we deﬁne a new WBL functional:

L(g, D) = Ew PW[L(g, w, D)], L(g, w, D) = 1

wi (g(Xi, w), yi) (2.4)

Note that we use the Dirichlet distribution for PW; hence the functional L(g, D) includes RWB procedure itself. Analogous to (2.2), we obtain the bootstrap generator bg by optimizing L(g, D):

bg = arg min

L(g, D) (2.5)

Then learned bg can generate bootstrap samples for given target data X by plugging an arbitrary w 2 W into bg(X , ). We refer to [38, Section 2] for detailed theoretical results on GBS.

Block Bootstrapping The above bootstrap generator g receives a bootstrap weight vector w of dimension n; hence its optimization via (2.5) would be hurdled when the number of data n is large. Hence we utilize a block bootstrapping procedure to reduce the dimension of bootstrap weight vector. We allocate the index set [n] to S number of blocks. Let u : [n] ! [S] denotes the assignment function. Then we impose the same value of weight on all elements in a block such as, wi = s for u(i) = s 2 [S], where = ( 1, . . . , S) S Dirichlet(1, . . . , 1). Instead of w, we plug in g(X, ) = g ( )(X) to generate bootstrap samples and compute the weighted loss function in (2.4):

L(g, D) = E S Dirichlet(1,...,1)

u(i) (g(Xi, ), yi)

The above procedure asymptotically converges to the same target distribution where the conventional non-block bootstrap converges. See appendix A for more detailed procedure and proofs.

3 Neural Bootstrapper

Now we propose Neural Bootstrapper (Neu Boots), which reduces computational complexity and memory requirement of the networks in the learning of bootstrapped distribution to being suitable for deep neural networks.

How to implement the bootstrap generator g for deep neural networks? One may consider directly applying GBS to existing deep neural networks by modeling a neural net ( ) that outputs the neural net parameters of g. However, this approach is computationally challenging due to the high-dimensionality of the output dimension of ( ) Indeed, [38] proposes an architecture which concatenates bootstrap weight vector to every layer of a given neural network (Figure 3.1(b)) and trains it with (2.6). However, the bagging performance of GBS gradually degrades as we applied it to the deeper neural networks. This may be because the information of bootstrap weights in the earlier layers less propagate since the target model reduces the parameters of the weights during the training.

3.1 Adaptive Block Bootstrapping

We found that the bootstrap weight in the ﬁnal layer mainly affects the bootstrap performance of GBS. This fact motivates us to utilize the following adaptive block bootstrapping, which is the key idea of Neu Boots. Take a neural network f 2 M with parameter . Let M 1 and F 2 be the single-layer neural network in the ﬁnal layer and the feature extractor of f, respectively, with parameter = ( 1, 2), so we can decompose f into M 1 F 2. Set S := dim(F 2(X)) for the number of blocks for block bootstrapping. Then, we redeﬁne bootstrap generator as follows:

g (X, ) := g( , )(X) = M 1(F 2(X) ) (3.1)

where denotes an elementwise multiplication. Bootstrap generator (3.1) can also be trained with (2.6); hence optimized bg (X, ) can generate the bootstrapped prediction as we plug . This

Bootstrapping weight sampling Sampling with replacement Repetitive computation

Elementwise Multiplication Parameterization D

(a) Standard Bootstrapping (b) GBS (c) Neu Boots

L(g, (b), D)

zg0ZPet Xfj3b6neh2l Zg Jflnf3Bs3wnf Q=</latexit>S Dirichlet(1, . . . , 1)

zg0ZPet Xfj3b6neh2l Zg Jflnf3Bs3wnf Q=</latexit>S Dirichlet(1, . . . , 1)

L(g, (b), D)

Concatenate

g( , (b)) = M 1(F 2 (b))

Figure. 3.1. A comparison between the bootstrapping procedure of (a) standard bootstrapping [7], (b) GBS [38], and (c) Neu Boots. This ﬁgure is best viewed in color.

modiﬁcation brings a computational beneﬁt, since we can generate bootstrap samples quickly and memory-efﬁciently by reusing a priori computed tensor F 2(X) without repetitive computation from scratch. See Figure 3.1 for the comparison between the previous methods and Neu Boots. In our empirical experience, the bootstrap evaluations over different groupings were consistent for all examined examples in this article.

Training and Prediction At every epoch, we update the w = { u(1), . . . , u(n)} randomly, and the expectation in (2.6) can be approximated by the average over the sampled weights. Considering the stochastic gradient descent (SGD) algorithms to update the parameter via mini-batch sequence {Dk : Dk D}K

k=1, we plug mini-batch size of bootstrap weight vector { u(i) : Xi 2 Dk} in (2.6) without changing . Each element of w is not used repeatedly during the epoch, so the sampling and replacement procedures in Algorithm 1 are conducted once at the beginning of epoch. After we obtain the optimized network bg , for the prediction, we use the generator bg ( ) = bg (X , ) for a given data point X . Then we can generate bootstrapped predictions by plugging (1), . . . , (B) in the generator bg ( ), as described in Algorithm 2.

Algorithm 1: Training step in Neu Boots. Input :Dataset D; epochs T; dimension of feature S; index function u; learning rate .

1 Initialize neural network parameter φ(0) and set n := |D|.

2 for t 2 {0, . . . , T 1} do

3 Sample (t) = { (t)

1 , . . . , (t)

i.i.d. S Dirichlet(1, . . . , 1)

4 Replace w(t)

u(1), . . . , (t)

5 Update (t+1) (t) r L(g( , ), w(t)

Algorithm 2: Prediction step in Neu Boots. Input : Data point X 2 Rp; number of bootstrap sampling B.

1 Compute the feed-forward network bg ( ) = bg (X , ) a priori.

2 for b 2 {1, . . . , B} do

3 Generate (b) i.i.d. S Dirichlet(1, . . . , 1) and evaluate by(b)

= bg ( (b)).

3.2 Discussion

Neu Boots vs Standard Bootstrap To examine the approximation power of Neu Boots, we have measured the frequentist s coverage rate of the conﬁdence bands (Figure 3.2.(a)). We estimate 95% conﬁdence band for nonparametric regression function by using the Neu Boots, and compare

(a) Frequentist Coverage

(b) Neu Boots

(c) Standard Bootstrap

Figure. 3.2. (a) Frequentist coverage rate of the 95% conﬁdence bands; (b) Curve ﬁtting with different nonlinear funtions. 95% conﬁdence band of the regression mean from Neu Boots; (c) the standard bootstrap. Each red dashed line indicates the mean, and the blue dotted lines show the true regression function.

it with credible bands (or conﬁdence bands) evaluated by the standard bootstrap, Gaussian Process (GP) regression, and MCDrop [13]. We adopt Algorithm 1 to train the Neu Boots generator with 3 hidden-layers with 500 hidden-nodes for each layer. For the standard bootstrap, we train 1000 neural networks. The result shows the conﬁdence band via Neu Boots stably covers the true regression function on each predictor value with almost 95% of frequency, which is compatible with the standard bootstrapping. In contrast, the coverage of the MCDrop is unstable and sometimes below 70%. This result indicates that the Neu Boots performs comparably with the standard bootstrapping in uncertainty quantiﬁcation tasks.

Neu Boots vs Amortized Bootstrapping We applied Neu Boots to classiﬁcation and regression experiments presented by the amortized bootstrap [31]. Indeed, every experiment demonstrates that Neu Boots outperforms the amortized bootstrap in bagging performance for various tasks: the rotated MNIST classiﬁcation (Table 3.1), classiﬁcation with different data points N (Figure B.1), and regression on two datasets (Figure B.2). We remark the expected calibration error (ECE, [30]) score on the rotated MNIST is improved via Neu Boots from 15.00 to 2.98 by increasing the number of bootstrap sampling B.

Methods Test Error B = 1 B = 5 B = 25

Traditional Bootstrap 22.57 19.68 18.57 Amortized Bootstrap 17.03 16.82 16.18 Neu Boots 17.94 0.74 14.98 0.31 14.45 0.31

Table 3.1. Rotated MNIST classiﬁcation with different bootstrap sampling number B.

Neu Boots vs Dropout At ﬁrst glance, Neu Boots is similar to Dropout in that the ﬁnal neurons are multiplied by random variables. However, random weights imposed by the Dropout are lack of connection to the loss function nor the working model, while the bootstrap weights of the Neu Boots appears in the loss function (2.6) have explicit connections to the bootstrapping. We brieﬂy verify the effect of the loss function on the 3-layers MLP with the different number of hidden variables 50, 100, and 200 for the image classiﬁcation task on MNIST datasets. With batch normalization [21], we have applied Dropout with probability p = 0.1 only to the ﬁnal layer of MLP. We measure the ECE, the negative log-likelihood (NLL), and the Brier score for comparisons. Neu Boots and Dropout records same accuracy. However, Figure B.3 shows that the Neu Boots is more feasible for conﬁdence-aware learning and clearly outperforms the Dropout in terms of ECE, NLL, and the Brier score.

Computation time and cost As we mentioned earlier, the algorithm evaluates the network from scratch for only once to store the tensor F 2(X ), while the standard bootstrapping and MCDrop [13] need repetitive feed-forward propagations. To check this empirically, we measure the prediction time by Res Net-34 between Neu Boots and MCDrop on the test set of CIFAR-10 with Nvidia V100 GPUs. Neu Boots predicts B = 100 bootstrapping in 1.9s whereas MCDrop takes 112s to generate 100 outputs. Also, Neu Boots is computationally more efﬁcient than the standard bootstrap and the

sparse GP [40] (Figure 3.3).

Method Training Time Test Time Memory Usage

Deep Ensemble O(LK) O(LK) O(MK + I) Batch Ensemble O(LK) O(LK) O(M + IK) MIMO O(L + 2K) O(L + 2K) O(M + IK) Neu Boots O(L) O(L + K) O(M + I) Table 3.2. A comparison of computational costs. We use the following notations: L the number of layers, K the number of bootstrapping (or ensemble), M the parameter size of a single model, I memory size of input data.

1 min 3 min 5 min

500 2000 5000 10000 20000 40000

Standard Bootstrap Sparse GP Neu Boots

Figure. 3.3. Comparison of computational time with different numbers of training data n for the example in Figure 3.2

We also compare Neu Boots to MIMO [16] and Batch Ensemble [42] in terms of training, test, and memory complexities (see Table 3.2). Since Neu Boots does not require repetitive forward computations, its training and test costs are O(L) and O(L + K), respectively, less than O(L+2K) of MIMO and O(LK) of Batch Ensemble. Note that MIMO needs to copy input images as many as K to supply into input layers. Even though it can compute in a single forward pass, it requires more memories to upload multiple inputs if the input data is high-dimensional (e.g., MRI/CT). The memory complexity of Batch Ensemble is similar to the one of MIMO since the memory usage of fast weights in Batch Ensemble is proportional to the dimension of input and output. This computational bottleneck is nothing to sneeze at in the application ﬁelds requiring on-device training or inference; however, the proposed method is free from such a problem since multiple computations occur only at the ﬁnal layer. For quantitative comparisons, we refer to Appendix B.2.

Diversity of predictions Diversity of predictions has been a reliable measure to examine overﬁts and performance of uncertainty quantiﬁcation for ensemble procedures [10, 35, 42]. In the presence of overﬁtting, it is expected that the diversity of different ensemble predictions would be minimal because the resulting ensemble models would produce similar predictions that are over-ﬁtted towards the training data points. To examine the diversity performance of Neu Boots, we consider various diversity measures including ratio-error, Q-statistics, correlation coefﬁcient, and prediction disagreement (see [1, 10, 35]). For the CIFAR-100 along with Dense Net-100, Table 3.3 summarizes the results. Neu Boots outperforms MCDrop in every metrics of diversity. Furthermore, Neu Boots shows comparable results with Deep Ensemble.

Method Ratio-error (") Q-stat (#) Correlation (#) Disagreement (")

Deep Ensemble 98.00 61.31 78.56 23.41 MCDrop 27.38 96.33 92.00 10.40 Neu Boots 93.79 63.95 76.11 32.20 Table 3.3. A comparison of diversity performances.

4 Empirical Studies

In this section, we conduct the wide range of empirical studies of Neu Boots for uncertainty quantiﬁcation and bagging performance. We apply Neu Boots to prediction calibration, active learning, out-of-distribution detection, bagging performance for semantic segmentation, and learning on imbalanced dataset with various deep convolutional neural networks. Our code is open to the public3.

4.1 Prediction Calibration

Setting We evaluate the proposed method on the prediction calibration for image classiﬁcation tasks. We apply Neu Boots to image classiﬁcation tasks on CIFAR and SVHN with Res Net-110 and Dense Net-100. We take k = 5 predictions of MCDrop and Deep Ensemble for calibration.

3https://github.com/sungbinlim/Neu Boots

For fair comparisons, we set the number of bootstrap sampling B = 5 as well, and ﬁx the other hyperparameters same with baseline models. All models are trained using SGD with a momentum of 0.9, an initial learning rate of 0.1, and a weight decay of 0.0005 with the mini-batch size of 128. We use Cosine Annealing for the learning rate scheduler. We implement MCDrop and evaluates its performance with dropout rate p = 0.2, which is a close setting to the original paper. For Deep Ensemble, we utilize adversarial training and the Brier loss function [24] and cross-entropy loss function [2]. For the metric, we evaluate the error rate, ECE, NLL, and Brier score. We also compute each method s training and prediction times to compare the relative speed based on the baseline.

Results See Table B.3 and B.4 for empirical results. Neu Boots generally show a comparable calibration ability compared to MCDrop and Deep Ensemble. Figure 4.1.(a) shows the reliability diagrams of Res Net-110 and Dense Net-100 on CIFAR-100. We observe that Neu Boots secures accuracy and prediction calibration in the image classiﬁcation tasks with Res Net-110 and Dense Net-100. Neu Boots is faster in prediction than MCDrop and Deep Ensemble at least three times. Furthermore, Neu Boots shows faster in training than Deep Ensemble at least nine times. This gap increases as the number of predictions k increases. It concludes that Neu Boots outperforms MCDrop and is comparable with Deep Ensemble in calibrating the prediction with the relatively faster prediction.

(a) Reliability Diagram

(b) Active Learning

Figure. 4.1. (a) Comparison of reliability diagrams for Res Net-110 and Dense Net-100 on CIFAR-100. Conﬁdence is the value of the maximal softmax output. A dashed black line represents a perfectly calibrated prediction. Points below this line indicate to under-conﬁdent predictions, whereas points above the line mean overconﬁdent predictions. (b) Actice learning performance on CIFAR-10 (left) and CIFAR-100 (right) with Random, MCDrop, and Neu Boots. Curves are averaged over ﬁve runs and shaded regions denote the conﬁdence intervals.

4.2 Active Learning

Setting We evaluate the Neu Boots on the active learning with Res Net-18 architecture on CIFAR. For a comparison, we consider MCDrop and Deep Ensemble with entropy-based sampling and random sampling. We follow an ordinary process to evaluate the performance of active learning (see [29]). Initially, a randomly sampled 2,000 labeled images are given, and we train a model. Based on the uncertainty estimation of each model, we sample 2,000 additional images from the unlabeled dataset and add to the labeled dataset for the next stage. We continue this process ten times for a single trial and repeat ﬁve trials for each model.

Results Figure 4.1.(b) shows the sequential performance improvement on CIFAR-10 and CIFAR100. Note that CIFAR-100 is more challenging dataset than CIFAR-10. Both plots demonstrate

that Neu Boots is superior to the other sampling methods in the active learning task. Neu Boots records 71.6% accuracy in CIFAR-100 and 2.5% gap with MCDrop and Deep Ensemble. Through the experiment, we verify that Neu Boots has a signiﬁcant advantage in active learning.

4.3 Out-of-Distribution Detection

Setting As an important application of uncertainty quantiﬁcation, we have applied Neu Boots to detection of out-of-distribution (OOD) samples. The setting for OOD is based on the Mahalanobis method [26]. At ﬁrst, we train Res Net-34 for the classiﬁcation task only using the training set of the CIFAR-10 (in-distribution). Then, we evaluate the performance of Neu Boots for OOD detection both in the test sets of in-distribution dataset and the SVHN (out-of-distribution). Using a separate

validation set from the testsets, we train a logistic regression based detector to discriminate OOD samples from in-distribution dataset. For the input vectors of the OOD detector, we extract the following four statistics regarding logit vectors: the max of predictive mean vectors, the standard deviation of logit vectors, expected entropy, and predictive entropy, which can be computed by the sampled output vectors of Neu Boots. To evaluate the performance of the detector, we measure the true negative rate (TNR) at 95% true positive rate (TPR), the are under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPR), and the detection accuracy. For comparison, we examine the baseline method [19], MCDrop, Deep Ensemble [24], Deep Ensemble_CE (trained with cross-entropy loss) [2], ODIN [27], and Mahalanobis [26].

Method TNR AUROC Detection AUPR AUPR at TPR 95% Accuracy In Out

Baseline 32.47 89.88 85.06 85.4 93.96 MCDrop 51.4 92.01 89.46 86.82 95.41 Deep Ensemble [24] 56.7 91.85 88.91 81.66 95.46 Deep Ensemble_CE [2] 48.5 92.29 90.48 86.33 95.49 ODIN 86.55 96.65 91.08 92.54 98.52 Mahalanobis 54.51 93.92 89.13 91.54 98.52 Mahalanobis + Calibration 96.42 99.14 95.75 98.26 99.6

Neu Boots 89.40 97.26 93.80 93.97 98.86 Neu Boots + Calibration 99.00 99.14 96.52 97.78 99.68

Table 4.1. OOD detection. All values are percantages and the best results are indicated in bold.

Results Table 4.1 shows Neu Boots signiﬁcantly outperform the baseline method [19], Deep Ensemble [2, 24], and ODIN [27] without any calibration technique in OOD detection. Furthermore, with the input pre-processing technique studied in [27], Neu Boots is superior to Mahalanobis [26] in most metrics, which employs both the feature ensemble and the input pre-processing for the calibration techniques. This validates Neu Boots can discriminate OOD samples effectively. In order to see the performance change of the OOD detector concerning the bootstrap sample size, we evaluate the predictive standard deviation estimated by the proposed method for different B 2 {2, 5, 10, 20, 30}. Figure B.5 illustrates that the Neu Boots successfully detects the in-distribution samples (top row) and the out-of-distribution samples (bottom row).

4.4 Bagging Performance for Semantic Segmentation

Setting To demonstrate the applicability of Neu Boots to different computer vision tasks, we validate Neu Boots on PASCAL VOC 2012 semantic segmentation benchmark [9] with Deep Lab-v3 [4] based on the backbone architectures of Res Net-50 and Res Net-101. We modify the ﬁnal 1 1 convolution layer after the Atrous Spatial Pyramid Pooling (ASPP) module by multiplying the channel-wise bootstrap weights. This is a natural modiﬁcation of the segmentation architecture analogous to the fully connected layer of the networks for classiﬁcation tasks. Additionally, we apply Neu Boots to real 3D image segmentation task on commercial ODT microscopy NIH3T3 [5] dataset, which is challenging for not only models but also human due to the 512 512 64 sized large resolution and endogenous cellular variability. We use two different U-Net-like models for this 3D image segmentation task, which are U-Res Net and SCNAS. We simply amend the bottleneck layer in the same way as the 2D version. Same as an image classiﬁcation task, we set B = 5 and k = 5. For the remaining, we follow the usual setting.

Results Table 4.2 shows Neu Boots signiﬁcantly improves mean Io U and ECE compared to the baseline. Furthermore, similar to the image classiﬁcation task, Neu Boots records faster prediction time than MCDrop and Deep Ensemble. This experiment indeed veriﬁes that Neu Boots can be applied to the wider scope of computer vision tasks beyond image classiﬁcation.

4.5 Imbalanced Dataset

Setting To validate the efﬁcacy for the imbalanced dataset, we have applied Neu Boots to two imbalance sets, the imbalanced CIFAR-10 and the white blood cell dataset with Res Net-18. To

Dataset Architecture Method m Io U(%) ECE(%) Relative Prediction Time

2D (PASCAL VOC [9])

Baseline 84.57 0.72 15.35 0.21 1.0 MCDrop 87.81 1.83 6.6 0.1 5.4 Deep Ensemble [24] 90.09 0.61 17.31 0.74 5.5 Deep Ensemble_CE [2] 86.95 0.57 12.36 0.53 5.5 Neu Boots 90.14 2.17 6.00 0.1 2.7

Res Net-101

Baseline 85.35 0.23 15.49 0.44 1.0 MCDrop 88.08 1.80 6.48 0.08 5.3 Deep Ensemble [24] 90.40 0.11 17.94 0.03 5.3 Deep Ensemble_CE [2] 87.48 0.09 11.52 0.02 5.3 Neu Boots 90.56 1.71 6.14 0.11 2.5

3D (NIH3T3 [5])

Baseline 61.54 1.14 1.85 0.19 1.0 MCDrop 64.15 0.48 1.53 0.09 5.5 Deep Ensemble [24] 59.71 1.82 1.78 0.29 5.5 Deep Ensemble_CE [2] 65.71 1.69 0.94 0.24 5.5 Neu Boots 67.78 1.01 1.67 0.19 3.5

Baseline 67.52 1.95 1.45 0.19 1.0 MCDrop 65.37 1.13 0.64 0.17 5.2 Deep Ensemble [24] 60.04 2.11 1.39 0.05 5.3 Deep Ensemble_CE [2] 68.66 2.58 0.83 0.09 5.3 Neu Boots 70.80 1.58 0.63 0.16 2.1

Table 4.2. Semantic segmentation. The best results are indicated in bold.

Figure. 4.2. Comparisons of classiﬁcation power for imbalance datasets. The minor class refers to the class with the least number of samples, and the major class refers to the highest number of samples.

conduct an imbalanced CIFAR-10, we randomly sampled from the training dataset of CIFAR-10 to follow a different distribution for each class, and the distribution is [50, 100, 150, 200, 250, 300, 350, 400, 450, 500] for [airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck]. The white blood cell dataset was acquired using a commercial ODT microscopy, and each image is a grayscale of 80 80. The dataset comprises four types of white blood cells, and the training distribution is [144, 281, 2195, 3177] for [eosinophil, monocyte, lymphocyte, neutrophil]. Res Net-18 model and MCDrop extension are used as a baseline and comparative model with the same settings as Section 4.2, respectively. We measure the F1 Score for each class for evaluation.

Results Comparing the performance of Baseline, MCDrop, and Deep Emsenble, Neu Boots performs better on both imbalanced CIFAR-10 and the white blood cell dataset, as shown in Figure 4.2. Especially, Neu Boots outperforms for eosinophil identiﬁcation, the class with the lowest number of samples in the white blood cell dataset, with low variance. This result shows that the Neu Boots boosts the prediction power for the fewer sampled classes with high stability via simple implementation.

5 Related Work

Bootstrapping Neural Network Since [7] ﬁrst proposed the nonparametric bootstrapping to quantify uncertainty in general settings, there has been a rich amount of literature that investigate theoretical advantages of using bootstrap procedures for parametric models [8, 14, 15]. For nerural networks, [12] investigated bootstrap consistency of one-layered MLP under some strong regularity conditions. [36] considered using a conventional nonparametric bootstrapping to robustify classiﬁers under noisy labeling. However, due to the nature of repetitive computations, its practical application to large-sized data sets is not trivial. [31] proposed an approximation of bootstrapping for neural network by applying amortized variational Bayes. Despite its computational efﬁciency, the armortized bootstrap does not induce the exact target bootstrap distribution, and its theoretical justiﬁcation is lacking. Recently, [25] proposes a bootstrapping method for neural processes. They utilized residual bootstrap to resolve the data discard problem, but their approach is not scalable since it requires multiple encoder computations.

Ensemble Methods Various advances of neural net ensembles have been made to improve computational efﬁciency and uncertainty quantiﬁcation performance. Havasi et al. [16] introduces Multiple Input Multiple Output (MIMO), that approximates independent neural nets by imposing multiple inputs and outputs, and Wen et al. [42] proposes a low-rank approximation of ensemble networks, called Batch Ensemble. Latent Posterior Bayes NN (LP-BNN, [11]) extends the Batch Ensemble to a Bayesian paradigm imposing a VAE structure on the individual low-rank factors, and the LP-BNN outperforms the MIMO and the Batch Ensemble in prediction calibration and OOD detection, but its computational burden is heavier than that of the Batch Ensemble. Stochastic Weight Averaging Gaussian (SWAG, [28]) computes the posterior of the base neural net via a low-rank approximation with a batch sampling. Even though these strategies reduces the computational cost to train each ensemble network, unlike Neu Boots, they still demand multiple optimizations, and its computational cost linearly increases as the ensemble size grows up.

Uncertainty Estimation There are numerous approaches to quantify the uncertainty in predictions of NNs. Deep Conﬁdence [6] proposes a framework to compute conﬁdence intervals for individual predictions using snapshot ensembling and conformal prediction. Also, a calibration procedure to approximate a conﬁdence interval is proposed based on Bayesain neural networks [23]. Gal and Ghahramani [13] proposes MCDrop which captures model uncertainty casting dropout training in neural networks as an approximation of variational Bayes. Smith and Gal [39] examines various measures of uncertainty for adversarial example detection. Lakshminarayanan et al. [24] proposes a non-Bayesian approach, called Deep Ensemble, to estimate predictive uncertainty based on ensembles and adversarial training. Compared to Deep Ensemble, Neu Boots does not require adversarial training nor learning multiple models.

6 Conclusion

We introduced a novel and scalable bootstrapping method, Neu Boots, for neural networks. We applied it to the wide range of machine learning tasks related to uncertainty quantiﬁcation; prediction calibration, active learning, out-of-distribution detection, and imbalanced datasets. Neu Boots also demonstrates superior bagging performance over semantic segmentation. Our empirical studies show that Neu Boots attains signiﬁcant potential in quantifying uncertainty for large-sized applications, such as biomedical data analysis with high-resolution. As a future research, one can apply Neu Boots to natural language processing tasks using Transformor [41].

7 Acknowledgement

The authors specially thanks to Dr. Hyokun Yun for his fruitful comments. Minsuk Shin would like to acknowledge support from the National Science Foundation (NSF-DMS award #2015528). This work was also supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2020-0-01336, Artiﬁcial Intelligence Graduate School Program(UNIST)) and National Research Foundation of Korea(NRF) funded by the Korea government(MSIT)(2021R1C1C1009256).

[1] Aksela, M. (2003). Comparison of classiﬁer selection methods for improving committee perfor-

mance. In International Workshop on Multiple Classiﬁer Systems, pages 84 93. Springer.

[2] Ashukha, A., Lyzhov, A., Molchanov, D., and Vetrov, D. (2020). Pitfalls of in-domain uncer-

tainty estimation and ensembling in deep learning. In International Conference on Learning Representations.

[3] Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123 140.

[4] Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking atrous convolution

for semantic image segmentation. ar Xiv preprint ar Xiv:1706.05587.

[5] Choi, J., Kim, H.-J., Sim, G., Lee, S., Park, W. S., Park, J. H., Kang, H.-Y., Lee, M., Heo, W. D.,

Choo, J., Min, H., and Park, Y. (2021). Label-free three-dimensional analyses of live cells with deep-learning-based segmentation exploiting refractive index distributions. bio Rxiv.

[6] Cortés-Ciriano, I. and Bender, A. (2018). Deep conﬁdence: a computationally efﬁcient framework

for calculating reliable prediction errors for deep neural networks. Journal of chemical information and modeling, 59(3):1269 1281.

[7] Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics,

[8] Efron, B. (1987). Better bootstrap conﬁdence intervals. Journal of the American statistical

Association, 82(397):171 185.

[9] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The pascal

visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303 338.

[10] Fort, S., Hu, H., and Lakshminarayanan, B. (2020). Deep ensembles: A loss landscape

perspective. ar Xiv preprint ar Xiv:1912.02757.

[11] Franchi, G., Bursuc, A., Aldea, E., Dubuisson, S., and Bloch, I. (2020). Encoding the latent

posterior of bayesian neural networks for uncertainty quantiﬁcation. In Neur IPS workshop on Bayesian Deep Learning.

[12] Franke, J. and Neumann, M. H. (2000). Bootstrapping neural networks. Neural computation,

12(8):1929 1949.

[13] Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model

uncertainty in deep learning. In international conference on machine learning, pages 1050 1059.

[14] Hall, P. (1986). On the bootstrap and conﬁdence intervals. The Annals of Statistics, pages

[15] Hall, P. (1992). On bootstrap conﬁdence intervals in nonparametric regression. The Annals of

Statistics, pages 695 711.

[16] Havasi, M., Jenatton, R., Fort, S., Liu, J. Z., Snoek, J., Lakshminarayanan, B., Dai, A. M.,

and Tran, D. (2021). Training independent subnetworks for robust prediction. In International Conference on Learning Representations.

[17] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778.

[18] Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common

corruptions and perturbations. International Conference on Learning Representations.

[19] Hendrycks, D. and Gimpel, K. (2017). A baseline for detecting misclassiﬁed and out-of-

distribution examples in neural networks. Proceedings of International Conference on Learning Representations.

[20] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected

convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708.

[21] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by

reducing internal covariate shift. In International conference on machine learning, pages 448 456. PMLR.

[22] Kim, S., Kim, I., Lim, S., Baek, W., Kim, C., Cho, H., Yoon, B., and Kim, T. (2019). Scalable

neural architecture search for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 220 228. Springer.

[23] Kuleshov, V., Fenner, N., and Ermon, S. (2018). Accurate uncertainties for deep learning using

calibrated regression. In International Conference on Machine Learning, pages 2796 2804.

[24] Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive

uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402 6413.

[25] Lee, J., Lee, Y., Kim, J., Yang, E., Hwang, S. J., and Teh, Y. W. (2020). Bootstrapping neural

processes. ar Xiv preprint ar Xiv:2008.02956.

[26] Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple uniﬁed framework for detecting

out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167 7177.

[27] Liang, S., Li, Y., and Srikant, R. (2018). Enhancing the reliability of out-of-distribution image

detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018.

[28] Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2020). A simple

baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32:13153 13164.

[29] Moon, J., Kim, J., Shin, Y., and Hwang, S. (2020). Conﬁdence-aware learning for deep neural

networks. In international conference on machine learning.

[30] Naeini, M. P., Cooper, G. F., and Hauskrecht, M. (2015). Obtaining well calibrated probabilities

using bayesian binning. In Proceedings of the... AAAI Conference on Artiﬁcial Intelligence. AAAI Conference on Artiﬁcial Intelligence, volume 2015, page 2901. NIH Public Access.

[31] Nalisnick, E. and Smyth, P. (2017). The amortized bootstrap. In ICML Workshop on Implicit

[32] Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted

likelihood bootstrap. Journal of the Royal Statistical Society: Series B (Methodological), 56(1):3 26.

[33] Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshmi-

narayanan, B., and Snoek, J. (2019). Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. Neural Information Processing System.

[34] Præstgaard, J. and Wellner, J. A. (1993). Exchangeably weighted bootstraps of the general

empirical process. The Annals of Probability, pages 2053 2086.

[35] Rame, A. and Cord, M. (2021). Dice: Diversity in deep ensembles via conditional redundancy

adversarial estimation. In International Conference on Learning Representations.

[36] Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. (2014). Training

deep neural networks on noisy labels with bootstrapping. ar Xiv preprint ar Xiv:1412.6596.

[37] Shao, J. and Tu, D. (1996). The jackknife and bootstrap. Springer Science & Business Media.

[38] Shin, M., Lee, Y., and Liu, J. S. (2020). Scalable uncertainty quantiﬁcation via generative

bootstrap sampler. ar Xiv preprint ar Xiv:2006.00767.

[39] Smith, L. and Gal, Y. (2018). Understanding Measures of Uncertainty for Adversarial Example

Detection. In UAI.

[40] Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In

Advances in neural information processing systems, pages 1257 1264.

[41] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and

Polosukhin, I. (2017). Attention is all you need. In NIPS.

[42] Wen, Y., Tran, D., and Ba, J. (2020). Batchensemble: an alternative approach to efﬁcient

ensemble and lifelong learning. In International Conference on Learning Representations.

(i) For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] See Section 1 (b) Did you describe the limitations of your work? [Yes] See Section 6

(c) Did you discuss any potential negative societal impacts of your work? [No] We could

not found any negative societal effect of our work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] (ii) If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Appendix (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix (iii) If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] See Section 4 (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] We report error bars except for OOD experiment (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See Section 3.2 (iv) If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [No]

NIH3T3 [5] is proprietary. (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [No] There is no issue on this problem. NTH3T3 [5] is cell-line data. (v) If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] Unnecessary (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] Unnecessary (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A] Unnecessary