# gflowout_dropout_with_generative_flow_networks__7c615a77.pdf GFlow Out: Dropout with Generative Flow Networks Dianbo Liu 1 2 Moksh Jain 1 3 Bonaventure F. P. Dossou 1 4 5 Qianli Shen 6 Salem Lahlou 1 3 Anirudh Goyal 7 Nikolay Malkin 1 3 Chris C. Emezue 1 8 Dinghuai Zhang 1 3 Nadhir Hassen 1 3 Xu Ji 1 3 Kenji Kawaguchi 6 Yoshua Bengio 1 3 9 Bayesian inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way to approximate inference and estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent research shows that the dropout mask can be seen as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlow Out to address these issues. GFlow Out leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlow Nets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlow Out results in predictive distributions that generalize better to out-of-distribution data and provide uncertainty estimates which lead to better performance in downstream tasks. 1Mila Quebec AI Institute 2Broad Institute of MIT and Harvard 3University of Montreal 4Mc Gill University 5Lelapa AI 6National University of Singapore 7Google Deep Mind 8Technical University of Munich 9CIFAR AI Chair. Correspondence to: Dianbo Liu , Moksh Jain . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1. Introduction A key shortcoming of modern deep neural networks is that they are often overconfident about their predictions, especially when there is a distributional shift between train and test dataset (Daxberger et al., 2021; Nguyen et al., 2015; Guo et al., 2017). In risk-sensitive scenarios such as clinical practice and drug discovery, where mistakes can be extremely costly, it is important that models provide predictions with reliable uncertainty estimates (Bhatt et al., 2021). Bayesian inference offers principled tools to model the parameters of neural networks as random variables, placing a prior on them and inferring their posterior given some observed data (Mac Kay, 1992; Neal, 2012). The posterior captures the uncertainty in the predictions of the model and also serves as an effective regularization strategy resulting in improved generalization (Wilson & Izmailov, 2020; Lotfi et al., 2022). In practice, exact Bayesian inference is often intractable and existing Bayesian deep learning methods rely on assumptions that result in posteriors that are less expressive and can provide poorly calibrated uncertainty estimates (Ovadia et al., 2019; Fort et al., 2019; Foong et al., 2020; Daxberger et al., 2021). In addition, even with several approximations, Bayesian deep learning methods are often significantly more computationally expensive and slower to train compared to non-Bayesian methods (Kuleshov et al., 2018; Boluki et al., 2020). Gal and Ghahramani (2016) show that deep neural networks with dropout perform approximate Bayesian inference and approximate the posterior of a deep Gaussian process (Damianou & Lawrence, 2013). One can obtain samples from this predictive distribution by taking multiple forward passes through the neural network with independently sampled dropout masks. Due to its simplicity and minimal computational overhead, dropout has since been used as a method to estimate uncertainty and improve robustness in neural networks. Different variants of dropout have been proposed and can be interpreted as different variational approximations to model the posterior over the neural network parameters (Ba & Frey, 2013; Kingma et al., 2015; Gal et al., 2017; Ghiasi et al., 2018; Fan et al., 2021; Pham & Le, 2021). There are a few major challenges in approximating the GFlow Out: Dropout with Generative Flow Networks Figure 1. In this work, we propose a Generative Flow Network (GFlow Net) based binary dropout mask generator which we refer to as GFlow Out. Purple squares are GFlow Netbased dropout mask generators parameterized as multi-layer perceptrons. zi,l refers to dropout masks for data point indexed by i at layer l of the model. hi,l refers to activations of the model at layer l given input xi. q( ) are auxiliary variational functions used and adapted only during model training, in which the posterior distribution over dropout masks is conditioned implicitly on input covariates (xi) and directly on the label (yi) of the data point to make the estimation easier. p( ) are mask generation functions used at test time, which are only conditioned on xi and trained by minimizing the Kullback Leibler(KL) divergence with q( ). In addition, both q( ) and p( ) conditions explicitly on dropout masks of all previous layers. Bayesian posterior over model parameters using dropout: (1) the multimodal nature of the posterior distribution makes it difficult to approximate with standard variational inference (Gal & Ghahramani, 2016; Le Folgoc et al., 2021), which assumes factorized priors; (2) dropout masks are discrete objects making gradient-based optimization difficult (Boluki et al., 2020); (3) variational inference methods can suffer from high gradient variance resulting in optimization instability (Kingma et al., 2015); (4) modeling dependence between dropout masks from different layers is non-trivial. The recently proposed Generative Flow Networks (GFlow Nets) (Bengio et al., 2021a;b) frame the problem of generating discrete objects as a control problem based on the sequential construction of discrete components. GFlow Nets learn probabilistic policies that sample objects proportional to a reward function (or exp(-energy)). They have demonstrated better generalization to multimodal distributions (Nica et al., 2022) and have lower gradient variance compared with policy gradient-based variational methods (Malkin et al., 2023), making it an interesting choice for posterior inference for dropout. Contributions. In this work, to address the limitations of standard variational inference, we develop a GFlow Netbased binary dropout mask generator which we refer to as GFlow Out, to estimate the posterior distribution of binary dropout masks. GFlow Out generates dropout masks for a layer, conditioned on masks generated for the previous layer, therefore accounting for inter-layer dropout dependence. Furthermore, the GFlow Out estimator can be conditioned on the data point: GFlow Out improves posterior estimation here by utilizing both input covariates and labels in the training set of supervised learning tasks via an auxiliary variational function. To investigate the quality of the posterior distribution learned by GFlow Out, we design empirical experiments, including evaluating robustness to distribution shift during inference, detecting out-of-distribution examples with uncertainty estimates, and transfer learning, using both benchmark datasets and a real-world clinical dataset. 2. Related work 2.1. Dropout as a Bayesian approximation Deep learning tools have shown tremendous power in different applications. However, traditional deep learning tools lack mechanisms to capture the uncertainty, which is of crucial importance in many fields. Uncertainty quantification (UQ) is studied extensively as a fundamental problem of deep learning and a large number of Bayesian deep learning tools have emerged in recent years. For example, Gal & Ghahramani (2016) showed that casting dropout in deep learning model training is an approximation of Bayesian inference in deep Gaussian processes and allows uncertainty estimation without extra computational cost. Kingma et al. (2015) proposed variational dropout, where a dropout posterior over parameters is learned by treating dropout regularization as approximate inference in deep models. Gal et al. (2017) developed a continuous relaxation of discrete dropout masks to improve uncertainty estimation, especially in reinforcement learning settings. Lee et al. (2020) introduced meta-dropout , which involves an additional global term shared across all data points during inference to improve generalization. Xie et al. (2019) replaced the hard dropout mask following a Bernoulli distribution with the soft mask following a beta distribution and conducted the optimization using a stochastic gradient variational Bayesian GFlow Out: Dropout with Generative Flow Networks algorithm to control the dropout rate. Boluki et al. (2020) combined a model-agnostic dropout scheme with variational auto-encoders (VAEs), resulting in semi-implicit VAE models. Instead of using mean-field family for variational inference, Nguyen et al. (2021) utilized a structured representation of multiplicative Gaussian noise for better posterior estimation. More recently, Fan et al. (2021) developed contextual dropout , which optimizes variational objectives in a sample-dependent manner and, to the best of our knowledge, is the closest approach to GFlow Out in the literature. GFlow Out differs from contextual dropout in several aspects. First, both methods take trainable priors into account, but GFlow Net also takes into account priors that depend on the input covariate of each data point. Second, the variational posterior of contextual dropout only depends on the input covariate (x), while in GFlow Out, the variational posterior is also conditioned on the label y, which provides more information for training. Third, within each neural network layer, the mask of contextual dropout is conditioned on previous masks implicitly, while the mask of GFlow Out is conditioned on previous masks explicitly by directly feeding previous masks as inputs into the generator, which improves the training process. Finally, instead of a REINFORCE-based gradient estimator used for contextual dropout training, GFlow Out employs powerful GFlow Nets for the variational posterior. 2.2. Generative flow networks Generative flow networks (GFlow Nets) (Bengio et al., 2021a;b) are a family of probabilistic models that amortizes sampling discrete compositional objects proportionally to a given unnormalized density function. GFlow Nets learn a stochastic policy to construct objects through a sequence of actions akin to deep reinforcement learning (Sutton & Barto, 2018). GFlow Nets are trained so as to make the likelihood of reaching a terminating state proportional to the reward. Recent works have shown close connections of GFlow Nets to other generative models (Zhang et al., 2022a) and to hierarchical variational inference (Malkin et al., 2023). GFlow Nets achieved great empirical success in learning energy-based models (Zhang et al., 2022b), smallmolecule generation (Bengio et al., 2021a; Nica et al., 2022; Malkin et al., 2022; Madan et al., 2023; Pan et al., 2023), biological sequence generation (Malkin et al., 2022; Jain et al., 2022; Madan et al., 2023), and structure learning (Deleu et al., 2022). Several training objectives have been proposed for GFlow Nets, including Flow Matching (FM) (Bengio et al., 2021a), Detailed Balance (DB) (Bengio et al., 2021b), Trajectory Balance (TB) (Malkin et al., 2022), and the more recent Sub-Trajectory Balance (Sub TB) (Madan et al., 2023). In this work, we use the Trajectory Balance (TB) objective. In this section, we define the problem setting and mathematical notations used in this study, as well as describe the proposed method, GFlow Out, for dropout mask generation in detail. 3.1. Background and notation Dropout. In a vanilla feed-forward neural network (MLP) with L layers, each layer of the model has weight matrix wl and bias vector bl. It takes as input activations hl 1 from previous layer with layer index l 1, and computes as output hl = σ(wlhl 1 + bl) where σ is a non-linear activation function. Dropout consists of dropping out units from the output of a layer. Formally this can be described as applying a sampled binary mask zl p(zl) on the output of the layer hl = zl σ(wlhl 1 + bl), at each layer in the model. In regular random dropout, zl is a collection of i.i.d. Bernoulli(r) variables, where r is a fixed parameter for all the layers. Recently, several approaches have been proposed to learn p(zl) along with the model parameters. In these approaches, z is viewed either as latent variables or part of the model parameters. We consider two variants for our proposed method: GFlow Out where the dropout masks z are viewed as sample dependent latent variables, and IDGFlow Out, which generates masks in a sample independent manner where z is viewed as a part of the model parameters shared across all samples. Next, we briefly introduce GFlow Nets and describe how they model the dropout masks z given the data. GFlow Nets. Let G = (S, A) be a directed acyclic graph (DAG) where the vertices s S are states, including a special initial state s0 with no incoming edges, and directed edges (s s ) A are actions. X S denotes the terminal states, with no outgoing edges. A complete trajectory τ = (s0 . . . si 1 si z) T in G is a sequence of states starting at s0 and terminating at z X where each (si 1 si) A. The forward policy PF ( |s) is a collection of distributions over the children of each non-terminal node s S and defines a distribution over complete trajectories, PF (τ) = Q (si 1 si) τ PF (si|si 1). We can sample terminal states z X by sampling trajectories following PF . Let π(x) be the marginal likelihood of sampling terminal state x, π(z) = P τ=(s0 z) T PF (τ). Given a non-negative reward function R : X R+, the learning problem tackled in GFlow Nets is to estimate PF such that π(z) R(z), z X. We refer the reader to Bengio et al. (2021b); Malkin et al. (2022) for a more thorough introduction to GFlow Nets. We adopt the Trajectory Balance (TB) (Malkin et al., 2022) parameterization, which includes PF ( | ; ϕ), PB( | ; ϕ), and Zγ, where ϕ and γ GFlow Out: Dropout with Generative Flow Networks are the learnable parameters. The backward policy PB is a distribution over parents of every noninitial state, and Z is an estimate of the partition function. Within the context of generating dropout masks, a complete dropout mask x X is a binary vector of dimension M, where M is the number of units in the neural network, i.e. X is equal to {0, 1}M. A partially constructed mask s S is a binary vector of dimension m < M representing the mask for a set of initial layers in the model, and an action consists of appending the mask for the subsequent layer to this vector. That is, each action in the sequence samples the mask for an entire layer (in parallel), conditioned on the masks for the previous layers. In the next section, we formally describe how GFlow Nets can be used for generating dropout masks, as well as practical implementation details. 3.2. GFlow Out We consider a generative model of the form p(x, y, z) = p(x)p(z|x)p(y|x, z), where x is the input data with corresponding label y and z is a local discrete latent variable representing the sample-dependent dropout mask, along with a dataset of observations D = {(xi, yi)}N i=1. GFlow Out learns to approximate the posterior p(z|x, y) using the given dataset D. In a supervised learning task where the goal is to learn the predictive distribution p(y|x), with the assumed generative model above, the following variational bound can be derived: i=1 p(yi|xi) (1) zi X p(zi|xi)p(yi|xi, zi) zi X p(zi|xi)q(zi|xi, yi) q(zi|xi, yi)p(yi|xi, zi) i=1 log E q(zi|xi,yi) q(zi|xi, yi)p(yi|xi, zi) (2) i=1 Eq(zi|xi,yi) log p(zi|xi) q(zi|xi, yi)p(yi|xi, zi) Eq(zi|xi,yi)[log p(yi|xi, zi)] KL(q(zi|xi, yi) p(zi|xi)) (3) where p(zi|xi) is part of the generative process and q(zi|xi, yi) is the variational distribution used to approximate the posterior of zi. To improve the efficiency of training and fully utilize the information available in each Algorithm 1 GFlow Out The whole system has the following 3 components: Backbone Model (eg, classifier) neural network p(yi|xi, zi; θ).This algorithm section is written assuming p(yi|xi, zi; θ) is an MLP with L hidden layers, but it can be easily extended to other architectures GFlow Net q(zi|xi, yi; ϕ) which approximates the posterior distribution over dropout masks zi conditioned on both xi and yi from the data point. Its tempered version q (zi|xi, yi; ϕ) is used for dropout mask sampling during training. p(zi|xi; ξ), which generates dropout mask distribution only conditioned on xi, is optimized by minimizing KL divergence with q(zi|xi, yi; ϕ) and is used for dropout mask sampling during test time. q(zi|xi, yi; ϕ) and p(zi|xi; ξ) are all implemented as groups of MLPs, one MLP for each layer l and they do not share parameters with each other nor between different layers. Next, we explain how dropout masks are generated and how p(yi|xi, zi; θ), q(zi|xi, yi; ϕ) and p(zi|xi; ξ) are computed. for epoch do for Iterate data point xi, yi do (batches used in actual training) var1 = 0 Variables to store probabilities from each layer var2 = 0 for layer l in 1 : L 1 do Use current layer s activation and dropout masks of all previous layers for mask generation h l,i = Re LU(bl + wlhl 1,i) zi,l q l (zi,l|h l,i, yi, (zi,j)l 1 j=1; ϕ) dropout masks generated hl,i = zi,lh l,i apply dropout var1+ = log ql(zi,l|xi, yi, (zi,j)l 1 j=1; ϕ) calculate log probabilities var2+ = log pl(zi,l|xi(zi,j)l 1 j=1; ξ) end for log q(zi|xi, yi; ϕ) = var1 log p(zi|xi; ξ) = var2 In the output layer, ˆyi = fout(b L + w Lh L 1,i) where fout is the output non-linearity of the output layer Update θ, ϕ, ξ, ω, γ using equations 4-12 end for end for GFlow Out: Dropout with Generative Flow Networks data point, we design q(zi|xi, yi) so that the distribution of zi is conditioned on both xi and yi. As a consequence, q(zi|xi, yi) is not accessible during inference where yi is not available. Instead, p(zi|xi), which is trained by minimizing KL divergence with q(zi|xi, yi), is used for inference. Parametrizing each of the terms as p(yi|xi, zi; θ), q(zi|xi, yi; ϕ) and p(zi|xi; ξ), the goal is to maximize the lower bound derived above, which we denote as: B(D; θ, ϕ, ξ) = E q(zi|xi,yi;ϕ)[log p(yi|xi, zi; θ)] KL(q(zi|xi, yi; ϕ) p(zi|xi; ξ)) (4) The gradients of B with respect to its parameters are: θB(D; θ, ϕ, ξ) = i=1 θEq(zi|xi,yi)[log p(yi|xi, zi; θ)] ξB(D; θ, ϕ, ξ) = i=1 ξEq(zi|xi,yi)[log p(zi | xi; ξ)] The gradient of the variational objective B with respect to ϕ requires a score function estimator, which is known to suffer from high gradient variance (Malkin et al., 2023). Instead of directly optimizing B with respect to ϕ, we first observe that B can be written as: i=1 (log p(yi|xi) KL(q(zi|xi, yi) p(zi|xi, yi))) , making p(zi|xi, yi), the true posterior, a target for the variational distribution q(zi|xi, yi). We thus propose to use a GFlow Net with the Trajectory Balance loss to train q(zi|xi, yi; ϕ) to match its target, given by its unnormalized density R = p(yi|xi, zi)p(zi|xi). As binary dropout masks z are high-dimensional discrete objects that can be constructed sequentially, we consider them as the terminating states of a GFlow Net, and instead of learning a distribution over these terminating states directly, we exploit the DAG structure to learn a forward policy PF ( | ; ϕ), for which the terminating state distribution is q(zi|xi, yi; ϕ). The Trajectory Balance loss requires an additional parameter Zγ, to train q(zi|xi, yi; ϕ) ((Bengio et al., 2021b; Malkin et al., 2022)). Corresponding trajectory balance loss for a trajectory τ = (s0, ...s L) w.r.t. (Zγ, PF ( | ; ϕ)) will be LT B(τ, D; ϕ, γ) = log Zγ QL t=1 PF (st|st 1; ϕ) where a state sl in the GFlow Net graph refers to the set of dropout masks sampled by the GFlow Net from layer 1 to l of the model ((zj)l j=1). L is the number of layers involving dropout. s L indicates the termination of the trajectory sampling process. PF (st|st 1; ϕ) refers to the forward policy in GFlow Nets 1. log Zγ = f(xi, yi; γ) is the partition function estimator with parameters γ conditioned on both xi and yi from the data point indexed by i. Its parameter γ is trained together with ϕ. R is the reward calculated from the likelihood of the data and the prior distribution of the states which are sets of dropout masks (see equation 12). The parameters ϕ and γ are updated by taking gradient steps on LT B(τ, D; ϕ, γ) for τ sampled from some training policy. We choose to make the training policy a tempered version of q(zi|xi, yi; ϕ), denoted q (zi|xi, yi; ϕ). The expected gradient update is thus equal to i=1 E zi q ( |xi,yi;ϕ) ϕ,γ( log Zγ + log q(zi|xi, yi; ϕ) log R)2 (6) log R = log p(yi|xi, zi; θ) + log p(zi|xi, ξ). During inference one estimates the posterior predictive as ppred(yi|xi) = 1 M PM j=1 p(yi|xi, zj; θ) where M different zj are sampled from the p(zi|xi; ξ) distribution. Implementation details. Algorithm 1 presents a high-level overview of the GFlow Out implementation. q(zi|xi, yi; ϕ) is implemented as a set of multiple MLPs, one for each layer in the model that requires dropout mask generation (see Figure 1). At layer l of the model, the dropout probabilities of all units in layer l are estimated in parallel conditioned on previous layer s activation hl 1,i, the label yi, and all dropout mask in layers before l using ql(zi,l|hl 1,i, yi, (zi,j)l 1 j=1; ϕ) which is parameterized as an MLP. The same process is repeated for each layer in the model. In this way, dropout mask probability at layer l takes into consideration input xi through hl 1,i, label yi and joint probability with all dropout mask in previous layers, but are independent of masks in the same layer. p(zi|xi; ξ) follows the same implementation except that it is not conditioned on yi and hence it can be used at test time for prediction. In convolutional neural networks, GFlow Out is implemented in a similar manner except that the units are dropped out channel-wise (Yang et al., 2020; Park & Kwak, 2016). Early stopping based on performance on the validation set is used to prevent overfitting. Details of the computational efficiency of GFlow Out are discussed in the Appendix. 1As the graph G used in this study has a tree structure, the backward policy PB(st 1|st) is a constant so we leave it out of the equations. GFlow Out: Dropout with Generative Flow Networks 3.3. ID-GFlow Out: GFlow Out without sample-dependent information To understand if the sample-dependent information is needed, we introduce a variant of GFlow Out that only uses sample-independent information and keeps the rest of the algorithm as close to GFlow Out as possible for comparison. Consider a generative model of the form p(x, y, z) = p(x)p(z)p(y|x, z). Given a supervised learning task p(y|x), we generate a dropout mask z that is not conditioned on the data point. We use q(z) to approximate the posterior of z which can be seen as part of the model parameters shared by all data points. The following equations can be derived: i=1 p(yi|xi) = log X i=1 p(yi|xi, z) i=1 p(yi|xi, z) = log E z q(z)[p(z) i=1 p(yi|xi, z)] E z q(z) log " p(z) q(z) i=1 p(yi|xi, z) i=1 log p(yi|xi, z) KL(q(z) p(z)) (7) where the same distribution p(z) is shared across the whole dataset and q is conditional on the whole data set implicitly. B(D; θ , ϕ ) = E q(z;ϕ ) i=1 log p(yi|xi, z; θ ) KL(q(z; ϕ )||p(z)) (8) Where B is a lower bound. We parameterized each of the terms as p(y|x, z; θ ) and q(z; ϕ ). p(z) is set as fixed prior to each unit of the dropout rate of 0.5. The gradients for stochastic optimization of θ can be obtained as θB(D; θ , ϕ , ξ ) = i=1 E z q(z;ϕ ) θ log p(yi|xi, z; θ ) The distribution q(z; ϕ ) can be trained as a the policy of a GFlow Net, using a tempered version q (z; ϕ ) to sample trajectories for training. The expected update direction for ϕ and γ can be shown to equal i=1 E zi q (zi;ϕ ) ϕ,γ (log Zγ + log q(zi; ϕ ) log R)2, where the log-reward for is log R = N log p(yi|xi, zi; θ ) + log p(zi). (11) In ID-GFlow Out, log Zγ = γ and does not condition on any inputs. During inference, the posterior predictive estimate is ppred(yi|xi) = 1 M PM j=1 p(yi|xi, zj; θ ) where M different zj are sampled from the q(z; ϕ ) distribution. 4. Experiments In this section, we empirically evaluate GFlow Out2 on a variety of tasks to understand its ability to generalize across different distributions and estimate uncertainties in prediction. We first evaluate the generalization performance of the posterior predictive approximated by GFlow Out on an image classification task. We also evaluate the efficacy of GFlow Out in the context of transfer learning. To understand the performance of GFlow Out when used in larger models and datasets, we conduct Visual Question Answering (VQA) experiments using Transformer architectures. Next, we evaluate the uncertainty captured by the posterior to detect out-of-distribution examples. Finally, we study a potential application of GFlow Out in a real-world clinical use case for the cross-hospital prediction of mortality in intensive care units (ICUs). We supplement these results with further analysis and additional experimental details in the Appendix. Robustness to data distribution shift. To evaluate the robustness of GFlow Out to distribution shifts between the train and test data, we study its predictive performance on OOD examples. We conduct experiments on MNIST, CIFAR10, and CIFAR-100 datasets with different types and levels of deformations. For MNIST, we train a two-layer MLP with 300 and 100 units respectively and evaluate predictions on MNIST images rotated by a uniformly sampled angle (0 360 ). Similarly, we use the Res Net-18 (He et al., 2016) models for the CIFAR-10/CIFAR-100 datasets and evaluate their robustness to distribution shifts induced by random rotations. Additionally, we consider Snow , Frost and Gaussian noises image corruptions (Hendrycks & Dietterich, 2019), and analyze the robustness of models to each type of deformation applied with varying intensities. We consider both GFlow Out and ID-GFlow Out variants and as baselines use Random Dropout (Standard Bernoulli Dropout) (Hinton et al., 2012), Contextual Dropout (Fan et al., 2021) and Concrete Dropout (Gal et al., 2017). The results, as summarized in Table 1, show that models trained using GFlow Out are in general more robust to random rotations, and GFlow Out outperforms (or at least matches the performance of) baselines 2Code is available at https://github.com/ kaiyuanmifen/GFNDropout GFlow Out: Dropout with Generative Flow Networks Table 1. Performance on clean and corrupted data to evaluate the robustness of models trained with different dropout methods to random image rotations at test time. Data Method Acc. Acc.(rotated) Concrete 90.13 30.68 Contextual 90.12 29.11 Random 91.47 27.04 ID-GFlow Out 91.85 30.57 GFlow Out 91.52 31.00 Concrete 62.49 16.43 Contextual 62.30 15.93 Random 67.19 15.70 ID-GFlow Out 69.99 16.29 GFlow Out 69.80 17.01 Concrete 97.36 66.10 Contextual 98.15 66.20 Random 87.38 43.96 ID-GFlow Out 97.05 70.19 GFlow Out 96.75 66.41 in five out of six experiments with different levels of corruption (Figure 2, Appendix Figure 3 and Figure 4). These observations suggest that models trained with GFlow Out are more robust to distribution shifts as compared to the baselines. Better generalization performance to distribution shifts indicates that GFlow Out potentially learns a better approximation of the Bayesian posterior over the model parameters. Visual Question Answering task using transformer architecture To evaluate GFlow Out on large-scale tasks with larger models, we consider a transformer-based multi-modal architecture MCAN (Yu et al., 2019) for the Visual Question Answering (VQA) task, following Fan et al. (2021). The task involves answering a textual question related to the content of a given image. There are three types of questions in the task, namely binary yes/no questions, numerical questions, and other questions. Dropout is applied on crossmodal attention between images and texts, within data type self-attention, and the feed-forward layers after attention. Our experimental results in Table 2 suggest that GFlow Out either outperforms or matches the performance of contextual and concrete dropout when tested on generalization to noisy dataset where a Gaussian noise is added to the visual inputs (Fan et al., 2021). Uncertainty estimation for out-of-distribution detection. Another way to evaluate the quality of the learned posterior is to analyze the uncertainty estimates on a downstream task. We consider the standard task of using uncertainty estimates for detecting out-of-distribution (OOD) examples (Nado et al., 2021). The intuition is that a well-calibrated model Table 2. Performance on different question types in Visual Question Answering task with a Transformer-based model trained with different methods. Method Test Set Acc.(All) Acc.(Yes/No) Acc.(Number) Acc.(Other) Ide-GFlow Out 66.66 84.21 48.99 58.42 GFlow Out Original 66.12 83.91 49.01 58.33 Contextual 66.89 84.48 49.04 58.24 Concrete 66.92 84.51 48.66 58.38 Ide-GFlow Out 50.27 74.01 32.16 36.12 GFlow Out Noisy 50.33 73.31 32.64 40.17 Contextual 49.72 73.81 32.4 35.97 Concrete 50.2 73.5 31.45 37.39 should produce uncertainty in predictions on OOD examples. This can be useful in cases where difficult OOD examples can be delegated to humans for more careful consideration. As in the previous experiments, we consider Res Net-18 models for CIFAR-10/CIFAR-100 classification and compute uncertainty estimates on the CIFAR-10/CIFAR-100 and SVHN (OOD) test sets. Uncertainty for prediction on each example is calculated using the Dempster-Shafer metric (Sensoy et al., 2018). For baselines, we consider Contextual Dropout and Concrete Dropout, along with standard MC Dropout and Deep Ensembles which are strong baselines for this task. We run the experiment with 5 seeds and report the mean and standard error. We study both GFlow Out with sample-dependent information and IDGFlow Out with only sample-independent information. In Table 3, we present AUPR and AUROC for in-distribution classification (CIFAR-10 and CIFAR-100) and OOD classification (SVHN) using the uncertainty estimates from each method. We observe that GFlow Out outperforms the other dropout baselines with both CIFAR-10 and CIFAR-100 as the training dataset, indicating that sample-dependent information used in GFlow Out results in more calibrated uncertainty estimates. ID-GFlow Out performs well on CIFAR100 but performs poorly on CIFAR-10. Results of deep ensembles, which is a widely used state-of-the-art uncertainty estimation method, is also reported in Table 3 for comparison. Adaptation after training on noisy data. Next, we evaluate the ability of models trained with GFlow Out to adapt quickly after being trained on noisy data. Concretely, during training we add label noise i.e., we randomly assign the labels for a fraction (30%) of points in the training set and then re-train the classifier on a small fraction of the dataset with clean labels. We adopt the same experimental setup as the previous experiments. We consider Res Net-18 models trained on CIFAR-10/CIFAR-100 datasets. As baselines, we again use Contextual Dropout and Concrete Dropout. Figure 2 shows that the models trained with GFlow Out perform adapt faster than the dropout methods we use as a baseline. We also observe that both the sample-dependent and sample-independent variants of GFlow Out achieve similar performance. Application on real-world clinical data. We explore the GFlow Out: Dropout with Generative Flow Networks 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 10 with all deformations 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 100 with all deformations 2000 4000 6000 8000 Number of data points Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 10 transfer learning 2000 4000 6000 8000 Number of data points Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 100 transfer learning Figure 2. Top: Evaluating the robustness of Res Net-18 models, trained with different dropout methods, to different amounts of deformation including SNOW deformation, FROST deformation or Gaussian noise on CIFAR-10/CIFAR-100 at test time. See Figure 3 and 4 in Appendix for detailed results. Bottom: Evaluating the transfer learning performance of Res Net-18 models trained on CIFAR-10/CIFAR100 with label noise and fine-tuned on varying amounts of clean data (i.e., without any label noise). Table 3. Performance on the out-of-distribution (OOD) detection task indicates the posterior approximated by GFlow Out provides better uncertainty estimates. Data Method AUROC AUPR Concrete 0.909 0.021 0.872 0.017 Contextual 0.915 0.011 0.874 0.014 MC Dropout 0.882 0.009 0.869 0.010 Deep Ensembles 0.935 0.007 0.912 0.006 ID-GFlow Out 0.781 0.021 0.843 0.018 GFlow Out 0.955 0.013 0.924 0.019 Concrete 0.795 0.017 0.691 0.021 Contextual 0.812 0.013 0.728 0.014 MC Dropout 0.783 0.011 0.715 0.021 Deep Ensembles 0.842 0.008 0.731 0.017 ID-GFlow Out 0.819 0.008 0.707 0.012 GFlow Out 0.839 0.010 0.741 0.012 competence of GFlow Out as a probabilistic tool for solving real-world problems. In intensive care units (ICUs), the ability to forecast the mortality of patients can help clinicians to allocate limited resources to help the individuals at the highest risk. However, to respect patient privacy, most hospitals have access only to data for a limited number of patients that is anonymized and available for training predictive models. Moreover, there are stringent regulations on the exchange of medical records among hospitals. To enable data-driven decision-making in these critical scenarios, we consider learning probabilistic classifiers for the problem of mortality predictions. We use patients medication usage in the first 48 hrs of ICU stay to make a binary prediction of mortality during the stay. We emphasize that the predic- Table 4. Performance of methods on cross-hospital mortality prediction on real-world clinical data from ICUs demonstrates superior performance of GFlow Out. Method F1 (Macro) Precision Recall Concrete 0.528 0.524 0.641 Contextual 0.521 0.518 0.659 Random 0.49 0.5 0.499 ID-GFlow Out 0.499 0.506 0.534 GFlow Out 0.536 0.528 0.681 tions are meant to help decision-makers (doctors) in making clinical decisions rather than being used directly. We use a 3-layer MLP trained with 4500 patients ICU records, including 158 deaths, from one hospital, and tested with data from another hospital (4018 patients, 147 deaths). Records of both hospitals are obtained from the e ICU database (Pollard et al., 2018). This task encompasses several important challenges in applying machine learning tools to real-world tasks: (1) the underlying prediction task is extremely difficult due to limited information and complex case-specific clinical details, (2) the data distribution is severely imbalanced as deaths are rare events, (3) limited training data resulting in a complex posterior, and (4) large distribution shifts between hospitals. Overall, this cross-hospital task setup is quite challenging. Our results in Table 4, show that GFlow Out significantly outperforms the baselines, in all the metrics. While the margins may appear to be small, in the context of real-world decision-making, they can have a significant impact. The findings demonstrate GFlow Out s effectiveness in addressing risk-averse real-world problems. GFlow Out: Dropout with Generative Flow Networks 5. Conclusion In this work, we propose GFlow Out, to learn the posterior distribution over dropout masks in a neural network. We evaluate GFlow Out on various downstream tasks such as uncertainty estimation, robustness to distribution shift, and transfer learning, using both benchmark datasets and real-world clinical datasets. Our empirical results show the favorable performance of GFlow Out over related methods like Concrete and Contextual Dropout. Future work should involve combining top-down and bottom-up dropout strategies, applying GFlow Out on larger models with complex architectures, and using it to promote exploration in RL problems where accurate estimation of posterior has shown to enhance sample efficiency (Osband et al., 2013). Author contributions D.L., Y.B. and K.K. initialized the project. D.L., M.J., B.D., C.E., Q.S., and S.L. contributed to the implementation and experiments of the project. D.L., M.J., and A.G. designed the experimental studies. D.L., Y.B., A.G., N.M., M.J., X.J., and S.L. contributed to the conceptualization of the project. N.M., S.L., K.K., D.Z., D.L., M.J., Q.S., N.H., and Y.B. contributed to the mathematical parts of the project. D.L., A.G., and M.J. coordinated the project. Y.B. supervised the whole project and designed the whole framework. All authors contributed to the writing of the manuscript. Acknowledgments The authors thank CIFAR, Samsung, and IVADO for funding and NVIDIA for equipment. Ba, J. and Frey, B. Adaptive dropout for training deep neural networks. Neural Information Processing Systems (NIPS), 2013. Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for noniterative diverse candidate generation. Neural Information Processing Systems (Neur IPS), 2021a. Bengio, Y., Deleu, T., Hu, E. J., Lahlou, S., Tiwari, M., and Bengio, E. Gflownet foundations. ar Xiv preprint ar Xiv:2111.09266, 2021b. Bhatt, U., Antor an, J., Zhang, Y., Liao, Q. V., Sattigeri, P., Fogliato, R., Melanc on, G., Krishnan, R., Stanley, J., Tickoo, O., et al. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 401 413, 2021. Boluki, S., Ardywibowo, R., Dadaneh, S. Z., Zhou, M., and Qian, X. Learnable Bernoulli dropout for Bayesian deep learning. Artificial Intelligence and Statistics (AISTATS), 2020. Damianou, A. and Lawrence, N. D. Deep Gaussian processes. Artificial Intelligence and Statistics (AISTATS), 2013. Daxberger, E., Nalisnick, E., Allingham, J. U., Antor an, J., and Hern andez-Lobato, J. M. Bayesian deep learning via subnetwork inference. International Conference on Machine Learning (ICML), 2021. Deleu, T., G ois, A., Emezue, C., Rankawat, M., Lacoste Julien, S., Bauer, S., and Bengio, Y. Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022. Fan, X., Zhang, S., Tanwisuth, K., Qian, X., and Zhou, M. Contextual dropout: An efficient sample-dependent dropout module. International Conference on Learning Representations (ICLR), 2021. Foong, A., Burt, D., Li, Y., and Turner, R. On the expressiveness of approximate inference in Bayesian neural networks. Neural Information Processing Systems (Neur IPS), 2020. Fort, S., Hu, H., and Lakshminarayanan, B. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019. Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning (ICML), 2016. Gal, Y., Hron, J., and Kendall, A. Concrete dropout. Neural Information Processing Systems (NIPS), 30, 2017. Ghiasi, G., Lin, T.-Y., and Le, Q. V. Dropblock: A regularization method for convolutional networks. Neural Information Processing Systems (Neur IPS), 2018. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. International Conference on Machine Learning (ICML), 2017. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. Computer Vision and Pattern Recognition (CVPR), 2016. Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations (ICLR), 2019. GFlow Out: Dropout with Generative Flow Networks Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. Jain, M., Bengio, E., Hernandez-Garcia, A., Rector-Brooks, J., Dossou, B. F., Ekbote, C. A., Fu, J., Zhang, T., Kilgour, M., Zhang, D., Simine, L., Das, P., and Bengio, Y. Biological sequence design with gflownets. International Conference on Machine Learning (ICML), 2022. Jain, M., Lahlou, S., Nekoei, H., Butoi, V., Bertin, P., Rector Brooks, J., Korablyov, M., and Bengio, Y. DEUP: Direct epistemic uncertainty prediction. Transactions on Machine Learning Research (TMLR), 2023. Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. Neural Information Processing Systems (NIPS), 2015. Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. International Conference on Machine Learning (ICML), 2018. Le Folgoc, L., Baltatzis, V., Desai, S., Devaraj, A., Ellis, S., Manzanera, O. E. M., Nair, A., Qiu, H., Schnabel, J., and Glocker, B. Is MC dropout bayesian? ar Xiv preprint ar Xiv:2110.04286, 2021. Lee, H. B., Nam, T., Yang, E., and Hwang, S. J. Meta dropout: Learning to perturb latent features for generalization. International Conference on Learning Representations (ICLR), 2020. Lotfi, S., Izmailov, P., Benton, G., Goldblum, M., and Wilson, A. G. Bayesian model selection, the marginal likelihood, and generalization. International Conference on Machine Learning (ICML), 2022. Mac Kay, D. J. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448 472, 1992. Madan, K., Rector-Brooks, J., Korablyov, M., Bengio, E., Jain, M., Nica, A., Bosc, T., Bengio, Y., and Malkin, N. Learning GFlow Nets from partial episodes for improved convergence and stability. International Conference on Machine Learning (ICML), 2023. Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in GFlow Nets. Neural Information Processing Systems (Neur IPS), 2022. Malkin, N., Lahlou, S., Deleu, T., Ji, X., Hu, E., Everett, K., Zhang, D., and Bengio, Y. GFlow Nets and variational inference. International Conference on Learning Representations (ICLR), 2023. Nado, Z., Band, N., Collier, M., Djolonga, J., Dusenberry, M., Farquhar, S., Filos, A., Havasi, M., Jenatton, R., Jerfel, G., Liu, J., Mariet, Z., Nixon, J., Padhy, S., Ren, J., Rudner, T., Wen, Y., Wenzel, F., Murphy, K., Sculley, D., Lakshminarayanan, B., Snoek, J., Gal, Y., and Tran, D. Uncertainty Baselines: Benchmarks for uncertainty & robustness in deep learning. ar Xiv preprint ar Xiv:2106.04015, 2021. Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. Computer Vision and Pattern Recognition (CVPR), 2015. Nguyen, S., Nguyen, D., Nguyen, K., Than, K., Bui, H., and Ho, N. Structured dropout variational inference for Bayesian neural networks. Neural Information Processing Systems (Neur IPS), 2021. Nica, A. C., Jain, M., Bengio, E., Liu, C.-H., Korablyov, M., Bronstein, M. M., and Bengio, Y. Evaluating generalization in gflownets for molecule design. ICLR 2022 Machine Learning for Drug Discovery workshop, 2022. Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. Neural Information Processing Systems (NIPS), 2013. Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., and Snoek, J. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. Neural Information Processing Systems (Neur IPS), 2019. Pan, L., Zhang, D., Courville, A. C., Huang, L., and Bengio, Y. Generative augmented flow networks. International Conference on Learning Representations (ICLR), 2023. Park, S. and Kwak, N. Analysis on the dropout effect in convolutional neural networks. Asian Conference on Computer Vision, 2016. Pham, H. and Le, Q. Autodropout: Learning dropout patterns to regularize deep networks. Association for the Advancement of Artificial Intelligence (AAAI), 2021. Pollard, T. J., Johnson, A. E., Raffa, J. D., Celi, L. A., Mark, R. G., and Badawi, O. The e ICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1):1 13, 2018. Sensoy, M., Kaplan, L., and Kandemir, M. Evidential deep learning to quantify classification uncertainty. Neural Information Processing Systems (Neur IPS), 2018. GFlow Out: Dropout with Generative Flow Networks Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Wilson, A. G. and Izmailov, P. Bayesian deep learning and a probabilistic perspective of generalization. Neural Information Processing Systems (Neur IPS), 2020. Xie, J., Ma, Z., Zhang, G., Xue, J.-H., Tan, Z.-H., and Guo, J. Soft dropout and its variational bayes approximation. Machine Learning for Signal Processing (MLSP), 2019. Yang, X., Tang, J., Torun, H. M., Becker, W. D., Hejase, J. A., and Swaminathan, M. Rx equalization for a highspeed channel based on bayesian active learning using dropout. Electrical Performance of Electronic Packaging and Systems (EPEPS), 2020. Yu, Z., Yu, J., Cui, Y., Tao, D., and Tian, Q. Deep modular co-attention networks for visual question answering. Computer Vision and Pattern Recognition (CVPR), 2019. Zhang, D., Chen, R. T., Malkin, N., and Bengio, Y. Unifying generative models with gflownets. ar Xiv preprint ar Xiv:2209.02606, 2022a. Zhang, D., Malkin, N., Liu, Z., Volokhova, A., Courville, A., and Bengio, Y. Generative flow networks for discrete probabilistic modeling. International Conference on Machine Learning (ICML), 2022b. GFlow Out: Dropout with Generative Flow Networks A. Appendix A.0.1. HYPERPARAMETERS Hyperparameters of the backbone Res Net and Transformer models were obtained from published baselines or architectures (He et al., 2016; Yu et al., 2019; Fan et al., 2021; Gal & Ghahramani, 2016; Gal et al., 2017). Several GFlow Net-specific hyperparameters are taken into consideration in this study, including the architecture of the variational function q( ) and its associated hyperparameters and the temperature of q ( ). For ID-GFlow Out, there is an additional hyperparameter, which is the prior p(z). The parameters are picked via grid search using the validation set. The temperature of q ( ) is set as 2. In addition, with a 0.1 probability, the forward policy will choose a random mask set in each layer. A.0.2. COMPUTATIONAL EFFICIENCY On a single RTX8000 GPU, training models with GFlow Out takes around the same time as Contextual dropout and Concrete Dropout, and around twice the time (Res Net 7 hrs and MCAN Transformer 16 hrs) as a model with the same architecture and random dropout. The three learned dropout methods have similar efficiency during inference. A.1. Experimental details A.1.1. SAMPLING DROPOUT MASKS In the forward pass during inference, 20 samples are used for each data point. In Res Net experiments, dropout masks are generated for each Res Net block. In the transformer VQA experiment, in each layer, dropout is applied to both the self-attention and the feed-forward layer. A.1.2. ROBUSTNESS TO DISTRIBUTION SHIFT The performance of each method was obtained with 9 repeats of different random seeds for training. Early stop using validation set was used to prevent overfitting. VQA Transformer experiments are designed according to Yu et al. (2019). A.1.3. OOD DETECTION For each data point, we take 20 forward passes and calculate the Uncertainty for prediction on each example using the Dempster-Shafer metric (Sensoy et al., 2018) and algorithm from Jain et al. (2023). The uncertainty score is used for classification of in-distribution vs. out-of-distribution data points assuming the later should have higher uncertainty. A.1.4. ADAPTATION AFTER TRAINING ON NOISY DATA When training the model with noisy CIFAR-10/100 data, randomly picked 30% data points are assigned a random label in the whole training set. The model obtained is then fine-tuned using a small number of clean data points all with correct labels. We conducted experiments with 1000,2000,4000 and 8000 data points used for fine-tuning. A.1.5. REAL-WORLD CLINICAL DATA The ICU dataset is a real-world dataset, containing information about the deaths or survival of 126489 patients, across 58 different hospitals, given a set of administrated drugs. The goal of this experiment is to evaluate how well our approach generalizes, in real-world settings. To imitate this, we built two sets: a training set that contains data points about patients from all hospitals, except the hospital with the highest number of patients (hospital ID 167). This results in a dataset with 120945 entries, which is equally partitioned (70:30 ratio) into the real training and validation sets. a test set that contains information about 5544 patients. As each hospital follows a specific distribution, the test set was designed to measure the OOD efficiency of GFlow Out, on the widest possible set of patients, which is a real-world scenario. We used a 3-layer MLP with multiple Dropout options as presented in Table 4. For the evaluation, we perform 20 forward passes and take the mean of the prediction. GFlow Out: Dropout with Generative Flow Networks 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 10 with all deformations 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 10 with SNOW deformation 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 10 with FROST deformation 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 10 with GAUSSIAN_NOISE deformation Figure 3. Robustness of Res Net-18 models trained with different dropout methods to different amounts of SNOW deformation, FROST deformation or Gaussian noise in CIFAR-10 during test time A.2. Analysing dropout masks Here, we analyze the behavior and dynamics of the binary masks generated by GFlow Out for data points corresponding to different labels and different augmentations. First, we want to verify that GFlow Out generates masks with probability proportional to the reward R as defined in equation (7). Our analysis shows a statistically significant correlation between the probabilities of a set of masks being generated by GFlow Net and the corresponding rewards, with correlation 0.4 and p values 0.05. Next, we want to explore whether GFlow Out generates diverse dropout masks. We take the mean dropout masks generated during inference for each data point and calculate Manhattan distances among different samples in the data set. The results are shown in Figure 6. GFlow Out: Dropout with Generative Flow Networks 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 100 with all deformations 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 100 with SNOW deformation 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 100 with FROST deformation 1 2 3 4 5 Amount of deformation Method Concrete Contextual ID GFlow Out GFlow Out CIFAR 100 with GAUSSIAN_NOISE deformation Figure 4. Robustness of Res Net-18 models trained with different dropout methods to different amounts of SNOW deformation, FROST deformation or Gaussian noise in CIFAR-100 during test time Figure 5. Robustness of Res Net-18 models trained with different dropout methods to different amounts of deformation using Resnet blocks with components in a slightly different order: Batch Norm -Conv-Batch Norm-Conv GFlow Out: Dropout with Generative Flow Networks Random Ide GFlow Out GFlow Out Amount of augmentation Diveristy(pairwise distance) Method Random Ide GFlow Out GFlow Out CIFAR10 with none augmentation ( Inter sample ) Random Ide GFlow Out GFlow Out Amount of augmentation Diveristy(pairwise distance) Method Random Ide GFlow Out GFlow Out CIFAR10 with snow augmentation ( Inter sample ) Random Ide GFlow Out GFlow Out Amount of augmentation Diveristy(pairwise distance) Method Random Ide GFlow Out GFlow Out CIFAR10 with frost augmentation ( Inter sample ) Random Ide GFlow Out GFlow Out Amount of augmentation Diveristy(pairwise distance) Method Random Ide GFlow Out GFlow Out CIFAR10 with gaussian_noise augmentation ( Inter sample ) Figure 6. Diversity of binary dropout masks among different data points measured by Manhattan distance