# contextual_dropout_an_efficient_sampledependent_dropout_module__b7431e65.pdf Published as a conference paper at ICLR 2021 CONTEXTUAL DROPOUT: AN EFFICIENT SAMPLEDEPENDENT DROPOUT MODULE Xinjie Fan ,1, Shujian Zhang ,1, Korawat Tanwisuth1, Xiaoning Qian2, Mingyuan Zhou1 1The University of Texas at Austin, 2Texas A&M University, xfan@utexas.edu, szhang19@utexas.edu, korawat.tanwisuth@utexas.edu,xqian@ece.tamu.edu, mingyuan.zhou@mccombs.utexas.edu Dropout has been demonstrated as a simple and effective module to not only regularize the training process of deep neural networks, but also provide the uncertainty estimation for prediction. However, the quality of uncertainty estimation is highly dependent on the dropout probabilities. Most current models use the same dropout distributions across all data samples due to its simplicity. Despite the potential gains in the flexibility of modeling uncertainty, sample-dependent dropout, on the other hand, is less explored as it often encounters scalability issues or involves non-trivial model changes. In this paper, we propose contextual dropout with an efficient structural design as a simple and scalable sample-dependent dropout module, which can be applied to a wide range of models at the expense of only slightly increased memory and computational cost. We learn the dropout probabilities with a variational objective, compatible with both Bernoulli dropout and Gaussian dropout. We apply the contextual dropout module to various models with applications to image classification and visual question answering and demonstrate the scalability of the method with large-scale datasets, such as Image Net and VQA 2.0. Our experimental results show that the proposed method outperforms baseline methods in terms of both accuracy and quality of uncertainty estimation. 1 INTRODUCTION Deep neural networks (NNs) have become ubiquitous and achieved state-of-the-art results in a wide variety of research problems (Le Cun et al., 2015). To prevent over-parameterized NNs from overfitting, we often need to appropriately regularize their training. One way to do so is to use Bayesian NNs that treat the NN weights as random variables and regularize them with appropriate prior distributions (Mac Kay, 1992; Neal, 2012). More importantly, we can obtain the model s confidence on its predictions by evaluating the consistency between the predictions that are conditioned on different posterior samples of the NN weights. However, despite significant recent efforts in developing various types of approximate inference for Bayesian NNs (Graves, 2011; Welling & Teh, 2011; Li et al., 2016; Blundell et al., 2015; Louizos & Welling, 2017; Shi et al., 2018), the large number of NN weights makes it difficult to scale to real-world applications. Dropout has been demonstrated as another effective regularization strategy, which can be viewed as imposing a distribution over the NN weights (Gal & Ghahramani, 2016). Relating dropout to Bayesian inference provides a much simpler and more efficient way than using vanilla Bayesian NNs to provide uncertainty estimation (Gal & Ghahramani, 2016), as there is no more need to explicitly instantiate multiple sets of NN weights. For example, Bernoulli dropout randomly shuts down neurons during training (Hinton et al., 2012; Srivastava et al., 2014). Gaussian dropout multiplies the neurons with independent, and identically distributed (iid) Gaussian random variables drawn from N(1, α), where the variance α is a tuning parameter (Srivastava et al., 2014). Variational dropout generalizes Gaussian dropout by reformulating it under a Bayesian setting and allowing α to be learned under a variational objective (Kingma et al., 2015; Molchanov et al., 2017). Equal contribution. Corresponding to: mingyuan.zhou@mccombs.utexas.edu Published as a conference paper at ICLR 2021 However, the quality of uncertainty estimation depends heavily on the dropout probabilities (Gal et al., 2017). To avoid grid-search over the dropout probabilities, Gal et al. (2017) and Boluki et al. (2020) propose to automatically learn the dropout probabilities, which not only leads to a faster experiment cycle but also enables the model to have different dropout probabilities for each layer, bringing greater flexibility into uncertainty modeling. But, these methods still impose the restrictive assumption that dropout probabilities are global parameters shared across all data samples. By contrast, we consider parameterizing dropout probabilities as a function of input covariates, treating them as data-dependent local variables. Applying covariate-dependent dropouts allows different data to have different distributions over the NN weights. This generalization has the potential to greatly enhance the expressiveness of a Bayesian NN. However, learning covariate-dependent dropout rates is challenging. Ba & Frey (2013) propose standout, where a binary belief network is laid over the original network, and develop a heuristic approximation to optimize free energy. But, as pointed out by Gal et al. (2017), it is not scalable due to its need to significantly increase the model size. In this paper, we propose a simple and scalable contextual dropout module, whose dropout rates depend on the covariates x, as a new approximate Bayesian inference method for NNs. With a novel design that reuses the main network to define how the covariate-dependent dropout rates are produced, it boosts the performance while only slightly increases the memory and computational cost. Our method greatly enhances the flexibility of modeling, maintains the inherent advantages of dropout over conventional Bayesian NNs, and is generally simple to implement and scalable to the large-scale applications. We plug the contextual dropout module into various types of NN layers, including fully connected, convolutional, and attention layers. On a variety of supervised learning tasks, contextual dropout achieves good performance in terms of accuracy and quality of uncertainty estimation. 2 CONTEXTUAL DROPOUT We introduce an efficient solution for data-dependent dropout: (1) treat the dropout probabilities as sample-dependent local random variables, (2) propose an efficient parameterization of dropout probabilities by sharing parameters between the encoder and decoder, and (3) learn the dropout distribution with a variational objective. 2.1 BACKGROUND ON DROPOUT MODULES Consider a supervised learning problem with training data D := {xi, yi}N i=1, where we model the conditional probability pθ(yi | xi) using a NN parameterized by θ. Applying dropout to a NN often means element-wisely reweighing each layer with a data-specific Bernoulli/Gaussian distributed random mask zi, which are iid drawn from a prior pη(z) parameterized by η (Hinton et al., 2012; Srivastava et al., 2014). This implies dropout training can be viewed as approximate Bayesian inference (Gal & Ghahramani, 2016). More specifically, one may view the learning objective of a supervised learning model with dropout as a log-marginal-likelihood: log R QN i=1 p(yi | xi, z)p(z)dz. To maximize this often intractable log-marginal, it is common to resort to variational inference (Hoffman et al., 2013; Blei et al., 2017) that introduces a variational distribution q(z) on the random mask z and optimizes an evidence lower bound (ELBO): L(D) = Eq(z) h log QN i=1 pθ(yi | xi,z)pη(z) q(z) i = PN i=1 Ezi q(z) [log pθ(yi | xi, zi)] KL(q(z)||pη(z)), (1) where KL(q(z)||pη(z)) = Eq(z)[log q(z) log p(z)] is a Kullback Leibler (KL) divergence based regularization term. Whether the KL term is explicitly imposed is a key distinction between regular dropout (Hinton et al., 2012; Srivastava et al., 2014) and their Bayesian generalizations (Gal & Ghahramani, 2016; Gal et al., 2017; Kingma et al., 2015; Molchanov et al., 2017; Boluki et al., 2020). 2.2 COVARIATE-DEPENDENT WEIGHT UNCERTAINTY In regular dropout, as shown in (1), while we make the dropout masks data specific during optimization, we keep their distributions the same. This implies that while the NN weights can vary from data to data, their distribution is kept data invariant. In this paper, we propose contextual dropout, in which the distributions of dropout masks zi depend on covariates xi for each sample (xi, yi). Specifically, we define the variational distribution as qφ(zi | xi), where φ denotes its NN parameters. In the framework of amortized variational Bayes (Kingma & Welling, 2013; Rezende Published as a conference paper at ICLR 2021 et al., 2014), we can view qφ as an inference network (encoder) trying to approximate the posterior p(zi | yi, xi) p(yi | xi, zi)p(zi). Note as we have no access to yi during testing, we parameterize our encoder in a way that it depends on xi but not yi. From the optimization point of view, what we propose corresponds to the ELBO of log QN i=1 R p(yi | xi, zi)p(zi)dzi given qφ(zi | xi) as the encoder, which can be expressed as L(D) = PN i=1 L(xi, yi), L(xi, yi) = Ezi qφ( | xi)[log pθ(yi | xi, zi)] KL(qφ(zi | xi)||pη(zi)). (2) This ELBO differs from that of regular dropout in (1) in that the dropout distributions for zi are now parameterized by xi and a single KL regularization term is replaced with the aggregation of N data-dependent KL terms. Unlike conventional Bayesian NNs, as zi is now a local random variable, the impact of the KL terms will not diminish as N increases, and from the viewpoint of uncertainty quantification, contextual dropout relies only on aleatoric uncertainty to model its uncertainty on yi given xi. Like conventional BNNs, we may add epistemic uncertainty by imposing a prior distribution on θ and/or φ, and infer their posterior given D. As contextual dropout with a point estimate on both θ and φ is already achieving state-of-the-art performance, we leave that extension for future research. In what follows, we omit the data index i for simplification and formally define its model structure. Cross-layer dependence: For a NN with L layers, we denote z = {z1, . . . , z L}, with zl representing the dropout masks at layer l. As we expect zl to be dependent on the dropout masks in previous layers {zj}j