# hierarchical_vaes_know_what_they_dont_know__07b3068c.pdf Hierarchical VAEs Know What They Don t Know Jakob D. Havtorn 1 2 Jes Frellsen 1 Søren Hauberg 1 Lars Maaløe 1 2 Deep generative models have been demonstrated as state-of-the-art density estimators. Yet, recent work has found that they often assign a higher likelihood to data from outside the training distribution. This seemingly paradoxical behavior has caused concerns over the quality of the attained density estimates. In the context of hierarchical variational autoencoders, we provide evidence to explain this behavior by out-of-distribution data having in-distribution low-level features. We argue that this is both expected and desirable behavior. With this insight in hand, we develop a fast, scalable and fully unsupervised likelihoodratio score for OOD detection that requires data to be in-distribution across all feature-levels. We benchmark the method on a vast set of data and model combinations and achieve state-of-the-art results on out-of-distribution detection. 1. Introduction The reliability and safety of machine learning systems applied in the real-world is contingent on the ability to detect when an input is different from the training distribution. Supervised classifiers built as deep neural networks are well-known to misclassify such out-of-distribution (OOD) inputs to known classes with high confidence (Goodfellow et al., 2015; Nguyen et al., 2015). Several approaches have been suggested to equip deep classifiers with OOD detection capabilities (Hendrycks & Gimpel, 2017; Lakshminarayanan et al., 2017; Hendrycks et al., 2019; De Vries & Taylor, 2018). But, such methods are inherently supervised and require in-distribution labels or examples of OOD data limiting their applicability and generality. Unsupervised generative models that estimate an explicit likelihood should understand what it means to be inand 1Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark 2Corti AI, Copenhagen, Denmark. Correspondence to: Jakob D. Havtorn , Lars Maaløe . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Figure 1. Reconstructions using a hierarchical VAE trained on Fashion MNIST. Reconstruction quality of OOD data is comparable to in-distribution data, resulting in high likelihoods and poor OOD discrimination. By sampling the k bottom-most latent variables from the conditional prior distribution p(z l|z>l) (latent reconstructions) instead of the approximate posterior q(z>l|zk = Epθ(z k|z>k)qφ(z>k|x) log pθ(x|z)pθ(z>k) where k {0, 1, . . . , L} (see Appendix for the derivation). We note that L>0 is the regular ELBO (1) and that empirically we always observe that L L>k k although this need not hold in general. The core idea behind this variation on the ELBO is to sample the k lowest latent variables from the conditional prior z1, . . . , zl pθ(z k|z>k) and only the L k highest from the approximate posterior zk+1, . . . , z L qφ(z>k|x). Importantly, this has the effect that the data likelihood p(x|z) is dependent on the approximate posterior through a latent variable zk+1 different from z1 for all k 1. Thereby, the likelihood can be evaluated with a reconstruction from each of the latent variables zk of the hierarchical VAE. Hence, we can now test how well the input x is reconstructed from each latent variable. The notation L>k highlights that for latent variables z>k, the bound is the regular ELBO while for the latent variables z k, the bound is evaluated using the (conditional) prior rather than the approximate posterior as the proposal distribution. 4.2. A likelihood-ratio score for all feature levels While the L>k bound provides a score for performing semantic OOD detection, it still relies on the data space likelihood function (see equation (7) below), which is known to be problematic for OOD detection (section 3.3). To alleviate this, we phrase OOD detection as a likelihood ratio test of being semantically in-distribution. A standard likelihood ratio test (Buse, 1982) suggests to consider the ratio between the associated likelihoods, which we can approximate on a log-scale by the corresponding lower bounds L and L>k, LLR>k(x) = L(x) L>k(x) . (6) Since, empirically, L L>k, the ratio is always positive as is standard for likelihood ratio tests. A low value of LLR>k(x) means that the ELBO and L>k are almost equally tight for the data. On the contrary, a high value indicates that L>k is looser on the data than the ELBO; hence, the data may be OOD. We can gather further insights about this score if we write the regular ELBO and the L>k bounds in the exact form that includes the intractable KL-divergence between the approximate and true posteriors, L = log pθ(x) DKL (qφ(z|x)||pθ(z|x)) , (7) L>k = log pθ(x) DKL (pθ(z k|z>k)qφ(z>k|x)||pθ(z|x)) . Hierarchical VAEs Know What They Don t Know Subtracting these cancel out the two data likelihood terms log pθ(x) and only the KL-divergences from the approximate to the true posterior remain, LLR>k(x) = DKL (qφ(z|x)||pθ(z|x)) (8) + DKL (pθ(z k|z>k)qφ(z>k|x)||pθ(z|x)) . Hence, it is clear that compared to the likelihood bound L>k, this likelihood-ratio measures divergence exclusively in the latent space whereas L>k includes the log pθ(x) term similar to the ELBO. Therefore, the LLR>k score should be an improved method for semantic OOD detection compared to L>k. Now, it can be noted that if we replace the regular ELBO, L, in (7) with the strictly tighter importance weighted bound (Burda et al., 2016), LS = Eq(z|x) then, in the limit S , we have LS log pθ(x) and the likelihood ratio reduces to LLR>k S (x) DKL(p(z k|z>k)q(z>k|x)||p(z|x)) (10) which, in practice, is well-approximated for a finite S. We expect this importance weighted likelihood ratio to monotonically improve upon the one in (8) as S increases and the KL-divergence in the regular ELBO that contains terms for which zi is high-dimensional goes to zero. Since the scores in (8) and (10) are estimated by sampling their estimators are stochastic objects with nonzero vari- ance. We note that Var([ LLR >k) = Var( ˆL) + Var( ˆL>k) 2 Cov( ˆL, ˆL>k). Since log pθ(x) and part of the KL divergence are identical in the expressions of L and L>k we expect Cov( ˆL, ˆL>k) to be positive which reduces the total variance. Empirical results indeed show that Var([ LLR >k) is larger than Var( ˆL) but smaller than Var( ˆL>k). Nevertheless, the variance of the estimators is guaranteed to go to zero as the number of samples is increased. The OOD scores considered in this research all assume that what discriminates an out-of-distribution from an indistribution data point are semantic, high-level features. Clearly, if this is not the case and the difference instead lies in low-level statistics, the scores would likely fail. We hypothesize that a complementary bound to (5), Lk score with one and S importance samples denoted by LLR>k S . Selection of k: To determine whether an example is OOD in practice, the value of LLR>k is computed on the indistribution test set for all k and the resulting empirical distribution is used as reference. If for any value of k, the LLR>k score of a new input differs significantly from the empirical distribution, it is regarded OOD. If it differs for multiple values of k, the value for which it differs the most is selected. In our experiments, we consider an entire dataset at a time and report the results of LLR>k with the value of k that yielded the highest AUROC for that dataset in a threshold-free manner. In practice, slightly better performance may be achieved by choosing k per example. This would not exclude the use of batching in our method, since LLR>k is computed after the forward pass. The likelihoods for our trained models are in Table 1 alongside baseline results for in-distribution and OOD data. The main results of the paper on the OOD tasks can be seen along with comparisons to the baseline methods in Table 2. We note that for all our results, the value of the score (L>k and LLR>k) for the training and test splits of the in-distribution data was observed to have the same empirical distribution to within sampling error hence yielding an AUROC score of 0.5 as expected. Results on additional commonly used datasets are found in Appendix G. 6.1. Likelihood-based OOD detection We first report the results of the different variations of the L>k bound for OOD detection. We reconfirm the results of Nalisnick et al. (2019a) by observing that our hierarchical latent variable models also assign higher L>0 to the OOD dataset in the Fashion MNIST/MNIST and CIFAR10/SVHN cases resulting in an AUROC inferior to random (Table 2). 2Serr a et al. (2020) performs the best when high likelihoods are assigned to OOD data such that the overlap with in-distribution data is low. Performance is worse when the overlap is high, cf. Serr a et al. (2020, Table 1), as seen with complex images. Method Dataset Avg. bits/dim log p(x) L>1 L>2 L>3 Trained on Fashion MNIST Glow Fashion MNIST 2.96 - - MNIST 1.83 - - HVAE (Ours) Fashion MNIST 0.420 0.476 0.579 - MNIST 0.317 0.601 0.881 - Trained on CIFAR10 Glow CIFAR10 3.46 - - SVHN 2.39 - - HVAE (Ours) CIFAR10 3.74 17.8 54.3 75.7 SVHN 2.62 10.2 64.0 93.9 BIVA (Ours) CIFAR10 3.46 8.74 19.7 37.3 SVHN 2.35 6.62 25.1 59.0 Table 1. Average bits per dimension of different datasets for models trained on Fashion MNIST and CIFAR10. For the hierarchical models we include the L>k bounds. The likelihoods of training and test splits of the in-distribution data are all cases close. Since we train on dynamically binarized Fashion MNIST, our bits/dim are smaller than for Glow. As k is increased for the L>k bound, the bound gets looser but the model eventually assigns higher likelihood to the in distribution data than to the OOD data. Glow refers to Kingma & Dhariwal (2018); Nalisnick et al. (2019a). BIVA refers to our implementation of Maaløe et al. (2019). Switching the in-distribution data for the OOD data in both cases result in correctly detecting the OOD data; an asymmetry also reported by Nalisnick et al. (2019a). Figure 5a shows the density of L>0 in bits per dimension (Theis et al., 2016) by the model trained on Fashion MNIST when evaluated on the Fashion MNIST and MNIST test sets. We observe a high degree of overlap, with less separation of the OOD data compared to similar results of autoregressive and flow-based models, like Xiao et al. (2020). We then evaluate the looser L>k (5) for k {1, L}. Figure 5b shows the result for L>2, which yielded the highest AUCROC , only slightly better than random. Like Maaløe et al. (2019), we see that increasing the value of k generally leads to improved OOD detection. However, we also observe that the two empirical distributions never cease to overlap. Importantly, depending on the OOD dataset, the amount of remaining overlap can be high which limits the discriminatory power of the likelihood-based L>k bound. This is in-line with the pathological behavior of the raw likelihood of latent variable models when used for OOD detection (Xiao et al., 2020). Since a high degree of overlap also seems present in Maaløe et al. (2019), and we see the same problem for our BIVA model trained on CIFAR10, we do not expect this to be due to the less expressive HVAE. 6.2. Likelihood-ratio-based OOD detection We now move to the likelihood ratio-based score. We find that LLR>k separates the OOD MNIST data from in-distribution Fashion MNIST to a higher degree than the Hierarchical VAEs Know What They Don t Know (a) (b) (c) Figure 5. Empirical densities of Fashion MNIST (in-distribution) and MNIST (OOD) using the raw likelihood (a), the L>2 bound (b) and the LLR>1 score (c). All densities are computed using the HVAE model.). For the regular likelihood MNIST is very clearly more likely on average than the Fashion MNIST test data while with the L>2 bound separation is better but significant overlap remains. The LLR>1 provides a high degree of separation. Likelihoods are reported in units of the natural log of the number of bits per dimension. Figure 6. ROC curves with AUROC score for detecting MNIST as OOD with the HVAE model trained on Fashion MNIST. A ROC curve is plotted for each of the L>k bounds including the ELBO along with one for the best-performing log likelihood-ratio LLR>1. Figure 7. ROC curves with AUROC score for detecting SVHN as OOD with the BIVA model trained on CIFAR10. A ROC curve is plotted for each of the L>k bounds including the ELBO along with one for the best-performing log likelihood-ratio LLR>2. likelihood estimates as can be seen by the empirical densities of the score in Figure 5c. We note that the likelihood ratio between the ELBO and the L>k bound provides the highest degree of separation of MNIST and Fashion MNIST as measured by the AUROC for k = 1 smaller than L. This is not surprising since the value of k that provides the maximal separation to the reference in-distribution dataset need not be the one for which LLR>k is overall maximal for the OOD dataset. We also visualize the ROC curves resulting from using the LLR>k score for OOD detection on both Fashion MNIST/MNIST and CIFAR10/SVHN and compare it to the ROC curves resulting from the different L>k bounds in Figures 6 and 7, respectively. On both datasets we see significantly better discriminatory performance when using the LLR>k score. Table 2 shows that BIVA improves upon the HVAE model for OOD detection on CIFAR while Table 1 shows that the BIVA model also improves upon the HVAE in terms of likelihood. We hypothesize that models larger than our implementation of BIVA, with better likelihood scores may perform even better (Maaløe et al., 2019; Vahdat & Kautz, 2020; Child, 2021). 6.3. Comparison to baselines Performance: Table 2 summarize our results compared to baselines based on the commonly used AUROC , AUPRC and FPR80 metrics. Our method outperforms other generative model-based methods such as WAIC (Choi et al., 2019) with Glow model and performs similarly to the likelihood regret method of (Xiao et al., 2020). Furthermore, our method performs similarly to the background constrative likelihood ratio method of Ren et al. (2019) on Fashion MNIST/MNIST but contrary to the failure of that method on CIFAR10/SVHN reported by (Xiao et al., 2020), our method performs very well on this task too. Our approach outperforms all supervised approaches that use in-distribution la- Hierarchical VAEs Know What They Don t Know Method AUROC AUPRC FPR80 Fashion MNIST (in) / MNIST (out) Use prior knowledge of OOD Backgr. contrast. LR (Pixel CNN) [1] 0.994 0.993 0.001 Backgr. contrast. LR (VAE) [7] 0.924 - - Binary classifier [1] 0.455 0.505 0.886 p(ˆy|x) with OOD as noise class [1] 0.877 0.871 0.195 p(ˆy|x) with calibration on OOD [1] 0.904 0.895 0.139 Input complexity (S, Glow) [9] 0.998 - - Input complexity (S, Pixel CNN++) [9] 0.967 - - Use in-distribution data labels y p(ˆy|x) [1], [2] 0.734 0.702 0.506 Entropy of p(y|x) [1] 0.746 0.726 0.448 ODIN [1, 3] 0.752 0.763 0.432 VIB [4, 7] 0.941 - - Mahalanobis distance, CNN [1] 0.942 0.928 0.088 Mahalanobis distance, Dense Net [5] 0.986 - - Ensemble, 20 classifiers [1, 6] 0.857 0.849 0.240 No OOD-specific assumptions - Ensembles WAIC, 5 models, VAE [7] 0.766 - - WAIC, 5 models, Pixel CNN [1] 0.221 0.401 0.911 - Not ensembles Likelihood regret [8] 0.988 - - L>0 + HVAE (ours) 0.268 0.363 0.882 L>1 + HVAE (ours) 0.593 0.591 0.658 L>2 + HVAE (ours) 0.712 0.750 0.548 LLR>1 + HVAE (ours) 0.964 0.961 0.036 LLR>1 250 + HVAE (ours) 0.984 0.984 0.013 CIFAR10 (in) / SVHN (out) Use prior knowledge of OOD Backgr. contrast. LR (Pixel CNN) [1] 0.930 0.881 0.066 Backgr. contrast. LR (VAE) [8] 0.265 - - Outlier exposure [9] 0.984 - - Input complexity (S, Glow) [10] 0.950 - - Input complexity (S, Pixel CNN++) [10] 0.929 - - Input complexity (S, HVAE) (Ours) [10]3 0.833 0.855 0.344 Use in-distribution data labels y Mahalanobis distance [5] 0.991 - - No OOD-specific assumptions - Ensembles WAIC, 5 models, Glow [7] 1.000 - - WAIC, 5 models, Pixel CNN [1] 0.628 0.616 0.657 - Not ensembles Likelihood regret [8] 0.875 - - LLR>2 + HVAE (ours) 0.811 0.837 0.394 LLR>2 + BIVA (ours) 0.891 0.875 0.172 Table 2. AUROC , AUPRC and FPR80 for OOD detection for a Fashion MNIST model using scores on the Fashion MNIST test set as reference. We bold the best results within the No OODspecific assumptions group since we only compare directly to those. HVAE (ours) refers to our hierarchical bottom-up VAE. BIVA (ours) refers to our implementation of the hierarchical BIVA model (Maaløe et al., 2019). [1] is (Ren et al., 2019), [2] is (Hendrycks & Gimpel, 2017), [3] is (Liang et al., 2018), [4] is (Alemi et al., 2018), [5] is (Lee et al., 2018), [6] is (Lakshminarayanan et al., 2017), [7] is (Choi et al., 2019), [8] is (Xiao et al., 2020), [9] is (Hendrycks et al., 2019), [10] is (Serr a et al., 2020). bels or synthetic examples of OOD data derived from the in-distribution data including ODIN (Liang et al., 2018) and the predictive distribution of a classifier p(ˆy|x) trained and evaluated in various ways (see Ren et al. (2019)). Runtime: For a full evaluation of a single example across all feature levels of a model with L stochastic layers, our method requires L 1 forward passes through the inference and generative networks as well as computing the likelihood ratio, of which the forward passes are dominant. For a typical forward pass that is linear in the input dimensionality, D, and the number of stochastic layers, L, this amounts to computation of O(DL). Compared to some related work that either requires an M > 1 sized batch of inputs of which either all or none are OOD (Nalisnick et al., 2019b) or cannot be applied to batches due to the required per-example optimization (Xiao et al., 2020), our method additionally is applicable to batches of any size that may consist of both OOD and in-distribution examples which provides drastic speed-ups via vectorization and parallelization. Furthermore, the method of Xiao et al. (2017) requires refitting the inference network of a VAE which can be computationally demanding. Compared to the likelihood ratio proposed in Ren et al. (2019), our method requires training only a single model on a single dataset. 7. Discussion Deep generative models are state-of-the-art density estimators, but the OOD failures reported in recent years have raised concerns about the limitations of such density estimates. Recent work on improving OOD detection has largely sidestepped this concern by relying on additional assumptions that strictly should not be needed for models with explicit likelihoods. While the engineering challenge of building reliable OOD detection schemes is important, it is of more fundamental importance to understand why the naive likelihood test fails. We have provided evidence that low-level features of the neural nets dominate the likelihood, which gives a cause to the why. The fact that a simple score for measuring the importance of semantic features yield state-of-the-art results on OOD detection without access to additional information gives validity to our hypothesis. The findings from, amongst others, Nalisnick et al. (2019a); Serr a et al. (2020) have a clear relation to information theory and compression. Semantically complex in-distribution data yields models with diverse low-level feature sets that enable generalization across datasets. Simpler datasets can only yield models with less diverse low-level feature sets compared to complex training data. Hence, there can be an asymmetry where the likelihoods of simple OOD data can be high for a model trained on complex data, but not the other way around. Loosely put, the minimal number of bits required to losslessly compress data sampled from some distribution is the entropy of the generating process (Shannon, 1948; Mac Kay, 2003). Townsend et al. (2019) recently showed that VAEs can be used for lossless compression at rates superior to more generic algorithms. We also note that since the hierarchical VAE is a probabilistic graphical latent variable model, it lends itself very Hierarchical VAEs Know What They Don t Know naturally to manipulation at the feature level (Kingma et al., 2014; Maaløe et al., 2016; 2017). This property sets it apart from other generative models that do not explicitly define such a hierarchy of features. This in turn enables reliable OOD detection with our methodology while making no explicit assumptions about the nature of OOD data and only using a single model. This has not been achieved with autoregressive or flow-based models. 8. Conclusion In this paper we study unsupervised out-of-distribution detection using hierarchical variational autoencoders. We provide evidence that highly generalizeable low-level features contribute greatly to estimated likelihoods resulting in poor OOD detection performance. We proceed to develop a likelihood-ratio based score for OOD detection and define it to explicitly ensure that data must be in-distribution across all feature levels to be regarded in-distribution. This ratio is mathematically shown to perform OOD detection in the latent space of the model, removing the reliance on the troublesome input-space likelihood. We point out that contrary to much recent literature on OOD detection, our approach is fully unsupervised and does not make assumptions about the nature of OOD data. Finally, we demonstrate state-of-the-art performance on a wide range of OOD failure cases. Acknowledgements This research was partially funded by the Innovation Fund Denmark via the Industrial Ph D Programme (grant no. 0153-00167B). JF and SH were funded in part by the Novo Nordisk Foundation (grant no. NNF20OC0062606) via the Center for Basic Machine Learning Research in Life Science (MLLS, https://www.mlls.dk). JF was further funded by the Novo Nordisk Foundation (grant no. NNF20OC0065611) and the Independent Research Fund Denmark (grant no. 9131-00082B). SH was further funded by VILLUM FONDEN (15334) and the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement no. 757360). Alemi, A. A., Fischer, I., and Dillon, J. V. Uncertainty in the Variational Information Bottleneck. July 2018. URL http://arxiv.org/abs/1807.00906. arxiv: 1807.00906. Bengio, Y., Courville, A. C., and Vincent, P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1798 1828, 2013. doi: 10.1109/TPAMI.2013.50. URL https://doi.org/10.1109/TPAMI.2013.50. Bishop, C. M. Novelty Detection and Neural-Network Validation. IEE Proceedings - Vision, Image and Signal Processing, 141(4):217 222, 1994. ISSN 1350245x, 13597108. doi: 10.1049/ip-vis:19941330. Burda, Y., Grosse, R., and Salakhutdinov, R. R. Importance Weighted Autoencoders. In Proceedings of the 4th International Conference on Learning Representations (ICLR), pp. 8, San Juan, Puerto Rico, 2016. URL https://arxiv.org/abs/1509.00519. Buse, A. The likelihood ratio, Wald, and Lagrange multiplier tests: An expository note. The American Statistician, 36(3a):153 157, 1982. Child, R. Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021. URL https://arxiv. org/pdf/2011.10650.pdf. Choi, H., Jang, E., and Alemi, A. A. WAIC, but Why? Generative Ensembles for Robust Anomaly Detection. May 2019. URL http://arxiv.org/abs/1810. 01392. arxiv: 1810.01392. Cremer, C., Li, X., and Duvenaud, D. Inference Suboptimality in Variational Autoencoders. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of machine learning research, pp. 1078 1086, Stockholmsm assan, Stockholm, Sweden, July 2018. PMLR. URL http://proceedings. mlr.press/v80/cremer18a.html. De Vries, T. and Taylor, G. W. Learning Confidence for Outof-Distribution Detection in Neural Networks. February 2018. URL http://arxiv.org/abs/1802. 04865. arxiv: 1802.04865. Dieng, A. B., Kim, Y., Rush, A. M., and Blei, D. M. Avoiding latent variable collapse with generative skip models. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89, pp. 2397 2405, Naha, Okinawa, Japan, 2019. PMLR. URL http://proceedings.mlr. press/v89/dieng19a.html. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In Bengio, Y. and Le Cun, Y. (eds.), Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015. URL http://arxiv.org/ abs/1412.6572. Hierarchical VAEs Know What They Don t Know Hendrycks, D. and Gimpel, K. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In Proceedings of the 5th International Conference on Learning Representations (ICRL), Toulon, France, 2017. URL http://arxiv.org/ abs/1610.02136. Hendrycks, D., Mazeika, M., and Dietterich, T. G. Deep anomaly detection with outlier exposure. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 2019. URL https://openreview.net/forum? id=Hyx Cxh Rc Y7. Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 9, Long Beach, CA, USA, 2019. URL http://proceedings.mlr.press/ v97/ho19a/ho19a.pdf. Kingma, D. P. and Dhariwal, P. Glow: Generative Flow with Invertible 1 1 Convolutions. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Neur IPS), pp. 10, Montr eal, Canada, 2018. Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada, 2014. URL http://arxiv.org/abs/ 1312.6114. ar Xiv: 1312.6114. Kingma, D. P., Rezende, D. J., Mohamed, S., and Welling, M. Semi-Supervised Learning with Deep Generative Models. In Proceedings of the 28th International Conference on Neural Information Processing Systems (Neur IPS), Montr eal, Quebec, Canada, June 2014. URL http://arxiv.org/abs/1406. 5298. ar Xiv: 1406.5298. Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Neur IPS), NIPS 16, pp. 4743 4751, Barcelona, Spain, 2016. ISBN 978-1-5108-3881-9. URL http://arxiv.org/abs/1606.04934. Kipf, T. N. and Welling, M. Variational Graph Auto Encoders. November 2016. URL http://arxiv. org/abs/1611.07308. arxiv: 1611.07308. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Ph D thesis, University of Toronto, 2009. ar Xiv: 1011.1669v3 ISBN: 9788578110796 ISSN: 10986596. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In In Proceddings of the 31st Conference on Neural Information Processing Systems (Neur IPS), Long Beach, CA, USA, 2017. URL http: //arxiv.org/abs/1612.01474. Le Cun, Y. A., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2323, 1998. Lee, K., Lee, K., Lee, H., and Shin, J. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Neur IPS), pp. 11, Montr eal, Quebec, Canada, 2018. URL https: //papers.nips.cc/paper/2018/file/ abdeb6f575ac5c6676b747bca8d09cc2-Paper. pdf. Liang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018. URL https://openreview.net/forum? id=H1VGk Ix RZ. Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary deep generative models. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of the 33rd International Conference on Machine Learning (ICML), volume 48 of Proceedings of machine learning research, pp. 1445 1453, New York, New York, USA, June 2016. PMLR. URL http://proceedings. mlr.press/v48/maaloe16.html. Maaløe, L., Fraccaro, M., and Winther, O. Semi Supervised Generation with Cluster-aware Generative Models. April 2017. URL http://arxiv.org/ abs/1704.00637. arxiv: 1704.00637. Maaløe, L., Fraccaro, M., Li evin, V., and Winther, O. BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Neur IPS), pp. 6548 6558, Vancouver, Canada, February 2019. URL http://arxiv.org/abs/1902.02102. Mac Kay, D. J. C. Information theory, inference, and learning algorithms. Cambridge University Press, 1 edition, 2003. ISBN 978-0-521-64298-9. Mattei, P.-A. and Frellsen, J. Refit your encoder when new data comes by. In 3rd Neur IPS Workshop on Bayesian Deep Learning, 2018. Hierarchical VAEs Know What They Don t Know Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. Do Deep Generative Models Know What They Don t Know? In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 2019a. URL http://arxiv.org/abs/1810.09136. ar Xiv: 1810.09136. Nalisnick, E., Matsukawa, A., Teh, Y. W., and Lakshminarayanan, B. Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality. pp. 15, 2019b. URL https://arxiv.org/abs/1906. 02994. arxiv: 1906.02994. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Neur IPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http: //ufldl.stanford.edu/housenumbers/ nips2011_housenumbers.pdf. Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CCVPR), volume 07-12-June, pp. 427 436, 2015. ISBN 978-1-4673-6964-0. doi: 10.1109/CVPR.2015.7298640. Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wave Net: A Generative Model for Raw Audio. In In Proceedings of the 9th ISCA Speech Synthesis Workshop, Sunnyval, CA, USA, September 2016a. URL http://arxiv.org/abs/1609.03499. Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel Recurrent Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, August 2016b. Journal of Machine Learning. URL http://arxiv.org/abs/ 1601.06759. Paszke, A., Chanan, G., Lin, Z., Gross, S., Yang, E., Antiga, L., and Devito, Z. Automatic differentiation in Py Torch. In In Proceedings of the 31st Conference on Neural Information Processing Systems (Neur IPS), 2017. URL https://pytorch.org/. Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., Dillon, J., and Lakshminarayanan, B. Likelihood Ratios for Out-of-Distribution Detection. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (Neur IPS), pp. 12, Vancouver, Canada, 2019. URL https: //papers.nips.cc/paper/2019/file/ 1e79596878b2320cac26dd792a6c51c9-Paper. pdf. Rezende, D. J. and Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 2015. URL http://arxiv.org/abs/ 1505.05770. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of Machine Learning Research, volume 32, pp. 1278 1286, Beijing, China, January 2014. PMLR. URL http://proceedings. mlr.press/v32/rezende14.pdf. Salimans, T. and Kingma, D. P. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Proceedings of the 30th Conference on Neural Information Processing Systems (Neur IPS), Barcelona, Spain, February 2016. URL http://arxiv.org/abs/1602.07868. Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixel CNN++: Improving the Pixel CNN with Discretized Logistic Mixture Likelihood and Other Modifications. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, April 2017. URL http://arxiv.org/abs/1701. 05517. Serr a, J., Alvarez, D., G omez, V., Slizovskaia, O., N u nez, J. F., and Luque, J. Input complexity and out-ofdistribution detection with likelihood-based generative models. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020. URL https://openreview.net/ forum?id=Syx IWp VYvr. Shannon, C. E. A Mathematical Theory of Communication. The Bell System Technical Journal, 27(July 1948):379 423, 1948. ISSN 07246811. doi: 10.1145/584091.584093. ar Xiv: chao-dyn/9411012 ISBN: 0252725484. Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder Variational Autoencoders. In Proceedings of the 29th Conference on Neural Information Processing Systems (Neur IPS), Barcelona, Spain, December 2016. URL http://arxiv.org/abs/ 1602.02282. Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016. URL http: //arxiv.org/abs/1511.01844. Hierarchical VAEs Know What They Don t Know Townsend, J., Bird, T., and Barber, D. Practical Lossless Compression With Latent Variables Using Bits Back Coding. In 7th International Conference on Learning Representations (ICLR), pp. 13, New Orleans, LA, USA, 2019. Vahdat, A. and Kautz, J. NVAE: A Deep Hierarchical Variational Autoencoder. In 34th Conference on Neural Information Processing Systems (Neur IPS), Virtual, July 2020. URL http://arxiv.org/abs/2007.03898. van den Oord, A., Kalchbrenner, N., Espeholt, L., Kavukcuoglu, K., Vinyals, O., and Graves, A. Conditional image generation with Pixel CNN decoders. In Proceedings of the 29th International Conference on Neural Information Processing Systems, pp. 4790 4798, Barcelona, Spain, 2016. URL https://proceedings. neurips.cc/paper/2016/hash/ b1301141feffabac455e1f90a7de2054-Abstract. html. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. 2017. URL https://arxiv.org/ abs/1708.07747. ar Xiv:1708.07747 [cs.LG]. Xiao, Z., Yan, Q., and Amit, Y. Likelihood Regret: An Out-of-Distribution Detection Score for Variational Auto-Encoder. In Proceedings of the 33rd Conference on Neural Information Processing Systems (Neur IPS), Virtual, 2020. URL https://proceedings. neurips.cc/paper/2020/hash/ eddea82ad2755b24c4e168c5fc2ebd40-Abstract. html.