# minimal_achievable_sufficient_statistic_learning__ae946105.pdf Minimal Achievable Sufficient Statistic Learning Milan Cvitkovic 1 G unther Koliander 2 Abstract We introduce Minimal Achievable Sufficient Statistic (MASS) Learning, a machine learning training objective for which the minima are minimal sufficient statistics with respect to a class of functions being optimized over (e.g., deep networks). In deriving MASS Learning, we also introduce Conserved Differential Information (CDI), an information-theoretic quantity that unlike standard mutual information can be usefully applied to deterministically-dependent continuous random variables like the input and output of a deep network. In a series of experiments, we show that deep networks trained with MASS Learning achieve competitive performance on supervised learning, regularization, and uncertainty quantification benchmarks. 1. Introduction The representation learning approach to machine learning focuses on finding a representation Z of an input random variable X that is useful for predicting a random variable Y (Goodfellow et al., 2016). What makes a representation Z useful is much debated, but a common assertion is that Z should be a minimal sufficient statistic of X for Y (Adragni, KofiP. & Cook, R. Dennis, 2009; Shamir et al., 2010; James et al., 2017; Achille & Soatto, 2018b). That is: 1. Z should be a statistic of X. This means Z = f(X) for some function f. 2. Z should be sufficient for Y . This means p(X|Z, Y ) = 3. Given that Z is a sufficient statistic, it should be mini- mal with respect to X. This means for any measurable, 1Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, California, USA 2Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria. Correspondence to: Milan Cvitkovic . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). non-invertible function g, g(Z) is no longer sufficient for Y .1 In other words: a minimal sufficient statistic is a random variable Z that tells you everything about Y you could ever care about, but if you do any irreversible processing to Z, you are guaranteed to lose some information about Y . Minimal sufficient statistics have a long history in the field of statistics (Lehmann & Scheffe, 1950; Dynkin, 1951). But the minimality condition (3, above) is perhaps too strong to be useful in machine learning, since it is a statement about any measurable function g, rather than about functions in a practical hypothesis class like the class of deep neural networks. Instead, in this work we consider minimal achievable sufficient statistics: sufficient statistics that are minimal within some particular set of functions. Definition 1 (Minimal Achievable Sufficient Statistic). Let f(X) be a sufficient statistic of X for Y . f(X) is minimal achievable with respect to a set of functions F if f 2 F and for any Lipschitz continuous, non-invertible function g, g(f(X)) is no longer sufficient for Y . In this work, we give a characterization of minimal achievable sufficient statistics that is applicable to deep neural networks and show that it can be used to train models with competitive performance on classification accuracy, uncertainty quantification, and out-of-distribution input detection. Contributions: We introduce Conserved Differential Information (CDI), an information-theoretic quantity that, unlike mutual information, is meaningful for deterministically-dependent continuous random variables, such as the input and output of a deep network. We introduce Minimal Achievable Sufficient Statis- tic Learning (MASS Learning), a training objective 1Although this is not the most common phrasing of statistical minimality, we feel it is more understandable. For the equivalence of this phrasing and the standard definition see Supplementary Material 7.1. Minimal Achievable Sufficient Statistic Learning based on CDI for finding minimal achievable sufficient statistics. We provide empirical evidence that models trained by MASS Learning achieve competitive performance on supervised learning, regularization, and uncertainty quantification benchmarks. 2. Conserved Differential Information Before we present MASS Learning, we need to introduce Conserved Differential Information (CDI), on which MASS Learning is based. CDI is an information-theoretic quantity that addresses an oft-cited issue in machine learning (Bell & Sejnowski, 1995; Amjad & Geiger, 2018; Saxe et al., 2018; Nash et al., 2018; Goldfeld et al., 2018), which is that for a continuous random variable X and a continuous, non-constant function f, the mutual information I(X, f(X)) is infinite. (See Supplementary Material 7.2 for details.) This makes I(X, f(X)) unsuitable for use in a learning objective when f is, for example, a standard deep network. The infinitude of I(X, f(X)) has been circumvented in prior works by two strategies. One is discretize X and f(X) (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017), though this is controversial (Saxe et al., 2018). Another is to use a random variable Z with distribution p(Z|X) as the representation of X rather than using f(X) itself as the representation (Alemi et al., 2016; Kolchinsky et al., 2017; Achille & Soatto, 2018b). In this latter approach, p(Z|X) is usually implemented by adding noise to a deep network that takes X as input. These are both reasonable strategies for avoiding the infinitude of I(X, f(X)). But another approach would be to derive a new information-theoretic quantity that is better suited to this situation. To that end we present Conserved Differential Information: Definition 2. For a continuous random variable X taking values in Rd and a Lipschitz continuous function f, the Conserved Differential Information (CDI) is C(X, f(X)) := H(f(X)) EX [log (Jf(X))] (1) where H denotes the differential entropy p(z) log p(z) dz and Jf is the Jacobian determinant of f @x T the Jacobian matrix of f at x. Readers familiar with normalizing flows (Rezende & Mohamed, 2015) or Real NVP (Dinh et al., 2016) will note that the Jacobian determinant used in those methods is a special case of the Jacobian determinant in the definition of CDI. This is because normalizing flows and Real NVP are based on the change of variables formula for invertible mappings, while CDI is based in part on the more general change of variables formula for non-invertible mappings. More details on this connection are given in Supplementary Material 7.3. The mathematical motivation for CDI based on the recent work of (Koliander et al., 2016) is provided in Supplementary Material 7.4. Figure 1 gives a visual example of what CDI measures about a function. f(X) = ½ X g(X) = X - ½ C(X, f(X)) = 0 C(X, g(X)) = - log 2 Figure 1. CDI of two functions f and g of the random variable X. Even though the random variables f(X) and g(X) have the same distribution, C(X, f(X)) is different from C(X, g(X)). This is because f is an invertible function, while g is not. The conserved differential information C(X, f(X)) between deterministically-dependent random variables behaves a lot like mutual information does on discrete random variables. For example, when f is invertible, C(X, f(X)) = H(X), just like with the mutual information between discrete random variables. Most importantly for our purposes, though, C(X, f(X)) obeys the following data processing inequality: Theorem 1 (CDI Data Processing Inequality). For Lipschitz continuous functions f and g with the same output space, C(X, f(X)) C(X, g(f(X)) with equality if and only if g is invertible almost everywhere. The proof is in Supplementary Material 7.5. Minimal Achievable Sufficient Statistic Learning 3. MASS Learning With CDI and its data processing inequality in hand, we can give the following optimization-based characterization of minimal achievable sufficient statistics: Theorem 2. Let X be a continuous random variable, Y be a discrete random variable, and F be any set of Lipschitz continuous functions with a common output space (e.g., different parameter settings of a deep network). If f 2 arg min S2F C(X, S(X)) s.t. I(S(X), Y ) = max S0 I(S0(X), Y ) then f(X) is a minimal achievable sufficient statistic of X for Y with respect to F. Proof. First note the following lemma (Cover & Thomas, 2006): Lemma 1. Z = f(X) is a sufficient statistic for a discrete random variable Y if and only if I(Z, Y ) = max S0 I(S0(X), Y ). Lemma 1 guarantees that any f satisfying the conditions in Theorem 2 is sufficient. If such an f was not minimal achievable there would exist a non-invertible, Lipschitz continuous g such that g(f(X)) was sufficient and by Theorem 1 C(X, g(f(X))) < C(X, f(X)) contradicting f minimizing C(X, S(X)). We can turn Theorem 2 into a learning objective over functions f by relaxing the strict constraint into a Lagrangian formulation with Lagrange multiplier 1/β with β > 0: C(X, f(X)) 1 β I(f(X), Y ) The larger the value of β, the more our objective will encourage minimality over sufficiency. We can then simplify this formulation using the identity I(f(X), Y ) = H(Y ) H(Y |f(X)), which gives us the following optimization objective: LMASS(f) := H(Y |f(X)) + βH(f(X)) βEX[log Jf(X)]. (2) We refer to minimizing this objective as MASS Learning. In practice, we are interested in using MASS Learning to train a deep network f with parameters using a finite dataset {(xi, yi)}N i=1 of N datapoints sampled from the joint distribution p(x, y) of X and Y . To do this, we introduce a parameterized variational approximation qφ(f (x)|y) p(f (x)|y). Using qφ, we minimize the following empirical upper bound to LMASS: b LMASS( , φ) := 1 log qφ(yi|f (xi)) β log qφ(f (xi)) β log Jf (xi) LMASS, where the quantity qφ(f (xi)) is computed as P y qφ(f (xi)|y)p(y) and the quantity qφ(yi|f (xi)) is computed with Bayes rule as qφ(f (xi)|yi)p(yi) P y qφ(f (xi)|y)p(y). When Y is discrete and takes on finitely many values, as in classification problems, and when we choose a variational distribution qφ that is differentiable with respect to φ (e.g., a multivariate Gaussian), then we can minimize b LMASS( , φ) using stochastic gradient descent. To perform classification using our trained network, we use the learned variational distribution qφ and Bayes rule p(Y |X) p(Y |f (X)) qφ(f (X)|Y )p(Y ) P y qφ(f (X)|y)p(y). 4. Related Work 4.1. Connection to the Information Bottleneck The well-studied Information Bottleneck learning method (Tishby et al., 2000; Tishby & Zaslavsky, 2015; Strouse & Schwab, 2016; Alemi et al., 2016; Saxe et al., 2018; Amjad & Geiger, 2018; Goldfeld et al., 2018; Kolchinsky et al., 2018; Achille & Soatto, 2018b;a) is based on minimizing the Information Bottleneck Lagrangian LIB(Z) := βI(X, Z) I(Y, Z) for β > 0, where Z is the representation whose conditional distribution p(Z|X) we are trying to learn. The LIB learning objective can be motivated based on pure information-theoretic elegance. But some works like (Shamir et al., 2010) also point out the connection between the LIB objective and minimal sufficient statistics, which is based on the following theorem: Theorem 3. Let X be a discrete random variable drawn according to a distribution p(X|Y ) determined by the discrete random variable Y . Let F be the set of deterministic functions of X to any target space. Then f(X) is a minimal sufficient statistic of X for Y if and only if f 2 arg min S2F I(X, S(X)) s.t. I(S(X), Y ) = max S02F I(S0(X), Y ). The LIB objective can then be thought of as a Lagrangian relaxation of the optimization problem in this theorem. Minimal Achievable Sufficient Statistic Learning Theorem 3 only holds for discrete random variables. For continuous X it holds only in the reverse direction, so minimizing LIB for continuous X has no formal connection to finding minimal sufficient statistics, not to mention minimal achievable sufficient statistics. See Supplementary Material 7.6 for details. Nevertheless, the optimization problems in Theorem 2 and Theorem 3 are extremely similar, relying as they both do on Lemma 1 for their proofs. And the idea of relaxing the optimization problem in Theorem 2 into a Lagrangian formulation to get LMASS is directly inspired by the Information Bottleneck. So while MASS Learning and Information Bottleneck learning entail different network architectures and loss functions, there is an Information Bottleneck flavor to MASS Learning. 4.2. Jacobian Regularization The presence of the Jf term in b LMASS is reminiscent of the contrastive autoencoder (Rifai et al., 2011) and Jacobian Regularization literature (Sokolic et al., 2017; Ross & Doshi-Velez, 2017; Varga et al., 2017; Novak et al., 2018; Jakubovitz & Giryes, 2018). Both these literatures suggest that minimizing EX[k Df(X)k F ] where Df(x) = @f(x) @x T is the Jacobian matrix seems to improve generalization and/or adversarial robustness. This may seem paradoxical at first, since by applying the AM-GM inequality to the eigenvalues of Df(x)Df(x)T where Df = @f(x) @x T 2 Rr d, we have EX[k Df(X)k2r F ] = EX[Tr(Df(X)Df(X)T)r] EX[rr det(Df(X)Df(X)T)] = EX[rr Jf(X)2] log EX[rr Jf(X)2] 2EX[log Jf(X)] + r log r and EX[log Jf(X)] is being maximized by b LMASS. So b LMASS would seem to be optimizing for worse generalization according to the Jacobian regularization literature. However, the conditional entropy term in b LMASS strongly encourages minimizing EX[k Df(X)k F ]. So overall b LMASS seems to be seeking the right balance of sensitivity (dependent on the value of β) in the network to its inputs, which is precisely in alignment with what the Jacobian regularization literature suggests. 5. Experiments Code to reproduce all experiments is available online.2 Full details on all experiments is in Supplementary Material 7.7. 2https://github.com/mwcvitkovic/ MASS-Learning In this section we compare MASS Learning to other approaches for training deep networks. We use the abbreviation Softmax CE to refer to the standard approach of training deep networks for classification problems by minimizing the softmax cross entropy loss b LSoftmax CE( ) := 1 log softmax(f (xi))yi where softmax(f (xi))yi is the yith element of the softmax function applied to the outputs f (xi) of the network s last linear layer. As usual, softmax(f (xi))yi is taken to be the network s estimate of p(yi|xi). We also compare against the Variational Information Bottleneck (Alemi et al., 2016) method for representation learning, which we abbreviate as VIB . We use two networks in our experiments. Small MLP is a feedforward network with two fully-connected layers of 400 and 200 hidden units, respectively, both with elu nonlinearities (Clevert et al., 2015). Res Net20 is the 20layer residual net of He et al. (2015). In all our experiments, the variational distribution qφ(x|y) for each possible output class y is a mixture of multivatiate Gaussian distributions for which we learn the mixture weights, means, and covariance matrices. Computing the Jf term in b LMASS( , φ) for every sample in a minibatch is too expensive to be practical. Doing so would require on the order of |Y | times more operations than computing b LSoftmax CE( ), since computing the Jf term in b LMASS( , φ) requires (in our implementation) computing the full Jacobian matrix of the network. Thus to make training tractable, we use a subsampling strategy: we estimate the Jf term using only a 1/|Y | fraction of the datapoints in a minibatch. In practice, we do not notice any performance detriment when using the subsampling strategy, and the numerical value of the Jf during training with subsampling is indistinguishable from training with no subsampling. Subsampling for the Jf term results in a significant performance improvement, but it must nevertheless be emphasized that even with the subsampling strategy, our implementation of MASS Learning is roughly twice as computationally costly as Softmax CE training. (Unless β = 0, in which case the cost is the same as Softmax CE.) This is by far the most significant drawback of (our implementation of) MASS Learning. There are many easier-to-compute upper bounds or estimates of Jf that one could use to make MASS Learning faster, but we do not explore these in this work. We performed all experiments on the CIFAR-10 dataset (Krizhevsky, 2009), and coded all our models in Py Torch Minimal Achievable Sufficient Statistic Learning (Paszke et al., 2017). 5.1. Classification Accuracy and Regularization We first confirm that networks trained by MASS Learning can make accurate predictions in supervised learning tasks. We also compare the classification accuracy of networks trained on varying amounts of data to see whether MASS Learning successfully regularizes networks and improves their generalization performance. Classification accuracies for the Small MLP network are shown in Table 1, and for the Res Net20 network in Table 2. For the Small MLP network, MASS Learning does not appear to offer any performance benefits. For the larger Res Net20 network, the results show that while MASS Learning maintains or improves accuracy compared to Softmax CE training, often fairly significantly, these improvements do not seem to be due to the MASS loss b LMASS( , φ) itself, since the same performance improvements are obtained even when the H(f(X)) and EX[log Jf(X)] terms in the MASS loss are set to 0 (i.e. the case when β = 0). This suggests that it is the use of the variational distribution qφ(x|y) to produce the output of the network, rather than the MASS Learning approach, that is providing the benefit. This is an interesting finding, but does not suggest an advantage to using the full MASS Learning method if one is concerned with accuracy or regularization. 5.2. Uncertainty Quantification We also evaluate the ability of networks trained by MASS Learning to properly quantify their uncertainty about their predictions. We assess uncertainty quantification in two ways: using proper scoring rules (Lakshminarayanan et al., 2016), which are scalar measures of how well a network s predictive distribution is calibrated, and by observing performance on an out-of-distribution (OOD) detection task. Tables 3 and 4 show the uncertainty quantification performance of networks according to three proper scoring rules: the Negative Log Likelihood (NLL), the Brier Score, and entropy of the predictive distribution p(y|f (x)). With the Small MLP network Softmax CE and VIB training perform best, while with the Res Net20 network the results are more varied. In general, though, any benefits produced by MASS Learning seem to derive not from the MASS objective but from the network architecture, since MASS Learning with β = 0 gives performance comparable to MASS Learning with β 6= 0. Table 5 shows scalar metrics for performance on an OOD detection task where the network is asked to identify whether an image is from its training distribution (CIFAR-10 images) or from another distribution (SVHN images (Netzer et al., 2011)). Following Hendrycks & Gimpel (2016) and Alemi Table 1. Test-set classification accuracy (percent) on CIFAR-10 dataset using the Small MLP network trained by various methods. Full experiment details are in Supplementary Material 7.7. Values are the mean classification accuracy over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened accuracies are those for which the maximum observed mean accuracy in the column was within one standard deviation. WD is weight decay; D is dropout. METHOD TRAINING SET SIZE 2500 10,000 40,000 Softmax CE 33.9 0.5 44.5 0.3 52.4 1.1 Softmax CE, WD 26.2 0.9 36.5 0.8 47.8 0.6 Softmax CE, D 33.0 1.1 43.9 0.6 54.2 0.5 VIB, β=1e 1 32.3 0.4 40.6 0.6 46.4 0.6 VIB, β=1e 2 34.2 0.4 44.1 0.5 51.6 0.4 VIB, β=1e 3 35.1 0.7 44.2 0.6 51.7 0.7 VIB, β=1e 1, D 28.9 0.9 39.9 0.5 49.8 0.1 VIB, β=1e 2, D 32.9 1.2 43.7 0.8 53.9 0.4 VIB, β=1e 3, D 34.1 1.0 44.3 0.5 54.5 0.3 MASS, β=1e 2 30.3 0.4 39.9 1.1 45.4 1.4 MASS, β=1e 3 32.6 0.6 40.9 0.6 47.0 0.8 MASS, β=1e 4 33.4 0.6 40.7 0.4 47.1 1.1 MASS, β=0 34.0 0.5 40.8 1.0 47.0 0.6 MASS, β=1e 2, D 29.6 1.2 42.2 0.5 51.9 0.5 MASS, β=1e 3, D 31.8 1.3 43.4 0.4 53.0 0.5 MASS, β=1e 4, D 31.9 0.8 43.2 0.2 52.9 0.6 MASS, β=0, D 32.1 1.3 43.4 0.4 52.7 0.4 Table 2. Test-set classification accuracy (percent) on CIFAR-10 dataset using the Res Net20 network trained by various methods. No data augmentation or learning rate scheduling was used full details in Supplementary Material 7.7. Values are the mean classification accuracy over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened accuracies are those for which the maximum observed mean accuracy in the column was within one standard deviation. METHOD TRAINING SET SIZE 2500 10,000 40,000 Softmax CE 37.4 0.7 52.0 1.1 67.8 2.7 VIB, β=1e 3 33.5 0.9 49.1 1.5 66.0 0.6 VIB, β=1e 4 34.0 1.0 50.3 1.6 67.1 0.6 VIB, β=1e 5 34.7 0.6 50.2 1.6 67.8 0.6 VIB, β=0 35.3 0.7 50.0 1.7 68.0 0.1 MASS, β=1e 3 38.5 0.9 52.0 1.0 67.1 0.5 MASS, β=1e 4 39.1 0.3 52.7 0.7 68.9 1.1 MASS, β=1e 5 39.0 1.0 52.5 1.1 69.5 0.6 MASS, β=0 39.7 0.5 52.9 0.4 69.0 0.8 Minimal Achievable Sufficient Statistic Learning et al. (2018), the metrics we report are the Area under the ROC curve (AUROC) and Average Precision score (APR). APR depends on whether the network is tasked with predicting whether an image is in-distribution or out of distribution; we report both metrics as APR In and APR Out, respectively. The Entropy detection method uses the entropy of the network s learned predictive distribution p(y|f (x)) as the OOD detection value. The maxi qφ(f (x)|yi) detection method uses the maximum pdf value for any of the potential output classes yi as the OOD detection value. (For the Softmax CE trained networks, qφ(f (x)|yi) was estimated by MLE of a mixture of 10 full-covariance, 10-dimensional multivariate Gaussians on the training set.) And for the VIB networks, the Rate detection method uses the KL divergence between the VIB s marginal distribution and the representation as the OOD detection value. Here we see MASS Learning outperforming Softmax CE and VIB, but again with the caveat that the benefits appear to be due to the variational distribution in the network architecture, rather than the MASS loss function. 5.3. Does MASS Learning finally solve the mystery of why stochastic gradient descent with the cross entropy loss works so well in deep learning? We do not believe so. MASS Learning and Softmax CE training seem to be producing fairly different representations during training. Figure 2 shows how the values of the three terms in b LMASS( , φ) change as the MLP network trains on the CIFAR-10 dataset using either the usual Softmax CE training or MASS training. Despite achieving similar accuracy, the Softmax CE training method does not seem to be implicitly performing MASS Learning, based on the differing values of the entropy (orange) and Jacobian (green) between the two methods as training progresses. 6. Conclusion MASS Learning is a new approach to representation learning based on the goal of finding minimal achievable sufficient statistics. We have shown that networks trained by MASS Learning perform well on classification tasks and on regularization and uncertainty quantification benchmarks, despite not being directly formulated for any of these tasks. There remain many open questions about MASS Learning. Of primary interest is more investigation into the properties of the representations learned by MASS Learning and how they differ from those learned in standard deep learning. There is also much to learn about how to best minimize the MASS loss. In this paper we used optimizer settings tuned for standard softmax cross entropy learning, but b LMASS( , φ) is such a different optimization objective that there are likely many potential improvements to be made in how we train the networks. We also plan to explore more expressive variational distributions qφ. Finally, in terms of efficiency, although MASS Learning is applicable in principle to any deep learning architecture, there is currently a significant computational cost in computing the Jf term in the MASS Loss function. Finding non-invertible network architectures which admit more efficiently computable Jacobians, as is done in methods like normalizing flows (Rezende & Mohamed, 2015) or Real NVP (Dinh et al., 2016), would greatly increase the utility of MASS Learning. Acknowledgements We would like to thank Georg Pichler, Thomas Vidick, Alex Alemi, Alessandro Achille, and Joseph Marino for useful discussions. Minimal Achievable Sufficient Statistic Learning Table 3. Uncertainty quantification metrics (proper scoring rules) on CIFAR-10 using the Small MLP network trained on 40,000 datapoints. Values are the mean over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened values are those for which the minimum observed mean value in the column was within one standard deviation. Lower values are better. Method Test Accuracy NLL Brier Score Entropy Softmax CE 52.4 1.1 4.19 0.15 0.0835 0.0018 0.230 0.003 Softmax CE, WD 47.8 0.6 1.47 0.02 0.0662 0.0006 1.511 0.019 Softmax CE, D 54.2 0.5 1.56 0.01 0.0642 0.0006 0.739 0.007 VIB, β=1e 1 46.4 0.6 4.78 0.13 0.0919 0.0009 0.296 0.008 VIB, β=1e 2 51.6 0.4 4.81 0.10 0.0861 0.0006 0.207 0.002 VIB, β=1e 3 51.7 0.7 5.09 0.27 0.0863 0.0013 0.194 0.008 VIB, β=1e 1, D 49.8 0.1 1.49 0.01 0.0642 0.0001 1.101 0.008 VIB, β=1e 2, D 53.9 0.4 1.52 0.00 0.0636 0.0002 0.803 0.010 VIB, β=1e 3, D 54.5 0.3 1.53 0.01 0.0641 0.0002 0.754 0.009 MASS, β=1e 2 45.4 1.4 6.85 0.26 0.0979 0.0027 0.207 0.007 MASS, β=1e 3 47.0 0.8 5.85 0.24 0.0943 0.0019 0.218 0.007 MASS, β=1e 4 47.1 1.1 5.71 0.25 0.0942 0.0025 0.219 0.006 MASS, β=0 47.0 0.6 5.67 0.28 0.0945 0.0019 0.221 0.004 MASS, β=1e 2, D 51.9 0.5 1.60 0.03 0.0662 0.0004 0.846 0.025 MASS, β=1e 3, D 53.0 0.5 1.56 0.02 0.0648 0.0008 0.812 0.017 MASS, β=1e 4, D 52.9 0.6 1.55 0.02 0.0646 0.0005 0.831 0.020 MASS, β=0, D 52.7 0.4 1.55 0.02 0.0648 0.0004 0.832 0.012 Table 4. Uncertainty quantification metrics (proper scoring rules) on CIFAR-10 using the Res Net20 network trained on 40,000 datapoints. Values are the mean over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened values are those for which the minimum observed mean value in the column was within one standard deviation. Lower values are better. Method Test Accuracy NLL Brier Score Entropy Softmax CE 67.8 2.7 1.98 0.15 0.0546 0.0043 0.209 0.021 VIB, β=1e 3 66.0 0.6 2.28 0.12 0.0577 0.0011 0.210 0.004 VIB, β=1e 4 67.1 0.6 2.23 0.07 0.0563 0.0010 0.196 0.003 VIB, β=1e 5 67.8 0.6 2.35 0.11 0.0559 0.0012 0.175 0.003 VIB, β=0 68.0 0.1 2.45 0.05 0.0558 0.0003 0.167 0.003 MASS, β=1e 3 67.1 0.5 1.77 0.03 0.0555 0.0010 0.227 0.006 MASS, β=1e 4 68.9 1.1 1.91 0.07 0.0533 0.0018 0.193 0.011 MASS, β=1e 5 69.5 0.6 1.96 0.05 0.0522 0.0011 0.188 0.007 MASS, β=0 69.0 0.8 2.00 0.08 0.0528 0.0015 0.190 0.003 Minimal Achievable Sufficient Statistic Learning Table 5. Out-of-distribution detection metrics on CIFAR-10 with SVHN digits as the out-of-distribution examples using Res Net20 network trained on 40,000 datapoints. Values are the mean over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened values are those for which the maximum observed mean value in the column was within one standard deviation. Higher values are better. Training Method Test Accuracy Detection Method AUROC APR In APR Out Softmax CE 67.8 2.7 Entropy 0.62 0.01 0.66 0.02 0.57 0.01 maxi qφ(f (x)|yi) 0.72 0.02 0.73 0.03 0.70 0.03 VIB, β=1e 3 66.0 0.6 Entropy 0.57 0.01 0.60 0.01 0.53 0.01 Rate 0.71 0.03 0.71 0.03 0.69 0.02 VIB, β=1e 4 67.1 0.6 Entropy 0.57 0.02 0.59 0.03 0.53 0.01 Rate 0.72 0.04 0.71 0.05 0.70 0.04 VIB, β=1e 5 67.8 0.6 Entropy 0.56 0.04 0.58 0.05 0.53 0.02 Rate 0.68 0.01 0.68 0.01 0.64 0.02 VIB, β=0 68.0 0.1 Entropy 0.60 0.03 0.63 0.04 0.55 0.02 Rate 0.61 0.03 0.60 0.02 0.57 0.04 MASS, β=1e 3 67.1 0.5 Entropy 0.63 0.02 0.68 0.02 0.57 0.02 maxi qφ(f (x)|yi) 0.69 0.02 0.68 0.02 0.68 0.02 MASS, β=1e 4 68.9 1.1 Entropy 0.64 0.01 0.69 0.01 0.58 0.01 maxi qφ(f (x)|yi) 0.74 0.01 0.73 0.01 0.72 0.02 MASS, β=1e 5 69.5 0.6 Entropy 0.64 0.01 0.68 0.01 0.58 0.01 maxi qφ(f (x)|yi) 0.76 0.04 0.75 0.04 0.75 0.04 MASS, β=0 69.0 0.8 Entropy 0.65 0.01 0.69 0.02 0.59 0.01 maxi qφ(f (x)|yi) 0.76 0.03 0.76 0.03 0.75 0.03 Figure 2. Value of each term in the MASS Learning loss function, LMASS(f) = H(Y |f(X)) + βH(f(X)) βEX[log Jf(X)], during training of the Small MLP network on the CIFAR-10 dataset. The MASS training was performed with β = 0.001, though the plotted values are for the terms without being multiplied by the β coefficients. The values of these terms for Softmax CE training are estimated using a variational distribution qφ(x|y), the parameters of which were estimated at each timestep by MLE over the training data. Minimal Achievable Sufficient Statistic Learning Achille, A. and Soatto, S. Emergence of Invariance and Disentanglement in Deep Representations. In 2018 Information Theory and Applications Workshop (ITA), pp. 1 9, February 2018a. doi: 10.1109/ITA.2018.8503149. Achille, A. and Soatto, S. Information Dropout: Learning Optimal Representations Through Noisy Computation, 2018b. Adragni, KofiP. and Cook, R. Dennis. Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society A: Mathematical, Phys- ical and Engineering Sciences, 367(1906):4385 4405, November 2009. doi: 10.1098/rsta.2009.0110. URL https://royalsocietypublishing.org/ doi/full/10.1098/rsta.2009.0110. Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep Variational Information Bottleneck. ar Xiv:1612.00410 [cs, math], December 2016. URL http://arxiv. org/abs/1612.00410. ar Xiv: 1612.00410. Alemi, A. A., Fischer, I., and Dillon, J. V. Uncertainty in the Variational Information Bottleneck. ar Xiv:1807.00906 [cs, stat], July 2018. URL http://arxiv.org/ abs/1807.00906. ar Xiv: 1807.00906. Amjad, R. A. and Geiger, B. C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. ar Xiv:1802.09766 [cs, math], February 2018. URL http://arxiv.org/ abs/1802.09766. ar Xiv: 1802.09766. Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129 1159, November 1995. ISSN 0899-7667. Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). ar Xiv:1511.07289 [cs], November 2015. URL http://arxiv.org/abs/1511. 07289. ar Xiv: 1511.07289. Cover, T. M. and Thomas, J. A. Elements of Information Theory 2nd Edition. Wiley-Interscience, Hoboken, NJ, 2 edition edition, July 2006. ISBN 978-0-471-24195-9. Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear Independent Components Estimation. ar Xiv:1410.8516 [cs], October 2014. URL http://arxiv.org/abs/ 1410.8516. ar Xiv: 1410.8516. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density es- timation using Real NVP. May 2016. URL https: //arxiv.org/abs/1605.08803. Dynkin, E. B. Necessary and sufficient statistics for afamily of probability distributions. Uspekhi Mat. Nauk, 6(1): 68 90, 1951. URL http://www.mathnet.ru/ php/archive.phtml?wshow=paper&jrnid= rm&paperid=6820&option_lang=eng. Federer, H. Geometric Measure Theory. Springer, New York, NY, 1969. Goldfeld, Z., Berg, E. v. d., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., and Polyanskiy, Y. Estimating Information Flow in Neural Networks. ar Xiv:1810.05728 [cs, stat], October 2018. URL http://arxiv.org/ abs/1810.05728. ar Xiv: 1810.05728. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. ar Xiv:1512.03385 [cs], December 2015. URL http://arxiv.org/abs/ 1512.03385. ar Xiv: 1512.03385. Hendrycks, D. and Gimpel, K. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. October 2016. URL https://arxiv. org/abs/1610.02136v3. Jakubovitz, D. and Giryes, R. Improving DNN Robust- ness to Adversarial Attacks using Jacobian Regularization. ar Xiv:1803.08680 [cs, stat], March 2018. URL http://arxiv.org/abs/1803.08680. ar Xiv: 1803.08680. James, R. G., Mahoney, J. R., and Crutchfield, J. P. Trimming the Independent Fat: Sufficient Statistics, Mutual Information, and Predictability from Effective Channel States. Physical Review E, 95(6), June 2017. ISSN 2470-0045, 2470-0053. doi: 10.1103/Phys Rev E. 95.060102. URL http://arxiv.org/abs/1702. 01831. ar Xiv: 1702.01831. Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. ar Xiv:1412.6980 [cs], December 2014. URL http://arxiv.org/abs/1412. 6980. ar Xiv: 1412.6980. Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. Nonlinear Information Bottleneck. ar Xiv:1705.02436 [cs, math, stat], May 2017. URL http://arxiv.org/abs/ 1705.02436. ar Xiv: 1705.02436. Kolchinsky, A., Tracey, B. D., and Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. ar Xiv:1808.07593 [cs, stat], August 2018. URL http://arxiv.org/abs/1808.07593. ar Xiv: 1808.07593. Minimal Achievable Sufficient Statistic Learning Koliander, G., Pichler, G., Riegler, E., and Hlawatsch, F. En- tropy and Source Coding for Integer-Dimensional Singular Random Variables. IEEE Transactions on Information Theory, 62(11):6124 6154, November 2016. ISSN 0018- 9448, 1557-9654. doi: 10.1109/TIT.2016.2604248. URL http://arxiv.org/abs/1505.03337. ar Xiv: 1505.03337. Krantz, S. G. and Parks, H. R. Geometric Integration Theory. Birkhuser, Basel, Switzerland, 2009. Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Sim- ple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. ar Xiv:1612.01474 [cs, stat], December 2016. URL http://arxiv.org/abs/1612. 01474. ar Xiv: 1612.01474. Lehmann, E. L. and Scheffe, H. Completeness, Similar Regions, and Unbiased Estimation: Part I. Sankhy: The Indian Journal of Statistics (1933-1960), 10(4):305 340, 1950. ISSN 0036-4452. URL https://www.jstor. org/stable/25048038. Nash, C., Kushman, N., and Williams, C. K. I. Inverting Supervised Representations with Autoregressive Neural Density Models. June 2018. URL https://arxiv. org/abs/1806.00400. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011. Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and Generalization in Neural Networks: an Empirical Study. ar Xiv:1802.08760 [cs, stat], February 2018. URL http://arxiv.org/ abs/1802.08760. ar Xiv: 1802.08760. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in Py Torch. In NIPS-W, 2017. Rezende, D. J. and Mohamed, S. Variational Inference with Normalizing Flows. ar Xiv:1505.05770 [cs, stat], May 2015. URL http://arxiv.org/abs/1505. 05770. ar Xiv: 1505.05770. Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. Higher Order Contractive Auto-Encoder. In Gunopulos, D., Hofmann, T., Malerba, D., and Vazirgiannis, M. (eds.), Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pp. 645 660. Springer Berlin Heidelberg, 2011. ISBN 978-3-642-23783-6. Ross, A. S. and Doshi-Velez, F. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients. ar Xiv:1711.09404 [cs], November 2017. URL http://arxiv.org/ abs/1711.09404. ar Xiv: 1711.09404. Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchin- sky, A., Tracey, B. D., and Cox, D. D. On the Information Bottleneck Theory of Deep Learning. February 2018. URL https://openreview.net/forum? id=ry_WPG-A-. Shamir, O., Sabato, S., and Tishby, N. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29):2696 2711, June 2010. ISSN 0304-3975. doi: 10.1016/j.tcs.2010.04. 006. URL http://www.sciencedirect.com/ science/article/pii/S030439751000201X. Shwartz-Ziv, R. and Tishby, N. Opening the Black Box of Deep Neural Networks via Information. ar Xiv:1703.00810 [cs], March 2017. URL http://arxiv.org/abs/1703.00810. ar Xiv: 1703.00810. Sokolic, J., Giryes, R., Sapiro, G., and Rodrigues, M. R. D. Robust Large Margin Deep Neural Networks. IEEE Transactions on Signal Processing, 65(16):4265 4280, August 2017. ISSN 1053-587X, 1941-0476. doi: 10.1109/TSP.2017.2708039. URL http://arxiv. org/abs/1605.08254. ar Xiv: 1605.08254. Strouse, D. J. and Schwab, D. J. The deterministic informa- tion bottleneck. April 2016. URL https://arxiv. org/abs/1604.00268. Tishby, N. and Zaslavsky, N. Deep Learning and the In- formation Bottleneck Principle. March 2015. URL https://arxiv.org/abs/1503.02406. Tishby, N., Pereira, F. C., and Bialek, W. The informa- tion bottleneck method. ar Xiv:physics/0004057, April 2000. URL http://arxiv.org/abs/physics/ 0004057. ar Xiv: physics/0004057. Varga, D., Csiszrik, A., and Zombori, Z. Gradient Regu- larization Improves Accuracy of Discriminative Models. ar Xiv:1712.09936 [cs], December 2017. URL http://arxiv.org/abs/1712.09936. ar Xiv: 1712.09936.