# minimal_achievable_sufficient_statistic_learning__ae946105.pdf

Minimal Achievable Sufﬁcient Statistic Learning

Milan Cvitkovic 1 G unther Koliander 2

Abstract We introduce Minimal Achievable Sufﬁcient Statistic (MASS) Learning, a machine learning training objective for which the minima are minimal sufﬁcient statistics with respect to a class of functions being optimized over (e.g., deep networks). In deriving MASS Learning, we also introduce Conserved Differential Information (CDI), an information-theoretic quantity that unlike standard mutual information can be

usefully applied to deterministically-dependent continuous random variables like the input and output of a deep network. In a series of experiments, we show that deep networks trained with MASS Learning achieve competitive performance on supervised learning, regularization, and uncertainty quantiﬁcation benchmarks.

1. Introduction

The representation learning approach to machine learning focuses on ﬁnding a representation Z of an input random variable X that is useful for predicting a random variable Y (Goodfellow et al., 2016).

What makes a representation Z useful is much debated, but a common assertion is that Z should be a minimal sufﬁcient statistic of X for Y (Adragni, KoﬁP. & Cook, R. Dennis, 2009; Shamir et al., 2010; James et al., 2017; Achille & Soatto, 2018b). That is:

1. Z should be a statistic of X. This means Z = f(X)

for some function f.

2. Z should be sufﬁcient for Y . This means p(X|Z, Y ) =

3. Given that Z is a sufﬁcient statistic, it should be mini-

mal with respect to X. This means for any measurable,

1Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, California, USA 2Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria. Correspondence to: Milan Cvitkovic <mcvitkov@caltech.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

non-invertible function g, g(Z) is no longer sufﬁcient for Y .1

In other words: a minimal sufﬁcient statistic is a random variable Z that tells you everything about Y you could ever care about, but if you do any irreversible processing to Z, you are guaranteed to lose some information about Y .

Minimal sufﬁcient statistics have a long history in the ﬁeld of statistics (Lehmann & Scheffe, 1950; Dynkin, 1951). But the minimality condition (3, above) is perhaps too strong to be useful in machine learning, since it is a statement about any measurable function g, rather than about functions in a practical hypothesis class like the class of deep neural networks.

Instead, in this work we consider minimal achievable sufﬁcient statistics: sufﬁcient statistics that are minimal within some particular set of functions.

Deﬁnition 1 (Minimal Achievable Sufﬁcient Statistic). Let f(X) be a sufﬁcient statistic of X for Y . f(X) is minimal achievable with respect to a set of functions F if f 2 F and for any Lipschitz continuous, non-invertible function g, g(f(X)) is no longer sufﬁcient for Y .

In this work, we give a characterization of minimal achievable sufﬁcient statistics that is applicable to deep neural networks and show that it can be used to train models with competitive performance on classiﬁcation accuracy, uncertainty quantiﬁcation, and out-of-distribution input detection.

Contributions:

We introduce Conserved Differential Information (CDI), an information-theoretic quantity that, unlike mutual information, is meaningful for deterministically-dependent continuous random variables, such as the input and output of a deep network.

We introduce Minimal Achievable Sufﬁcient Statis-

tic Learning (MASS Learning), a training objective

1Although this is not the most common phrasing of statistical minimality, we feel it is more understandable. For the equivalence of this phrasing and the standard deﬁnition see Supplementary Material 7.1.

Minimal Achievable Sufﬁcient Statistic Learning

based on CDI for ﬁnding minimal achievable sufﬁcient statistics.

We provide empirical evidence that models trained

by MASS Learning achieve competitive performance on supervised learning, regularization, and uncertainty quantiﬁcation benchmarks.

2. Conserved Differential Information

Before we present MASS Learning, we need to introduce Conserved Differential Information (CDI), on which MASS Learning is based.

CDI is an information-theoretic quantity that addresses an oft-cited issue in machine learning (Bell & Sejnowski, 1995; Amjad & Geiger, 2018; Saxe et al., 2018; Nash et al., 2018; Goldfeld et al., 2018), which is that for a continuous random variable X and a continuous, non-constant function f, the mutual information I(X, f(X)) is inﬁnite. (See Supplementary Material 7.2 for details.) This makes I(X, f(X)) unsuitable for use in a learning objective when f is, for example, a standard deep network.

The inﬁnitude of I(X, f(X)) has been circumvented in prior works by two strategies. One is discretize X and f(X) (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017), though this is controversial (Saxe et al., 2018). Another is to use a random variable Z with distribution p(Z|X) as the representation of X rather than using f(X) itself as the representation (Alemi et al., 2016; Kolchinsky et al., 2017; Achille & Soatto, 2018b). In this latter approach, p(Z|X) is usually implemented by adding noise to a deep network that takes X as input.

These are both reasonable strategies for avoiding the inﬁnitude of I(X, f(X)). But another approach would be to derive a new information-theoretic quantity that is better suited to this situation. To that end we present Conserved Differential Information: Deﬁnition 2. For a continuous random variable X taking values in Rd and a Lipschitz continuous function f, the Conserved Differential Information (CDI) is

C(X, f(X)) := H(f(X)) EX [log (Jf(X))] (1)

where H denotes the differential entropy

p(z) log p(z) dz

and Jf is the Jacobian determinant of f

@x T the Jacobian matrix of f at x.

Readers familiar with normalizing ﬂows (Rezende & Mohamed, 2015) or Real NVP (Dinh et al., 2016) will note that the Jacobian determinant used in those methods is a special case of the Jacobian determinant in the deﬁnition of CDI. This is because normalizing ﬂows and Real NVP are based on the change of variables formula for invertible mappings, while CDI is based in part on the more general change of variables formula for non-invertible mappings. More details on this connection are given in Supplementary Material 7.3. The mathematical motivation for CDI based on the recent work of (Koliander et al., 2016) is provided in Supplementary Material 7.4. Figure 1 gives a visual example of what CDI measures about a function.

f(X) = ½ X g(X) = X - ½

C(X, f(X)) = 0 C(X, g(X)) = - log 2

Figure 1. CDI of two functions f and g of the random variable X.

Even though the random variables f(X) and g(X) have the same distribution, C(X, f(X)) is different from C(X, g(X)). This is because f is an invertible function, while g is not.

The conserved differential information C(X, f(X)) between deterministically-dependent random variables behaves a lot like mutual information does on discrete random variables. For example, when f is invertible, C(X, f(X)) = H(X), just like with the mutual information between discrete random variables. Most importantly for our purposes, though, C(X, f(X)) obeys the following data processing inequality:

Theorem 1 (CDI Data Processing Inequality). For Lipschitz continuous functions f and g with the same output space,

C(X, f(X)) C(X, g(f(X))

with equality if and only if g is invertible almost everywhere.

The proof is in Supplementary Material 7.5.

Minimal Achievable Sufﬁcient Statistic Learning

3. MASS Learning

With CDI and its data processing inequality in hand, we can give the following optimization-based characterization of minimal achievable sufﬁcient statistics:

Theorem 2. Let X be a continuous random variable, Y be a discrete random variable, and F be any set of Lipschitz continuous functions with a common output space (e.g., different parameter settings of a deep network). If

f 2 arg min

S2F C(X, S(X))

s.t. I(S(X), Y ) = max

S0 I(S0(X), Y )

then f(X) is a minimal achievable sufﬁcient statistic of X for Y with respect to F.

Proof. First note the following lemma (Cover & Thomas, 2006):

Lemma 1. Z = f(X) is a sufﬁcient statistic for a discrete random variable Y if and only if I(Z, Y ) = max S0 I(S0(X), Y ).

Lemma 1 guarantees that any f satisfying the conditions in Theorem 2 is sufﬁcient. If such an f was not minimal achievable there would exist a non-invertible, Lipschitz continuous g such that g(f(X)) was sufﬁcient and by Theorem 1 C(X, g(f(X))) < C(X, f(X)) contradicting f minimizing C(X, S(X)).

We can turn Theorem 2 into a learning objective over functions f by relaxing the strict constraint into a Lagrangian formulation with Lagrange multiplier 1/β with β > 0:

C(X, f(X)) 1

β I(f(X), Y )

The larger the value of β, the more our objective will encourage minimality over sufﬁciency. We can then simplify this formulation using the identity I(f(X), Y ) = H(Y ) H(Y |f(X)), which gives us the following optimization objective:

LMASS(f) := H(Y |f(X)) + βH(f(X))

βEX[log Jf(X)]. (2)

We refer to minimizing this objective as MASS Learning.

In practice, we are interested in using MASS Learning to train a deep network f with parameters using a ﬁnite dataset {(xi, yi)}N

i=1 of N datapoints sampled from the joint distribution p(x, y) of X and Y . To do this, we introduce a parameterized variational approximation qφ(f (x)|y)

p(f (x)|y). Using qφ, we minimize the following empirical upper bound to LMASS:

b LMASS( , φ) := 1

log qφ(yi|f (xi))

β log qφ(f (xi))

β log Jf (xi) LMASS,

where the quantity qφ(f (xi)) is computed as P

y qφ(f (xi)|y)p(y) and the quantity qφ(yi|f (xi))

is computed with Bayes rule as qφ(f (xi)|yi)p(yi) P

y qφ(f (xi)|y)p(y). When Y is discrete and takes on ﬁnitely many values, as in classiﬁcation problems, and when we choose a variational distribution qφ that is differentiable with respect to φ (e.g., a multivariate Gaussian), then we can minimize

b LMASS( , φ) using stochastic gradient descent.

To perform classiﬁcation using our trained network, we use the learned variational distribution qφ and Bayes rule

p(Y |X) p(Y |f (X)) qφ(f (X)|Y )p(Y ) P

y qφ(f (X)|y)p(y).

4. Related Work

4.1. Connection to the Information Bottleneck

The well-studied Information Bottleneck learning method (Tishby et al., 2000; Tishby & Zaslavsky, 2015; Strouse & Schwab, 2016; Alemi et al., 2016; Saxe et al., 2018; Amjad & Geiger, 2018; Goldfeld et al., 2018; Kolchinsky et al., 2018; Achille & Soatto, 2018b;a) is based on minimizing the Information Bottleneck Lagrangian

LIB(Z) := βI(X, Z) I(Y, Z)

for β > 0, where Z is the representation whose conditional distribution p(Z|X) we are trying to learn.

The LIB learning objective can be motivated based on pure information-theoretic elegance. But some works like (Shamir et al., 2010) also point out the connection between the LIB objective and minimal sufﬁcient statistics, which is based on the following theorem:

Theorem 3. Let X be a discrete random variable drawn according to a distribution p(X|Y ) determined by the discrete random variable Y . Let F be the set of deterministic functions of X to any target space. Then f(X) is a minimal sufﬁcient statistic of X for Y if and only if

f 2 arg min

S2F I(X, S(X))

s.t. I(S(X), Y ) = max

S02F I(S0(X), Y ).

The LIB objective can then be thought of as a Lagrangian relaxation of the optimization problem in this theorem.

Minimal Achievable Sufﬁcient Statistic Learning

Theorem 3 only holds for discrete random variables. For continuous X it holds only in the reverse direction, so minimizing LIB for continuous X has no formal connection to ﬁnding minimal sufﬁcient statistics, not to mention minimal achievable sufﬁcient statistics. See Supplementary Material 7.6 for details.

Nevertheless, the optimization problems in Theorem 2 and Theorem 3 are extremely similar, relying as they both do on Lemma 1 for their proofs. And the idea of relaxing the optimization problem in Theorem 2 into a Lagrangian formulation to get LMASS is directly inspired by the Information Bottleneck. So while MASS Learning and Information Bottleneck learning entail different network architectures and loss functions, there is an Information Bottleneck ﬂavor to MASS Learning.

4.2. Jacobian Regularization

The presence of the Jf term in b LMASS is reminiscent of the contrastive autoencoder (Rifai et al., 2011) and Jacobian Regularization literature (Sokolic et al., 2017; Ross & Doshi-Velez, 2017; Varga et al., 2017; Novak et al., 2018; Jakubovitz & Giryes, 2018). Both these literatures suggest that minimizing EX[k Df(X)k F ] where Df(x) = @f(x)

@x T is the Jacobian matrix seems to improve generalization and/or adversarial robustness.

This may seem paradoxical at ﬁrst, since by applying the AM-GM inequality to the eigenvalues of Df(x)Df(x)T

where Df = @f(x)

@x T 2 Rr d, we have

EX[k Df(X)k2r

F ] = EX[Tr(Df(X)Df(X)T)r]

EX[rr det(Df(X)Df(X)T)]

= EX[rr Jf(X)2]

log EX[rr Jf(X)2]

2EX[log Jf(X)] + r log r

and EX[log Jf(X)] is being maximized by b LMASS. So

b LMASS would seem to be optimizing for worse generalization according to the Jacobian regularization literature. However, the conditional entropy term in b LMASS strongly encourages minimizing EX[k Df(X)k F ]. So overall b LMASS seems to be seeking the right balance of sensitivity (dependent on the value of β) in the network to its inputs, which is precisely in alignment with what the Jacobian regularization literature suggests.

5. Experiments

Code to reproduce all experiments is available online.2 Full details on all experiments is in Supplementary Material 7.7.

2https://github.com/mwcvitkovic/ MASS-Learning

In this section we compare MASS Learning to other approaches for training deep networks. We use the abbreviation Softmax CE to refer to the standard approach of training deep networks for classiﬁcation problems by minimizing the softmax cross entropy loss

b LSoftmax CE( ) := 1

log softmax(f (xi))yi

where softmax(f (xi))yi is the yith element of the softmax function applied to the outputs f (xi) of the network s last linear layer. As usual, softmax(f (xi))yi is taken to be the network s estimate of p(yi|xi).

We also compare against the Variational Information Bottleneck (Alemi et al., 2016) method for representation learning, which we abbreviate as VIB .

We use two networks in our experiments. Small MLP is a feedforward network with two fully-connected layers of 400 and 200 hidden units, respectively, both with elu nonlinearities (Clevert et al., 2015). Res Net20 is the 20layer residual net of He et al. (2015).

In all our experiments, the variational distribution qφ(x|y) for each possible output class y is a mixture of multivatiate Gaussian distributions for which we learn the mixture weights, means, and covariance matrices.

Computing the Jf term in b LMASS( , φ) for every sample in a minibatch is too expensive to be practical. Doing so would require on the order of |Y | times more operations than computing b LSoftmax CE( ), since computing the Jf term in b LMASS( , φ) requires (in our implementation) computing the full Jacobian matrix of the network. Thus to make training tractable, we use a subsampling strategy: we estimate the Jf term using only a 1/|Y | fraction of the datapoints in a minibatch. In practice, we do not notice any performance detriment when using the subsampling strategy, and the numerical value of the Jf during training with subsampling is indistinguishable from training with no subsampling.

Subsampling for the Jf term results in a signiﬁcant performance improvement, but it must nevertheless be emphasized that even with the subsampling strategy, our implementation of MASS Learning is roughly twice as computationally costly as Softmax CE training. (Unless β = 0, in which case the cost is the same as Softmax CE.) This is by far the most signiﬁcant drawback of (our implementation of) MASS Learning. There are many easier-to-compute upper bounds or estimates of Jf that one could use to make MASS Learning faster, but we do not explore these in this work.

We performed all experiments on the CIFAR-10 dataset (Krizhevsky, 2009), and coded all our models in Py Torch

Minimal Achievable Sufﬁcient Statistic Learning

(Paszke et al., 2017).

5.1. Classiﬁcation Accuracy and Regularization

We ﬁrst conﬁrm that networks trained by MASS Learning can make accurate predictions in supervised learning tasks. We also compare the classiﬁcation accuracy of networks trained on varying amounts of data to see whether MASS Learning successfully regularizes networks and improves their generalization performance.

Classiﬁcation accuracies for the Small MLP network are shown in Table 1, and for the Res Net20 network in Table 2. For the Small MLP network, MASS Learning does not appear to offer any performance beneﬁts. For the larger Res Net20 network, the results show that while MASS Learning maintains or improves accuracy compared to Softmax CE training, often fairly signiﬁcantly, these improvements do not seem to be due to the MASS loss b LMASS( , φ) itself, since the same performance improvements are obtained even when the H(f(X)) and EX[log Jf(X)] terms in the MASS loss are set to 0 (i.e. the case when β = 0).

This suggests that it is the use of the variational distribution qφ(x|y) to produce the output of the network, rather than the MASS Learning approach, that is providing the beneﬁt. This is an interesting ﬁnding, but does not suggest an advantage to using the full MASS Learning method if one is concerned with accuracy or regularization.

5.2. Uncertainty Quantiﬁcation

We also evaluate the ability of networks trained by MASS Learning to properly quantify their uncertainty about their predictions. We assess uncertainty quantiﬁcation in two ways: using proper scoring rules (Lakshminarayanan et al., 2016), which are scalar measures of how well a network s predictive distribution is calibrated, and by observing performance on an out-of-distribution (OOD) detection task.

Tables 3 and 4 show the uncertainty quantiﬁcation performance of networks according to three proper scoring rules: the Negative Log Likelihood (NLL), the Brier Score, and entropy of the predictive distribution p(y|f (x)). With the Small MLP network Softmax CE and VIB training perform best, while with the Res Net20 network the results are more varied. In general, though, any beneﬁts produced by MASS Learning seem to derive not from the MASS objective but from the network architecture, since MASS Learning with β = 0 gives performance comparable to MASS Learning with β 6= 0.

Table 5 shows scalar metrics for performance on an OOD detection task where the network is asked to identify whether an image is from its training distribution (CIFAR-10 images) or from another distribution (SVHN images (Netzer et al., 2011)). Following Hendrycks & Gimpel (2016) and Alemi

Table 1. Test-set classiﬁcation accuracy (percent) on CIFAR-10

dataset using the Small MLP network trained by various methods. Full experiment details are in Supplementary Material 7.7. Values are the mean classiﬁcation accuracy over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened accuracies are those for which the maximum observed mean accuracy in the column was within one standard deviation. WD is weight decay; D is dropout.

METHOD TRAINING SET SIZE 2500 10,000 40,000

Softmax CE 33.9 0.5 44.5 0.3 52.4 1.1 Softmax CE, WD 26.2 0.9 36.5 0.8 47.8 0.6 Softmax CE, D 33.0 1.1 43.9 0.6 54.2 0.5 VIB, β=1e 1 32.3 0.4 40.6 0.6 46.4 0.6 VIB, β=1e 2 34.2 0.4 44.1 0.5 51.6 0.4 VIB, β=1e 3 35.1 0.7 44.2 0.6 51.7 0.7 VIB, β=1e 1, D 28.9 0.9 39.9 0.5 49.8 0.1 VIB, β=1e 2, D 32.9 1.2 43.7 0.8 53.9 0.4 VIB, β=1e 3, D 34.1 1.0 44.3 0.5 54.5 0.3 MASS, β=1e 2 30.3 0.4 39.9 1.1 45.4 1.4 MASS, β=1e 3 32.6 0.6 40.9 0.6 47.0 0.8 MASS, β=1e 4 33.4 0.6 40.7 0.4 47.1 1.1 MASS, β=0 34.0 0.5 40.8 1.0 47.0 0.6 MASS, β=1e 2, D 29.6 1.2 42.2 0.5 51.9 0.5 MASS, β=1e 3, D 31.8 1.3 43.4 0.4 53.0 0.5 MASS, β=1e 4, D 31.9 0.8 43.2 0.2 52.9 0.6 MASS, β=0, D 32.1 1.3 43.4 0.4 52.7 0.4

Table 2. Test-set classiﬁcation accuracy (percent) on CIFAR-10

dataset using the Res Net20 network trained by various methods. No data augmentation or learning rate scheduling was used full details in Supplementary Material 7.7. Values are the mean classiﬁcation accuracy over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened accuracies are those for which the maximum observed mean accuracy in the column was within one standard deviation.

METHOD TRAINING SET SIZE 2500 10,000 40,000

Softmax CE 37.4 0.7 52.0 1.1 67.8 2.7 VIB, β=1e 3 33.5 0.9 49.1 1.5 66.0 0.6 VIB, β=1e 4 34.0 1.0 50.3 1.6 67.1 0.6 VIB, β=1e 5 34.7 0.6 50.2 1.6 67.8 0.6 VIB, β=0 35.3 0.7 50.0 1.7 68.0 0.1 MASS, β=1e 3 38.5 0.9 52.0 1.0 67.1 0.5 MASS, β=1e 4 39.1 0.3 52.7 0.7 68.9 1.1 MASS, β=1e 5 39.0 1.0 52.5 1.1 69.5 0.6 MASS, β=0 39.7 0.5 52.9 0.4 69.0 0.8

Minimal Achievable Sufﬁcient Statistic Learning

et al. (2018), the metrics we report are the Area under the ROC curve (AUROC) and Average Precision score (APR). APR depends on whether the network is tasked with predicting whether an image is in-distribution or out of distribution; we report both metrics as APR In and APR Out, respectively. The Entropy detection method uses the entropy of the network s learned predictive distribution p(y|f (x)) as the OOD detection value. The maxi qφ(f (x)|yi) detection method uses the maximum pdf value for any of the potential output classes yi as the OOD detection value. (For the Softmax CE trained networks, qφ(f (x)|yi) was estimated by MLE of a mixture of 10 full-covariance, 10-dimensional multivariate Gaussians on the training set.) And for the VIB networks, the Rate detection method uses the KL divergence between the VIB s marginal distribution and the representation as the OOD detection value.

Here we see MASS Learning outperforming Softmax CE and VIB, but again with the caveat that the beneﬁts appear to be due to the variational distribution in the network architecture, rather than the MASS loss function.

5.3. Does MASS Learning ﬁnally solve the mystery of

why stochastic gradient descent with the cross entropy loss works so well in deep learning?

We do not believe so. MASS Learning and Softmax CE training seem to be producing fairly different representations during training. Figure 2 shows how the values of the three terms in b LMASS( , φ) change as the MLP network trains on the CIFAR-10 dataset using either the usual Softmax CE training or MASS training. Despite achieving similar accuracy, the Softmax CE training method does not seem to be implicitly performing MASS Learning, based on the differing values of the entropy (orange) and Jacobian (green) between the two methods as training progresses.

6. Conclusion

MASS Learning is a new approach to representation learning based on the goal of ﬁnding minimal achievable sufﬁcient statistics. We have shown that networks trained by MASS Learning perform well on classiﬁcation tasks and on regularization and uncertainty quantiﬁcation benchmarks, despite not being directly formulated for any of these tasks.

There remain many open questions about MASS Learning. Of primary interest is more investigation into the properties of the representations learned by MASS Learning and how they differ from those learned in standard deep learning. There is also much to learn about how to best minimize the MASS loss. In this paper we used optimizer settings tuned for standard softmax cross entropy learning, but b LMASS( , φ) is such a different optimization objective that there are likely many potential improvements to be

made in how we train the networks. We also plan to explore more expressive variational distributions qφ. Finally, in terms of efﬁciency, although MASS Learning is applicable in principle to any deep learning architecture, there is currently a signiﬁcant computational cost in computing the Jf term in the MASS Loss function. Finding non-invertible network architectures which admit more efﬁciently computable Jacobians, as is done in methods like normalizing ﬂows (Rezende & Mohamed, 2015) or Real NVP (Dinh et al., 2016), would greatly increase the utility of MASS Learning.

Acknowledgements

We would like to thank Georg Pichler, Thomas Vidick, Alex Alemi, Alessandro Achille, and Joseph Marino for useful discussions.

Minimal Achievable Sufﬁcient Statistic Learning

Table 3. Uncertainty quantiﬁcation metrics (proper scoring rules) on CIFAR-10 using the Small MLP network trained on 40,000 datapoints. Values are the mean over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened values are those for which the minimum observed mean value in the column was within one standard deviation. Lower values are better.

Method Test Accuracy NLL Brier Score Entropy

Softmax CE 52.4 1.1 4.19 0.15 0.0835 0.0018 0.230 0.003 Softmax CE, WD 47.8 0.6 1.47 0.02 0.0662 0.0006 1.511 0.019 Softmax CE, D 54.2 0.5 1.56 0.01 0.0642 0.0006 0.739 0.007 VIB, β=1e 1 46.4 0.6 4.78 0.13 0.0919 0.0009 0.296 0.008 VIB, β=1e 2 51.6 0.4 4.81 0.10 0.0861 0.0006 0.207 0.002 VIB, β=1e 3 51.7 0.7 5.09 0.27 0.0863 0.0013 0.194 0.008 VIB, β=1e 1, D 49.8 0.1 1.49 0.01 0.0642 0.0001 1.101 0.008 VIB, β=1e 2, D 53.9 0.4 1.52 0.00 0.0636 0.0002 0.803 0.010 VIB, β=1e 3, D 54.5 0.3 1.53 0.01 0.0641 0.0002 0.754 0.009 MASS, β=1e 2 45.4 1.4 6.85 0.26 0.0979 0.0027 0.207 0.007 MASS, β=1e 3 47.0 0.8 5.85 0.24 0.0943 0.0019 0.218 0.007 MASS, β=1e 4 47.1 1.1 5.71 0.25 0.0942 0.0025 0.219 0.006 MASS, β=0 47.0 0.6 5.67 0.28 0.0945 0.0019 0.221 0.004 MASS, β=1e 2, D 51.9 0.5 1.60 0.03 0.0662 0.0004 0.846 0.025 MASS, β=1e 3, D 53.0 0.5 1.56 0.02 0.0648 0.0008 0.812 0.017 MASS, β=1e 4, D 52.9 0.6 1.55 0.02 0.0646 0.0005 0.831 0.020 MASS, β=0, D 52.7 0.4 1.55 0.02 0.0648 0.0004 0.832 0.012

Table 4. Uncertainty quantiﬁcation metrics (proper scoring rules) on CIFAR-10 using the Res Net20 network trained on 40,000 datapoints. Values are the mean over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened values are those for which the minimum observed mean value in the column was within one standard deviation. Lower values are better.

Method Test Accuracy NLL Brier Score Entropy

Softmax CE 67.8 2.7 1.98 0.15 0.0546 0.0043 0.209 0.021 VIB, β=1e 3 66.0 0.6 2.28 0.12 0.0577 0.0011 0.210 0.004 VIB, β=1e 4 67.1 0.6 2.23 0.07 0.0563 0.0010 0.196 0.003 VIB, β=1e 5 67.8 0.6 2.35 0.11 0.0559 0.0012 0.175 0.003 VIB, β=0 68.0 0.1 2.45 0.05 0.0558 0.0003 0.167 0.003 MASS, β=1e 3 67.1 0.5 1.77 0.03 0.0555 0.0010 0.227 0.006 MASS, β=1e 4 68.9 1.1 1.91 0.07 0.0533 0.0018 0.193 0.011 MASS, β=1e 5 69.5 0.6 1.96 0.05 0.0522 0.0011 0.188 0.007 MASS, β=0 69.0 0.8 2.00 0.08 0.0528 0.0015 0.190 0.003

Minimal Achievable Sufﬁcient Statistic Learning

Table 5. Out-of-distribution detection metrics on CIFAR-10 with SVHN digits as the out-of-distribution examples using Res Net20 network

trained on 40,000 datapoints. Values are the mean over 4 training runs with different random seeds plus or minus the standard deviation. Emboldened values are those for which the maximum observed mean value in the column was within one standard deviation. Higher values are better.

Training Method Test Accuracy Detection Method AUROC APR In APR Out

Softmax CE 67.8 2.7 Entropy 0.62 0.01 0.66 0.02 0.57 0.01 maxi qφ(f (x)|yi) 0.72 0.02 0.73 0.03 0.70 0.03

VIB, β=1e 3 66.0 0.6 Entropy 0.57 0.01 0.60 0.01 0.53 0.01 Rate 0.71 0.03 0.71 0.03 0.69 0.02

VIB, β=1e 4 67.1 0.6 Entropy 0.57 0.02 0.59 0.03 0.53 0.01 Rate 0.72 0.04 0.71 0.05 0.70 0.04

VIB, β=1e 5 67.8 0.6 Entropy 0.56 0.04 0.58 0.05 0.53 0.02 Rate 0.68 0.01 0.68 0.01 0.64 0.02

VIB, β=0 68.0 0.1 Entropy 0.60 0.03 0.63 0.04 0.55 0.02 Rate 0.61 0.03 0.60 0.02 0.57 0.04

MASS, β=1e 3 67.1 0.5 Entropy 0.63 0.02 0.68 0.02 0.57 0.02 maxi qφ(f (x)|yi) 0.69 0.02 0.68 0.02 0.68 0.02

MASS, β=1e 4 68.9 1.1 Entropy 0.64 0.01 0.69 0.01 0.58 0.01 maxi qφ(f (x)|yi) 0.74 0.01 0.73 0.01 0.72 0.02

MASS, β=1e 5 69.5 0.6 Entropy 0.64 0.01 0.68 0.01 0.58 0.01 maxi qφ(f (x)|yi) 0.76 0.04 0.75 0.04 0.75 0.04

MASS, β=0 69.0 0.8 Entropy 0.65 0.01 0.69 0.02 0.59 0.01 maxi qφ(f (x)|yi) 0.76 0.03 0.76 0.03 0.75 0.03

Figure 2. Value of each term in the MASS Learning loss function, LMASS(f) = H(Y |f(X)) + βH(f(X)) βEX[log Jf(X)], during

training of the Small MLP network on the CIFAR-10 dataset. The MASS training was performed with β = 0.001, though the plotted values are for the terms without being multiplied by the β coefﬁcients. The values of these terms for Softmax CE training are estimated using a variational distribution qφ(x|y), the parameters of which were estimated at each timestep by MLE over the training data.

Minimal Achievable Sufﬁcient Statistic Learning

Achille, A. and Soatto, S. Emergence of Invariance and

Disentanglement in Deep Representations. In 2018 Information Theory and Applications Workshop (ITA), pp. 1 9, February 2018a. doi: 10.1109/ITA.2018.8503149.

Achille, A. and Soatto, S. Information Dropout: Learning

Optimal Representations Through Noisy Computation, 2018b.

Adragni, KoﬁP. and Cook, R. Dennis. Sufﬁcient dimension

reduction and prediction in regression. Philosophical Transactions of the Royal Society A: Mathematical, Phys-

ical and Engineering Sciences, 367(1906):4385 4405, November 2009. doi: 10.1098/rsta.2009.0110. URL https://royalsocietypublishing.org/ doi/full/10.1098/rsta.2009.0110.

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep

Variational Information Bottleneck. ar Xiv:1612.00410 [cs, math], December 2016. URL http://arxiv. org/abs/1612.00410. ar Xiv: 1612.00410.

Alemi, A. A., Fischer, I., and Dillon, J. V. Uncertainty in the

Variational Information Bottleneck. ar Xiv:1807.00906 [cs, stat], July 2018. URL http://arxiv.org/ abs/1807.00906. ar Xiv: 1807.00906.

Amjad, R. A. and Geiger, B. C. Learning Representations

for Neural Network-Based Classiﬁcation Using the Information Bottleneck Principle. ar Xiv:1802.09766 [cs, math], February 2018. URL http://arxiv.org/ abs/1802.09766. ar Xiv: 1802.09766.

Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129 1159, November 1995. ISSN 0899-7667.

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast

and Accurate Deep Network Learning by Exponential Linear Units (ELUs). ar Xiv:1511.07289 [cs], November 2015. URL http://arxiv.org/abs/1511. 07289. ar Xiv: 1511.07289.

Cover, T. M. and Thomas, J. A. Elements of Information

Theory 2nd Edition. Wiley-Interscience, Hoboken, NJ, 2

edition edition, July 2006. ISBN 978-0-471-24195-9.

Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear

Independent Components Estimation. ar Xiv:1410.8516 [cs], October 2014. URL http://arxiv.org/abs/ 1410.8516. ar Xiv: 1410.8516.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density es-

timation using Real NVP. May 2016. URL https: //arxiv.org/abs/1605.08803.

Dynkin, E. B. Necessary and sufﬁcient statistics for afamily

of probability distributions. Uspekhi Mat. Nauk, 6(1): 68 90, 1951. URL http://www.mathnet.ru/ php/archive.phtml?wshow=paper&jrnid= rm&paperid=6820&option_lang=eng.

Federer, H. Geometric Measure Theory. Springer, New

York, NY, 1969.

Goldfeld, Z., Berg, E. v. d., Greenewald, K., Melnyk, I.,

Nguyen, N., Kingsbury, B., and Polyanskiy, Y. Estimating Information Flow in Neural Networks. ar Xiv:1810.05728 [cs, stat], October 2018. URL http://arxiv.org/ abs/1810.05728. ar Xiv: 1810.05728.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning.

MIT Press, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual

Learning for Image Recognition. ar Xiv:1512.03385 [cs], December 2015. URL http://arxiv.org/abs/ 1512.03385. ar Xiv: 1512.03385.

Hendrycks, D. and Gimpel, K. A Baseline for Detecting

Misclassiﬁed and Out-of-Distribution Examples in Neural Networks. October 2016. URL https://arxiv. org/abs/1610.02136v3.

Jakubovitz, D. and Giryes, R. Improving DNN Robust-

ness to Adversarial Attacks using Jacobian Regularization. ar Xiv:1803.08680 [cs, stat], March 2018. URL http://arxiv.org/abs/1803.08680. ar Xiv: 1803.08680.

James, R. G., Mahoney, J. R., and Crutchﬁeld, J. P.

Trimming the Independent Fat: Sufﬁcient Statistics, Mutual Information, and Predictability from Effective Channel States. Physical Review E, 95(6), June 2017. ISSN 2470-0045, 2470-0053. doi: 10.1103/Phys Rev E. 95.060102. URL http://arxiv.org/abs/1702. 01831. ar Xiv: 1702.01831.

Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. ar Xiv:1412.6980 [cs], December 2014. URL http://arxiv.org/abs/1412. 6980. ar Xiv: 1412.6980.

Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. Nonlinear

Information Bottleneck. ar Xiv:1705.02436 [cs, math, stat], May 2017. URL http://arxiv.org/abs/ 1705.02436. ar Xiv: 1705.02436.

Kolchinsky, A., Tracey, B. D., and Van Kuyk, S. Caveats

for information bottleneck in deterministic scenarios. ar Xiv:1808.07593 [cs, stat], August 2018. URL http://arxiv.org/abs/1808.07593. ar Xiv: 1808.07593.

Minimal Achievable Sufﬁcient Statistic Learning

Koliander, G., Pichler, G., Riegler, E., and Hlawatsch, F. En-

tropy and Source Coding for Integer-Dimensional Singular Random Variables. IEEE Transactions on Information Theory, 62(11):6124 6154, November 2016. ISSN 0018-

9448, 1557-9654. doi: 10.1109/TIT.2016.2604248. URL http://arxiv.org/abs/1505.03337. ar Xiv: 1505.03337.

Krantz, S. G. and Parks, H. R. Geometric Integration Theory.

Birkhuser, Basel, Switzerland, 2009.

Krizhevsky, A. Learning multiple layers of features from

tiny images. 2009.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Sim-

ple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. ar Xiv:1612.01474 [cs, stat], December 2016. URL http://arxiv.org/abs/1612. 01474. ar Xiv: 1612.01474.

Lehmann, E. L. and Scheffe, H. Completeness, Similar

Regions, and Unbiased Estimation: Part I. Sankhy: The Indian Journal of Statistics (1933-1960), 10(4):305 340, 1950. ISSN 0036-4452. URL https://www.jstor.

org/stable/25048038.

Nash, C., Kushman, N., and Williams, C. K. I. Inverting

Supervised Representations with Autoregressive Neural Density Models. June 2018. URL https://arxiv. org/abs/1806.00400.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,

and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.

Novak, R., Bahri, Y., Abolaﬁa, D. A., Pennington, J., and

Sohl-Dickstein, J. Sensitivity and Generalization in Neural Networks: an Empirical Study. ar Xiv:1802.08760 [cs, stat], February 2018. URL http://arxiv.org/ abs/1802.08760. ar Xiv: 1802.08760.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in Py Torch. In NIPS-W, 2017.

Rezende, D. J. and Mohamed, S. Variational Inference

with Normalizing Flows. ar Xiv:1505.05770 [cs, stat], May 2015. URL http://arxiv.org/abs/1505. 05770. ar Xiv: 1505.05770.

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y.,

Dauphin, Y., and Glorot, X. Higher Order Contractive Auto-Encoder. In Gunopulos, D., Hofmann, T., Malerba, D., and Vazirgiannis, M. (eds.), Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pp. 645 660. Springer Berlin Heidelberg, 2011. ISBN 978-3-642-23783-6.

Ross, A. S. and Doshi-Velez, F. Improving the Adversarial

Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients. ar Xiv:1711.09404 [cs], November 2017. URL http://arxiv.org/ abs/1711.09404. ar Xiv: 1711.09404.

Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchin-

sky, A., Tracey, B. D., and Cox, D. D. On the Information Bottleneck Theory of Deep Learning. February 2018. URL https://openreview.net/forum? id=ry_WPG-A-.

Shamir, O., Sabato, S., and Tishby, N. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29):2696 2711, June 2010. ISSN 0304-3975. doi: 10.1016/j.tcs.2010.04. 006. URL http://www.sciencedirect.com/ science/article/pii/S030439751000201X.

Shwartz-Ziv, R. and Tishby, N. Opening the Black Box of Deep Neural Networks via Information. ar Xiv:1703.00810 [cs], March 2017. URL http://arxiv.org/abs/1703.00810. ar Xiv: 1703.00810.

Sokolic, J., Giryes, R., Sapiro, G., and Rodrigues, M.

R. D. Robust Large Margin Deep Neural Networks. IEEE Transactions on Signal Processing, 65(16):4265 4280, August 2017. ISSN 1053-587X, 1941-0476. doi: 10.1109/TSP.2017.2708039. URL http://arxiv. org/abs/1605.08254. ar Xiv: 1605.08254.

Strouse, D. J. and Schwab, D. J. The deterministic informa-

tion bottleneck. April 2016. URL https://arxiv. org/abs/1604.00268.

Tishby, N. and Zaslavsky, N. Deep Learning and the In-

formation Bottleneck Principle. March 2015. URL https://arxiv.org/abs/1503.02406.

Tishby, N., Pereira, F. C., and Bialek, W. The informa-

tion bottleneck method. ar Xiv:physics/0004057, April 2000. URL http://arxiv.org/abs/physics/ 0004057. ar Xiv: physics/0004057.

Varga, D., Csiszrik, A., and Zombori, Z. Gradient Regu-

larization Improves Accuracy of Discriminative Models. ar Xiv:1712.09936 [cs], December 2017. URL http://arxiv.org/abs/1712.09936. ar Xiv: 1712.09936.