# bayesian_compression_for_deep_learning__cb1e6f5b.pdf

Bayesian Compression for Deep Learning

Christos Louizos University of Amsterdam TNO Intelligent Imaging c.louizos@uva.nl

Karen Ullrich University of Amsterdam k.ullrich@uva.nl

Max Welling University of Amsterdam CIFAR m.welling@uva.nl

Compression and computational efﬁciency in deep learning have become a problem of great signiﬁcance. In this work, we argue that the most principled and effective way to attack this problem is by adopting a Bayesian point of view, where through sparsity inducing priors we prune large parts of the network. We introduce two novelties in this paper: 1) we use hierarchical priors to prune nodes instead of individual weights, and 2) we use the posterior uncertainties to determine the optimal ﬁxed point precision to encode the weights. Both factors signiﬁcantly contribute to achieving the state of the art in terms of compression rates, while still staying competitive with methods designed to optimize for speed or energy efﬁciency.

1 Introduction

While deep neural networks have become extremely successful in in a wide range of applications, often exceeding human performance, they remain difﬁcult to apply in many real world scenarios. For instance, making billions of predictions per day comes with substantial energy costs given the energy consumption of common Graphical Processing Units (GPUs). Also, real-time predictions are often about a factor 100 away in terms of speed from what deep NNs can deliver, and sending NNs with millions of parameters through band limited channels is still impractical. As a result, running them on hardware limited devices such as smart phones, robots or cars requires substantial improvements on all of these issues. For all those reasons, compression and efﬁciency have become a topic of interest in the deep learning community.

While all of these issues are certainly related, compression and performance optimizing procedures might not always be aligned. As an illustration, consider the convolutional layers of Alexnet, which account for only 4% of the parameters but 91% of the computation [65]. Compressing these layers will not contribute much to the overall memory footprint.

There is a variety of approaches to address these problem settings. However, most methods have the common strategy of reducing both the neural network structure and the effective ﬁxed point precision for each weight. A justiﬁcation for the former is the ﬁnding that NNs suffer from signiﬁcant parameter redundancy [14]. Methods in this line of thought are network pruning, where unnecessary connections are being removed [38, 24, 21], or student-teacher learning where a large network is used to train a signiﬁcantly smaller network [5, 26].

From a Bayesian perspective network pruning and reducing bit precision for the weights is aligned with achieving high accuracy, because Bayesian methods search for the optimal model structure (which leads to pruning with sparsity inducing priors), and reward uncertain posteriors over parameters through the bits back argument [27] (which leads to removing insigniﬁcant bits). This relation is made explicit in the MDL principle [20] which is known to be related to Bayesian inference.

Canadian Institute For Advanced Research.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

In this paper we will use the variational Bayesian approximation for Bayesian inference which has also been explicitly interpreted in terms of model compression [27]. By employing sparsity inducing priors for hidden units (and not individual weights) we can prune neurons including all their ingoing and outgoing weights. This avoids more complicated and inefﬁcient coding schemes needed for pruning or vector quantizing individual weights. As an additional Bayesian bonus we can use the variational posterior uncertainty to assess which bits are signiﬁcant and remove the ones which ﬂuctuate too much under approximate posterior sampling. From this we derive the optimal ﬁxed point precision per layer, which is still practical on chip.

2 Variational Bayes and Minimum Description Length

A fundamental theorem in information theory is the minimum description length (MDL) principle [20]. It relates to compression directly in that it deﬁnes the best hypothesis to be the one that communicates the sum of the model (complexity cost LC) and the data misﬁt (error cost LE) with the minimum number of bits [57, 58]. It is well understood that variational inference can be reinterpreted from an MDL point of view [54, 69, 27, 29, 19]. More speciﬁcally, assume that we are presented with a dataset D that consists from N input-output pairs {(x1, y1), . . . , (xn, yn)}. Let p(D|w) = QN i=1 p(yi|xi, w) be a parametric model, e.g. a deep neural network, that maps inputs x to their corresponding outputs y using parameters w governed by a prior distribution p(w). In this scenario, we wish to approximate the intractable posterior distribution p(w|D) = p(D|w)p(w)/p(D) with a ﬁxed form approximate posterior qφ(w) by optimizing the variational parameters φ according to:

L(φ) = Eqφ(w)[log p(D|w)] | {z } LE

+ Eqφ(w)[log p(w)] + H(qφ(w)) | {z } LC

where H( ) denotes the entropy and L(φ) is known as the evidence-lower-bound (ELBO) or negative variational free energy. As indicated in eq. 1, L(φ) naturally decomposes into a minimum cost for communicating the targets {yn}N n=1 under the assumption that the sender and receiver agreed on a prior p(w) and that the receiver knows the inputs {xn}N n=1 and form of the parametric model.

By using sparsity inducing priors for groups of weights that feed into a neuron the Bayesian mechanism will start pruning hidden units that are not strictly necessary for prediction and thus achieving compression. But there is also a second mechanism by which Bayes can help us compress. By explicitly entertaining noisy weight encodings through qφ(w) we can beneﬁt from the bits-back argument [27, 29] due to the entropy term; this is in contrast to inﬁnitely precise weights that lead to H(δ(w)) = 2. Nevertheless in practice, the data misﬁt term LE is intractable for neural network models under a noisy weight encoding, so as a solution Monte Carlo integration is usually employed. Continuous qφ(w) allow for the reparametrization trick [34, 56]. Here, we replace sampling from qφ(w) by a deterministic function of the variational parameters φ and random samples from some noise variables ϵ:

L(φ) = Ep(ϵ)[log p(D|f(φ, ϵ))] + Eqφ(w)[log p(w)] + H(qφ(w)), (2)

where w = f(φ, ϵ). By applying this trick, we obtain unbiased stochastic gradients of the ELBO with respect to the variational parameters φ, thus resulting in a standard optimization problem that is ﬁt for stochastic gradient ascent. The efﬁciency of the gradient estimator resulting from eq. 2 can be further improved for neural networks by utilizing local reparametrizations [35] (which we will use in our experiments); they provide variance reduction in an efﬁcient way by locally marginalizing the weights at each layer and instead sampling the distribution of the pre-activations.

3 Related Work

One of the earliest ideas and most direct approaches to tackle efﬁciency is pruning. Originally introduced by [38], pruning has recently been demonstrated to be applicable to modern architectures [25, 21]. It had been demonstrated that an overwhelming amount of up to 99,5% of parameters can be pruned in common architectures. There have been quite a few encouraging results obtained by (empirical) Bayesian approaches that employ weight pruning [19, 7, 50, 67, 49]. Nevertheless,

2In practice this term is a large constant determined by the weight precision.

weight pruning is in general inefﬁcient for compression since the matrix format of the weights is not taken into consideration, therefore the Compressed Sparse Column (CSC) format has to be employed. Moreover, note that in conventional CNNs most ﬂops are used by the convolution operation. Inspired by this observation, several authors proposed pruning schemes that take these considerations into account [70, 71] or even go as far as efﬁciency aware architectures to begin with [31, 15, 30]. From the Bayesian viewpoint, similar pruning schemes have been explored at [45, 51, 37, 33].

Given optimal architecture, NNs can further be compressed by quantization. More precisely, there are two common techniques. First, the set of accessible weights can be reduced drastically. As an extreme example, [13, 46, 55, 72] and [11] trained NN to use only binary or tertiary weights with ﬂoating point gradients. This approach however is in need of signiﬁcantly more parameters than their ordinary counterparts. Work by [18] explores various techniques beyond binary quantization: k-means quantization, product quantization and residual quantization. Later studies extent this set to optimal ﬁxed point [42] and hashing quantization [10]. [25] apply k-means clustering and consequent center training. From a practical point of view, however, all these are fairly unpractical during test time. For the computation of each feature map in a net, the original weight matrix must be reconstructed from the indexes in the matrix and a codebook that contains all the original weights. This is an expensive operation and this is why some studies propose a different approach than set quantization. Precision quantization simply reduces the bit size per weight. This has a great advantage over set quantization at inference time since feature maps can simply be computed with less precision weights. Several studies show that this has little to no effect on network accuracy when using 16bit weights [47, 22, 12, 68, 9]. Somewhat orthogonal to the above discussion but certainly relevant are approaches that customize the implementation of CNNs for hardware limited devices[30, 4, 60].

4 Bayesian compression with scale mixtures of normals

Consider the following prior over a parameter w where its scale z is governed by a distribution p(z):

z p(z); w N(w; 0, z2), (3)

with z2 serving as the variance of the zero-mean normal distribution over w. By treating the scales of w as random variables we can recover marginal prior distributions over the parameters that have heavier tails and more mass at zero; this subsequently biases the posterior distribution over w to be sparse. This family of distributions is known as scale-mixtures of normals [6, 2] and it is quite general, as a lot of well known sparsity inducing distributions are special cases.

One example of the aforementioned framework is the spike-and-slab distribution [48], the golden standard for sparse Bayesian inference. Under the spike-and-slab, the mixing density of the scales is a Bernoulli distribution, thus the marginal p(w) has a delta spike at zero and a continuous slab over the real line. Unfortunately, this prior leads to a computationally expensive inference since we have to explore a space of 2M models, where M is the number of the model parameters. Dropout [28, 64], one of the most popular regularization techniques for neural networks, can be interpreted as positing a spike and slab distribution over the weights where the variance of the slab is zero [17, 43]. Another example is the Laplace distribution which arises by considering p(z2) = Exp(λ). The mode of the posterior distribution under a Laplace prior is known as the Lasso [66] estimator and has been previously used for sparsifying neural networks at [70, 59]. While computationally simple, the Lasso estimator is prone to shrinking" large signals [8] and only provides point estimates about the parameters. As a result it does not provide uncertainty estimates, it can potentially overﬁt and, according to the bits-back argument, is inefﬁcient for compression.

For these reasons, in this paper we will tackle the problem of compression and efﬁciency in neural networks by adopting a Bayesian treatment and inferring an approximate posterior distribution over the parameters under a scale mixture prior. We will consider two choices for the prior over the scales p(z); the hyperparameter free log-uniform prior [16, 35] and the half-Cauchy prior, which results into a horseshoe [8] distribution. Both of these distributions correspond to a continuous relaxation of the spike-and-slab prior and we provide a brief discussion on their shrinkage properties at Appendix C.

4.1 Reparametrizing variational dropout for group sparsity

One potential choice for p(z) is the improper log-uniform prior [35]: p(z) |z| 1. It turns out that we can recover the log-uniform prior over the weights w if we marginalize over the scales z:

p(w) Z 1 |z|N(w|0, z2)dz = 1

This alternative parametrization of the log uniform prior is known in the statistics literature as the normal-Jeffreys prior and has been introduced by [16]. This formulation allows to couple" the scales of weights that belong to the same group (e.g. neuron or feature map), by simply sharing the corresponding scale variable z in the joint prior3:

ij N(wij|0, z2 i ), (5)

where W is the weight matrix of a fully connected neural network layer with A being the dimensionality of the input and B the dimensionality of the output. Now consider performing variational inference with a joint approximate posterior parametrized as follows:

i=1 N(zi|µzi, µ2 ziαi)

i,j N(wij|ziµij, z2 i σ2 ij), (6)

where αi is the dropout rate [64, 35, 49] of the given group. As explained at [35, 49], the multiplicative parametrization of the approximate posterior over z suffers from high variance gradients; therefore we will follow [49] and re-parametrize it in terms of σ2 zi = µ2 ziαi, hence optimize w.r.t. σ2 zi. The lower bound under this prior and approximate posterior becomes:

L(φ) = Eqφ(z)qφ(W|z)[log p(D|W)] Eqφ(z)[KL(qφ(W|z)||p(W|z))] KL(qφ(z)||p(z)). (7)

Under this particular variational posterior parametrization the negative KL-divergence from the conditional prior p(W|z) to the approximate posterior qφ(W|z) is independent of z:

KL(qφ(W|z)||p(W|z)) = 1

z2 i σ2 ij + z2 i σ2 ij

z2 i + z2 i µ2 ij

z2 i 1 . (8)

This independence can be better understood if we consider a non-centered parametrization of the prior [53]. More speciﬁcally, consider reparametrizing the weights as wij = wij

zi ; this will then result into p(W|z)p(z) = p( W)p(z), where p( W) = Q

i,j N( wij|0, 1) and W = diag(z) W. Now if we perform variational inference under the p( W)p(z) prior with an approximate posterior that has the form of qφ( W, z) = qφ( W)qφ(z), with qφ( W) = Q

i,j N( wij|µij, σ2 ij), then we see that we arrive at the same expressions for the negative KL-divergence from the prior to the approximate posterior. Finally, the negative KL-divergence from the normal-Jeffreys scale prior p(z) to the Gaussian variational posterior qφ(z) depends only on the implied dropout rate, αi = σ2 zi/µ2 zi, and takes the following form [49]:

KL(qφ(z)||p(z))

k1σ(k2 + k3 log αi) 0.5m( log αi) k1 , (9)

where σ( ), m( ) are the sigmoid and softplus functions respectively4 and k1 = 0.63576, k2 = 1.87320, k3 = 1.48695. We can now prune entire groups of parameters by simply specifying a threshold for the variational dropout rate of the corresponding group, e.g. log αi = (log σ2 zi log µ2 zi) t. It should be mentioned that this prior parametrization readily allows for a more ﬂexible marginal posterior over the weights as we now have a compound distribution, qφ(W) = R qφ(W|z)qφ(z)dz; this is in contrast to the original parametrization and the Gaussian approximations employed by [35, 49].

3Stricly speaking the result of eq. 4 only holds when each weight has its own scale and not when that scale is shared across multiple weights. Nevertheless, in practice we obtain a prior that behaves in a similar way, i.e. it biases the variational posterior to be sparse. 4σ(x) = (1 + exp( x)) 1, m(x) = log(1 + exp(x))

Furthermore, this approach generalizes the low variance additive parametrization of variational dropout proposed for weight sparsity at [49] to group sparsity (which was left as an open question at [49]) in a principled way.

At test time, in order to have a single feedforward pass we replace the distribution over W at each layer with a single weight matrix, the masked variational posterior mean:

ˆ W = diag(m) Eq(z)q( W)[diag(z) W] = diag m µz MW , (10)

where m is a binary mask determined according to the group variational dropout rate and MW are the means of qφ( W). We further use the variational posterior marginal variances5 for this particular posterior approximation:

V(wij)NJ = σ2 zi σ2 ij + µ2 ij + σ2 ijµ2 zi, (11)

to asess the bit precision of each weight in the weight matrix. More speciﬁcally, we employed the mean variance across the weight matrix ˆ W to compute the unit round off necessary to represent the weights. This method will give us the amount signiﬁcant bits, and by adding 3 exponent and 1 sign bits we arrive at the ﬁnal bit precision for the entire weight matrix ˆ W6. We provide more details at Appendix B.

4.2 Group horseshoe with half-Cauchy scale priors

Another choice for p(z) is a proper half-Cauchy distribution: C+(0, s) = 2(sπ(1 + (z/s)2)) 1; it induces a horseshoe prior [8] distribution over the weights, which is a well known sparsity inducing prior in the statistics literature. More formally, the prior hierarchy over the weights is expressed as (in a non-centered parametrization):

s C+(0, τ0); zi C+(0, 1); wij N(0, 1); wij = wij zis, (12)

where τ0 is the free parameter that can be tuned for speciﬁc desiderata. The idea behind the horseshoe is that of the global-local" shrinkage; the global scale variable s pulls all of the variables towards zero whereas the heavy tailed local variables zi can compensate and allow for some weights to escape. Instead of directly working with the half-Cauchy priors we will employ a decomposition of the half-Cauchy that relies upon (inverse) gamma distributions [52] as this will allow us to compute the negative KL-divergence from the scale prior p(z) to an approximate log-normal scale posterior qφ(z) in closed form (the derivation is given in Appendix D). More speciﬁcally, we have that the half-Cauchy prior can be expressed in a non-centered parametrization as:

p( β) = IG(0.5, 1); p( α) = G(0.5, k2); z2 = α β, (13)

where IG( , ), G( , ) correspond to the inverse Gamma and Gamma distributions in the scale parametrization, and z follows a half-Cauchy distribution with scale k. Therefore we will re-express the whole hierarchy as:

sb IG(0.5, 1); sa G(0.5, τ 2 0 ); βi IG(0.5, 1); αi G(0.5, 1); wij N(0, 1);

wij = wij q

sasb αi βi. (14)

It should be mentioned that the improper log-uniform prior is the limiting case of the horseshoe prior when the shapes of the (inverse) Gamma hyperpriors on αi, βi go to zero [8]. In fact, several well known shrinkage priors can be expressed in this form by altering the shapes of the (inverse) Gamma hyperpriors [3]. For the variational posterior we will employ the following mean ﬁeld approximation:

qφ(sb, sa, β) = LN(sb|µsb, σ2 sb)LN(sa|µsa, σ2 sa)

i LN( βi|µ βi, σ2 βi) (15)

qφ( α, W) =

i LN( αi|µ αi, σ2 αi)

i,j N( wij|µ wij, σ2 wij), (16)

5V(wij) = V(zi wij) = V(zi) E[ wij]2 + V( wij) + V( wij) E[zi]2. 6Notice that the fact that we are using mean-ﬁeld variational approximations (which we chose for simplicity) can potentially underestimate the variance, thus lead to higher bit precisions for the weights. We leave the exploration of more involved posteriors for future work.

where LN( , ) is a log-normal distribution. It should be mentioned that a similar form of noncentered variational inference for the horseshoe has been also successfully employed for undirected models at [32]. Notice that we can also apply local reparametrizations [35] when we are sampling q

αi βi and sasb by exploiting properties of the log-normal distribution7 and thus forming the implied:

αi βi LN(µ zi, σ2 zi); s = sasb LN(µs, σ2 s) (17)

2(µ αi + µ βi); σ2 zi = 1

4(σ2 αi + σ2 βi); µs = 1

2(µsa + µsb); σ2 s = 1

4(σ2 sa + σ2 sb). (18)

As a threshold rule for group pruning we will use the negative log-mode8 of the local log-normal r.v. zi = s zi, i.e. prune when (σ2 zi µzi) t, with µzi = µ zi + µs and σ2 zi = σ2 zi + σ2 s.This ignores dependencies among the zi elements induced by the common scale s, but nonetheless we found that it works well in practice. Similarly with the group normal-Jeffreys prior, we will replace the distribution over W at each layer with the masked variational posterior mean during test time:

ˆ W = diag(m) Eq(z)q( W)[diag(z) W] = diag m exp(µz + 1

2σ2 z) MW , (19)

where m is a binary mask determined according to the aforementioned threshold, MW are the means of q( W) and µz, σ2 z are the means and variances of the local log-normals over zi. Furthermore, similarly to the group normal-Jeffreys approach, we will use the variational posterior marginal variances:

V(wij)HS = (exp(σ2 zi) 1) exp(2µzi + σ2 zi) σ2 ij + µ2 ij + σ2 ij exp(2µzi + σ2 zi), (20)

to compute the ﬁnal bit precision for the entire weight matrix ˆ W.

5 Experiments

We validated the compression and speed-up capabilities of our models on the well-known architectures of Le Net-300-100 [39], Le Net-5-Caffe9 on MNIST [40] and, similarly with [49], VGG [61]10 on CIFAR 10 [36]. The groups of parameters were constructed by coupling the scale variables for each ﬁlter for the convolutional layers and for each input neuron for the fully connected layers. We provide the algorithms that describe the forward pass using local reparametrizations for fully connected and convolutional layers with each of the employed approximate posteriors at appendix F. For the horseshoe prior we set the scale τ0 of the global half-Cauchy prior to a reasonably small value, e.g. τ0 = 1e 5. This further increases the prior mass at zero, which is essential for sparse estimation and compression. We also found that constraining the standard deviations as described at [44] and warm-up" [62] helps in avoiding bad local optima of the variational objective. Further details about the experimental setup can be found at Appendix A. Determining the threshold for pruning can be easily done with manual inspection as usually there are two well separated clusters (signal and noise). We provide a sample visualization at Appendix E.

5.1 Architecture learning & bit precisions

We will ﬁrst demonstrate the group sparsity capabilities of our methods by illustrating the learned architectures at Table 1, along with the inferred bit precision per layer. As we can observe, our methods infer signiﬁcantly smaller architectures for the Le Net-300-100 and Le Net-5-Caffe, compared to Sparse Variational Dropout, Generalized Dropout and Group Lasso. Interestingly, we observe that for the VGG network almost all of big 512 feature map layers are drastically reduced to around 10 feature maps whereas the initial layers are mostly kept intact. Furthermore, all of the Bayesian methods considered require far fewer than the standard 32 bits per-layer to represent the weights, sometimes even allowing for 5 bit precisions.

7The product of log-normal r.v.s is another log-normal and a power of a log-normal r.v. is another log-normal. 8Empirically, it slightly better separates the scales compared to the negative log-mean (µzi + 0.5σ2 zi). 9https://github.com/BVLC/caffe/tree/master/examples/mnist 10The adapted CIFAR 10 version described at http://torch.ch/blog/2015/07/30/cifar.html.

Table 1: Learned architectures with Sparse VD [49], Generalized Dropout (GD) [63] and Group Lasso (GL) [70]. Bayesian Compression (BC) with group normal-Jeffreys (BC-GNJ) and group horseshoe (BC-GHS) priors correspond to the proposed models. We show the amount of neurons left after pruning along with the average bit precisions for the weights at each layer.

Network & size Method Pruned architecture Bit-precision

Le Net-300-100 Sparse VD 512-114-72 8-11-14

784-300-100 BC-GNJ 278-98-13 8-9-14 BC-GHS 311-86-14 13-11-10

Le Net-5-Caffe Sparse VD 14-19-242-131 13-10-8-12 GD 7-13-208-16 - 20-50-800-500 GL 3-12-192-500 -

BC-GNJ 8-13-88-13 18-10-7-9 BC-GHS 5-10-76-16 10-10 14-13

VGG BC-GNJ 63-64-128-128-245-155-6310-10-10-10-8-8-8- -26-24-20-14-12-11-11-15 -5-5-5-5-5-6-7-11 (2 64)-(2 128)- BC-GHS 51-62-125-128-228-129-3811-12-9-14-10-8-5- -(3 256)-(8 512) -13-9-6-5-6-6-6-20 -5-6-6-6-8-11-17-10

5.2 Compression Rates

For the actual compression task we compare our method to current work in three different scenarios: (i) compression achieved only by pruning, here, for non-group methods we use the CSC format to store parameters; (ii) compression based on the former but with reduced bit precision per layer (only for the weights); and (iii) the maximum compression rate as proposed by [25]. We believe

Table 2: Compression results for our methods. DC corresponds to Deep Compression method introduced at [25], DNS to the method of [21] and SWS to the Soft-Weight Sharing of [67]. Numbers marked with * are best case guesses.

Compression Rates (Error %)

Model Fast Maximum Original Error % Method |w =0|

|w| % Pruning Prediction Compression

Le Net-300-100 DC 8.0 6 (1.6) - 40 (1.6) DNS 1.8 28* (2.0) - - 1.6 SWS 4.3 12* (1.9) - 64(1.9) Sparse VD 2.2 21(1.8) 84(1.8) 113 (1.8)

BC-GNJ 10.8 9(1.8) 36(1.8) 58(1.8) BC-GHS 10.6 9(1.8) 23(1.9) 59(2.0)

Le Net-5-Caffe DC 8.0 6*(0.7) - 39(0.7) DNS 0.9 55*(0.9) - 108(0.9) 0.9 SWS 0.5 100*(1.0) - 162(1.0) Sparse VD 0.7 63(1.0) 228(1.0) 365(1.0)

BC-GNJ 0.9 108(1.0) 361(1.0) 573(1.0) BC-GHS 0.6 156(1.0) 419(1.0) 771(1.0)

VGG BC-GNJ 6.7 14(8.6) 56(8.8) 95(8.6) 8.4 BC-GHS 5.5 18(9.0) 59(9.0) 116(9.2)

these to be relevant scenarios because (i) can be applied with already existing frameworks such as Tensorﬂow [1], (ii) is a practical scheme given upcoming GPUs and frameworks will be designed to work with low and mixed precision arithmetics [41, 23]. For (iii), we perform k-means clustering on the weights with k=32 and consequently store a weight index that points to a codebook of available

weights. Note that the latter achieves highest compression rate but it is however fairly unpractical at test time since the original matrix needs to be restored for each layer. As we can observe at Table 2, our methods are competitive with the state-of-the art for Le Net-300-100 while offering signiﬁcantly better compression rates on the Le Net-5-Caffe architecture, without any loss in accuracy. Do note that group sparsity and weight sparsity can be combined so as to further prune some weights when a particular group is not removed, thus we can potentially further boost compression performance at e.g. Le Net-300-100. For the VGG network we observe that training from a random initialization yielded consistently less accuracy (around 1%-2% less) compared to initializing the means of the approximate posterior from a pretrained network, similarly with [49], thus we only report the latter results11. After initialization we trained the VGG network regularly for 200 epochs using Adam with the default hyperparameters. We observe a small drop in accuracy for the ﬁnal models when using the deterministic version of the network for prediction, but nevertheless averaging across multiple samples restores the original accuracy. Note, that in general we can maintain the original accuracy on VGG without sampling by simply ﬁnetuning with a small learning rate, as done at [49]. This will still induce (less) sparsity but unfortunately it does not lead to good compression as the bit precision remains very high due to not appropriately increasing the marginal variances of the weights.

5.3 Speed and energy consumption

We demonstrate that our method is competitive with [70], denoted as GL, a method that explicitly prunes convolutional kernels to reduce compute time. We measure the time and energy consumption of one forward pass of a mini-batch with batch size 8192 through Le Net-5-Caffe. We average over 104 forward passes and all experiments were run with Tensorﬂow 1.0.1, cuda 8.0 and respective cu DNN. We apply 16 CPUs run in parallel (CPU) or a Titan X (GPU). Note that we only use the pruned architecture as lower bit precision would further increase the speed-up but is not implementable in any common framework. Further, all methods we compare to in the latter experiments would barely show an improvement at all since they do not learn to prune groups but only parameters. In ﬁgure 1 we present our results. As to be expected the largest effect on the speed up is caused by GPU usage. However, both our models and best competing models reach a speed up factor of around 8 . We can further save about 3 energy costs by applying our architecture instead of the original one on a GPU. For larger networks the speed-up is even higher: for the VGG experiments with batch size 256 we have a speed-up factor of 51 .

Figure 1: Left: Avg. Time a batch of 8192 samples takes to pass through Le Net-5-Caffe. Numbers on top of the bars represent speed-up factor relative to the CPU implementation of the original network. Right: Energy consumption of the GPU of the same process (when run on GPU).

6 Conclusion

We introduced Bayesian compression, a way to tackle efﬁciency and compression in deep neural networks in a uniﬁed and principled way. Our proposed methods allow for theoretically principled compression of neural networks, improved energy efﬁciency with reduced computation while naturally learning the bit precisions for each weight. This serves as a strong argument in favor of Bayesian methods for neural networks, when we are concerned with compression and speed up.

11We also tried to ﬁnetune the same network with Sparse VD, but unfortunately it increased the error considerably (around 3% extra error), therefore we do not report those results.

Acknowledgments

We would like to thank Dmitry Molchanov, Dmitry Vetrov, Klamer Schutte and Dennis Koelma for valuable discussions and feedback. This research was supported by TNO, NWO and Google.

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016.

[2] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), pages 99 102, 1974.

[3] A. Armagan, M. Clyde, and D. B. Dunson. Generalized beta mixtures of gaussians. In Advances in neural information processing systems, pages 523 531, 2011.

[4] E. Azarkhish, D. Rossi, I. Loi, and L. Benini. Neurostream: Scalable and energy efﬁcient deep learning with smart memory cubes. ar Xiv preprint ar Xiv:1701.06420, 2017.

[5] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654 2662, 2014.

[6] E. Beale, C. Mallows, et al. Scale mixing of symmetric distributions with zero means. The Annals of Mathematical Statistics, 30(4):1145 1151, 1959.

[7] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015.

[8] C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97 (2):465 480, 2010.

[9] S. Chai, A. Raghavan, D. Zhang, M. Amer, and T. Shields. Low precision neural networks using subband decomposition. ar Xiv preprint ar Xiv:1703.08595, 2017.

[10] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural networks. ar Xiv preprint ar Xiv:1506.04449, 2015.

[11] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. ar Xiv preprint ar Xiv:1602.02830, 2016.

[12] M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplications. ar Xiv preprint ar Xiv:1412.7024, 2014.

[13] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3105 3113, 2015.

[14] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148 2156, 2013.

[15] X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference complexity. ar Xiv preprint ar Xiv:1703.08651, 2017.

[16] M. A. Figueiredo. Adaptive sparseness using jeffreys prior. Advances in neural information processing systems, 1:697 704, 2002.

[17] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. ICML, 2016.

[18] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. ICLR, 2015.

[19] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348 2356, 2011.

[20] P. D. Grünwald. The minimum description length principle. MIT press, 2007.

[21] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efﬁcient dnns. In Advances In Neural Information Processing Systems, pages 1379 1387, 2016.

[22] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. Co RR, abs/1502.02551, 392, 2015.

[23] P. Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks. Master s thesis, University of California, 2016.

[24] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efﬁcient neural networks. In Advances in Neural Information Processing Systems, pages 1135 1143, 2015.

[25] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016.

[26] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

[27] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5 13. ACM, 1993.

[28] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012.

[29] A. Honkela and H. Valpola. Variational learning and bits-back coding: an information-theoretic view to bayesian learning. IEEE Transactions on Neural Networks, 15(4):800 810, 2004.

[30] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017.

[31] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. ICLR, 2017.

[32] J. B. Ingraham and D. S. Marks. Bayesian sparsity for intractable distributions. ar Xiv preprint ar Xiv:1602.03807, 2016.

[33] T. Karaletsos and G. Rätsch. Automatic relevance determination for deep generative models. ar Xiv preprint ar Xiv:1505.07765, 2015.

[34] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.

[35] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick. Advances in Neural Information Processing Systems, 2015.

[36] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.

[37] N. D. Lawrence. Note relevance determination. In Neural Nets WIRN Vietri-01, pages 128 133. Springer, 2002.

[38] Y. Le Cun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In NIPs, volume 2, pages 598 605, 1989.

[39] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[40] Y. Le Cun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.

[41] D. D. Lin and S. S. Talathi. Overcoming challenges in ﬁxed point training of deep convolutional networks. Workshop ICML, 2016.

[42] D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks. ar Xiv preprint ar Xiv:1511.06393, 2015.

[43] C. Louizos. Smart regularization of deep architectures. Master s thesis, University of Amsterdam, 2015.

[44] C. Louizos and M. Welling. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks. Ar Xiv e-prints, Mar. 2017.

[45] D. J. Mac Kay. Probable networks and plausible predictions a review of practical bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6(3):469 505, 1995.

[46] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with ﬁne-grained quantization. ar Xiv preprint ar Xiv:1705.01462, 2017.

[47] P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha. Deep neural networks are robust to weight binarization and other non-linear distortions. ar Xiv preprint ar Xiv:1606.01981, 2016.

[48] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023 1032, 1988.

[49] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsiﬁes deep neural networks. ar Xiv preprint ar Xiv:1701.05369, 2017.

[50] E. Nalisnick, A. Anandkumar, and P. Smyth. A scale mixture perspective of multiplicative noise in neural networks. ar Xiv preprint ar Xiv:1506.03208, 2015.

[51] R. M. Neal. Bayesian learning for neural networks. Ph D thesis, Citeseer, 1995.

[52] S. E. Neville, J. T. Ormerod, M. Wand, et al. Mean ﬁeld variational bayes for continuous sparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8(1):1113 1151, 2014.

[53] O. Papaspiliopoulos, G. O. Roberts, and M. Sköld. A general framework for the parametrization of hierarchical models. Statistical Science, pages 59 73, 2007.

[54] C. Peterson. A mean ﬁeld theory learning algorithm for neural networks. Complex systems, 1:995 1019, 1987.

[55] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classiﬁcation using binary convolutional neural networks. In European Conference on Computer Vision, pages 525 542. Springer, 2016.

[56] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1278 1286, 2014.

[57] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465 471, 1978.

[58] J. Rissanen. Stochastic complexity and modeling. The annals of statistics, pages 1080 1100, 1986.

[59] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural networks. ar Xiv preprint ar Xiv:1607.00485, 2016.

[60] S. Shi and X. Chu. Speeding up convolutional neural networks by exploiting the sparsity of rectiﬁer units. ar Xiv preprint ar Xiv:1704.07724, 2017.

[61] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.

[62] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders. ar Xiv preprint ar Xiv:1602.02282, 2016.

[63] S. Srinivas and R. V. Babu. Generalized dropout. ar Xiv preprint ar Xiv:1611.06791, 2016.

[64] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014.

[65] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Efﬁcient processing of deep neural networks: A tutorial and survey. ar Xiv preprint ar Xiv:1703.09039, 2017.

[66] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996.

[67] K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression. ICLR, 2017.

[68] G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision and sparsity. ar Xiv preprint ar Xiv:1610.00324, 2016.

[69] C. S. Wallace. Classiﬁcation by minimum-message-length inference. In International Conference on Computing and Information, pages 72 81. Springer, 1990.

[70] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances In Neural Information Processing Systems, pages 2074 2082, 2016.

[71] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efﬁcient convolutional neural networks using energy-aware pruning. CVPR, 2017.

[72] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. ICLR, 2017.