# binarized_neural_networks__229b2796.pdf Binarized Neural Networks Itay Hubara1* itayh@technion.ac.il Matthieu Courbariaux2* matthieu.courbariaux@gmail.com Daniel Soudry3 daniel.soudry@gmail.com Ran El-Yaniv1 rani@cs.technion.ac.il Yoshua Bengio2,4 yoshua.umontreal@gmail.com (1) Technion, Israel Institute of Technology. (2) Université de Montréal. (3) Columbia University. (4) CIFAR Senior Fellow. (*) Indicates equal contribution. Abstract We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At train-time the binary weights and activations are used for computing the parameter gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. To validate the effectiveness of BNNs, we conducted two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. We also report our preliminary results on the challenging Image Net dataset. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line. Introduction Deep Neural Networks (DNNs) have substantially pushed Artificial Intelligence (AI) limits in a wide range of tasks (Le Cun et al., 2015). Today, DNNs are almost exclusively trained on one or many very fast and power-hungry Graphic Processing Units (GPUs) (Coates et al., 2013). As a result, it is often a challenge to run DNNs on target low-power devices, and substantial research efforts are invested in speeding up DNNs at run-time on both general-purpose (Gong et al., 2014; Han et al., 2015b) and specialized computer hardware (Chen et al., 2014; Esser et al., 2015). This paper makes the following contributions: We introduce a method to train Binarized-Neural-Networks (BNNs), neural networks with binary weights and activations, at run-time, and when computing the parameter gradients at train-time (see Section 1). We conduct two sets of experiments, each implemented on a different framework, namely Torch7 and Theano, which show that it is possible to train BNNs on MNIST, CIFAR-10 and SVHN and achieve near state-of-the-art results (see Section 2). Moreover, we report preliminary results on the challenging Image Net dataset We show that during the forward pass (both at run-time and train-time), BNNs drastically reduce memory consumption (size and number of accesses), and replace most arithmetic operations with bit-wise operations, which potentially lead to a substantial increase in power-efficiency (see Section 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. 3). Moreover, a binarized CNN can lead to binary convolution kernel repetitions; we argue that dedicated hardware could reduce the time complexity by 60% . Last but not least, we programed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy (see Section 4). The code for training and running our BNNs is available on-line (both Theano1 and Torch framework2). 1 Binarized Neural Networks In this section, we detail our binarization function, show how we use it to compute the parameter gradients,and how we backpropagate through it. Deterministic vs Stochastic Binarization When training a BNN, we constrain both the weights and the activations to either +1 or 1. Those two values are very advantageous from a hardware perspective, as we explain in Section 4. In order to transform the real-valued variables into those two values, we use two different binarization functions, as in (Courbariaux et al., 2015). Our first binarization function is deterministic: xb = Sign(x) = +1 if x 0, 1 otherwise, (1) where xb is the binarized variable (weight or activation) and x the real-valued variable. It is very straightforward to implement and works quite well in practice. Our second binarization function is stochastic: xb = +1 with probability p = σ(x), 1 with probability 1 p, (2) where σ is the hard sigmoid function: σ(x) = clip(x + 1 2 , 0, 1) = max(0, min(1, x + 1 The stochastic binarization is more appealing than the sign function, but harder to implement as it requires the hardware to generate random bits when quantizing. As a result, we mostly use the deterministic binarization function (i.e., the sign function), with the exception of activations at train-time in some of our experiments. Gradient Computation and Accumulation Although our BNN training method uses binary weights and activation to compute the parameter gradients, the real-valued gradients of the weights are accumulated in real-valued variables, as per Algorithm 1. Real-valued weights are likely required for Stochasic Gradient Descent (SGD) to work at all. SGD explores the space of parameters in small and noisy steps, and that noise is averaged out by the stochastic gradient contributions accumulated in each weight. Therefore, it is important to maintain sufficient resolution for these accumulators, which at first glance suggests that high precision is absolutely required. Moreover, adding noise to weights and activations when computing the parameter gradients provide a form of regularization that can help to generalize better, as previously shown with variational weight noise (Graves, 2011), Dropout (Srivastava et al., 2014) and Drop Connect (Wan et al., 2013). Our method of training BNNs can be seen as a variant of Dropout, in which instead of randomly setting half of the activations to zero when computing the parameter gradients, we binarize both the activations and the weights. Propagating Gradients Through Discretization The derivative of the sign function is zero almost everywhere, making it apparently incompatible with back-propagation, since the exact gradient of the cost with respect to the quantities before the discretization (pre-activations or weights) would 1https://github.com/Matthieu Courbariaux/Binary Net 2https://github.com/itayhubara/Binary Net be zero. Note that this remains true even if stochastic quantization is used. Bengio (2013) studied the question of estimating or propagating gradients through stochastic discrete neurons. He found in his experiments that the fastest training was obtained when using the straight-through estimator, previously introduced in Hinton s lectures (Hinton, 2012). We follow a similar approach but use the version of the straight-through estimator that takes into account the saturation effect, and does use deterministic rather than stochastic sampling of the bit. Consider the sign function quantization q = Sign(r), and assume that an estimator gq of the gradient C q has been obtained (with the straight-through estimator when needed). Algorithm 1: Training a BNN. C is the cost function for minibatch, λ the learning rate decay factor and L the number of layers. indicates element-wise multiplication. The function Binarize() specifies how to (stochastically or deterministically) binarize the activations and weights, and Clip() specifies how to clip the weights. Batch Norm() specifies how to batch-normalize the activations, using either batch normalization (Ioffe & Szegedy, 2015) or its shift-based variant we describe in Algorithm 3. Back Batch Norm() specifies how to backpropagate through the normalization. Update() specifies how to update the parameters when their gradients are known, using either ADAM (Kingma & Ba, 2014) or the shift-based Ada Max we describe in Algorithm 2. Require: a minibatch of inputs and targets (a0, a ), previous weights W, previous Batch Norm parameters θ, weight initialization coefficients from (Glorot & Bengio, 2010) γ, and previous learning rate η. Ensure: updated weights W t+1, updated Batch Norm parameters θt+1 and updated learning rate ηt+1. {1. Computing the gradients:} {1.1. Forward propagation:} for k = 1 to L do W b k Binarize(Wk), sk ab k 1W b k ak Batch Norm(sk, θk) if k < L then ab k Binarize(ak) {1.2. Backward propagation:} {Please note that the gradients are not binary.} Compute ga L = C a L knowing a L and a for k = L to 1 do if k < L then gak gab k 1|ak| 1 (gsk, gθk) Back Batch Norm(gak, sk, θk) gab k 1 gsk W b k, g W b k g skab k 1 {2. Accumulating the gradients:} for k = 1 to L do θt+1 k Update(θk, ηt, gθk), ηt+1 ληt W t+1 k Clip(Update(Wk, γkηt, g W b k), 1, 1) Algorithm 2: Shift based Ada Max learning rule (Kingma & Ba, 2014). g2 t indicates the element-wise square gt gt and stands for both left and right bit-shift. Good default settings are α = 2 10, 1 β1 = 2 3, 1 β2 = 2 10. All operations on vectors are element-wise. With βt 1 and βt 2 we denote β1 and β2 to the power t. Require: Previous parameters θt 1 and their gradient gt, and learning rate α. Ensure: Updated parameters θt. {Biased 1st and 2nd moment estimates:} mt β1 mt 1 + (1 β1) gt vt max(β2 vt 1, |gt|) {Updated parameters:} θt θt 1 (α (1 β1)) ˆm v 1 t ) Algorithm 3: Shift based Batch Normalizing Transform, applied to activation x over a mini-batch. The approximate power-of2 is3AP2(x) = sign(x)2round(log2|x|), and stands for both left and right binary shift. Require: Values of x over a mini-batch: B = {x1...m}; parameters to learn: γ, β. Ensure: {yi = BN(xi,γ, β)} {1. Mini-batch mean:} µB 1 m Pm i=1 xi {2. Centered input: } C(xi) (xi µB) {3. Approximate variance:} σ2 B 1 m Pm i=1(C(xi) AP2(C(xi))) {4. Normalize:} ˆxi C(xi) AP2(( p σ2 B + ϵ) 1) {5. Scale and shift:} yi AP2(γ) ˆxi Then, our straight-through estimator of C r is simply gr = gq1|r| 1. (4) Note that this preserves the gradient s information and cancels the gradient when r is too large. Not cancelling the gradient when r is too large significantly worsens the performance. The use of this straight-through estimator is illustrated in Algorithm 1. The derivative 1|r| 1 can also be seen as propagating the gradient through hard tanh, which is the following piece-wise linear activation function: Htanh(x) = Clip(x, 1, 1). (5) Algorithm 4: Running a BNN. L = layers. Require: a vector of 8-bit inputs a0, the binary weights W b, and the Batch Norm parameters θ. Ensure: the MLP output a L. {1. First layer:} a1 0 for n = 1 to 8 do a1 a1+2n 1 Xnor Dot Product(an 0, Wb 1) ab 1 Sign(Batch Norm(a1, θ1)) {2. Remaining hidden layers:} for k = 2 to L 1 do ak Xnor Dot Product(ab k 1, W b k) ab k Sign(Batch Norm(ak, θk)) {3. Output layer:} a L Xnor Dot Product(ab L 1, W b L) a L Batch Norm(a L, θL) For hidden units, we use the sign function nonlinearity to obtain binary activations, and for weights we combine two ingredients: Constrain each real-valued weight between -1 and 1, by projecting wr to -1 or 1 when the weight update brings wr outside of [ 1, 1], i.e., clipping the weights during training, as per Algorithm 1. The real-valued weights would otherwise grow very large without any impact on the binary weights. When using a weight wr, quantize it using wb = Sign(wr). This is consistent with the gradient canceling when |wr| > 1, according to Eq. 4. Shift-based Batch Normalization Batch Normalization (BN) (Ioffe & Szegedy, 2015), accelerates the training and also seems to reduces the overall impact of the weight scale. The normalization noise may also help to regularize the model. However, at train-time, BN requires many multiplications (calculating the standard deviation and dividing by it), namely, dividing by the running variance (the weighted mean of the training set activation variance). Although the number of scaling calculations is the same as the number of neurons, in the case of Conv Nets this number is quite large. For example, in the CIFAR-10 dataset (using our architecture), the first convolution layer, consisting of only 128 3 3 filter masks, converts an image of size 3 32 32 to size 3 128 28 28, which is two orders of magnitude larger than the number of weights. To achieve the results that BN would obtain, we use a shift-based batch normalization (SBN) technique. detailed in Algorithm 3. SBN approximates BN almost without multiplications. In the experiment we conducted we did not observe accuracy loss when using the shift based BN algorithm instead of the vanilla BN algorithm. Shift based Ada Max The ADAM learning rule (Kingma & Ba, 2014) also seems to reduce the impact of the weight scale. Since ADAM requires many multiplications, we suggest using instead the shift-based Ada Max we detail in Algorithm 2. In the experiment we conducted we did not observe accuracy loss when using the shift-based Ada Max algorithm instead of the vanilla ADAM algorithm. First Layer In a BNN, only the binarized values of the weights and activations are used in all calculations. As the output of one layer is the input of the next, all the layers inputs are binary, with the exception of the first layer. However, we do not believe this to be a major issue. First, in computer vision, the input representation typically has far fewer channels (e.g, red, green and blue) than internal representations (e.g, 512). As a result, the first layer of a Conv Net is often the smallest convolution layer, both in terms of parameters and computations (Szegedy et al., 2014). Second, it is relatively easy to handle continuous-valued inputs as fixed point numbers, with m bits of precision. For example, in the common case of 8-bit fixed point inputs: s = x wb ; s = n=1 2n 1(xn wb), (6) where x is a vector of 1024 8-bit inputs, x8 1 is the most significant bit of the first input, wb is a vector of 1024 1-bit weights, and s is the resulting weighted sum. This trick is used in Algorithm 4. 2 Benchmark Results We conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano. Implementation details are reported in Appendix A and code for both frameworks is available online. Results are reported in Table 1. 3Hardware implementation of AP2 is as simple as extracting the index of the most significant bit from the number s binary representation. Table 1: Classification test error rates of DNNs trained on MNIST (fully connected architecture), CIFAR-10 and SVHN (convnet). No unsupervised pre-training or data augmentation was used. Data set MNIST SVHN CIFAR-10 Binarized activations+weights, during training and test BNN (Torch7) 1.40% 2.53% 10.15% BNN (Theano) 0.96% 2.80% 11.40% Committee Machines Array (Baldassi et al., 2015) 1.35% - - Binarized weights, during training and test Binary Connect (Courbariaux et al., 2015) 1.29 0.08% 2.30% 9.90% Binarized activations+weights, during test EBP (Cheng et al., 2015) 2.2 0.1% - - Bitwise DNNs (Kim & Smaragdis, 2016) 1.33% - - Ternary weights, binary activations, during test (Hwang & Sung, 2014) 1.45% - - No binarization (standard results) No regularization 1.3 0.2% 2.44% 10.94% Gated pooling (Lee et al., 2015) - 1.69% 7.62% Figure 1: Training curves for different methods on CIFAR-10 dataset. The dotted lines represent the training costs (square hinge losses) and the continuous lines the corresponding validation error rates. Although BNNs are slower to train, they are nearly as accurate as 32-bit float DNNs. Preliminary Results on Image Net To test the strength of our method, we applied it to the challenging Image Net classification task. Considerable research has been concerned with compressing Image Net architectures while preserving high accuracy performance (e.g., Han et al. (2015a)). Previous approaches that have been tried include pruning near zero weights using matrix factorization techniques, quantizing the weights and applying Huffman codes among others. To the best of the our knowledge, so far there are no reports on successfully quantizing the network s activations. Moreover, a recent work Han et al. (2015a) showed that accuracy significantly deteriorates when trying to quantize convolutional layers weights below 4 bits (FC layers are more robust to quantization and can operate quite well with only 2 bits). In the present work we attempted to tackle the difficult task of binarizing both weights and activations. Employing the well known Alex Net and Google Net architectures, we applied our techniques and achieved 36.1% top-1 and 60.1% top-5 accuracies using Alex Net and 47.1% top-1 and 69.1% top-5 accuracies using Google Net. While this performance leaves room for improvement (relative to full precision nets), they are by far better than all previous attempts to compress Image Net architectures using less than 4 bits precision for the weights. Moreover, this advantage is achieved while also binarizing neuron activations. Detailed descriptions of these results as well as full implementation details of our experiments are reported in the supplementary material (Appendix B). In our latest work (Hubara et al., 2016) we relaxed the binary constrains and allowed more than 1-bit per weight and activations. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of Alex Net with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy and Google Net with 4-bits weighs and activation achived 66.6%. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Full details can be found in (Hubara et al., 2016) Table 2: Energy consumption of multiplyaccumulations in pico-joules (Horowitz, 2014) Operation MUL ADD 8bit Integer 0.2p J 0.03p J 32bit Integer 3.1p J 0.1p J 16bit Floating Point 1.1p J 0.4p J 32tbit Floating Point 3.7p J 0.9p J Table 3: Energy consumption of memory accesses in pico-joules (Horowitz, 2014) Memory size 64-bit memory access 8K 10p J 32K 20p J 1M 100p J DRAM 1.3-2.6n J 3 High Power Efficiency during the Forward Pass Computer hardware, be it general-purpose or specialized, is composed of memories, arithmetic operators and control logic. During the forward pass (both at run-time and train-time), BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which might lead to a great increase in power-efficiency. Moreover, a binarized CNN can lead to binary convolution kernel repetitions, and we argue that dedicated hardware could reduce the time complexity by 60% . Memory Size and Accesses Improving computing performance has always been and remains a challenge. Over the last decade, power has been the main constraint on performance (Horowitz, 2014). This is why much research effort has been devoted to reducing the energy consumption of neural networks. Horowitz (2014) provides rough numbers for the energy consumed by the computation (the given numbers are for 45nm technology), as summarized in Tables 2 and 3. Importantly, we can see that memory accesses typically consume more energy than arithmetic operations, and memory access cost augments with memory size. In comparison with 32-bit DNNs, BNNs require 32 times smaller memory size and 32 times fewer memory accesses. This is expected to reduce energy consumption drastically (i.e., more than 32 times). XNOR-Count Applying a DNN mainly consists of convolutions and matrix multiplications. The key arithmetic operation of deep learning is thus the multiply-accumulate operation. Artificial neurons are basically multiply-accumulators computing weighted sums of their inputs. In BNNs, both the activations and the weights are constrained to either 1 or +1. As a result, most of the 32-bit floating point multiply-accumulations are replaced by 1-bit XNOR-count operations. This could have a big impact on dedicated deep learning hardware. For instance, a 32-bit floating point multiplier costs about 200 Xilinx FPGA slices (Govindu et al., 2004; Beauchamp et al., 2006), whereas a 1-bit XNOR gate only costs a single slice. Exploiting Filter Repetitions When using a Conv Net architecture with binary weights, the number of unique filters is bounded by the filter size. For example, in our implementation we use filters of size 3 3, so the maximum number of unique 2D filters is 29 = 512. Since we now have binary filters, many 2D filters of size k k repeat themselves. By using dedicated hardware/software, we can apply only the unique 2D filters on each feature map and sum the results to receive each 3D filter s convolutional result. For example, in our Conv Net architecture trained on the CIFAR-10 benchmark, there are only 42% unique filters per layer on average. Hence we can reduce the number of the XNOR-popcount operations by 3. 4 Seven Times Faster on GPU at Run-Time It is possible to speed up GPU implementations of BNNs, by using a method sometimes called SIMD (single instruction, multiple data) within a register (SWAR). The basic idea of SWAR is to concatenate groups of 32 binary variables into 32-bit registers, and thus obtain a 32-times speed-up on bitwise operations (e.g, XNOR). Using SWAR, it is possible to evaluate 32 connections with only 3 instructions: a1+ = popcount(xnor(a32b 0 , w32b 1 )), (7) where a1 is the resulting weighted sum, and a32b 0 and w32b 1 are the concatenated inputs and weights. Those 3 instructions (accumulation, popcount, xnor) take 1 + 4 + 1 = 6 clock cycles on recent Nvidia GPUs (and if they were to become a fused instruction, it would only take a single clock cycle). Consequently, we obtain a theoretical Nvidia GPU speed-up of factor of 32/6 5.3. In practice, this speed-up is quite easy to obtain as the memory bandwidth to computation ratio is also increased by 6 times. Figure 2: The first three columns represent the time it takes to perform a 8192 8192 8192 (binary) matrix multiplication on a GTX750 Nvidia GPU, depending on which kernel is used. We can see that our XNOR kernel is 23 times faster than our baseline kernel and 3.4 times faster than cu BLAS. The next three columns represent the time it takes to run the MLP from Section 2 on the full MNIST test set. As MNIST s images are not binary, the first layer s computations are always performed by the baseline kernel. The last three columns show that the MLP accuracy does not depend on which kernel is used. In order to validate those theoretical results, we programed two GPU kernels: The first kernel (baseline) is an unoptimized matrix multiplication kernel. The second kernel (XNOR) is nearly identical to the baseline kernel, except that it uses the SWAR method, as in Equation (7). The two GPU kernels return identical outputs when their inputs are constrained to 1 or +1 (but not otherwise). The XNOR kernel is about 23 times faster than the baseline kernel and 3.4 times faster than cu BLAS, as shown in Figure 2. Last but not least, the MLP from Section 2 runs 7 times faster with the XNOR kernel than with the baseline kernel, without suffering any loss in classification accuracy (see Figure 2). 5 Discussion and Related Work Until recently, the use of extremely lowprecision networks (binary in the extreme case) was believed to be highly destructive to the network performance (Courbariaux et al., 2014). Soudry et al. (2014) and Cheng et al. (2015) proved the contrary by showing that good performance could be achieved even if all neurons and weights are binarized to 1 . This was done using Expectation Back Propagation (EBP), a variational Bayesian approach, which infers networks with binary weights and neurons by updating the posterior distributions over the weights. These distributions are updated by differentiating their parameters (e.g., mean values) via the back propagation (BP) algorithm. Esser et al. (2015) implemented a fully binary network at run time using a very similar approach to EBP, showing significant improvement in energy efficiency. The drawback of EBP is that the binarized parameters are only used during inference. The probabilistic idea behind EBP was extended in the Binary Connect algorithm of Courbariaux et al. (2015). In Binary Connect, the real-valued version of the weights is saved and used as a key reference for the binarization process. The binarization noise is independent between different weights, either by construction (by using stochastic quantization) or by assumption (a common simplification; see Spang (1962). The noise would have little effect on the next neuron s input because the input is a summation over many weighted neurons. Thus, the real-valued version could be updated by the back propagated error by simply ignoring the binarization noise in the update. Using this method, Courbariaux et al. (2015) were the first to binarize weights in CNNs and achieved near state-of-the-art performance on several datasets. They also argued that noisy weights provide a form of regularization, which could help to improve generalization, as previously shown in (Wan et al., 2013). This method binarized weights while still maintaining full precision neurons. Lin et al. (2015) carried over the work of Courbariaux et al. (2015) to the back-propagation process by quantizing the representations at each layer of the network, to convert some of the remaining multiplications into bit-shifts by restricting the neurons values to be power-of-two integers. Lin et al. (2015) s work and ours seem to share similar characteristics . However, their approach continues to use full precision weights during the test phase. Moreover, Lin et al. (2015) quantize the neurons only during the back propagation process, and not during forward propagation. Other research Baldassi et al. (2015) showed that full binary training and testing is possible in an array of committee machines with randomized input, where only one weight layer is being adjusted. Gong et al. (2014) aimed to compress a fully trained high precision network by using a quantization or matrix factorization methods. These methods required training the network with full precision weights and neurons, thus requiring numerous MAC operations the proposed BNN algorithm avoids. Hwang & Sung (2014) focused on a fixed-point neural network design and achieved performance almost identical to that of the floating-point architecture. Kim & Smaragdis (2016) retrained neural networks with binary weights and activations. So far, to the best of our knowledge, no work has succeeded in binarizing weights and neurons, at the inference phase and the entire training phase of a deep network. This was achieved in the present work. We relied on the idea that binarization can be done stochastically, or be approximated as random noise. This was previously done for the weights by Courbariaux et al. (2015), but our BNNs extend this to the activations. Note that the binary activations are especially important for Conv Nets, where there are typically many more neurons than free weights. This allows highly efficient operation of the binarized DNN at run time, and at the forward-propagation phase during training. Moreover, our training method has almost no multiplications, and therefore might be implemented efficiently in dedicated hardware. However, we have to save the value of the full precision weights. This is a remaining computational bottleneck during training, since it is an energy-consuming operation. We have introduced BNNs, which binarize deep neural networks and can lead to dramatic improvements in both power consumption and computation speed. During the forward pass (both at run-time and train-time), BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. Our estimates indicate that power efficiency can be improved by more than one order of magnitude (see Section 3). In terms of speed, we programed a binary matrix multiplication GPU kernel that enabled running MLP over the MNIST datset 7 times faster (than with an unoptimized GPU kernel) without suffering any accuracy degradation (see Section 4). We have shown that BNNs can handle MNIST, CIFAR-10 and SVHN while achieving nearly stateof-the-art accuracy performance. While our preliminary results for the challenging Image Net are not on par with the best results achievable with full precision networks, they significantly improve all previous attempts to compress Image Net-capable architectures (see Section 2 and supplementary material - Appendix B). Moreover by relaxing the binary constrains and allowed more than 1-bit per weight and activations we have been able to achieve prediction accuracy comparable to their 32-bit counterparts. Full details can be found in our latest work (Hubara et al., 2016) A major open question would be to further improve our results on Image Net. A substantial progress in this direction might lead to huge impact on DNN usability in low power instruments such as mobile phones. Acknowledgments We would like to express our appreciation to Elad Hoffer, for his technical assistance and constructive comments. We thank our fellow MILA lab members who took the time to read the article and give us some feedback. We thank the developers of Torch, Collobert et al. (2011) a Lua based environment, and Theano (Bergstra et al., 2010; Bastien et al., 2012), a Python library which allowed us to easily develop a fast and optimized code for GPU. We also thank the developers of Pylearn2 (Goodfellow et al., 2013) and Lasagne (Dieleman et al., 2015), two Deep Learning libraries built on the top of Theano. We thank Yuxin Wu for helping us compare our GPU kernels with cu BLAS. We are also grateful for funding from NSERC, the Canada Research Chairs, Compute Canada, and CIFAR. We are also grateful for funding from CIFAR, NSERC, IBM, Samsung. This research was also supported by The Israel Science Foundation (grant No. 1890/14). Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., and Zecchina, R. Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses. Physical Review Letters, 115(12):1 5, 2015. Bastien, F., Lamblin, P., Pascanu, R., et al. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012. Beauchamp, M. J., Hauck, S., Underwood, K. D., and Hemmert, K. S. Embedded floating-point units in FPGAs. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, pp. 12 20. ACM, 2006. Bengio, Y. Estimating or propagating gradients through stochastic neurons. Technical Report ar Xiv:1305.2982, Universite de Montreal, 2013. Bergstra, J., Breuleux, O., Bastien, F., et al. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (Sci Py), June 2010. Oral Presentation. Chen, T., Du, Z., Sun, N., et al. Diannao: A small-footprint high-throughput accelerator for ubiquitous machinelearning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, pp. 269 284. ACM, 2014. Cheng, Z., Soudry, D., Mao, Z., and Lan, Z. Training binary multilayer neural networks for image classification using expectation backpropgation. ar Xiv preprint ar Xiv:1503.03562, 2015. Coates, A., Huval, B., Wang, T., et al. Deep learning with COTS HPC systems. In Proceedings of the 30th international conference on machine learning, pp. 1337 1345, 2013. Collobert, R., Kavukcuoglu, K., and Farabet, C. Torch7: A matlab-like environment for machine learning. In Big Learn, NIPS Workshop, 2011. Courbariaux, M., Bengio, Y., and David, J.-P. Training deep neural networks with low precision multiplications. Ar Xiv e-prints, abs/1412.7024, December 2014. Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep neural networks with binary weights during propagations. Ar Xiv e-prints, abs/1511.00363, November 2015. Dieleman, S., Schlüter, J., Raffel, C., et al. Lasagne: First release., August 2015. Esser, S. K., Appuswamy, R., Merolla, P., Arthur, J. V., and Modha, D. S. Backpropagation for energy-efficient neuromorphic computing. In Advances in Neural Information Processing Systems, pp. 1117 1125, 2015. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In AISTATS 2010, 2010. Gong, Y., Liu, L., Yang, M., and Bourdev, L. Compressing deep convolutional networks using vector quantization. ar Xiv preprint ar Xiv:1412.6115, 2014. Goodfellow, I. J., Warde-Farley, D., Lamblin, P., et al. Pylearn2: a machine learning research library. ar Xiv preprint ar Xiv:1308.4214, 2013. Govindu, G., Zhuo, L., Choi, S., and Prasanna, V. Analysis of high-performance floating-point arithmetic on FPGAs. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, pp. 149. IEEE, 2004. Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pp. 2348 2356, 2011. Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149, 2015a. Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135 1143, 2015b. Hinton, G. Neural networks for machine learning. Coursera, video lectures, 2012. Horowitz, M. Computing s Energy Problem (and what we can do about it). IEEE Interational Solid State Circuits Conference, pp. 10 14, 2014. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. ar Xiv preprint ar Xiv:1609.07061, 2016. Hwang, K. and Sung, W. Fixed-point feedforward deep neural network design using weights+ 1, 0, and1. In Signal Processing Systems (Si PS), 2014 IEEE Workshop on, pp. 1 6. IEEE, 2014. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015. Kim, M. and Smaragdis, P. Bitwise Neural Networks. Ar Xiv e-prints, January 2016. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436 444, 2015. Lee, C.-Y., Gallagher, P. W., and Tu, Z. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. ar Xiv preprint ar Xiv:1509.08985, 2015. Lin, Z., Courbariaux, M., Memisevic, R., and Bengio, Y. Neural networks with few multiplications. Ar Xiv e-prints, abs/1510.03009, October 2015. Soudry, D., Hubara, I., and Meir, R. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In NIPS 2014, 2014. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929 1958, 2014. Szegedy, C., Liu, W., Jia, Y., et al. Going deeper with convolutions. Technical report, ar Xiv:1409.4842, 2014. Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. Regularization of neural networks using dropconnect. In ICML 2013, 2013.