# lossaware_weight_quantization_of_deep_networks__3f657cc0.pdf Published as a conference paper at ICLR 2018 LOSS-AWARE WEIGHT QUANTIZATION OF DEEP NETWORKS Lu Hou, James T. Kwok Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong {lhouab, jamesk}@cse.ust.hk The huge size of deep networks hinders their use in small computing devices. In this paper, we consider compressing the network by weight quantization. We extend a recently proposed loss-aware weight binarization scheme to ternarization, with possibly different scaling parameters for the positive and negative weights, and m-bit (where m > 2) quantization. Experiments on feedforward and recurrent neural networks show that the proposed scheme outperforms state-of-the-art weight quantization algorithms, and is as accurate (or even more accurate) than the full-precision network. 1 INTRODUCTION The last decade has witnessed huge success of deep neural networks in various domains. Examples include computer vision, speech recognition, and natural language processing (Le Cun et al., 2015). However, their huge size often hinders deployment to small computing devices such as cell phones and the internet of things. Many attempts have been recently made to reduce the model size. One common approach is to prune a trained dense network (Han et al., 2015; 2016). However, most of the pruned weights may come from the fully-connected layers where computations are cheap, and the resultant time reduction is insignificant. Li et al. (2017b) and Molchanov et al. (2017) proposed to prune filters in the convolutional neural networks based on their magnitudes or significance to the loss. However, the pruned network has to be retrained, which is again expensive. Another direction is to use more compact models. Google Net (Szegedy et al., 2015) and Res Net (He et al., 2016) replace the fully-connected layers with simpler global average pooling. However, they are also deeper. Squeeze Net (Iandola et al., 2016) reduces the model size by replacing most of the 3 3 filters with 1 1 filters. This is less efficient on smaller networks because the dense 1 1 convolutions are costly. Mobile Net (Howard et al., 2017) compresses the model using separable depth-wise convolution. Shuffle Net (Zhang et al., 2017) utilizes pointwise group convolution and channel shuffle to reduce the computation cost while maintaining accuracy. However, highly optimized group convolution and depth-wise convolution implementations are required. Alternatively, Novikov et al. (2015) compressed the model by using a compact multilinear format to represent the dense weight matrix. The CP and Tucker decompositions have also been used on the kernel tensor in CNNs (Lebedev et al., 2014; Kim et al., 2016). However, they often need expensive fine-tuning. Another effective approach to compress the network and accelerate training is by quantizing each full-precision weight to a small number of bits. This can be further divided to two sub-categories, depending on whether pre-trained models are used (Lin et al., 2016a; Mellempudi et al., 2017) or the quantized model is trained from scratch (Courbariaux et al., 2015; Li et al., 2017a). Some of these also directly learn with low-precision weights, but they usually suffer from severe accuracy deterioration (Li et al., 2017a; Miyashita et al., 2016). By keeping the full-precision weights during learning, Courbariaux et al. (2015) pioneered the Binary Connect algorithm, which uses only one bit for each weight while still achieving state-of-the-art classification results. Rastegari et al. (2016) further incorporated weight scaling, and obtained better results. Instead of simply finding the closest binary approximation of the full-precision weights, a loss-aware scheme is proposed in (Hou et al., 2017). Beyond binarization, Ternary Connect (Lin et al., 2016b) quantizes each weight to Published as a conference paper at ICLR 2018 { 1, 0, 1}. Li & Liu (2016) and Zhu et al. (2017) added scaling to the ternarized weights, and Do Re Fa-Net (Zhou et al., 2016) further extended quantization to more than three levels. However, these methods do not consider the effect of quantization on the loss, and rely on heuristics in their procedures (Zhou et al., 2016; Zhu et al., 2017). Recently, a loss-aware low-bit quantized neural network is proposed in (Leng et al., 2017). However, it uses full-precision weights in the forward pass and the extra-gradient method (Vasilyev et al., 2010) for update, both of which are expensive. In this paper, we propose an efficient and disciplined ternarization scheme for network compression. Inspired by (Hou et al., 2017), we explicitly consider the effect of ternarization on the loss. This is formulated as an optimization problem which is then solved efficiently by the proximal Newton algorithm. When the loss surface s curvature is ignored, the proposed method reduces to that of (Li & Liu, 2016), and is also related to the projection step of (Leng et al., 2017). Next, we extend it to (i) allow the use of different scaling parameters for the positive and negative weights; and (ii) the use of m bits (where m > 2) for weight quantization. Experiments on both feedforward and recurrent neural networks show that the proposed quantization scheme outperforms state-of-the-art algorithms. Notations: For a vector x, x denotes the element-wise square root (i.e., [ x]i = xi), |x| is the element-wise absolute value, x p = (P i |xi|p) 1 p is its p-norm, and Diag(x) returns a diagonal matrix with x on the diagonal. For two vectors x and y, x y denotes the element-wise multiplication and x y the element-wise division. Given a threshold , I (x) returns a vector such that [I (x)]i = 1 if xi > , 1 if xi < , and 0 otherwise. I+ (x) considers only the positive threshold, i.e., [I+ (x)]i = 1 if xi > , and 0 otherwise. Similarly, [I (x)]i = 1 if xi < , and 0 otherwise. For a matrix X, vec(X) returns a vector by stacking all the columns of X, and diag(X) returns a vector whose entries are from the diagonal of X. 2 RELATED WORK Let the full-precision weights from all L layers be w = [w 1 , w 2 , . . . , w L] , where wl = vec(Wl), and Wl is the weight matrix at layer l. The corresponding quantized weights will be denoted ˆw = [ ˆw 1 , ˆw 2 , . . . , ˆw L] . 2.1 WEIGHT BINARIZED NETWORKS In Binary Connect (Courbariaux et al., 2015), each element of wl is binarized to 1 or +1 by using the sign function: Binarize(wl) = sign(wl). In the Binary-Weight-Network (BWN) (Rastegari et al., 2016), a scaling parameter is also included, i.e., Binarize(wl) = αlbl, where αl > 0, bl { 1, +1}nl and nl is the number of weights in wl. By minimizing the difference between wl and αlbl, the optimal αl, bl have the simple form: αl = wl 1/nl, and bl = sign(wl). Instead of simply finding the best binary approximation for the full-precision weight wt l at iteration t, the loss-aware binarized network (LAB) directly minimizes the loss w.r.t. the binarized weight αt lbt l (Hou et al., 2017). Let dt 1 l be a vector containing the diagonal of an approximate Hessian of the loss. It can be shown that αt l = dt 1 l wt l 1/ dt 1 l 1 and bt l = sign(wt l). 2.2 WEIGHT TERNARIZED NETWORKS In a weight ternarized network, zero is used as an additional quantized value. In Ternary Connect (Lin et al., 2016b), each weight value is clipped to [ 1, 1] before quantization, and then a non-negative weight [wt l]i is stochastically quantized to 1 with probability [wt l]i (and 0 otherwise). When [wt l]i is negative, it is quantized to 1 with probability [wt l]i, and 0 otherwise. In the ternary weight network (TWN) (Li & Liu, 2016), wt l is quantized to ˆwt l = αt l I t l(wt l), where t l is a threshold (i.e., [ ˆwt l]i = αt l if [wt l]i > t l, αt l if [wt l]i < t l and 0 otherwise). To obtain t l and αt l, TWN minimizes the ℓ2-distance between the full-precision and ternarized Published as a conference paper at ICLR 2018 weights, leading to t l = arg max >0 1 I (wt l) 1 i:|[wt l]i|> t l , αt l = 1 I t l(wt l) 1 i:|[wt l]i|> t l |[wt l]i|. (1) However, t l in (1) is difficult to solve. Instead, TWN simply sets t l = 0.7 E(|wt l|) in practice. In TWN, one scaling parameter (αt l) is used for both the positive and negative weights at layer l. In the trained ternary quantization (TTQ) network (Zhu et al., 2017), different scaling parameters (αt l and βt l ) are used. The weight wt l is thus quantized to ˆwt l = αt l I+ t l(wt l) + βt l I t l(wt l). The scaling parameters are learned by gradient descent. As for t l, two heuristics are used. The first sets t l to a constant fraction of max(|wt l|), while the second sets t l such that at all layers are equally sparse. 2.3 WEIGHT QUANTIZED NETWORKS In a weight quantized network, m bits (where m 2) are used to represent each weight. Let Q be a set of (2k + 1) quantized values, where k = 2m 1 1. The two popular choices of Q are 1, k 1 k , . . . , 1 k, . . . , k 1 k , 1 (linear quantization), and 1, 1 2, . . . , 1 2k 1 , 0, 1 2k 1 , . . . , 1 2, 1 (logarithmic quantization). By limiting the quantized values to powers of two, logarithmic quantization is advantageous in that expensive floating-point operations can be replaced by cheaper bit-shift operations. When m = 2, both schemes reduce to Q = { 1, 0, 1}. In the Do Re Fa-Net (Zhou et al., 2016), weight wt l is heuristically quantized to m-bit, with:1 [ ˆwt l]i = 2 quantizem tanh([wt l]i) 2 max(| tanh([wt l]i)|) + 1 in { 1, 2m 2 2m 1, . . . , 1 2m 1, 1 2m 1, . . . , 2m 2 2m 1, 1}, where quantizem(x) = 1 2m 1round((2m 1)x). Similar to loss-aware binarization (Hou et al., 2017), Leng et al. (2017) proposed a loss-aware quantized network called low-bit neural network (LBNN). The alternating direction method of multipliers (ADMM) (Boyd et al., 2011) is used for optimization. At the tth iteration, the full-precision weight wt l is first updated by the method of extra-gradient (Vasilyev et al., 2010): wt l = wt 1 l ηt l L(wt 1 l ), wt l = wt 1 l ηt l L( wt l), (2) where L is the augmented Lagrangian in the ADMM formulation, and ηt is the stepsize. Next, wt l is projected to the space of m-bit quantized weights so that ˆwt l is of the form αlbl, where αl > 0, and bl 1, 1 2, . . . , 1 2k 1 , 0, 1 2k 1 , . . . , 1 3 LOSS-AWARE QUANTIZATION 3.1 TERNARIZATION USING PROXIMAL NEWTON ALGORITHM In weight ternarization, TWN simply finds the closest ternary approximation of the full precision weight at each iteration, while TTQ sets the ternarization threshold heuristically. Inspired by LAB (for binarization), we consider the loss explicitly during quantization and obtain the quantization thresholds and scaling parameter by solving an optimization problem. As in TWN, the weight wl is ternarized as ˆwl = αlbl, where αl > 0 and bl { 1, 0, 1}nl. Given a loss function ℓ, we formulate weight ternarization as the following optimization problem: min ˆw ℓ( ˆw) : ˆwl = αlbl, αl > 0, bl Qnl, l = 1, . . . , L, (3) where Q is the set of desired quantized values. As in LAB, we will solve this using the proximal Newton method (Lee et al., 2014; Rakotomamonjy et al., 2016). At iteration t, the objective is replaced by the second-order expansion ℓ( ˆwt 1) + ℓ( ˆwt 1) ( ˆw ˆwt 1) + 1 2( ˆw ˆwt 1) Ht 1( ˆw ˆwt 1), (4) 1Note that the quantized value of 0 is not used in Do Re Fa-Net. Published as a conference paper at ICLR 2018 where Ht 1 is an estimate of the Hessian of ℓat ˆwt 1. We use the diagonal equilibration preconditioner (Dauphin et al., 2015), which is robust in the presence of saddle points and also readily available in popular stochastic deep network optimizers such as Adam (Kingma & Ba, 2015). Let Dl be the approximate diagonal Hessian at layer l. We use D = Diag([diag(D1) , . . . , diag(DL) ] ) as an estimate of H. Substituting (4) into (3), we solve the following subproblem at the tth iteration: min ˆwt ℓ( ˆwt 1) ( ˆwt ˆwt 1) + 1 2( ˆwt ˆwt 1) Dt 1( ˆwt ˆwt 1) (5) s.t. ˆwt l = αt lbt l, αt l > 0, bt l Qnl, l = 1, . . . , L. Proposition 3.1 The objective in (5) can be rewritten as min ˆwt 1 2 ( ˆwt l wt l) 2 , (6) where dt 1 l diag(Dt 1 l ), and wt l ˆwt 1 l lℓ( ˆwt 1) dt 1 l . (7) Obviously, this objective can be minimized layer by layer. Each proximal Newton iteration thus consists of two steps: (i) Obtain wt l in (7) by gradient descent along lℓ( ˆwt 1), which is preconditioned by the adaptive learning rate 1 dt 1 l so that the rescaled dimensions have similar curvatures; (ii) Quantize wt l to ˆwt l by minimizing the scaled difference between ˆwt l and wt l in (6). Intuitively, when the curvature is low ([dt 1 l ]i is small), the loss is not sensitive to the weight and ternarization error can be less penalized. When the loss surface is steep, ternarization has to be more accurate. Though the constraint in (6) is more complicated than that in LAB, interestingly the following simple relationship can still be obtained for weight ternarization. Proposition 3.2 With Q = { 1, 0, 1}, and the optimal ˆwt l in (6) of the form αb. For a fixed b, α = b dt 1 l wt l 1 b dt 1 l 1 ; whereas when α is fixed, b = Iα/2(wt l). Equivalently, b can be written as ΠQ(wt l/α), where ΠQ( ) projects each entry of the input argument to the nearest element in Q. Further discussions on how to solve for αt l will be presented in Sections 3.1.1 and 3.1.2. When the curvature is the same for all dimensions at layer l, the following Corollary shows that the solution above reduces that of TWN. Corollary 3.1 When Dt 1 l = λI, αt l reduces to the TWN solution in (1) with t l = αt l/2. In other words, TWN corresponds to using the proximal gradient algorithm, while the proposed method corresponds to using the proximal Newton algorithm with diagonal Hessian. In composite optimization, it is known that the proximal Newton algorithm is more efficient than the proximal gradient algorithm (Lee et al., 2014; Rakotomamonjy et al., 2016). Moreover, note that the interesting relationship t l = αt l/2 is not observed in TWN, while TTQ completely neglects this relationship. In LBNN (Leng et al., 2017), its projection step uses an objective which is similar to (6), but without using the curvature information. Besides, their wt l is updated with the extra-gradient in (2), which doubles the number of forward, backward and update steps, and can be costly. Moreover, LBNN uses full-precision weights in the forward pass, while all other quantization methods including ours use quantized weights (which eliminates most of the multiplications and thus faster training). When (i) ℓis continuously differentiable with Lipschitz-continuous gradient (i.e., there exists β > 0 such that ℓ(u) ℓ(v) 2 β u v 2 for any u, v); (ii) ℓis bounded from below; and (iii) [dt l]k > β l, k, t, it can be shown that the objective of (3) produced by the proximal Newton algorithm (with solution in Proposition 3.2) converges (Hou et al., 2017). In practice, it is important to keep the full-precision weights during update (Courbariaux et al., 2015). Hence, we replace (7) by wt l wt 1 l lℓ( ˆwt 1) dt 1 l . The whole procedure, which is called Loss-Aware Ternarization (LAT), is shown in Algorithm 3 of Appendix B. It is similar to Algorithm 1 of LAB (Hou et al., 2017), except that αt l and bt l are computed differently. In step 4, following (Li & Liu, 2016), we first rescale input xt 1 l with αl, so that multiplications in dot products and convolutions become additions. Algorithm 3 can also be easily extended to ternarize weights in recurrent networks. Interested readers are referred to (Hou et al., 2017) for details. Published as a conference paper at ICLR 2018 3.1.1 EXACT SOLUTION OF αt l To simplify notations, we drop the superscripts and subscripts. From Proposition 3.2, α = b d w 1 b d 1 , b = Iα/2(w). (8) We now consider how to solve for α. First, we introduce some notations. Given a vector x = [x1, x2, . . . , xn], and an indexing vector s Rn whose entries are a permutation of {1, . . . , n}, perms(x) returns the vector [xs1, xs2, . . . xsn], and cum(x) = [x1, P2 i=1 xi, . . . , Pn i=1 xi] returns partial sums for elements in x. For example, let a = [1, 1, 2], and b = [3, 1, 2]. Then, permb(a) = [ 2, 1, 1] and cum(a) = [1, 0, 2]. We sort elements of |w| in descending order, and let the vector containing the sorted indices be s. For example, if w = [1, 0, 2], then s = [3, 1, 2]. From (8), α = Iα/2(w) d w 1 Iα/2(w) d 1 = [cum(perms(|d w|))]j [cum(perms(|d|))]j = 2cj, (9) where c = cum(perms(|d w|)) cum(perms(d)) 2, and j is the index such that [perms(|w|)]j > cj > [perms(|w|)]j+1. (10) For simplicity of notations, let the dimensionality of w (and thus also of c) be n, and the operation find(condition(x)) returns all indices in x that satisfies the condition. It is easy to see that any j satisfying (10) is in S find([perms(|w|)][1:(n 1)] c[1:(n 1)]) ([perms(|w|)][2:n] c[1:n 1]) < 0), where c[1:(n 1)] is the subvector of c with elements in the index range 1 to n 1. The optimal α (= 2cj) is then the one which yields the smallest objective in (6), which can be simplified by Proposition 3.3 below. The procedure is shown in Algorithm 1. Proposition 3.3 The optimal αt l of (6) equals 2 arg maxcj:j S c2 j [cum(perms(dt 1 l ))]j. Algorithm 1 Exact solver of (6) 1: Input: full-precision weight wt l, diagonal entries of the approximate Hessian dt 1 l . 2: s = arg sort(|wt l|); 3: c = cum(perms(|dt 1 l wt l|)) cum(perms(dt 1 l )) 2; 4: S = find(([perms(|wt l|)][1:(n 1)] c[1:(n 1)]) ([perms(|wt l|)][2:n] c[1:n 1]) < 0); 5: αt l = 2 arg maxcj:j S c2 j [cum(perms(dt 1 l ))]j; 6: bt l = Iαt l/2(wt l); 7: Output: ˆwt l = αt lbt l. 3.1.2 APPROXIMATE SOLUTION OF αt l In case the sorting operation in step 2 is expensive, αt l and bt l can be obtained by alternating the iteration in Proposition 3.2 (Algorithm 2). Empirically, it converges very fast, usually in 5 iterations. Algorithm 2 Approximate solver for (6). 1: Input: bt 1 l , full-precision weight wt l, diagonal entries of the approximate Hessian dt 1 l . 2: Initialize: α = 1.0, αold = 0.0, b = bt 1 l , ϵ = 10 6; 3: while |α αold| > ϵ do 4: αold = α; 5: α = b dt 1 l wt l 1 b dt 1 l 1 ; 6: b = Iα/2(wt l); 7: end while 8: Output: ˆwt l = αb. Published as a conference paper at ICLR 2018 3.2 EXTENSION TO TERNARIZATION WITH TWO SCALING PARAMETERS As in TTQ (Zhu et al., 2017), we can use different scaling parameters for the positive and negative weights in each layer. The optimization subproblem at the tth iteration then becomes: min ˆwt ℓ( ˆwt 1) ( ˆwt ˆwt 1) + 1 2( ˆwt ˆwt 1) Dt 1( ˆwt ˆwt 1) (11) s.t. ˆwt l { βt l , 0, αt l}nl, αt l > 0, βt l > 0, l = 1, . . . , L. Proposition 3.4 The optimal ˆwt l in (5) is of the form ˆwt l = αt lpt l + βt l qt l, where αt l = pt l dt 1 l wt l 1 pt l dt 1 l 1 , pt l = I+ αt l/2(wt l), βt l = qt l dt 1 l wt l 1 qt l dt 1 l 1 , and qt l = I βt l /2(wt l). The exact and approximate solutions for αt l and βt l can be obtained in a similar way as in Sections 3.1.1 and 3.1.2. Details are in Appendix C. 3.3 EXTENSION TO LOW-BIT QUANTIZATION For m-bit quantization, we simply change the set Q of desired quantized values in (3) to one with k = 2m 1 1 quantized values. The optimization still contains a gradient descent step with adaptive learning rates like LAT, and a quantization step which can be solved efficiently by alternating minimization of (α, b) (similar to the procedure in Algorithm 2) using the following Proposition. Proposition 3.5 Let the optimal ˆwt l in (6) be of the form αb. For a fixed b, α = b dt 1 l wt l 1 b dt 1 l 1 ; whereas when α is fixed, b = ΠQ( wt l α ), where Q = 1, k 1 k , . . . , 1 k, . . . , k 1 k , 1 for linear quantization and Q = 1, 1 2, . . . , 1 2k 1 , 0, 1 2k 1 , . . . , 1 2, 1 for logarithmic quantization. 4 EXPERIMENTS In this section, we perform experiments on both feedforward and recurrent neural networks. The following methods are compared: (i) the original full-precision network; (ii) weight-binarized networks, including Binary Connect (Courbariaux et al., 2015), Binary-Weight-Network (BWN) (Rastegari et al., 2016), and Loss-Aware Binarized network (LAB) (Hou et al., 2017); (iii) weightternarized networks, including Ternary Weight Networks (TWN) (Li & Liu, 2016), Trained Ternary Quantization (TTQ)2 (Zhu et al., 2017), the proposed Loss-Aware Ternarized network with exact solution (LATe), approximate solution (LATa), and with two scaling parameters (LAT2e and LAT2a); (iv) m-bit-quantized networks (where m > 2), including Do Re Fa-Netm (Zhou et al., 2016), the proposed loss-aware quantized network with linear quantization (LAQm(linear)), and logarithmic quantization (LAQm(log)). Since weight quantization can be viewed as a form of regularization (Courbariaux et al., 2015), we do not use other regularizers such as dropout and weight decay. 4.1 FEEDFORWARD NETWORKS In this section, we perform experiments with the multilayer perceptron (on the MNIST data set) and convolutional neural networks (on CIFAR-10, CIFAR-100 and SVHN). For MNIST, CIFAR-10, and SVHN, the setup is similar to that in (Courbariaux et al., 2015; Hou et al., 2017). Details can be found in Appendix D. For CIFAR-100, we use 45, 000 images for training, another 5, 000 for validation, and the remaining 10, 000 for testing. The testing errors are shown in Table 1. Ternarization: On MNIST, CIFAR100 and SVHN, the weight-ternarized networks perform better than weight-binarized networks, and are comparable to the full-precision networks. Among the weight-ternarized networks, the proposed LAT and its variants have the lowest errors. On CIFAR-10, LATa has similar performance as the full-precision network, but is outperformed by Binary Connect. Figure 1(a) shows convergence of the training loss for LATa on CIFAR-10, and Figure 1(b) shows the scaling parameter obtained at each CNN layer. As can be seen, the scaling parameters for the 2For TTQ, we follow the CIFAR-10 setting in (Zhu et al., 2017), and set t l = 0.005 max(|wt l|). Published as a conference paper at ICLR 2018 Table 1: Testing errors (%) on the feedforward networks. Algorithm with the lowest error in each group is highlighted. MNIST CIFAR-10 CIFAR-100 SVHN no binarization full-precision 1.11 10.38 39.06 2.28 Binary Connect 1.28 9.86 46.42 2.45 binarization BWN 1.31 10.51 43.62 2.54 LAB 1.18 10.50 43.06 2.35 TWN 1.23 10.64 43.49 2.37 1 scaling LATe 1.15 10.47 39.10 2.30 ternarization LATa 1.14 10.38 39.19 2.30 TTQ 1.20 10.59 42.09 2.38 2 scaling LAT2e 1.20 10.45 39.01 2.34 LAT2a 1.19 10.48 38.84 2.35 Do Re Fa-Net3 1.31 10.54 45.05 2.39 3-bit quantization LAQ3(linear) 1.20 10.67 38.70 2.34 LAQ3(log) 1.16 10.52 38.50 2.29 first and last layers (conv1 and linear3, respectively) are larger than the others. This agrees with the finding that, to maintain the activation variance and back-propagated gradients variance during the forward and backward propagations, the variance of the weights between the lth and (l +1)th layers should roughly follow 2/(nl+nl+1) (Glorot & Bengio, 2010). Hence, as the input and output layers are small, larger scaling parameters are needed for their high-variance weights. (a) Training loss. (b) Scaling parameter α. Figure 1: Convergence of the training loss and scaling parameter by LATa on CIFAR-10. Using Two Scaling Parameters: Compared to TTQ, the proposed LAT2 always has better performance. However, the extra flexibility of using two scaling parameters does not always translate to lower testing error. As can be seen, it outperforms algorithms with one scaling parameter only on CIFAR-100. We speculate this is because the capacities of deep networks are often larger than needed, and so the limited expressiveness of quantized weights may not significantly deteriorate performance. Indeed, as pointed out in (Courbariaux et al., 2015), weight quantization is a form of regularization, and can contribute positively to the performance. Using More Bits: Among the 3-bit quantization algorithms, the proposed scheme with logarithmic quantization has the best performance. It also outperforms the other quantization algorithms on CIFAR-100 and SVHN. However, as discussed above, more quantization flexibility is useful only when the weight-quantized network does not have enough capacity. 4.2 RECURRENT NETWORKS In this section, we follow (Hou et al., 2017) and perform character-level language modeling experiments on the long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997). The training objective is the cross-entropy loss over all target sequences. Experiments are performed on three Published as a conference paper at ICLR 2018 data sets: (i) Leo Tolstoy s War and Peace; (ii) source code of the Linux Kernel; and (iii) Penn Treebank Corpus (Taylor et al., 2003). For the first two, we follow the setting in (Karpathy et al., 2016; Hou et al., 2017). For Penn Treebank, we follow the setting in (Mikolov & Zweig, 2012). In the experiment, we tried different initializations for TTQ and then report the best. Cross-entropy values on the test set are shown in Table 2. Table 2: Testing cross-entropy values on the LSTM. Algorithm with the lowest cross-entropy value in each group is highlighted. War and Peace Linux Kernel Penn Treebank no binarization full-precision 1.268 1.326 1.083 Binary Connect 2.942 3.532 1.737 binarization BWN 1.313 1.307 1.078 LAB 1.291 1.305 1.081 TWN 1.290 1.280 1.045 1 scaling LATe 1.248 1.256 1.022 ternarization LATa 1.253 1.264 1.024 TTQ 1.272 1.302 1.031 2 scaling LAT2e 1.239 1.258 1.018 LAT2a 1.245 1.258 1.015 Do Re Fa-Net3 1.349 1.276 1.017 3-bit quantization LAQ3(linear) 1.282 1.327 1.017 LAQ3(log) 1.268 1.273 1.009 Do Re Fa-Net4 1.328 1.320 1.019 4-bit quantization LAQ4 (linear) 1.294 1.337 1.046 LAQ4 (log) 1.272 1.319 1.016 Ternarization: As in Section 4.1, the proposed LATe and LATa outperform the other weight ternarization schemes, and are even better than the full-precision network on all three data sets. Figure 2 shows convergence of the training and validation losses on War and Peace. Among the ternarization methods, LAT and its variants converge faster than both TWN and TTQ. (a) Training loss. (b) Validation loss. Figure 2: Convergence of the training and validation losses on War and Peace. Using Two Scaling Parameters: LAT2e and LAT2a outperform TTQ on all three data sets. They also perform better than using one scaling parameter on War and Peace and Penn Treebank. Using More Bits: The proposed LAQ always outperforms Do Re Fa-Net when 3 or 4 bits are used. As noted in Section 4.1, using more bits does not necessarily yield better generalization performance, and ternarization (using 2 bits) yields the lowest validation loss on War and Peace and Linux Kernel. Moreover, logarithmic quantization is better than linear quantization. Figure 3 shows distributions of the input-to-hidden (full-precision and quantized) weights of the input gate trained after 20 epochs using LAQ3(linear) and LAQ3(log) (results on the other weights are similar). As can be seen, distributions of the full-precision weights are bell-shaped. Hence, logarithmic quantization can give finer resolutions to many of the weights which have small magnitudes. Published as a conference paper at ICLR 2018 (a) Full-precision weights. (b) Quantized weights. (c) Full-precision weights. (d) Quantized weights. Figure 3: Distributions of the full-precision and LAQ3-quantized weights on War and Peace. Left ((a) and (b)): Linear quantization; Right ((c) and (d)): Logarithmic quantization. Quantized vs Full-precision Networks: The quantized networks often perform better than the fullprecision networks. We speculate that this is because deep networks often have larger-than-needed capacities, and so are less affected by the limited expressiveness of quantized weights. Moreover, low-bit quantization acts as regularization, and so contributes positively to the performance. 5 CONCLUSION In this paper, we proposed a loss-aware weight quantization algorithm that directly considers the effect of quantization on the loss. The problem is solved using the proximal Newton algorithm. Each iteration consists of a preconditioned gradient descent step and a quantization step that projects fullprecision weights onto a set of quantized values. For ternarization, an exact solution and an efficient approximate solution are provided. The procedure is also extended to the use of different scaling parameters for the positive and negative weights, and to m-bit (where m > 2) quantization. Experiments on both feedforward and recurrent networks show that the proposed quantization scheme outperforms the current state-of-the-art. ACKNOWLEDGMENTS This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant 614513). We thank the developers of Theano (Theano Development Team, 2016), Pylearn2 (Goodfellow et al., 2013) and Lasagne. We also thank NVIDIA for the gift of GPU card. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. M. Courbariaux, Y. Bengio, and J. P. David. Binary Connect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3105 3113, 2015. Y. Dauphin, H. de Vries, and Y. Bengio. Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems, pp. 1504 1512, 2015. X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 249 256, 2010. I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin, M. Mirza, R. Pascanu, J. Bergstra, F. Bastien, and Y. Bengio. Pylearn2: a machine learning research library. Preprint, 2013. S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135 1143, 2015. Published as a conference paper at ICLR 2018 S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations, 2016. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In International Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, pp. 1735 1780, 1997. L. Hou, Q. Yao, and J. T. Kwok. Loss-aware binarization of deep networks. In International Conference on Learning Representations, 2017. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobile Nets: Efficient convolutional neural networks for mobile vision applications. Preprint ar Xiv:1704.04861, 2017. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5MB model size. Preprint ar Xiv:1602.07360, 2016. A. Karpathy, J. Johnson, and F. F. Li. Visualizing and understanding recurrent networks. In International Conference on Learning Representations, 2016. Y. D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In International Conference on Learning Representations, 2016. D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. Preprint ar Xiv:1412.6553, 2014. Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436 444, 2015. J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for minimizing composite functions. SIAM Journal on Optimization, 24(3):1420 1443, 2014. C. Leng, H. Li, S. Zhu, and R. Jin. Extremely low bit neural network: Squeeze the last bit out with admm. Preprint ar Xiv:1707.09870, 2017. F. Li and B. Liu. Ternary weight networks. Preprint ar Xiv:1605.04711, 2016. H. Li, S. De, Z. Xu, C. Studer, H. Samet, and Goldstein T. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, 2017a. H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017b. D. Lin, S. Talathi, and S. Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849 2858, 2016a. Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. In International Conference on Learning Representations, 2016b. N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with fine-grained quantization. Preprint ar Xiv:1705.01462, 2017. T. Mikolov and G. Zweig. Context dependent recurrent neural network language model. IEEE Spoken Language Technology Workshop, 12:234 239, 2012. D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. Preprint ar Xiv:1603.01025, 2016. Published as a conference paper at ICLR 2018 P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient transfer learning. In International Conference on Learning Representations, 2017. A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442 450, 2015. A. Rakotomamonjy, R. Flamary, and G. Gasso. DC proximal Newton for nonconvex optimization problems. IEEE Transactions on Neural Networks and Learning Systems, 27(3):636 647, 2016. M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: Image Net classification using binary convolutional neural networks. In European Conference on Computer Vision, 2016. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In International Conference on Computer Vision and Pattern Recognition, pp. 1 9, 2015. A. Taylor, M. Marcus, and B. Santorini. The Penn treebank: An overview. In Treebanks, pp. 5 22. Springer, 2003. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. Preprint ar Xiv:1605.02688, 2016. F. P. Vasilyev, E. V. Khoroshilova, and A. S. Antipin. An extragradient method for finding the saddle point in an optimal control problem. Moscow University Computational Mathematics and Cybernetics, 34(3):113 118, 2010. X. Zhang, X. Zhou, M. Lin, and J. Sun. Shuffle Net: An extremely efficient convolutional neural network for mobile devices. Preprint ar Xiv:1707.01083, 2017. S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Do Re Fa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Preprint ar Xiv:1606.06160, 2016. C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. In International Conference on Learning Representations, 2017. Published as a conference paper at ICLR 2018 A.1 PROOF OF PROPOSITION 3.1 With wt l in (7), the objective in (5) can be rewritten as ℓ( ˆwt 1) ( ˆwt ˆwt 1) + 1 2( ˆwt ˆwt 1) Dt 1( ˆwt ˆwt 1) ( ˆwt l ( ˆwt 1 l lℓ( ˆwt 1) dt 1 l )) 2 + c1 ( ˆwt l wt l) 2 + c1 (αt lbt l wt l) 2 + c1 i=1 [dt 1 l ]i(αt l[bt l]i [wt l]i)2 + c1, where c1 = 1 ( lℓ( ˆwt 1) dt 1 l ))2 is independent of αt l and bt l. A.2 PROOF OF PROPOSITION 3.2 To simplify notations, we drop the subscript and superscript. Considering one particular layer, problem (6) is of the form: i=1 di(αbi wi)2 s.t. α > 0, bi { 1, 0, 1}. When α is fixed, bi = arg min bi 1 2di(αbi wi)2 = 1 2diα2(bi wi/α)2 = Iα/2(wi). When b is fixed, α = arg min α 1 2 i=1 di(αbi wi)2 = arg min α 1 2 b b d 1α2 b d w 1α + c2, = arg min α 1 2 b b d 1 2 b d w 2 1 b b d 1 + c2 A.3 PROOF OF COROLLARY 3.1 When Dt 1 l = λI, i.e., the curvature is the same for all dimensions in the lth layer, From Proposition 3.2, αt l = b dt 1 l wt l 1 b dt 1 l 1 = Iαt l/2(wt l) wt l 1 Iαt l/2(wt l) 1 = 1 I t l(wt l) 1 i:[wt l]i> t l Published as a conference paper at ICLR 2018 Iαt l/2 wt l 1 Iαt l/2 1 = arg max >0 1 I (wt l) 1 i:[wt l]i> |[wt l]i| This is the same as the TWN solution in (1). A.4 PROOF OF PROPOSITION 3.3 For simplicity of notations, we drop the subscript and superscript. For each layer, we have an optimization problem of the form arg min α ( = arg min α b b d 1 2 b d w 2 1 b b d 1 = arg min α Iα/2(w) Iα/2(w) d 1 α Iα/2(w) d w 1 Iα/2(w) Iα/2(w) d 1 2 Iα/2(w) Iα/2(w) w 2 1 Iα/2(w) Iα/2(w) d 1 = arg min α Iα/2(w) d w 2 1 Iα/2(w) d 1 , where the second equality holds as b = Iα/2(w). From (9), we have Iα/2(w) d w 2 1 Iα/2(w) d 1 = Iα/2(w) d w 1 Iα/2(w) d 1 Iα/2(w) d w 1 Iα/2(w) d 1 Iα/2(w) d 1 = 2cj 2cj [cum(perms(d))]j = 2c2 j [cum(perms(d))]j. Thus, the α that minimizes ( d (αb w))2 is α = 2 arg maxcj,j S c2 j [cum(perms(d))]j. A.5 PROOF FOR PROPOSITION 3.4 For simplicity of notations, we drop the subscript and superscript, and consider the optimization problem: i=1 di( ˆwi wi)2 s.t. ˆwi { β, 0, +α}. Let f( ˆwi) = ( ˆwi wi)2. Then, f(α) = (α wi)2, f(0) = w2 i , and f( β) = (β + wi)2. It is easy to see that (i) if wi > α/2, f(α) is the smallest; (ii) if wi < β/2, f( 1) is the smallest; (iii) if β/2 wi α/2, f(0) is the smallest. In other words, the optimal ˆwi satisfies ˆwi = αI+ α/2(wi) + βI β/2(wi), or equivalently, ˆw = αp + βq, where p = I+ α/2(w), and q = I β (w). Define w+ and w such that [w+]i = wi wi > 0 0 otherwise, and [w ]i = wi wi < 0 0 otherwise.. Then, i=1 di( ˆwi wi)2 = 1 i=1 di(αpi w+ i )2 + 1 i=1 di(βqi w i )2. (12) The objective in (12) has two parts, and each part can be viewed as a special case of the ternarization step in Proposition 3.1 (considering only with positive or negative weights). Similar to the proof for Proposition 3.2, we can obtain that the optimal ˆw = αp + βq satisfies α = p d w 1 p d 1 , p = I+ α/2(w), β = q d w 1 q d 1 , q = I β/2(w). Published as a conference paper at ICLR 2018 A.6 PROOF OF PROPOSITION 3.5 For simplicity of notations, we drop the subscript and superscript. Since 1 d (αb w))2 = 1 2 Pn i=1 di(αbi wi)2 for each layer, we simply consider the optimization problem: i=1 di(αbi wi)2 s.t. α > 0, bi Q. When α is fixed, bi = arg min bi 1 2di(αbi wi)2 = 1 2diα2(bi wi/α)2 = ΠQ wi When b is fixed, α = arg min α 1 2 i=1 di(αbi wi)2 = arg min α 1 2 b b d 1α2 b d w 1α + c2 = arg min α 1 2 b b d 1 2 b d w 2 1 b b d 1 B LOSS-AWARE TERNARIZATION ALGORITHM (LAT) The whole procedure of LAT is shown in Algorithm 3. C EXACT AND APPROXIMATE SOLUTIONS FOR TERNARIZATION WITH TWO SCALING PARAMETERS Let there be n1 positive elements and n2 negative elements in wl. For a n-dimensional vector x = [x1, x2, . . . , xn], define inverse(x) = [xn, xn 1, . . . , x1]. As is shown in (12), the objective can be separated into two parts, and each part can be viewed as a special case of ternarization step in Proposition 3.1, dealing only with positive or negative weights. Thus the exact and approximate solutions for αt l and βt l can separately be derived in a similar way as that of using one scaling parameter. The exact and approximate solutions for αt l and βt l for layer-l at the tth time step are shown in Algorithms 4 and 5. D EXPERIMENTAL DETAILS D.1 SETUP FOR FEEDFORWARD NETWORKS The setup for the four data sets are as follows: 1. MNIST: This contains 28 28 gray images from 10 digit classes. We use 50, 000 images for training, another 10, 000 for validation, and the remaining 10, 000 for testing. We use the 4-layer model: 784FC 2048FC 2048FC 2048FC 10SV M, where FC is a fully-connected layer, and SV M is a ℓ2-SVM output layer using the square hinge loss. Batch normalization with a minibatch size 100, is used to accelerate learning. The maximum number of epochs is 50. The learning rate starts at 0.01, and decays by a factor of 0.1 at epochs 15 and 25. Published as a conference paper at ICLR 2018 Algorithm 3 Loss-Aware Ternarization (LAT) for training a feedforward neural network. Input: Minibatch {(xt 0, yt)}, current full-precision weights {wt l}, first moment {mt 1 l }, second moment {vt 1 l }, and learning rate ηt. 1: Forward Propagation 2: for l = 1 to L do 3: compute αt l and bt l using Algorithm 1 or 2; 4: rescale the layer-l input: xt l 1 = αt lxt l 1; 5: compute zt l with input xt l 1 and binary weight bt l; 6: apply batch-normalization and nonlinear activation to zt l to obtain xt l; 7: end for 8: compute the loss ℓusing xt L and yt; 9: Backward Propagation 10: initialize output layer s activation s gradient ℓ xt L ; 11: for l = L to 2 do 12: compute ℓ xt l 1 using ℓ xt l , αt l and bt l; 13: end for 14: Update parameters using Adam 15: for l = 1 to L do 16: compute gradients lℓ( ˆwt) using ℓ xt l and xt l 1; 17: update first moment mt l = β1mt 1 l + (1 β1) lℓ( ˆwt); 18: update second moment vt l = β2vt 1 l + (1 β2)( lℓ( ˆwt) lℓ( ˆwt)); 19: compute unbiased first moment ˆmt l = mt l/(1 βt 1); 20: compute unbiased second moment ˆvt l = vt l/(1 βt 2); 21: compute current curvature matrix dt l = 1 ηt ϵ1 + p 22: update full-precision weights wt+1 l = wt l ˆmt l dt l; 23: update learning rate ηt+1 = Update Learningrate(ηt, t + 1); 24: end for Algorithm 4 Exact solver for ˆwt l with two scaling parameters. 1: Input: full-precision weight wt l, diagonal entries of the approximate Hessian dt 1 l . 2: s1 = arg sort(wt l); 3: c1 = cum(perms1(|dt 1 l wt l|)) cum(perms1(|dt 1 l |)) 2; 4: S1 = find[([perms1(wt l)][1:(n1 1)] [c1][1:(n1 1)]) [perms1(wt l)][2:n1] [c1][1:n1 1]) < 0); 5: αt l = 2 arg maxci,i S1[c1]2 i [cum(perms1(|dt 1 l |))]i; 6: pt l = I+ α/2(wt l); 7: s2 = inverse(s1); 8: c2 = cum(perms2(|dt 1 l wt l|)) cum(perms2(|dt 1 l |)) 2; 9: S2 = find(([ perms2(wt l)][1:(n2 1)] [c2][1:(n2 1)]) ([ perms2(wt l)][2:n2] [c2][1:n2 1]) < 0); 10: βt l = 2 arg maxci,i S2[c2]2 i [cum(perms2(|dt 1 l |))]i; 11: qt l = I β/2(wt l); 12: Output: ˆwt l = αt lpt l + βt l qt l. 2. CIFAR-10: This contains 32 32 color images from 10 object classes. We use 45, 000 images for training, another 5, 000 for validation, and the remaining 10, 000 for testing. The images are preprocessed with global contrast normalization and ZCA whitening. We use the VGG-like architecture: (2 128C3) MP2 (2 256C3) MP2 (2 512C3) MP2 (2 1024FC) 10SV M, where C3 is a 3 3 Re LU convolution layer, and MP2 is a 2 2 max-pooling layer. Batch normalization with a minibatch size of 50, is used. The maximum number of epochs Published as a conference paper at ICLR 2018 Algorithm 5 Approximate solver for ˆwt l with two scaling parameters 1: Input: bt 1 l , full-precision weight wt l, and diagonal entries of approximate Hessian dt 1 l . 2: Initialize: α = 1.0, αold = 0.0, β = 1.0, βo = 0.0, b = bt 1 l , p = I+ 0 (b), q = I 0 (b), ϵ = 10 6. 3: while |α αold| > ϵ and |β βold| > ϵ do 4: αold = α, βold = β; 5: α = p dt 1 l wt l 1 p dt 1 l 1 ; 6: p = I+ α/2(wt l); 7: β = q dt 1 l wt l 1 q dt 1 l 1 ; 8: q = I β/2(wt l); 9: end while 10: Output: ˆwt l = αp + βq. is 200. The learning rate for the weight-binarized network starts at 0.03 while for all the other networks starts at 0.002, and decays by a factor of 0.5 after every 15 epochs. 3. CIFAR-100: This contains 32 32 color images from 100 object classes. We use 45, 000 images for training, another 5, 000 for validation, and the remaining 10, 000 for testing. The images are preprocessed with global contrast normalization and ZCA whitening. We use the VGG-like architecture: (2 128C3) MP2 (2 256C3) MP2 (2 512C3) MP2 (2 1024FC) 100SV M. Batch normalization with a minibatch size of 100, is used. The maximum number of epochs is 200. The learning rate starts at 0.0005, and decays by a factor of 0.5 after every 15 epochs. 4. SVHN: This contains 32 32 color images from 10 digit classes. We use 598, 388 images for training, another 6, 000 for validation, and the remaining 26, 032 for testing. The images are preprocessed with global and local contrast normalization. The model used is: (2 64C3) MP2 (2 128C3) MP2 (2 256C3) MP2 (2 1024FC) 10SV M. Batch normalization with a minibatch size of 50, is used. The maximum number of epochs is 50. The learning rate starts at 0.001 for the weight-binarized network, and 0.0005 for the other networks. It then decays by a factor of 0.1 at epochs 15 and 25. D.2 SETUP FOR RECURRENT NETWORKS The setup for the three data sets are as follows: 1. Leo Tolstoy s War and Peace: It consists of 3258K characters of almost entirely English text with minimal markup and a vocabulary size of 87. We use the same training/validation/test set split as in (Karpathy et al., 2016; Hou et al., 2017). 2. The source code of the Linux Kernel: This consists of 621K characters and a vocabulary size of 101. We use the same training/validation/test set split as in (Karpathy et al., 2016; Hou et al., 2017). 3. The Penn Treebank data set (Taylor et al., 2003): This has been frequently used for language modeling. It contains 50 different characters, including English characters, numbers, and punctuations. We follow the setting in (Mikolov & Zweig, 2012), with 5,017K characters for training, 393K for validation, and 442K characters for testing. We use a one-layer LSTM with 512 cells. The maximum number of epochs is 200, and the number of time steps is 100. The initial learning rate is 0.002. After 10 epochs, it is decayed by a factor of 0.98 after each epoch. The weights are initialized uniformly in [0.08, 0.08]. After each iteration, the gradients are clipped to the range [ 5, 5]. All the updated weights are clipped to [ 1, 1] for binarization and ternarization methods, but not for m-bit (where m > 2) quantization methods.