# lossaware_binarization_of_deep_networks__eaec85bc.pdf Published as a conference paper at ICLR 2017 LOSS-AWARE BINARIZATION OF DEEP NETWORKS Lu Hou, Quanming Yao, James T. Kwok Department of Computer Science and Engineering Hong Kong University of Science and Technology Clear Water Bay, Hong Kong {lhouab,qyaoaa,jamesk}@cse.ust.hk Deep neural network models, though very powerful and highly successful, are computationally expensive in terms of space and time. Recently, there have been a number of attempts on binarizing the network weights and activations. This greatly reduces the network size, and replaces the underlying multiplications to additions or even XNOR bit operations. However, existing binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the loss. In this paper, we propose a proximal Newton algorithm with diagonal Hessian approximation that directly minimizes the loss w.r.t. the binarized weights. The underlying proximal step has an efficient closed-form solution, and the second-order information can be efficiently obtained from the second moments already computed by the Adam optimizer. Experiments on both feedforward and recurrent networks show that the proposed loss-aware binarization algorithm outperforms existing binarization schemes, and is also more robust for wide and deep networks. 1 INTRODUCTION Recently, deep neural networks have achieved state-of-the-art performance in various tasks such as speech recognition, visual object recognition, and image classification (Le Cun et al., 2015). Though powerful, the large number of network weights leads to space and time inefficiencies in both training and storage. For instance, the popular Alex Net, VGG-16 and Resnet-18 all require hundred of megabytes to store, and billions of high-precision operations on classification. This limits its use in embedded systems, smart phones and other portable devices that are now everywhere. To alleviate this problem, a number of approaches have been recently proposed. One attempt first trains a neural network and then compresses it (Han et al., 2016; Kim et al., 2016). Instead of this two-step approach, it is more desirable to train and compress the network simultaneously. Example approaches include tensorizing (Novikov et al., 2015), parameter quantization (Gong et al., 2014), and binarization (Courbariaux et al., 2015; Hubara et al., 2016; Rastegari et al., 2016). In particular, binarization only requires one bit for each weight value. This can significantly reduce storage, and also eliminates most multiplications during the forward pass. Courbariaux et al. (2015) pioneered neural network binarization with the Binary Connect algorithm, which achieves state-of-the-art results on many classification tasks. Besides binarizing the weights, Hubara et al. (2016) further binarized the activations. Rastegari et al. (2016) also learned to scale the binarized weights, and obtained better results. Besides, they proposed the XNOR-network with both weights and activations binarized as in (Hubara et al., 2016). Instead of binarization, ternary-connect quantizes each weight to { 1, 0, 1} (Lin et al., 2016). Similarly, the ternary weight network (Li & Liu, 2016) and Do Re Fa-net (Zhou et al., 2016) quantize weights to three levels or more. However, though using more bits allows more accurate weight approximations, specialized hardwares are needed for the underlying non-binary operations. Besides the huge amount of computation and storage involved, deep networks are difficult to train because of the highly nonconvex objective and inhomogeneous curvature. To alleviate this problem, Hessian-free methods (Martens & Sutskever, 2012) use the second-order information by conjugate gradient. A related method is natural gradient descent (Pascanu & Bengio, 2014), which utilizes ge- Published as a conference paper at ICLR 2017 ometry of the underlying parameter manifold. Another approach uses element-wise adaptive learning rate, as in Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012), RMSprop (Tieleman & Hinton, 2012), and Adam Kingma & Ba (2015). This can also be considered as preconditioning that rescales the gradient so that all dimensions have similar curvatures. In this paper, instead of directly approximating the weights, we propose to consider the effect of binarization on the loss during binarization. We formulate this as an optimization problem using the proximal Newton algorithm (Lee et al., 2014) with a diagonal Hessian. The crux of proximal algorithms is the proximal step. We show that this step has a closed-form solution, whose form is similar to the use of element-wise adaptive learning rate. The proposed method also reduces to Binary Connect (Courbariaux et al., 2015) and the Binary-Weight-Network (Hubara et al., 2016) when curvature information is dropped. Experiments on both feedforward and recurrent neural network models show that it outperforms existing binarization algorithms. In particular, Binary Connect fails on deep recurrent networks because of the exploding gradient problem, while the proposed method still demonstrates robust performance. Notations: For a vector x, x denotes the element-wise square root, |x| denotes the element-wise absolute value, x p = (P i |xi|p) 1 p is the p-norm of x, x 0 denotes that all entries of x are positive, sign(x) is the vector with [sign(x)]i = 1 if xi 0 and 1 otherwise, and Diag(x) returns a diagonal matrix with x on the diagonal. For two vectors x and y, x y denotes the elementwise multiplication and x y denotes the element-wise division. For a matrix X, vec(X) returns the vector obtained by stacking the columns of X, and diag(X) returns a diagonal matrix whose diagonal elements are extracted from diagonal of X. 2 RELATED WORK 2.1 WEIGHT BINARIZATION IN DEEP NETWORKS In a feedforward neural network with L layers, let the weight matrix (or tensor in the case of a convolutional layer) at layer l be Wl. We combine the (full-precision) weights from all layers as w = [w 1 , w 2 , . . . , w L] , where wl = vec(Wl). Analogously, the binarized weights are denoted as ˆw = [ ˆw 1 , ˆw 2 , . . . , ˆw L] . As it is essential to use full-precision weights during updates (Courbariaux et al., 2015), typically binarized weights are only used during the forward and backward propagations, but not on parameter update. At the tth iteration, the (full-precision) weight wt l is updated by using the backpropagated gradient lℓ( ˆwt 1) (where ℓis the loss and lℓ( ˆwt 1) is the partial derivative of ℓw.r.t. the weights of the lth layer). In the next forward propagation, it is then binarized as ˆwt l = Binarize(wt l), where Binarize( ) is some binarization scheme. The two most popular binarization schemes are Binary Connect (Courbariaux et al., 2015) and Binary-Weight-Network (BWN) (Rastegari et al., 2016). In Binary Connect, binarization is performed by transforming each element of wt l to 1 or +1 using the sign function:1 Binarize(wt l) = sign(wt l). (1) Besides the binarized weight matrix, a scaling parameter is also learned in BWN. In other words, Binarize(wt l) = αt lbt l, where αt l > 0 and bt l is binary. They are obtained by minimizing the difference between wt l and αt lbt l, and have a simple closed-form solution: αt l = wt l 1 nl , bt l = sign(wt l), (2) where nl is the number of weights in layer l. Hubara et al. (2016) further binarized the activations as ˆxt l = sign(xt l), where xt l is the activation of the lth layer at iteration t. 2.2 PROXIMAL NEWTON ALGORITHM The proximal Newton algorithm (Lee et al., 2014) has been popularly used for solving composite optimization problems of the form min x f(x) + g(x), 1A stochastic binarization scheme is also proposed in (Courbariaux et al., 2015). However, it is much more computational expensive than (1) and so will not be considered here. Published as a conference paper at ICLR 2017 where f is convex and smooth, and g is convex but possibly nonsmooth. At iteration t, it generates the next iterate as xt+1 = arg min x f(xt) (x xt) + (x xt) H(x xt) + g(x), where H is an approximate Hessian matrix of f at xt. With the use of second-order information, the proximal Newton algorithm converges faster than the proximal gradient algorithm (Lee et al., 2014). Recently, by assuming that f and g have difference-of-convex decompositions (Yuille & Rangarajan, 2002), the proximal Newton algorithm is also extended to the case where g is nonconvex (Rakotomamonjy et al., 2016). 3 LOSS-AWARE BINARIZATION As can be seen, existing weight binarization methods (Courbariaux et al., 2015; Rastegari et al., 2016) simply find the closest binary approximation of w, and ignore its effects to the loss. In this paper, we consider the loss directly during binarization. As in (Rastegari et al., 2016), we also binarize the weight wl in each layer as ˆwl = αlbl, where αl > 0 and bl is binary. In the following, we make the following assumptions on ℓ. (A1) ℓis continuously differentiable with Lipschitz-continuous gradient, i.e., there exists β > 0 such that ℓ(u) ℓ(v) 2 β u v 2 for any u, v; (A2) ℓis bounded from below. 3.1 BINARIZATION USING PROXIMAL NEWTON ALGORITHM We formulate weight binarization as the following optimization problem: min ˆw ℓ( ˆw) (3) s.t. ˆwl = αlbl, αl > 0, bl { 1}nl, l = 1, . . . , L, (4) where ℓis the loss. Let C be the feasible region in (4), and define its indicator function: IC( ˆw) = 0 if ˆw C, and otherwise. Problem (3) can then be rewritten as min ˆw ℓ( ˆw) + IC( ˆw). (5) We solve (5) using the proximal Newton method (Section 2.2). At iteration t, the smooth term ℓ( ˆwt) is replaced by the second-order expansion ℓ( ˆwt 1) + ℓ( ˆwt 1) ( ˆwt ˆwt 1) + 1 2( ˆwt ˆwt 1) Ht 1( ˆwt ˆwt 1), where Ht 1 is an estimate of the Hessian of ℓat ˆwt 1. Note that using the Hessian to capture second-order information is essential for efficient neural network training, as ℓis often flat in some directions but highly curved in others. By rescaling the gradient, the loss has similar curvatures along all directions. This is also called preconditioning in the literature (Dauphin et al., 2015a). For neural networks, the exact Hessian is rarely positive semi-definite. This can be problematic as the nonconvex objective leads to indefinite quadratic optimization. Moreover, computing the exact Hessian is both timeand space-inefficient on large networks. To alleviate these problems, a popular approach is to approximate the Hessian by a diagonal positive definite matrix D. One popular choice is the efficient Jacobi preconditioner. Though an efficient approximation of the Hessian under certain conditions, it is not competitive for indefinite matrices (Dauphin et al., 2015a). More recently, it is shown that equilibration provides a more robust preconditioner in the presence of saddle points (Dauphin et al., 2015a). This is also adopted by popular stochastic optimization algorithms such as RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). Specifically, the second moment v in these algorithms is an estimator of diag(H2) (Dauphin et al., 2015b). Here, we use the square root of this v, which is readily available in Adam, to construct D = Diag([diag(D1) , . . . , diag(DL) ] ), where Dl is the approximate diagonal Hessian at layer l. In general, other estimators of diag(H) can also be used. At the tth iteration of the proximal Newton algorithm, the following subproblem is solved: min ˆwt ℓ( ˆwt 1) ( ˆwt ˆwt 1) + 1 2( ˆwt ˆwt 1) Dt 1( ˆwt ˆwt 1) (6) s.t. ˆwt l = αt lbt l, αt l > 0, bt l { 1}nl, l = 1, . . . , L. Published as a conference paper at ICLR 2017 Proposition 3.1 Let dt 1 l diag(Dt 1 l ), and wt l ˆwt 1 l lℓ( ˆwt 1) dt 1 l . (7) The optimal solution of (6) can be obtained in closed-form as αt l = dt 1 l wt l 1 dt 1 l 1 , bt l = sign(wt l). (8) Theorem 3.1 Assume that [dt l]k > β l, k, t, the objective of (5) produced by the proximal Newton algorithm (with closed-form update of ˆwt in Proposition 3.1) converges. Note that both the loss ℓand indicator function IC( ) in (5) are not convex. Hence, convergence analysis of the proximal Newton algorithm in (Lee et al., 2014), which is only for convex problems, cannot be applied. Recently, Rakotomamonjy et al. (2016) proposed a nonconvex proximal Newton extension. However, it assumes a difference-of-convex decomposition which does not hold here. Remark 3.1 When Dt 1 l = λI, i.e., the curvature is the same for all dimensions in the lth layer, (8) then reduces to the BWN solution in (2) In other words, BWN corresponds to using the proximal gradient algorithm, while the proposed method corresponds to the proximal Newton algorithm with diagonal Hessian. In composite optimization, it is known that the proximal Newton method is more efficient than the proximal gradient algorithm (Lee et al., 2014; Rakotomamonjy et al., 2016). Remark 3.2 When αt l = 1, (8) reduces to sign(wt l), which is the Binary Connect solution in (1). From (7) and (8), each iteration first performs gradient descent along lℓ( ˆwt 1) with an adaptive learning rate 1 dt 1 l , and then projects it to a binary solution. As discussed in (Courbariaux et al., 2015), it is important to keep a full-precision weight during training. Hence, we replace (7) by wt l wt 1 l lℓ( ˆwt 1) dt 1 l . The whole procedure, which will be called Loss-Aware Binarization (LAB), is shown in Algorithm 1. In steps 5 and 6, following (Li & Liu, 2016), we first rescale input xt 1 l to the lth layer with αl, so that multiplications in dot products and convolutions become additions. While binarizing weights changes most multiplications to additions, binarizing both weights and activations saves even more computations as additions are further changed to XNOR bit operations (Hubara et al., 2016). Our Algorithm 1 can also be easily extended by binarizing the activations with the simple sign function. 3.2 EXTENSION TO RECURRENT NEURAL NETWORKS The proposed method can be easily extended to recurrent neural networks. Let xl and hl be the input and hidden states, respectively, at time step (or depth) l. A typical recurrent neural network has a recurrence of the form hl = Wxxl + Whσ(hl 1) + b (equivalent to the more widely known hl = σ(Wxxl+Whhl 1+b) (Pascanu et al., 2013) ). We binarize both the input-to-hidden weight Wx and hidden-to-hidden weight Wh. Since weights are shared across time in a recurrent network, we only need to binarize Wx and Wh once in each forward propagation. Besides weights, one can also binarize the activations (of the inputs and hidden states) as in the previous section. In deep networks, the backpropagated gradient takes the form of a product of Jacobian matrices (Pascanu et al., 2013). In a vanilla recurrent neural network,2 for activations hp and hq at depths p and q, respectively (where p > q), hp q