# rotated_binary_neural_network__16ad78ac.pdf Rotated Binary Neural Network Mingbao Lin 1 Rongrong Ji 1,2,3 Zihan Xu 1 Baochang Zhang 4 Yan Wang 5 Yongjian Wu 6 Feiyue Huang 6 Chia-Wen Lin 7 1 Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University 2Institute of Artificial Intelligence, Xiamen University 3 Peng Cheng Lab 4 Beihang University 5 Pinterest 6 Tencent Youtu Lab 7 National Tsing Hua University Binary Neural Network (BNN) shows its predominance in reducing the complexity of deep neural networks. However, it suffers severe performance degradation. One of the major impediments is the large quantization error between the full-precision weight vector and its binary vector. Previous works focus on compensating for the norm gap while leaving the angular bias hardly touched. In this paper, for the first time, we explore the influence of angular bias on the quantization error and then introduce a Rotated Binary Neural Network (RBNN), which considers the angle alignment between the full-precision weight vector and its binarized version. At the beginning of each training epoch, we propose to rotate the fullprecision weight vector to its binary vector to reduce the angular bias. To avoid the high complexity of learning a large rotation matrix, we further introduce a bi-rotation formulation that learns two smaller rotation matrices. In the training stage, we devise an adjustable rotated weight vector for binarization to escape the potential local optimum. Our rotation leads to around 50% weight flips which maximize the information gain. Finally, we propose a training-aware approximation of the sign function for the gradient backward. Experiments on CIFAR-10 and Image Net demonstrate the superiorities of RBNN over many state-of-the-arts. Our source code, experimental settings, training logs and binary models are available at https://github.com/lmbxmu/RBNN. 1 Introduction The community has witnessed the remarkable performance improvements of deep neural networks (DNNs) in computer vision tasks, such as image classification [26, 19], object detection [39, 20] and semantic segmentation [35, 33]. However, the cost of massive parameters and computational complexity makes DNNs hard to be deployed on resource-constrained and low-power devices. To solve this problem, many compression techniques have been proposed including network pruning [30, 14, 29], low-rank decomposition [12, 43, 18], efficient architecture design [24, 42, 7] and network quantization [28, 2, 21], etc. In particular, network quantization resorts to converting the weights and activations of a full-precision network to low-bit representations. In the extreme case, a binary neural network (BNN) restricts its weights and activations to only two possible values ( 1 and +1) such that: 1) the network size is 32 less than its full-precision counterpart; 2) the multiply-accumulation convolution can be replaced with the efficient xnor and bitcount logics. Though BNN has attracted great interest, it remains a challenge to close the accuracy gap between a full-precision network and its binarized version [38, 6]. One of the major obstacles comes at the large Corresponding Author: rrji@xmu.edu.cn 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: (a) Early works [8, 9] suffer from large quantization error caused by both the norm gap and angular bias between the full-precision weights and its binarized version. (b) Recent works [38, 37] introduce a scaling factor to reduce the norm gap but cannot reduce the angular bias, i.e., θ. Therefore the quantization error w sin θ 2 is still large when θ is large. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0.5 (a) Cosine Similarity 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 (b) Quantization Error Figure 2: Cosine similarity and quantization error in various layers of Res Net-20. (a) Our RBNN achieves a significantly higher cosine similarity between the full-precision weight and its binarization than XNOR-Net [38] does, implying fewer angular bias. (b) XNOR-Net suffers great quantization error while RBNN leads to a much smaller one. quantization error between the full-precision weight vector w and its binary vector b [8, 9] as shown in Fig. 1(a). To solve this, state-of-the-art approaches [38, 37] try to lessen the quantization error by introducing a per-channel learnable/optimizable scaling factor λ to minimize the quantization error: min λ,b λb w 2. (1) However, the introduction of λ only partly mitigates the quantization error by compensating for the norm gap between the full-precision weight and its binarized version, but cannot reduce the quantization error due to an angular bias as shown in Fig. 1(b). Apparently, with a fixed angular bias θ, when λb w is orthogonal to λb, Eq. (1) reaches the minimum and we have w sin θ 2 λb w 2, (2) Thus, w sin θ 2 serves as the lower bound of the quantization error and cannot be diminished as long as the angular bias exists. This lower bound could be huge with a large angular bias θ. Though the training process updates the weights and may close the angular bias, we experimentally observe the possibility of this case is small, as shown in Fig. 2. Thus, it is desirable to reduce this angular error for the sake of further reducing the quantization error. Moreover, the information of BNN learning is upper-bounded by 2n where n is the total number of weight elements and the base 2 denotes the two possible values in BNN [32, 37]. Weight flips refer to that positive value turns to 1 and vice versa. It is easy to see that when the probability of flip achieves 50%, the information reaches the maximum of 2n. However, the scaling factor results in a small ratio of flipping weights thus leading to little information gain in the training process [21, 37]2. In this paper, we propose a Rotated Binary Neural Network (RBNN) to further mitigate the quantization error from the intrinsic angular bias as illustrated in Fig. 3. To the best of our knowledge, this is the first work that explores and reduces the influence of angular bias on quantization error in the field of BNN. To this end, we devise an angle alignment scheme by learning a rotation matrix that rotates the full-precision weight vector to its geometrical vertex of the binary hypercube at the beginning of each training epoch. Instead of directly learning a large rotation matrix, we introduce a bi-rotation formulation that learns two smaller matrices with a significantly reduced complexity. A series of optimization steps are then developed to learn the rotation matrices and binarization alternatingly to align the angle difference as shown in Fig. 2(a), which significantly reduces the quantization error as illustrated in Fig. 2(b). To get rid of the possible local optimum in the optimization, we dynamically adjust the rotated weights for binarization in the training stage. We show that the proposed rotation not only reduces the angular bias which leads to less quantization error, but also achieves around 50% weight flips thereby achieving maximum information gain. Finally, we provide a training-aware approximation of the sign function for gradient backpropagation. We show the superiority of RBNN through extensive experiments. 2The binarization in Eq. (1) is obtained by b = sign(w), which does not change the coordinate quadrant as shown in Fig. 1. Thus, only a small number of weight flips occur in the training stage. See Sec. 4.3 for our experimental validation. Figure 3: Framework of our RBNN. The weight vector is rotated at the beginning of each training epoch such that the angular bias ϕ between the rotated weight vector and the geometrical binary vertex is smaller than that of the original θ. After rotation, the weights are either unflipped (b) or flipped (c) which increases the information gain. During training, the rotated weights are dynamically adjusted such that w with much less angular bias ϕ is obtained, which then follows up the binarization. 2 Related Work The pioneering BNN work dates back to [9] that binarizes the weights and activations to 1 or +1 by the sign function. The straight-through estimator (STE) [3] was proposed for the gradient backpropagation. Following this, abundant works have been devoted to improving the accuracy performance and implementing them in low-power and resource-constrained platforms. We refer readers to the survey paper [40, 36] for a more detailed overview. XNOR-Net [38] includes all the basic components from [3] but further introduces a per-channel scaling factor to reduce the quantization error. The scaling factor is obtained through the ℓ1-norm of both weights and activations before binarization. Do Re Fa-Net [45] introduces a changeable bit-width for the quantization of weights and activations, even the gradient in the backpropagation. The scaling factor is layer-wise and deduced from network weights which allows efficient inference since the weights do not change after training. XNOR++ [5] fuses the separate activation and weight scaling factors in [3] into a single one which is then learned discriminatively via backpropagation. Besides using the scaling factor to reduce the quantization error, more recent works are engaged in expanding the expressive ability to gain more information in the learning of BNN and devising a differentiable approximation to the sign function to achieve the propagation of the gradient. ABC-Net [31] proposes multiple parallel binary convolution layers to enhance the model accuracy. Bi-Real Net [32] adds Res Net-like shortcuts to reduce the information loss caused by binarization. Both ABC-Net and Bi-Real Net modify the structure of networks to strengthen the information of BNN learning. However, they ignore the probability of weight flips, thus the actual learning capacities of ABC-Net and Bi-Real Net are smaller than 2n as stressed in Sec. 1. To compensate, [21, 37] strengthen the learning ability of BNNs during network training via increasing the probability of weight flips. 3.1 Binary Neural Networks Given a CNN model, we denote wi Rni and ai Rmi as its weights and feature maps in the i-th layer, where ni = ci out ci in wi f hi f and mi = ci out wi a hi a. (ci out, ci in) represent the number of output and input channels, respectively. (wi f, hi f) and (wi a, hi a) are the width and height of filters and feature maps, respectively. We then have ai = wi ai 1, where is the standard convolution and we omit activation layers for simplicity. The BNN aims to convert wi and ai into bi w { 1, +1}ni and bi a { 1, +1}mi, such that the convolution can be achieved by using the efficient xnor and bitcount logics. Following [23, 5], we binarize the activations with the sign function: bi a = sign(bi w bi 1 a ), where represents the xnor and bitcount logics, and sign( ) denotes the sign function which returns 1 if the input is larger than zero, and 1 otherwise. Similar to [23, 45, 5], bi w can also be obtained by bi w = sign(wi) and a scaling factor can be applied to compensate for the norm difference. However, the existence of angular bias between wi and bi w could lead to large quantization error as analyzed in Sec. 1. Besides, it results in consistent signs between wi and bi w which lessens the information gain (see footnote 1). We aim to minimize this angular bias to reduce the quantization error, meanwhile increasing the probability of weight flips to increase the information gain. 3.2 Rotated Binary Neural Networks As shown in Fig. 3, we consider applying a rotation matrix Ri Rni ni to wi at the beginning of each training epoch, such that the angle ϕi between the rotated weight vector (Ri)T wi and its binary vector sign (Ri)T wi should be minimized. To this end, we derive the following formulation: cos(ϕi) = sign (Ri)T wi T (Ri)T wi sign (Ri)T wi 2 (Ri)T wi 2 , s.t. (Ri)T Ri = Ini, (3) where Ini Rni ni is an ni-th order identity matrix. It is easy to know that sign (Ri)T wi 2 = ni, (Ri)T wi 2 = wi 2, both of which are constant3. Thus, Eq. (3) can be further simplified as cos(ϕi) = ηi sign (Ri)T wi T (Ri)T wi = ηi tr bi w (wi)T Ri , s.t. (Ri)T Ri = Ini, (4) where ηi = 1/ sign (Ri)T wi 2 (Ri)T wi 2 = 1/ ni wi 2 , tr( ) returns the trace of the input matrix and bi w = sign (Ri)T wi . However, Eq. (4) involves a large rotation matrix, ni of which can be up to millions in a neural network. Direct optimization of Ri would consume massive memory and computation. Besides, performing a large rotation leads to O (ni)2 complexity in both space and time. To deal with this, inspired by the properties of Kronecker product [27], we introduce a bi-rotation scenario where two smaller rotation matrices Ri 1 Rni 1 ni 1 and Ri 2 Rni 2 ni 2 are used to reconstruct the large rotation matrix Ri Rni ni with ni = ni 1 ni 2. One of the basic property of Kronecker product [27] is that if two matrices Ri 1 Rni 1 ni 1 and Ri 2 Rni 2 ni 2 are orthogonal, then Ri 1 Ri 2 Rni 1ni 2 ni 1ni 2 is orthogonal as well, where denotes the Kronecker product. Another basic property of Kronecker product comes at: (wi)T (Ri 1 Ri 2) = Vec (Ri 2)T (Wi)T Ri 1 , (5) where Vec( ) vectorizes its input and Vec(Wi) = wi. Thus, we can see that applying the bi-rotation to Wi is equivalent to applying an Ri = Ri 1 Ri 2 Rni 1ni 2 ni 1ni 2 rotation to wi. Learning two smaller matrices Ri 1 and Ri 2 can well reconstruct the large rotation matrix Ri. Moreover, performing the bi-rotation consumes only O (ni 1)2 + (ni 2)2 space complexity and O (ni 1)2ni 2 + ni 1(ni 2)2 time complexity, respectively, leading to a significant complexity reduction compared to the large rotation4. Accordingly, Eq. (4) can be reformulated as cos(ϕi) = ηi tr bi w Vec (Ri 2)T (Wi)T Ri 1 = ηi tr Bi W (Ri 2)T (Wi)T Ri 1 , s.t. (Ri 1)T Ri 1 = Ini 1, (Ri 2)T Ri 2 = Ini 2, (6) where Bi W = sign (Ri 1)T Wi Ri 2 . Finally, we rewrite our optimization objective below: arg max Bi W ,Ri 1,Ri 2 tr Bi W (Ri 2)T (Wi)T Ri 1 , s.t. Bi W { 1, + 1}ni 1 ni 2, (Ri 1)T Ri 1 = Ini 1, (Ri 2)T Ri 2 = Ini 2. (7) 3To stress, the rotation is applied at the beginning of each training epoch instead of the training stage. Thus, wi 2 should be regarded as a constant. 4With ni 1 ni 2 = ni, our RBNN achieves the least complexity when ni 1 = ni 2 = ni, which is also our experimental setting. 3.3 Alternating Optimization Eq. (7) is non-convex w.r.t. Bi W , Ri 1 and Ri 2. To find a feasible solution, we adopt an alternating optimization approach, i.e., updating one variable with the rest two fixed until convergence. 1) Bi W -step: Fix Ri 1 and Ri 2, then learn the binarization Bi W . The sub-problem of Eq. (7) becomes: arg max Bi W tr Bi W (Ri 2)T (Wi)T Ri 1 , s.t. Bi W { 1, +1}ni 1 ni 2, (8) which can be achieved by Bi W = sign (Ri 1)T Wi Ri 2 . 2) Ri 1-step: Fix Bi W and Ri 2, then update Ri 1. The corresponding sub-problem is: arg max Ri 1 tr Gi 1Ri 1 , s.t. (Ri 1)T Ri 1 = Ini 1, (9) where Gi 1 = Bi W (Ri 2)T (Wi)T . The above maximum can be achieved by using the polar decomposition [34]: Ri 1 = Vi 1(Ui 1)T , where Gi 1 = Ui 1Si 1(Vi 1)T is the SVD of Gi 1. 3) Ri 2-step: Fix Bi W and Ri 1, then update Ri 2. The corresponding sub-problem becomes arg max Ri 2 tr (Ri 2)T Gi 2 , s.t. (Ri 2)T Ri 2 = Ini 2, (10) where Gi 2 = (Wi)T Ri 1Bi W . Similar to the updating rule for Ri 1, the updating rule for Ri 2 is Ri 2 = Ui 2(Vi 2)T , where Gi 2 = Ui 2Si 2(Vi 2)T is the SVD of Gi 2. In the experiments, we iteratively update Bi W , Ri 1 and Ri 2, which can reach convergence after three cycles of updating. Therefore, the weight rotation can be efficiently implemented. 3.4 Adjustable Rotated Weight Vector Epochs 0 100 200 300 400 Figure 4: Best viewed with zooming in. We narrow the angular bias between the full-precision weights and the binarization using our bi-rotation at the beginning of each training epoch. Then, we can set wi = (Ri)T wi, which will be fed to the sign function and follow up the standard gradient update in the neural network. However, the alternating optimization may get trapped in a local optimum that either overshoots (Fig. 4(a)) or undershoots (Fig. 4(b)) the binarization sign (Ri)T wi . To deal with this, we further propose to self-adjust the rotated weight vector as below: wi = wi + (Ri)T wi wi αi, (11) where αi = abs sin(βi) [0, 1] and βi R. As can be seen from Fig. 4, Eq. (11) constrains that the final weight vector moves along the residual direction of (Ri)T wi wi with αi 0. It is intuitive that when overshooting, αi 1; when undershooting, αi 1. We empirically observe that overshooting is in a dominant position. Thus, we simply constrain αi [0, 1] to shrink the feasible region of αi, which we find can well further reduce the quantization error and boost the performance as demonstrated in Table 4. The final value of αi varies across different layers. In Fig. 4(c), we show a toy example of how αi updates during training in Res Net-20 (layer2.2.conv2). At the beginning of each training epoch, with fixed wi, we learn the rotation matrix Ri (Ri 1 and Ri 2 actually). In the training stage, with fixed Ri, we feed the sign function using Eq. (11) in the forward, and update wi and βi in the backward. 3.5 Gradient Approximation The derivative of the sign function is almost zero everywhere, which makes the training unstable and degrades the accuracy performance. To solve it, various gradient approximations in the literature 15 10 5 0 5 10 15 1.0 28% 43% 58% 73% 88% 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0 28% 43% 58% 73% 88% Figure 5: Visualization of our approximation function and its derivative w.r.t. different values of e have been proposed to enable the gradient updating, e.g., straight through estimation [3], piecewise polynomial function [32], annealing hyperbolic tangent function [1], error decay estimator [37] and so on. Instead of simply using these existing approximations, in this paper, we further devise the following training-aware approximation function to replace the sign function: ( k sign(x) t2x2 2tx , |x| < 2 t , sign(x) k, else. (12) with t = 10Tmin+ e E (Tmax Tmin), k = max(1 where Tmin = 2, Tmax = 1 in our implementation, E is the number of training epochs and e represents the current epoch. As can be seen, the shape of F(x) relies on the value of e E , which indicates the training progress. Then, the gradient of F(x) w.r.t. the input x can be obtained by F (x) = F(x) x = max k ( 2t |t2x|), 0 . (13) In Fig. 5, we visualize Eq. (12) and Eq. (13) w.r.t. varying values of e E . In the early training stage, the gradient exists almost everywhere, which overcomes the drawback of sign function and enables the updating of the whole network. As the training proceeds, our approximation gradually becomes the sign-like function, which ensures the binary property. Thus, our approximation is training-aware. So far, we can have the gradient of loss function L w.r.t. the activation ai and weight wi: σai = L F(ai) F(ai) ai , σwi = L F( wi) F( wi) wi = (1 αi) Ini + αi (Ri)T . (15) Besides, the gradient of αi in Eq. (11) can be obtained by h σ wi (Ri)T wi wi i where [ ]j denotes the j-th element of its input vector. Note that the error decay estimator in [37] can be also regarded as a training-aware approximation. However, our design in Eq. (12) is fundamentally different from that in [37] and its superiority is validated in Sec. 4.2. 4 Experiments In this section, we evaluate our RBNN on CIFAR-10 [25] using Res Net-18/20 [19] and VGGsmall [44], and on Image Net [11] using Res Net-18/34 [19]. Following the compared methods, all convolutional and fully-connected layers except the first and last ones are binarized. We implement RBNN with Pytorch and the SGD is adopted as the optimizer. Also, for fair comparison, we only apply the classification loss during training. Table 1: Performance comparison with SOTAs on CIFAR-10. W/A denotes the bit length of weights and activations. The denotes the network with the Bi-Real structure [32]. Network Method W/A Acc FP 32/32 93.0% RAD [13] 1/1 90.5% IR-Net [37] 1/1 91.5% RBNN(Ours) 1/1 92.2% FP 32/32 91.7% Do Re Fa [45] 1/1 79.3% DSQ [15] 1/1 84.1% IR-Net [37] 1/1 85.4% RBNN(Ours) 1/1 86.5% IR-Net [37] 1/1 86.5% RBNN (Ours) 1/1 87.8% FP 32/32 91.7% LAB [22] 1/1 87.7% XNOR-Net [38] 1/1 89.8% BNN [23] 1/1 89.9% RAD [13] 1/1 90.0% IR-Net [37] 1/1 90.4% RBNN(Ours) 1/1 91.3% Table 2: Performance comparison with SOTAs on Image Net. W/A denotes the bit length of weights and activations. We report the top-1 and top-5 accuracy performances. Network Method W/A Top-1 Top-5 FP 32/32 69.6% 89.2% ABC-Net [31] 1/1 42.7% 67.6% XNOR-Net [38] 1/1 51.2% 73.2% BNN+ [10] 1/1 53.0% 72.6% Do Re Fa [45] 1/2 53.4% - Bi-Real [32] 1/1 56.4% 79.5% XNOR++ [5] 1/1 57.1% 79.9% IR-Net [37] 1/1 58.1% 80.0% RBNN(Ours) 1/1 59.9% 81.9% FP 32/32 73.3% 91.3% ABC-Net [31] 1/1 52.4% 76.5% Bi-Real [32] 1/1 62.2% 83.9% IR-Net [37] 1/1 62.9% 84.1% RBNN(Ours) 1/1 63.1% 84.4% 4.1 Experimental Results 4.1.1 CIFAR-10 On CIFAR-10, we compare our RBNN with several SOTAs. For Res Net-18, we compare with RAD [13] and IR-Net [37]. For Res Net-34, we compare with Do Re Fa [45], DSQ [15], and IR-Net [37]. For VGG-small, we compare with LAB [22], XNOR-Net [38], BNN [23], RAD [13], and IR-Net [37]. We list the experimental results in Table 1. As can be seen, RBNN consistently outperforms the SOTAs. Compared with the best baseline [37], RBNN achieves 0.7%, 1.1% and 0.9% accuracy improvements with Res Net-18, Res Net-20 with normal structure [19], and VGG-small, respectively. Furthermore, binarizing network with the Bi-Real structure [32] achieves a better accuracy performance over the normal structure as shown by Res Net-20. For example, with the Bi-Real structure, IR-Net obtains 1.1% accuracy improvements while RBNN also gains 1.3% improvements. Other variants of network structure proposed in [4, 46, 16] and training loss in [22, 13, 41, 17] can be combined to further improve the final accuracy performance. Nevertheless, under the same structure, our RBNN performs the best (87.8% of RBNN v.s. 86.5% of IR-Net for Res Net-20 with Bi-Real structure). Hence, the superiority of the angle alignment is evident. 4.1.2 Image Net We further show the experimental results on Image Net in Table 2. For Res Net-18, we compare RBNN with ABC-Net [31], XNOR-Net [38], BNN+ [10], Do Re Fa [45], Bi-Real [32], XNOR++ [5], and IR-Net [37]. For Res Net-34, ABC-Net [31], Bi-Real [32], and IR-Net [37] are compared. As shown in Table 2, RBNN beats all the compared binary models in both top-1 and top-5 accuracy. More detailedly, with Res Net-18, RBNN achieves 59.9% and 81.9% in top-1 and top-5 accuracy, with 1.8% and 1.9% improvements over IR-Net, respectively. With Res Net-34, it achieves a top-1 accuracy of 63.1% and a top-5 accuracy of 84.4%, with 0.2% and 0.3% improvements over IR-Net, respectively. 4.2 Performance Study In this section, we first show the benefit of our training-aware approximation over other recent advances [3, 32, 37]. And then, we show the effect of different components proposed in our RBNN. All the experiments are conducted on top of Res Net-20 with Bi-Real structure on CIFAR-10. 0.1 0.0 0.1 0 2 1 0 1 2 0 0.025 0.000 0.025 0 2 1 0 1 2 0 0.025 0.000 0.025 0 2 1 0 1 2 0 - - - - - - - - - (a) layer1.0.conv2 (b) layer2.0.conv1 (c) layer2.1.conv2 Figure 6: Weight histograms (before binarization) of the XNOR and RBNN in Res Net-20. Table 3: Gradient Approximation Analysis. Method W/A Acc FP 32/32 91.7% STE [3] 1/1 84.9% PPF [32] 1/1 86.9% EDE [37] 1/1 86.0% Ours 1/1 87.8% Table 4: Ablation Study of RBNN. B, T, R and A respectively denote binarization using XNORNet, training-aware approximation, weight rotation and adjustable scheme. Method W/A Acc FP 32/32 91.7% B 1/1 83.7% B + R 1/1 86.4% B + T 1/1 86.6% B + T + R 1/1 87.1% B + T + R + A (RBNN) 1/1 87.8% Table 3 compares the performances of RBNN based on various gradient approximations including straight through estimation (denoted by STE) [3], piecewise polynomial function (denoted by PPF) [32] and error decay estimator (denoted by EDE) [37]. As can be seen, STE shows the least accuracy. Though EDE also studies approximation with dynamic changes, its performance is even worse than PPF which is a fixed approximation. In contrast, our training-aware approximation achieves 1.8% improvements over EDE, which validates the effectiveness of our approximation. To further understand the effect of each component in our RBNN, we conduct an ablation study by starting with the binarization using XNOR-Net [38] (denoted by B), and then gradually add different parts of training-aware approximation (denoted by T), weight rotation (denoted by R) and adjustable scheme (denoted by A). As shown in Table 4, the binarization using XNOR-Net suffers a great performance degradation of 8.0% compared with the full-precision model. By adding our weight rotation or training-aware approximation, the accuracy performance increases to 86.4% or 86.6%. Then, the collective effort of weight rotation and training-aware approximation further raises it to 87.1%. Lastly, by considering the adjustable weight vector in the training process, our RBNN achieves the highest accuracy of 87.8%. Therefore, each part of RBNN plays its unique role in improving the performance. 4.3 Weight Distribution and Flips 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0.49 0.50 0.49 0.50 0.50 0.50 0.49 0.49 0.50 0.49 0.48 0.49 0.49 0.49 0.48 0.47 0.48 Figure 7: Weight flip rates of our RBNN and XNOR-Net in different layers of Res Net-20. Fig. 6 shows the histograms of weights (before binarization) for XNOR-Net and our RBNN. It can be seen that the weight values for XNORNet are mixed up tightly around zero center and the value magnitude remains far less than 1. Thus it causes large quantization error when being pushed to the binary values of 1 and +1. On the contrary, our RBNN results in two-mode distributions, each of which is centered around 1/ + 1. Besides, there exist few weights around the zero, which creates a clear boundary between the two distributions. Thus, by the weight rotation, our RBNN effectively reduces quantization error as explained in Fig. 2. As discussed in Sec. 2, the capacity of learning BNN is up to 2n where n is the total number of weight elements. When the probability of each element being 1 or +1 is equal during training, it reaches the maximum of 2n. We compare the initialization weights and binary weights, and then show the weight flipping rates of our RBNN and XNOR-Net across the layers of Res Net-20 in Fig. 7. As can be observed, XNOR-Net leads to a small flipping rate, i.e., most positive weights are directly quantized to +1, and vice versa. Differently, RBNN leads to around 50% weight flips each layer due to the introduced weight rotation as illustrated in Fig. 3, which thus maximizes the information gain during the training stage. 5 Conclusion In this paper, we analyzed the influence of angular bias on the quantization error in binary neural networks and proposed a Rotated Binary Neural Network (RBNN) to achieve the angle alignment between the rotated weight vector and the binary vector at the beginning of each training epochs. We have also introduced a bi-rotation scheme involving two smaller rotation matrices to reduce the complexity of learning a large rotation matrix. In the training stage, our method dynamically adjusts the rotated weight vector via backward gradient updating to overcome the potential sub-optimal problem in the optimization of bi-rotation. Our rotation maximizes the information gain of learning BNN by achieving around 50% weight flips. To enable gradient propagation, we have devised a training-aware approximation of the sign function. Extensive experiments have demonstrated the efficacy of our RBNN in reducing the quantization error, and its superiorities over several SOTAs. Acknowledgement This work is supported by the National Natural Science Foundation of China (No.U1705262, No.61772443, No.61572410, No.61802324 and No.61702136), National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), Key R&D Program of Jiangxi Province (No. 20171ACH80022), Natural Science Foundation of Guangdong Province in China (No.2019B1515120049) and National Key R&D Plan Project (No.2018YFC0830105 and No.2018YFC0830100). Broader Impact Benefit: The binary neural network community may benefit from our research. The proposed Rotated Binary Neural Network (RBNN) provides a novel perspective to lessen the quantization error by reducing the angular bias, which was ignored by previous works. With the code publicly available, our work will also help researchers quantize DNNs so that the deep models can be deployed on devices with limited resources such as mobile phones. Disadvantage: The angular bias between the activation and its binarization remains an open problem. It may be not appropriate to apply our rotation to the activation vector since it will add the computation in the inference. Consequence: The failure of the network quantization will not bring serious consequences, as our RBNN causes fewer accuracy drops compared to other SOTAs. Data Biases: The proposed RBNN is irrelevant to data selection, so it does not have the data bias problem. [1] Thalaiyasingam Ajanthan, Kartik Gupta, Philip HS Torr, Richard Hartley, and Puneet K Dokania. Mirror descent view for neural network quantization. ar Xiv preprint ar Xiv:1910.08237, 2019. [2] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 5145 5153, 2018. [3] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013. [4] Joseph Bethge, Christian Bartz, Haojin Yang, Ying Chen, and Christoph Meinel. Meliusnet: Can binary neural networks achieve mobilenet-level accuracy? ar Xiv preprint ar Xiv:2001.05936, 2020. [5] Adrian Bulat and Georgios Tzimiropoulos. Xnor-net++: Improved binary neural networks. In Proceedings of the British Machine Vision Conference (BMVC), 2019. [6] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5918 5926, 2017. [7] Hanting Chen, Yunhe Wang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. Addernet: Do we really need multiplications in deep learning? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1468 1477, 2020. [8] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 3123 3131, 2015. [9] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv preprint ar Xiv:1602.02830, 2016. [10] Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux, and Vahid Partovi Nia. Bnn+: Improved binary network training. ar Xiv preprint ar Xiv:1812.11800, 2018. [11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248 255, 2009. [12] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann Le Cun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 1269 1277, 2014. [13] Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu. Regularizing activation distribution for training binarized deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11408 11417, 2019. [14] Xiaohan Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, Ji Liu, et al. Global sparse momentum sgd for pruning very deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 6379 6391, 2019. [15] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4852 4861, 2019. [16] Jiaxin Gu, Ce Li, Baochang Zhang, Jungong Han, Xianbin Cao, Jianzhuang Liu, and David Doermann. Projection convolutional neural networks for 1-bit cnns via discrete back propagation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 33, pages 8344 8351, 2019. [17] Jiaxin Gu, Junhe Zhao, Xiaolong Jiang, Baochang Zhang, Jianzhuang Liu, Guodong Guo, and Rongrong Ji. Bayesian optimized 1-bit cnns. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4909 4917, 2019. [18] Kohei Hayashi, Taiki Yamaguchi, Yohei Sugawara, and Shin-ichi Maeda. Exploring unexplored tensor network decompositions for convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 5553 5563, 2019. [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016. [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2961 2969, 2017. [21] Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 7531 7542, 2019. [22] Lu Hou, Quanming Yao, and James T Kwok. Loss-aware binarization of deep networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2016. [23] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 4107 4115, 2016. [24] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. ar Xiv preprint ar Xiv:1602.07360, 2016. [25] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 1097 1105, 2012. [27] Alan J Laub. Matrix analysis for scientists and engineers, volume 91. Siam, 2005. [28] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In Proceedings of the International Conference on Machine Learning (ICML), pages 2849 2858, 2016. [29] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1529 1538, 2020. [30] Mingbao Lin, Rongrong Ji, Yuxin Zhang, Baochang Zhang, Yongjian Wu, and Yonghong Tian. Channel pruning via automatic structure search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 673 679, 2020. [31] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pages 345 353, 2017. [32] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 722 737, 2018. [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431 3440, 2015. [34] Shih-Yau Lu and Russell A Chipman. Interpretation of mueller matrices based on polar decomposition. Journal of the Optical Society of America (JOSA A), 13(5):1106 1113, 1996. [35] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1520 1528, 2015. [36] Haotong Qin, Ruihao Gong, Xianglong Liu, Xiao Bai, Jingkuan Song, and Nicu Sebe. Binary neural networks: A survey. Pattern Recognition (PR), page 107281, 2020. [37] Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2250 2259, 2020. [38] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 525 542, 2016. [39] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779 788, 2016. [40] Taylor Simons and Dah-Jye Lee. A review of binarized neural networks. Electronics, 8(6):661, 2019. [41] Ziwei Wang, Jiwen Lu, Chenxin Tao, Jie Zhou, and Qi Tian. Learning channel-wise interactions for binary convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 568 577, 2019. [42] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9127 9135, 2018. [43] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7370 7379, 2017. [44] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 365 382, 2018. [45] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160, 2016. [46] Shilin Zhu, Xin Dong, and Hao Su. Binary ensemble neural network: More bits per network or more networks per bit? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4923 4932, 2019.