# resilient_binary_neural_network__7bd7130e.pdf

Resilient Binary Neural Network

Sheng Xu1*, Yanjing Li1*, Teli Ma3*, Mingbao Lin4, Hao Dong5, Baochang Zhang1,2 , Peng Gao3, Jinhu Lu1,2

1 Beihang University, Beijing, P.R.China 2 Zhongguancun Laboratory, Beijing, P.R.China 3 Shanghai AI Laboratory, Shanghai, P.R.China 4 Tencent, P.R.China 5 Peking University, Beijing, P.R.China shengxu@buaa.edu.cn, yanjingli@buaa.edu.cn, mateli@pjlab.org.cn, linmb001@outlook.com, hao.dong@pku.edu.cn, bczhang@buaa.edu.cn, gaopeng@pjlab.org.cn, lvjinhu@buaa.edu.cn

Binary neural networks (BNNs) have received ever-increasing popularity for their great capability of reducing storage burden as well as quickening inference time. However, there is a severe performance drop compared with real-valued networks, due to its intrinsic frequent weight oscillation during training. In this paper, we introduce a Resilient Binary Neural Network (Re BNN) to mitigate the frequent oscillation for better BNNs training. We identify that the weight oscillation mainly stems from the non-parametric scaling factor. To address this issue, we propose to parameterize the scaling factor and introduce a weighted reconstruction loss to build an adaptive training objective. For the first time, we show that the weight oscillation is controlled by the balanced parameter attached to the reconstruction loss, which provides a theoretical foundation to parameterize it in back propagation. Based on this, we learn our Re BNN by calculating the balanced parameter based on its maximum magnitude, which can effectively mitigate the weight oscillation with a resilient training process. Extensive experiments are conducted upon various network models, such as Res Net and Faster-RCNN for computer vision, as well as BERT for natural language processing. The results demonstrate the overwhelming performance of our Re BNN over prior arts. For example, our Re BNN achieves 66.9% Top-1 accuracy with Res Net-18 backbone on the Image Net dataset, surpassing existing state-of-the-arts by a significant margin. Our code is open-sourced at https://github.com/Steve Tsui/Re BNN.

Introduction

Deep neural networks (DNNs) have dominated the recent advances of artificial intelligence from computer vision (CV) (Krizhevsky, Sutskever, and Hinton 2012; Russakovsky et al. 2015) to natural language processing (NLP) (Wang et al. 2018; Qin et al. 2019) and many beyond. In particular, large pre-trained models, e.g., Res Net (He et al. 2016) and BERT (Devlin et al. 2018), have continuously broken many records of the state-of-the-art performance. However, the achievement also comes with tremendous demands for

*These authors contributed equally. Corresponding author. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

memory and computation resources. These demands pose a huge challenge to the computing ability of many devices, especially resource-limited platforms such as mobile phones and electronic gadgets. In light of this, substantial research efforts are being invested in saving memory usage and computational power for an efficient online inference (He et al. 2018; Rastegari et al. 2016; Qin et al. 2022; Li et al. 2022). Among these studies, network quantization is particularly suitable for model deployment on resource-limited platforms for its great reduction in parameter bit-width and practical speedups supported by general hardware devices. Binarization, an extreme form of quantization, represents weights and activations of CNNs using a single bit, which well decreases the storage requirements by 32 and computation cost by up to 58 (Rastegari et al. 2016). Consequently, binarized neural networks (BNNs) are widely deployed on various tasks such as image classification (Rastegari et al. 2016; Liu et al. 2018; Lin et al. 2022) and object detection (Wang et al. 2020; Xu et al. 2021a, 2022b), and have the potential to be deployed directly on next-generation AI chips. However, the performance of BNNs remains largely inequal to the real-valued counterparts, due primarily to the degraded representation capability and trainability. Conventional BNNs (Rastegari et al. 2016; Liu et al. 2020) are often sub-optimized, due to their intrinsic frequent weight oscillation during training. We first identify that the weight oscillation mainly stems from the non-parametric scaling factor. Fig. 1(a) shows the epoch-wise oscillation1 of Re Act Net, where weight oscillation exists even when the network is convergent. As shown in Fig. 1(b), the conventional Re Act Net (Liu et al. 2020) possesses a channel-wise tri-modal distribution in the 1-bit convolution layers, whose peaks respectively center around the { 1, 0, +1}. Such distribution leads to a magnified scaling factor α, and thus the quantized weights α are quite larger than the small weights around 0, which might cause the weight oscillation. As illustrated in Fig. 1(c), in BNNs, the real-valued latent tensor is binarized by the sign function and scaled by the scaling factor

1A toy example of weight oscillation: From iteration t to t+1, a misleading weight update occurs causing an oscillation from 1 to 1, and from iteration t+1 to t+2 causes an oscillation from 1 to 1.

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

(b). Weight distribution of Re Act Net (c). Illustration of weight oscillation

(a). Weight oscillation of Re Act Net

Figure 1: (a) We show the epoch-wise weight oscillation of Re Act Net. (b) We randomly select 2 channels of the first 1-bit layer in Re Act Net (Liu et al. 2020). Obviously, the distribution is with 3 peaks centering around { 1, 0, +1}, which magnifies the non-parametric scaling factor (red line). (c) We illustrate the weight oscillation caused by such inappropriate scale calculation, where w and L indicate the latent weight and network loss function (blue line), respectively.

(the orange dot) in the forward propagation. In the backward propagation, the gradient is computed based on the quantized value α (indicated by the yellow dotted line). However, the gradient of small latent weights is misleading, when the scaling factor is magnified by the weights around 1 as Re Act Net (Fig. 1(a)). Then the update is conducted on the latent value (the black dot), which leads to the oscillation of the latent weight. With extremely limited representation states, such latent weights with small magnitudes frequently oscillate during the non-convex optimization. To address the aforementioned problem, we aim to introduce a Resilient Binary Neural Network (Re BNN). The intuition of our work is to re-learn the channel-wise scaling factor as well as the latent weights in a unified framework. Accordingly, we propose to parameterize the scaling factor and introduce a weighted reconstruction loss to build an adaptive training objective. We further show that the oscillation is factually controlled by the balanced parameter attached to the reconstruction loss, which provides a theoretical foundation to parameterize it in back propagation. The oscillation only happens when the gradient possesses a magnitude big enough to change the sign of the latent weight. Consequently, we calculate the balanced parameter based on the maximum magnitude of weight gradient during each iteration, leading to resilient gradients and effectively mitigating the weight oscillation. Our main contributions are summarized as follows:

We propose a new resilient gradient for learning the binary neural networks (Re BNN), which mitigates the frequent oscillation to better train BNNs. We parameterize the scaling factor and introduce a weighted reconstruction loss to build an adaptive learning objective. We prove that the occurrence of oscillation is controlled by the balanced parameter attached to the reconstruction loss. Therefore, we utilize resilient weight gradients to learn our Re BNN and effectively mitigate the weight oscillation.

Extensive experiments demonstrate the superiority of our Re BNN against other prior state-of-the-arts. For example, our Re BNN achieves 66.9% Top-1 accuracy on the Image Net dataset, surpassing prior Re Act Net by 1.0% with no extra parameters. In particular, our Re BNN also achieves state-of-the-art on fully binarized BERT models, demonstrating the generality of our Re BNN.

Related Work

Binary Net, based on Binary Connect, was proposed to train CNNs with binary weights. The activations are triggered at run-time while the parameters are computed during training. Following this line of research, local binary convolution (LBC) layers are introduced in (Juefei-Xu, Naresh Boddeti, and Savvides 2017) to binarize the non-linearly activations. XNOR-Net (Rastegari et al. 2016) is introduced to improve convolution efficiency by binarizing the weights and inputs of the convolution kernels. More recently, Bi-Real Net (Liu et al. 2018) explores a new variant of residual structure to preserve the information of real activations before the sign function, with a tight approximation to the derivative of the non-differentiable sign function. Real-to-binary (Martinez et al. 2020) re-scales the feature maps on the channels according to the input before binarized operations and adds a SE-Net (Hu, Shen, and Sun 2018) like gating module. Re Act Net (Liu et al. 2020) replaces the conventional PRe LU (He et al. 2015) and the sign function of the BNNs with RPRe LU and RSign with a learnable threshold, thus improving the performance of BNNs. RBONN (Xu et al. 2022a) introduces a recurrent bilinear optimization to address the asynchronous convergence problem for BNNs, which further improves the performance of BNNs. However, most of these aforementioned suffer from the weight oscillation mainly stemming from the non-parametric scaling factor. Unlike prior works, our Re BNN proposes to parameterize the scaling factor and introduces a weighted reconstruction

loss to build an adaptive training objective. We further prove the oscillation is controlled by the balanced parameter. Based on the analysis, we introduce a resilient weight gradient to effectively address the oscillation problem.

Methodology Preliminaries Given an N-layer CNN model, we denote its weight set as W = {wn}N n=1 and input feature map set as A = {an in}N n=1. The wn RCn out Cn in Kn Kn and an in RCn in W n in Hn in are the convolutional weight and the input feature map in the n-th layer, where Cn in, Cn out and Kn respectively stand for input channel number, output channel number and the kernel size. Also, W n in and Hn in are the width and height of the feature maps. Then, the convolutional outputs an out can be technically formulated as:

an out = wn an in, (1)

where represents the convolution operation. Herein, we omit the non-linear function for simplicity. Binary neural network intends to represent wn and an in a 1-bit format as bwn { 1, +1}Cn out Cn in Kn Kn and ban in { 1, +1}Cn in W n in Hn in such that the float-point convolutional outputs can be approximated by the efficient XNOR and Bit-count instructions as:

an out αn (bwn ban in), (2)

where represents the channel-wise multiplication, denotes the XNOR and Bit-count instructions, and αn = {αn 1, αn 2, ..., αn Cn out} RCn out + is known as the channel-wise scaling factor vector (Rastegari et al. 2016) to mitigate the output gap between Eq. (1) and its approximation of Eq. (2). We denote A = {αn}N n=1. Most existing implementations simply follow earlier studies (Rastegari et al. 2016; Liu et al. 2018) to optimize A and latent weights W based on a nonparametric bi-level optimization as:

W = arg min W L(W; A ), (3)

s.t. αn = arg min αn wn αn bwn 2 2, (4)

where L( ) represents the training loss. Consequently, a closed-form solution of αn can be derived via the channelwise absolute mean (CAM) as αn i = wn i,:,:,: 1 M n and M n = Cn in Kn Kn. For ease of representation, we use wn i as an alternative of wn i,:,:,: in what follows. The latent weight wn is updated via a standard gradient back-propagation algorithm and its gradient is calculated as:

ˆwn i wn i = αn i L ˆwn i 1|wn i | 1, (5)

where denotes the Hadamard product and ˆwn = αn bwn. Discussion. Eq. (5) shows weight gradient mainly comes from the non-parametric αn i and the gradient L ˆwn i . L ˆwn i is automatically solved in the back propagation and becomes smaller as network convergence, however, αn i is often magnified by the tri-modal distribution (Liu et al. 2020). Therefore,

weight oscillation mainly stems from αn i . Given a single weight wn i,j(1 j M n) centering around zero, the gradient L wn i,j is misleading , due to the significant gap between

wn i,j and αn i bwn i,j. Consequently, the bi-level optimization leads to frequent weight oscillation. To address this issue, we reformulate traditional bi-level optimization using Lagrange multiplier and show that a learnable scaling factor is a natural training stabilizer.

Resilient Binary Neural Network We first give the learning objective in this paper as:

arg min W,A L(W, A) + LR(W, A), (6)

where LR(W, A) is a weighted reconstruction loss and defined as:

LR(W, A) = 1

i=1 γn i wn i αn i bwn i 2 2, (7)

in which γn i is a balanced parameter. Based on the objective, the weight gradient in Eq. (5) becomes:

wn i + γn i (wn i αn i bwn i )

ˆwn i 1|wn i | 1 γn i bwn i ) + γn i wn i . (8)

The Sn i (αn i , wn i ) = γn i (wn i αn i bwn i ) is an additional term added in the back-propagation process. We add this item given that a too small αn i diminishes the gradient δwn i and causes a constant weight wn i . In what follows, we state and prove the proposition that δwn i,j is a resilient gradient for a single weight wn i,j. We sometimes omit subscript i, j and superscript n for an easy representation. Proposition 1. The additional term S(α, w) = γ(w αbw) achieves a resilient training process by suppressing frequent weight oscillation. Its balanced factor γ can be considered as the parameter controlling the occurrence of the weight oscillation. Proof: We prove the proposition by contradiction. For a single weight w centering around zero, the straight-throughestimator 1|w| 1 = 1. Thus we omit it in the following. Based on Eq. (8), with a learning rate η, the weight updating process is formulated as:

wt+1 = wt ηδwt

= wt η[αt( L

ˆwt γbwt) + γwt]

= (1 ηγ)wt ηαt( L

= (1 ηγ) wt ηαt

ˆwt γbwt) ,

where t denotes the t-th training iteration and η represents learning rate. Different weights shares different distances to the quantization level 1, therefore, their gradients should be modified in compliance with their scaling factors and current

learning rate. We first assume the initial state bwt = 1, and the analysis process is applicable to the case of initial state bwt = 1. The oscillation probability from the iteration t to t + 1 is:

P(bwt = bwt+1) bwt= 1 P( L

ˆwt γ). (10)

Similarly, the oscillation probability from the iteration t+1 to t + 2 is:

P(bwt+1 = bwt+2) bwt+1=1 P( L ˆwt+1 γ). (11)

Thus, the sequential oscillation probability from the iteration t to t + 2 is:

P((bwt+1 = bwt+2) (bwt+1 = bwt+2))|bwt= 1

ˆwt γ) ( L ˆwt+1 γ) , (12)

which denotes that the weight oscillation happens only if the magnitudes of L ˆwt and L ˆwt+1 are both larger than γ. As a result, its attached factor γ can be considered as a parameter used to control the occurrence of the weight oscillation. However, if the conditions in Eq. (12) are met, with Eq. (9) concluded, the gradient of ˆwt+1 is formulated as:

L ˆwt+1 = L

ˆwt γ 2γ. (13)

Note that η and γ are two positive variables, thus the second-order gradient 2L ( ˆwt)2 < 0 holds always. Consequently, L( ˆwt+1) can only be a local maxima, instead of a minima, which raises a contradiction with convergence in the training process. Such a contradiction indicates that the training algorithm will be convergent until no oscillation occurs, due to the additional term S(α, w). Therefore, we completes our proof.

Our proposition and proof reveal that the balanced parameter γ is actually a threshold . A very small threshold fails to mitigate the frequent oscillation effectively while a too large one suppresses the necessary sign inversion and hinders the gradient descent process. To solve this, we devise the learning rule of γ as:

γn,t+1 i = 1 M n bwn,t i bwn,t+1 i 1 0 max 1 j M n(| L

ˆwn,t i,j |),

(14) where the first item 1 M n bwn,t i bwn,t+1 i 1 0 denotes the proportion of weights with sign changed. The second item max1 j M n(| L ˆwn,t i,j |) is derived from Eq. (12), denoting the

gradient with the greatest magnitude of the t-th iteration. In this way, we suppress the frequent weight oscillation by a resilient gradient. We further optimize the scaling factor as:

αn i . (15)

The gradient derived from softmax loss can be easily calculated according to back propagation. Based on Eq. (7), it is easy to derive: LR

αn i = γn i (wn i αn i bwn i ) bwn i . (16)

Experiments Our Re BNNs are evaluated first on image classification and object detection tasks for visual recognition. Then, we evaluate Re BNN on the GLUE (Wang et al. 2018) benchmark with diverse NLP tasks. In this section, we first introduce the implementation details of Re BNN. Then we validate the effectiveness of the balanced parameter in the ablation study. Finally, we compare our method with state-of-the-art BNNs on various tasks to demonstrate the superiority of Re BNNs.

Datasets and Implementation Details Datasets: For its huge scope and diversity, the Image Net object classification dataset (Russakovsky et al. 2015) is more demanding, which has 1000 classes, 1.2 million training photos, and 50k validation images. The COCO dataset includes images from 80 different categories. All our experiments on COCO dataset are conducted on the COCO 2014 (Lin et al. 2014) object detection track in the training stage, which contains the combination of 80k images from the COCO train2014 and 35k images sampled from COCO val2014, i.e., COCO trainval35k. Then we test our method on the remaining 5k images from the COCO minival. We report the average precision (AP) for Io Us [0.5: 0.05: 0.95], designated as m AP@[.5,.95], using COCO s standard evaluation metric. For further analyzing our method, we also report AP50, AP75, APs, APm, and APl. The GLUE benchmark contains multiple natural language understanding tasks. We follow (Wang et al. 2018) to evaluate the performance: Matthews correlation for Co LA, Spearman correlation for STS-B, and accuracy for the rest tasks: RTE, MRPC, SST-2, QQP, MNLI-m (matched), and MNLI-mm (mismatched). Also, for machine reading comprehension on SQu AD, we report the EM (exact match) and F1 score.

Implementation details: Py Torch (Paszke et al. 2017) is used to implement Re BNN. We run the experiments on 4 NVIDIA Tesla A100 GPUs with 80 GB memory. Following (Liu et al. 2018), we retain weights in the first layer, shortcut, and last layer in the networks as the real-valued. For the image classification task, Res Nets (He et al. 2016) are employed as the backbone networks to build our Re BNNs. We offer two implementation setups for fair comparison. First, we use one-stage training on Res Nets, with SGD as the optimization algorithm and a momentum of 0.9, and a weight decay of 1e 4 following (Xu et al. 2021c). η is set to 0.1. The learning rates are optimized by the annealing cosine learning rate schedule. The number of epochs is set as 200. Then, we employ two-stage training following (Liu et al. 2020). Each stage counts 256 epochs. In this implementation, Adam is selected as the optimizer. And the network is supervised by a real-valued Res Net-34 teacher. The weight decay is set as 0 following (Liu et al. 2020). The learning rates η is set as 1e 3 and annealed to 0 by linear descent.

Figure 2: The evolution of latent weight distribution of (a) Re Act Net and (b) Re BNN. We select the first channel of the first binary convolution layer to show the evolution. The model is initialized from the first stage training with W32A1 following (Liu et al. 2020). We plot the distribution every 32 epochs.

Value of γ Top-1 Top-5 0 65.8 86.3 1e 5 66.2 86.7 1e 4 66.4 86.7 1e 3 66.3 86.8 1e 2 65.9 86.5 max1 j M n(| L ˆwn,t i,j |) 66.3 86.2

Eq. (14) 66.9 87.1

Table 1: We compare different calculation method of γ, including constant varying from 0 to 1e 2 and gradient-based calculation.

For objection detection, we use the Faster-RCNN (Ren et al. 2016) and SSD (Liu et al. 2016), which are based on Res Net-18 (He et al. 2016) and VGG-16 (Simonyan and Zisserman 2015) backbone, respectively. We fine-tune the detector on the dataset for object detection. For SSD and Faster-RCNN, the batch size is set to 16 and 8, respectively, with applying SGD optimizer. η is equal to 0.008. We use the same structure and training settings as Bi Det (Wang et al. 2020) on the SSD framework. The input resolution is 1000 600 for Faster-RCNN and 300 300 for SSD, respectively. For the natural language processing task, we conduct experiments based on BERTBASE (Devlin et al. 2018) (with 12 hidden layers) architecture following Bi BERT (Qin et al. 2022), respectively. The detailed training setups are the same as Bi BERT. We extend the Re BNN to multi-layer perceptrons (MLPs) and use the Bi-Attention following Bi BERT (Qin et al. 2022).

Ablation Study Since no extra hyper-parameter is introduced, we first evaluate the different calculation of γ. Then we show how our Re BNN achieves a resilient training process. In the ablation study, we use the Res Net-18 backbone initialized from the first stage training with W32A1 following (Liu et al. 2020).

Calculation of γ: We compare the different calculations of γ in this part. As shown in Tab. 1, the performances increase first and then decrease when increasing the value of constant γ. Considering that the gradient magnitude varies layer-wise and channel-wise, a subtle γ can hardly be manually set as

a) Epoch-wise oscillation ratio

Epoch 0 50 100 150 200 250

Oscillation ratio

b) Epoch-wise training loss

Epoch 0 50 100 150 200 250

Figure 3: (a) The epoch-wise weight oscillation ratio of Re Act Net (solid), Re CU (dotted) and Re BNN (dashed). (b) Comparing the loss curves of Re Act Net and our Re BNN with different calculation of γ.

a global value. We further compare the gradient-based calculation. To avoid extreme values, we set the upper bound as 2e 4 and lower bound as 1e 5. As shown in the bottom lines, we first use max1 j M n(| L ˆwn,t i,j |), the maximum

intra-channel gradient of last iteration, which shows a similar performance compared with the constant 1e 4. This indicates that only using the maximum intra-channel gradient may also suppress necessary sign flip, thus hindering the training. Inspired by this, we use Eq. (14) to calculate γ and improve the performance by 0.6%, showing that considering the weight oscillation proportion allows the necessary sign flip and leads a more effective training. We also show the training loss curves in Fig. 3(b). As plotted, the curves of L almost demonstrate the degrees of training sufficiency. Thus we draw the conclusion that Re BNN with γ calculated by Eq. (14) achieves the lowest training loss as well as an efficient training process. Note that the loss may not be minimal at each training iteration, but our method is just a reasonable version of gradient descent algorithms by nature, which can be used to solve the optimization problem as general. We empirically prove Re BNN s capability of mitigating the weight oscillation, leading to a better convergence.

Resilient training process: We first show the evolution of the latent weight distribution is this section. We plot the

Network Method #Bits Size(MB) OPs(108) Top-1 Top-5

Real-valued 32-32 46.76 18.21 69.6 89.2 BNN

1-1 4.15 1.63

42.2 67.1 XNOR-Net 51.2 73.2 Bi-Real Net 56.4 79.5 RBNN 59.6 81.6 Re CU 61.0 82.6 Re BNN1 61.6 83.4 Re Act Net 1-1 4.15 1.63 65.9 86.2 FDA-BNN 65.8 86.4 Re CU 66.4 86.5 Re BNN2 66.9 87.1

Real-valued 32-32 87.19 36.74 73.3 91.3 Bi-Real Net 1-1 5.41 1.93 62.2 83.9 RBNN 63.1 84.4 Re CU 65.1 85.8 Re BNN1 65.8 86.2 Re Act Net 1-1 5.41 1.93 69.3 88.6 Re BNN2 69.9 88.9

Table 2: A performance comparison with SOTAs on Image Net with different training strategies. #Bits denotes the bit width of weights and activations. We report the Top-1 (%) and Top-5 (%) accuracy performances. Re BNN1 and Re BNN2 denote our Re BNN learned with one-stage and two-stage training.

Framework Backbone Method #Bits Size(MB) OPs(G) m AP @[.5, .95] AP50 AP75 APs APm AP1

Faster-RCNN Res Net-18

Real-valued 32-32 47.48 434.39 26.0 44.8 27.2 10.0 28.9 39.7 De Ro Fa-Net 4-4 6.73 55.90 22.9 38.6 23.7 8.0 24.9 36.3 XNOR-Net

1-1 2.39 8.58

10.4 21.6 8.8 2.7 11.8 15.9 Bi-Real Net 14.4 29.0 13.4 3.7 15.4 24.1 Bi Det 15.7 31.0 14.4 4.9 16.7 25.4 Re BNN 19.6 37.6 20.4 7.0 20.1 33.1

Real-valued 32-32 105.16 31.44 23.2 41.2 23.4 5.3 23.2 39.6 Do Re Fa-Net 4-4 29.58 6.67 19.5 35.0 19.6 5.1 20.5 32.8 XNOR-Net

1-1 21.88 2.13

8.1 19.5 5.6 2.6 8.3 13.3 Bi-Real Net 11.2 26.0 8.3 3.1 12.0 18.3 Bi Det 13.2 28.3 10.5 5.1 14.3 20.5 Re BNN 18.1 33.9 17.5 4.2 17.9 25.9

Table 3: Comparison of m AP@[.5, .95](%), AP (%) with different Io U threshold and AP for objects in various sizes with state-of-the-art binarized object detectors on COCO minival. #Bits denotes the bit width of weights and activations

distribution of the first channel of the first binary convolution layer per 32 epochs in Fig. 2. As seen, our Re BNN can efficiently redistribute the BNNs towards resilience. Conventional Re Act Net (Liu et al. 2020) possesses a tri-model distribution, which is unstable due to the scaling factor with large magnitudes. In contrast, our Re BNN is constrained by the balanced parameter γ during training, thus leading to a resilient bi-modal distribution with fewer weights centering around zero. We also plot the ratios of sequential weight oscillation of Re BNN and Re Act Net for the 1-st, 8-th, and 16-th binary convolution layers of Res Net-18. As shown in Fig. 3(a), the dashed lines gain much lower magnitudes than the solid (Re Act Net) and dotted (Re CU (Xu et al. 2021c)) lines with the same color, validating the effectiveness of our Re BNN in suppressing the consecutive weight oscillation. Besides, the sequential weight oscillation ratios of Re BNN are gradually decreased to 0 as the training converges.

Image Classification We first show the experimental results on Image Net with Res Net-18 and Res Net-34 (He et al. 2016) backbones in

Tab. 2. We compare Re BNN with BNN (Courbariaux, Bengio, and David 2015), XNOR-Net (Rastegari et al. 2016), Bi-Real Net (Liu et al. 2018), RBNN (Lin et al. 2020), and Re CU (Xu et al. 2021c) for the one-stage training strategy. For the twostage training strategy, we compare with Re Act Net (Liu et al. 2020), FDA-BNN (Xu et al. 2021b), and Re CU (Xu et al. 2021c). We evaluate our Re BNN with one-stage training following (Xu et al. 2021c). Re BNN outperforms all of the compared binary models in both Top-1 and Top-5 accuracy, as shown in Tab. 2. Re BNN-based Res Net-18 respectively achieves 61.6% and 83.4% in Top-1 and Top-5 accuracy, with 0.6% increases over state-of-the-art RECU. Re BNN further outperforms all compared methods with Res Net-34 backbone, achieving 65.8% Top-1 accuracy. We further evaluate our Re BNN with two-stage training following (Liu et al. 2020), where Re BNN surpasses Re Act Net by 1.0%/0.6% Top-1 accuracy with Res Net-18/34 backbones, respectively. In this paper, we use memory usage and OPs following (Liu et al. 2018) in comparison to other tasks for further reference. As shown in Tab. 2, Re BNN theoretically

Method #Bits Size(MB) OPs(G) MNLI-m/mm QQP QNLI SST-2 Co LA STS-B MRPC RTE Avg. Real-valued 32-32-32 418 22.5 84.9/85.5 91.4 92.1 93.2 59.7 90.1 86.3 72.2 83.9 Q-BERT 2-8-8 43.0 6.5 76.6/77.0 - - 84.6 - - 68.3 52.7 - Q2BERT 2-8-8 43.0 6.5 47.2/47.3 67.0 61.3 80.6 0 4.4 68.4 52.7 47.7 Ternary BERT 2-2-2 28.0 1.5 40.3/40.0 63.1 50.0 80.7 0 12.4 68.3 54.5 45.5 Binary BERT 1-1-1 16.5 0.4 35.6/35.3 66.2 51.5 53.2 0 6.1 68.3 52.7 41.0 Bi BERT 66.1/67.5 84.8 72.6 88.7 25.4 33.6 72.5 57.4 63.2 Re BNN 69.9/71.3 85.2 79.2 89.3 28.8 38.7 72.6 56.9 65.8

Table 4: Comparison of BERT quantization methods without data augmentation. #Bits denotes the bit width of weights, word embedding, and activations. Avg. denotes the average results.

Network Method #Bits Size(MB) Memory Saving Latency(ms) Acceleration

Res Net-18 Real-valued 32-32 46.76 - 583.1 - Re BNN 1-1 4.15 11.26 67.5 8.64

Res Net-34 Real-valued 32-32 87.19 - 1025.6 - Re BNN 1-1 5.41 16.12 113.6 9.03

Table 5: Comparing Re BNN with real-valued models on hardware (single thread).

accelerates Res Net-18/34 by 11.17 and 19.04 , which is significant for real-time applications.

Object Detection

On COCO, the proposed Re BNN is compared against stateof-the-art 1-bit neural networks such as XNOR-Net (Rastegari et al. 2016), Bi-Real Net (Liu et al. 2018), and Bi Det (Wang et al. 2020). We present the performance of the 4-bit Do Re Fa-Net (Zhou et al. 2016) for reference. As shown in Tab. 3, compared with state-of-the-art XNORNet, Bi-Real Net, and Bi Det, our method improves the m AP@[.5,.95] by 9.2%, 5.2%, and 3.9% using the Faster RCNN framework with the Res Net-18 backbone. Moreover, on other APs with different Io U thresholds, our Re BNN clearly beats others. Compared to Do Re Fa-Net, a quantized neural network with 4-bit weights and activations, our Re BNN obtains only 3.3% lower m AP. Our method yields a 1-bit detector with a performance of only 6.1% m AP lower than the best-performing real-valued counterpart (19.6% vs. 26.0%). Similarly, using the SSD300 framework with the VGG-16 backbone, our method achieves 18.1% m AP@[.5,.95], outperforming XNOR-Net, Bi-Real Net, and Bi Det by 10.0 %, 6.9%, and 4.9% m AP, respectively. Our Re BNN also achieves highly efficient models by theoretically accelerating Faster-RCNN and SSD by 50.62 and 14.76 .

Natural Language Processing

In Tab. 4, we show experiments on the BERTBASE architecture and the GLUE benchmark without data augmentation following Bi BERT (Qin et al. 2022). Experiments show that outperforms other methods on the development set of GLUE benchmark, including Ternary BERT (Zhang et al. 2020), Binary BERT (Bai et al. 2020), Q-BERT (Shen et al. 2020), Q2BERT (Shen et al. 2020), and Bi BERT (Qin et al. 2022). Our Re BNN surpasses existing methods on BERTBASE architecture by a clear margin in the average accuracy. For example, our Re BNN surpasses Bi BERT by 6.6% accuracy on QNLI dataset, which is significant for the natural language

processing task. We observe our Re BNN brings improvements on 7 out of total 8 datasets, thus leading to a 2.6% average accuracy improvement. Our Re BNN also achieves highly efficient models by theoretically accelerating the BERTBASE architecture by 56.25 .

Deployment Efficiency We implement the 1-bit models achieved by our Re BNN on ODROID C4, which has a 2.016 GHz 64-bit quad-core ARM Cortex-A55. With evaluating its real speed in practice, the efficiency of our Re BNN is proved when deployed into real-world mobile devices. We leverage the SIMD instruction SSHL on ARM NEON to make the inference framework BOLT (Feng 2021) compatible with Re BNN. We compare Re BNN to the real-valued backbones in Tab. 5. We can see that Re BNN s inference speed is substantially faster with the highly efficient BOLT framework. For example, the acceleration rate achieves about 8.64 on Res Net-18, which is slightly lower than the theoretical acceleration rate. For Res Net-34 backbone, Re BNN can achieve 9.03 acceleration rate with BOLT framework on hardware, which is significant for the computer vision on real-world edge devices.

Conclusion In this paper, we analyze the influence of frequent weight oscillation in binary neural networks and proposed a Resilient Binary Neural Network (Re BNN) to provide resilient gradients for latent weights updating. Our method specifically proposes to parameterize the scaling factor and introduces a weighted reconstruction loss to build an adaptive training objective. We further manifest that the balanced parameter can serve as an indicator to reflect the frequency of the weight oscillation during back propagation. Our Re BNN reaches resilience by learning the balanced parameter, leading to a great reduction of weight oscillation. Re BNN shows strong generalization to gain impressive performance on various tasks such as image classification, object detection, and natural language processing tasks, demonstrating the superiority of the proposed method over state-of-the-art BNNs.

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant 62076016, 62206272, 62141604, 61972016, 62032016, Beijing Natural Science Foundation L223024, National Key R&D Program of China (NO.2022ZD0160100), and in part by Shanghai Committee of Science and Technology (Grant No. 21DZ1100100).

Bai, H.; Zhang, W.; Hou, L.; Shang, L.; Jin, J.; Jiang, X.; Liu, Q.; Lyu, M.; and King, I. 2020. Binarybert: Pushing the limit of bert quantization. In Proc. of ACL, 4334 4348. Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proc. of Neur IPS, 3123 3131. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805. Feng, J. 2021. Bolt. https://github.com/huawei-noah/bolt. Accessed: 2020-03-10. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proc. of ICCV, 1026 1034. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proc. of CVPR, 770 778. He, Y.; Kang, G.; Dong, X.; Fu, Y.; and Yang, Y. 2018. Soft filter pruning for accelerating deep convolutional neural networks. In Proc. of IJCAI, 2234 2240. Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In Proc. of CVPR, 7132 7141. Juefei-Xu, F.; Naresh Boddeti, V.; and Savvides, M. 2017. Local binary convolutional neural networks. In Proc. of CVPR, 19 28. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Image Net Classification with Deep Convolutional Neural Networks. In Proc. of Neur IPS, 1097 1105. Li, Y.; Xu, S.; Zhang, B.; Cao, X.; Gao, P.; and Guo, G. 2022. Q-Vi T: Accurate and Fully Quantized Low-bit Vision Transformer. ar Xiv preprint ar Xiv:2210.06707. Lin, M.; Ji, R.; Xu, Z.; Zhang, B.; Chao, F.; Lin, C.-W.; and Shao, L. 2022. Siman: Sign-to-magnitude network binarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1 12. Lin, M.; Ji, R.; Xu, Z.; Zhang, B.; Wang, Y.; Wu, Y.; Huang, F.; and Lin, C.-W. 2020. Rotated Binary Neural Network. In Proc. of Neur IPS, 1 9. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Proc. of ECCV, 740 755. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multibox detector. In Proc. of ECCV, 21 37.

Liu, Z.; Shen, Z.; Savvides, M.; and Cheng, K.-T. 2020. Re Act Net: Towards Precise Binary Neural Network with Generalized Activation Functions. In Proc. of ECCV, 143 159. Liu, Z.; Wu, B.; Luo, W.; Yang, X.; Liu, W.; and Cheng, K.-T. 2018. Bi-Real Net: Enhancing the Performance of 1Bit CNNs with Improved Representational Capability and Advanced Training Algorithm. In Proc. of ECCV, 722 737. Martinez, B.; Yang, J.; Bulat, A.; and Tzimiropoulos, G. 2020. Training binary neural networks with real-to-binary convolutions. In Proc. of ICLR, 1 11. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch. In Proc. of Neur IPS Workshop, 1 4.

Qin, H.; Ding, Y.; Zhang, M.; Yan, Q.; Liu, A.; Dang, Q.; Liu, Z.; and Liu, X. 2022. Bi BERT: Accurate Fully Binarized BERT. In Proc. of ICLR, 1 24.

Qin, L.; Che, W.; Li, Y.; Wen, H.; and Liu, T. 2019. A Stack Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding. In Proc. of EMNLP, 2078 2087.

Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proc. of ECCV, 525 542. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2016. Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137 1149. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3): 211 252. Shen, S.; Dong, Z.; Ye, J.; Ma, L.; Yao, Z.; Gholami, A.; Mahoney, M. W.; and Keutzer, K. 2020. Q-bert: Hessian based ultra low precision quantization of bert. In Proc. of AAAI, 8815 8821.

Simonyan, K.; and Zisserman, A. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. of ICLR, 1 13. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of ICLR, 1 20. Wang, Z.; Wu, Z.; Lu, J.; and Zhou, J. 2020. Bi Det: An Efficient Binarized Object Detector. In Proc. of CVPR, 2049 2058.

Xu, S.; Li, Y.; Wang, T.; Ma, T.; Zhang, B.; Gao, P.; Qiao, Y.; L u, J.; and Guo, G. 2022a. Recurrent bilinear optimization for binary neural networks. In Proc. of ECCV, 19 35. Xu, S.; Li, Y.; Zeng, B.; Ma, T.; Zhang, B.; Cao, X.; Gao, P.; and L u, J. 2022b. IDa-Det: An Information Discrepancy Aware Distillation for 1-Bit Detectors. In Proc. of ECCV, 346 361.

Xu, S.; Zhao, J.; Lu, J.; Zhang, B.; Han, S.; and Doermann, D. 2021a. Layer-Wise Searching for 1-Bit Detectors. In Proc. of CVPR, 5682 5691. Xu, Y.; Han, K.; Xu, C.; Tang, Y.; Xu, C.; and Wang, Y. 2021b. Learning frequency domain approximation for binary neural networks. In Proc. of Neur IPS, 25553 25565. Xu, Z.; Lin, M.; Liu, J.; Chen, J.; Shao, L.; Gao, Y.; Tian, Y.; and Ji, R. 2021c. Recu: Reviving the dead weights in binary neural networks. In Proc. of ICCV, 5198 5208. Zhang, W.; Hou, L.; Yin, Y.; Shang, L.; Chen, X.; Jiang, X.; and Liu, Q. 2020. Ternarybert: Distillation-aware ultra-low bit bert. In Proc. of EMNLP, 509 521. Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; and Zou, Y. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160.