# shiftaddnet_a_hardwareinspired_deep_network__d28c111f.pdf

Shift Add Net: A Hardware-Inspired Deep Network

Haoran You , Xiaohan Chen , Yongan Zhang , Chaojian Li , Sicheng Li , Zihao Liu , Zhangyang Wang , and Yingyan Lin

Department of Electrical and Computer Engineering, Rice University Department of Electrical and Computer Engineering, The University of Texas at Austin Alibaba DAMO Academy {hy34, yz87, cl114, yingyan.lin}@rice.edu, {xiaohan.chen, atlaswang}@utexas.edu {sicheng.li, zihao.liu}@alibaba-inc.com

Multiplication (e.g., convolution) is arguably a cornerstone of modern deep neural networks (DNNs). However, intensive multiplications cause expensive resource costs that challenge DNNs deployment on resource-constrained edge devices, driving several attempts for multiplication-less deep networks. This paper presented Shift Add Net, whose main inspiration is drawn from a common practice in energyefﬁcient hardware implementation, that is, multiplication can be instead performed with additions and logical bit-shifts. We leverage this idea to explicitly parameterize deep networks in this way, yielding a new type of deep network that involves only bit-shift and additive weight layers. This hardware-inspired Shift Add Net immediately leads to both energy-efﬁcient inference and training, without compromising the expressive capacity compared to standard DNNs. The two complementary operation types (bit-shift and add) additionally enable ﬁner-grained control of the model s learning capacity, leading to more ﬂexible trade-off between accuracy and (training) efﬁciency, as well as improved robustness to quantization and pruning. We conduct extensive experiments and ablation studies, all backed up by our FPGA-based Shift Add Net implementation and energy measurements. Compared to existing DNNs or other multiplication-less models, Shift Add Net aggressively reduces over 80% hardware-quantiﬁed energy cost of DNNs training and inference, while offering comparable or better accuracies. Codes and pre-trained models are available at https://github.com/RICE-EIC/Shift Add Net.

1 Introduction

Powerful deep neural networks (DNNs) come at the price of prohibitive resource costs during both DNN inference and training, limiting the application feasibility and scope of DNNs in resourceconstrained device for more pervasive intelligence. DNNs are largely composed of multiplication operations for both forward and backward propagation, which are much more computationally costly than addition [1]. The above roadblock has driven several attempts to design new types of hardware-friendly deep networks, which rely less on heavy multiplications in order for higher energy efﬁciency. Shift Net [2, 3] adopted spatial shift operations paired with pointwise convolutions, to replace a large portion of convolutions. Deep Shift [4] employed an alternative of bit-wise shifts, which are equivalent to multiplying the input with powers of 2. Lately, Adder Net [5] pioneered to demonstrate the feasibility and promise of replacing all convolutions with merely addition operations.

This paper takes one step further along this direction of multiplication-less deep networks, by drawing a very fundamental idea in the hardware-design practice, computer processors, and even digital signal processing. It has been known for long that multiplications can be performed with additions and logical bit-shifts [6, 7], whose hardware implementation are very simple and much faster [8], without compromising the result quality or precision. Also on currently available processors, a bit-shift instruction is faster than a multiply instruction and can be leveraged to multiply (shift left) and divide

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

(shift right) by powers of two. Multiplication (or division) by a constant is then implemented using a sequence of shifts and adds (or subtracts). The above clever shortcut saves arithmetic operations, and can readily be applied to accelerating the hardware implementation of any machine learning algorithm involving multiplication (either scalar, vector, or matrix). But our curiosity is well beyond this: can we learn from this hardware-level shortcut" to design efﬁcient learning algorithms?

The above uniquely motivates our work: in order to be more hardware-friendly , we strive to re-design our model to be hardware-inspired , leveraging the successful experience directly form the efﬁcient hardware design community. Speciﬁcally, we explicit re-parameterize our deep networks by replacing all convolutional and fully-connected layers (both built on multiplications) with two multiplication-free layers: bit-shift and add. Our new type of deep model, named Shift Add Net, immediately lead to both energy-efﬁcient inference and training algorithms.

We note that Shift Add Net seamlessly integrates bit-shift and addition together, with strong motivations that address several pitfalls in prior arts [2, 3, 4, 5]. Compared to utilizing spatialor bit-shifts alone [2, 4], Shift Add Net can be fully expressive as standard DNNs, while [2, 4] only approximate the original expressive capacity since shift operations cannot span the entire continuous space of multiplicative mappings (e.g., bit-shifts can only represent the subset of power-of-2 multiplications). Compared to the fully additive model [5], we note that while repeated additions can in principle replace any multiplicative mapping, they do so in a very inefﬁcient way. In contrast, by also exploiting bit-shifts, Shift Add Net is expected to be more parameter-efﬁcient than [5] which relies on adding templates. As a bonus, we notice the bit-shift and add operations naturally correspond to coarseand ﬁne-grained input manipulations. We can exploit this property to more ﬂexibly trade-offs between training efﬁciency and achievable accuracy, e.g., by freezing all bit-shifts and only training add layers. Shift Add Net with ﬁxed shift layers can achieve up to 90% and 82.8% energy savings than fully additive models [5] and shift models [4] under ﬂoating-point or ﬁxed-point precisions trainings, while leading to a comparable or better accuracies (-3.7% +31.2% and 3.5% 23.6%), respectively. Our contributions can be summarized as follow:

Uniquely motivated by the hardware design expertise, we combine two multiplication-less and complementary operations (bit-shift and add) to develop a hardware-inspired network called Shift Add Net that is fully expressive and ultra-efﬁcient. We develop training and inference algorithms for Shift Add Net. Leveraging the two operations distinct granularity levels, we also investigate Shift Add Net trade-offs between training efﬁciency and achievable accuracy, e.g., by freezing all the bit-shift layers. We conduct extensive experiments to compare Shift Add Net with existing DNNs or multiplicationless models. Results on multiple benchmarks demonstrate its superior compactness, accuracy, training efﬁciency, and robustness. Speciﬁcally, we implement Shift Add Net on a ZYNQ-7 ZC706 FPGA board [9] and collect all real energy measurements for benchmarking.

2 Related works

Multiplication-less DNNs. Shrinking the cost-dominate multiplications has been widely considered in many DNN designs for reducing the computational complexity [10, 11]: [10] decomposes the convolutions into separate depthwise and pointwise modules which require fewer multiplications; and [12, 13, 14] binarize the weights or activations to construct DNNs consisting of sign changes paired with much fewer multiplications. Another trend is to replace the multiplication operations with other cheaper operations. Speciﬁcally, [3, 2] leverage spatial shift operations to shift feature maps, which needs to be cooperated with pointwise convolution to aggregate spatial information; [4] fully replaces multiplications with both bit-wise shift operations and sign changes; and [5, 15, 16] trade multiplications for cheaper additions and develop a special backpropogation scheme for effectively training the add-only networks.

Hardware costs of basic operations. As compared to shift and add, multipliers can be very inefﬁcient in hardware as they require high hardware costs in terms of consumed energy/time and chip area. Shift and add operations can be a substitute for such multipliers. For example, they have been adopted for saving computer resources and can be easily and efﬁciently performed by a digital processor [17]. This hardware idea has been adopted to accelerate multilayer perceptrons (MLP) in digital processors [8]. We here motivated by such hardware expertise to fully replace multiplications in modern DNNs with merely shift and add, aiming to solve the drawbacks in existing shift-only or add-only replacements methods and to boost the network efﬁciency over multiplication-based DNNs.

Relevant observations in DNN training. It has been shown that DNN training contains redundancy in various aspects [18, 19, 20, 21, 22, 23]. For example, [24] explores an orthogonal weight training algorithm which over-parameterizes the networks with the multiplication between a learnable orthogonal matrix and ﬁxed randomly initialized weights, and argue that ﬁxing weights during training and only learning a proper coordinate system can yield good generalization for over-parameterized networks; and [25] separates the convolution into spatial and pointwise convolutions, while freezing the binary spatial convolution ﬁlters (called anchor weights) and only learning the pointwise convolutions. These works inspire the ablation study of ﬁxing shift parameters in our Shift Add Net.

3 The proposed model: Shift Add Net

In this section, we present our proposed Shift Add Net. First, we will introduce the motivation and hypothesis beyond Shift Add Net, and then discuss Shift Add Net s component layers (i.e., shift and add layers) from both hardware cost and algorithmic perspectives, providing high-level background and justiﬁcation for Shift Add Net. Finally, we discuss a more efﬁcient variant of Shift Add Net.

3.1 Motivation and hypothesis Driven from the long-standing tradition in the ﬁeld of energy-efﬁcient hardware implementation to replace expensive multiplication with lower-cost bit-shifts and adds, we re-design DNNs by pipelining the shift and add layers. We hypothesize that (1) while DNNs with merely either shift and add layers in general are less capable compared to their multiplication-based DNN counterparts, integrating these two weak players can lead to networks with much improved expressive capacity, while maintaining their hardware efﬁcient advantages; and (2) thanks to the coarseand ﬁne-grained input manipulations resulted from the complementary shift and add layers, there is a possibility that such new network pipeline can even lead to new models which are comparable with multiplication-based DNNs in terms of task accuracy, while offering superior hardware efﬁciency.

3.2 Shift Add Net: shift Layers

This subsection discusses the shift layers adopted in our proposed Shift Add Net in terms of hardware efﬁciency and algorithmic capacity.

Table 1: Unit energy comparisons using ASIC & FPGA.

Format ASIC (45nm) FPGA (ZYNQ-7 ZC706)

Operation Format Energy (p J) Improv. Energy (p J) Improv.

Mult. FP32 3.7 - 18.8 - FIX32 3.1 - 19.6 - FIX8 0.2 - 0.2 -

Add FP32 0.9 4.1x 0.4 47x FIX32 0.1 31x 0.1 196x FIX8 0.03 6.7x 0.1 2x

Shift FIX32 0.13 24x 0.1 196x FIX8 0.024 8.3x 0.025 8x

Hardware perspective. Shift operation is a well known efﬁcient hardware primitive, motivating recent development of various shiftbased efﬁcient DNNs [4, 2, 3]. Speciﬁcally, the shift layers in [4] reduce DNNs computation and energy costs by replacing the regular costdominant multiplication-based convolution and linear operations (a.k.a fully-connected layers) with bit-shift-based convolution and linear operations, respectively. Mathematically, such bitshift operations are equivalent to multiplying by powers of 2. As summarized in Tab. 1, such shift operations can be extremely efﬁcient as compared to their corresponding multiplications. In particular, bit-shifts can save as high as 196 and 24 energy costs over their multiplication couterpart, when implemented in a 45nm CMOS technology and SOTA FPGA [26], respectively. In addition, for a 16-bit design, it has been estimated that the average power and area of multipliers are at least 9.7 and 1.45 , respectively, of the bit-shifts [4].

Algorithmic perspective. Despite its promising hardware efﬁciency, networks constructed with bit-shifts can compare unfavorably with its multiplication-based counterpart in terms of expressive efﬁciency. Formally, expressive efﬁciency of architecture A is higher than architecture B if any functions realized by B could be replicated by A, but there exists functions realized by A, which cannot be replicated by B unless its size grows signiﬁcantly larger [27]. For example, it is commonly adopted that DNNs are exponentially efﬁcient as compared to shallow networks because a shallow network must grow exponentially large for approximating the functions represented by a DNN of polynomial sizes. For ease of discussion, we refer to [28] and use a loosely deﬁned metric of expressiveness called expressive capacity (accuracy) in this paper without loss of generality. Speciﬁcally, expressive capacity refers to the achieved accuracy of networks under the same or similar hardware cost, i.e., network A is deemed to have a better expressive capacity compared to network B if the former achieves a higher accuracy at a cost of the same or even fewer FLOPs (or energy cost). From this

Feature Extraction Classification

Figure 1: Illustrating the overview structure of Shift Add Net.

perspective, networks with bit-shift layers without full-precision latent weights are observed to be inferior to networks with add layers or multiplication-based convolution layers as shown in prior arts [5, 10] and validated in our experiments (see Sec. 4) under various settings and datasets.

3.3 Shift Add Net: add Layers Similar to the aforementioned subsection, here we discuss the add layers adopted in our proposed Shift Add Net in terms of hardware efﬁciency and algorithmic capacity.

Hardware perspective. Addition is another well known efﬁcient hardware primitive. This has motivated the design of efﬁcient DNNs mostly using additions [5], and there are many works trying to trade multiplications for additions in order to speed up DNNs [29, 5, 30]. In particular, [5] investigates the feasibility of replacing multiplications with additions in DNNs, and presents Adder Nets which trade the massive multiplications in DNNs for much cheaper additions to reduce computational costs. As a concrete example in Tab. 1, additions can save up to 196 and 31 energy costs over multiplications in ﬁxed-point formats, and can save 47 and 4.1 energy costs in ﬂaoting-point formats (more expensive), when being implemented in a 45nm CMOS technology and SOTA FPGA [26], respectively. Note that the pioneering work [5] which investigates addition dominant DNNs presents their networks in ﬂaoting-point formats.

Algorithmic perspective. While there have been no prior works studying add layers in terms of expressive efﬁciency or capacity. The results in SOTA bit-shift-based networks [4] and add-based networks [5] as well as our experiments under various settings show that add-based networks in general have better expressive capacity than their bit-shift-based counterparts. In particular, Adder Nets [5] achieves a 1.37% higher accuracy than that of Deep Shift [4] at a cost of similar or even lower FLOPs on Res Net-18 with the Image Net dataset. Furthermore, when it comes to DNNs, the diversity of learned features granularity is another factor that is important to the achieved accuracy [31]. In this regard, shift layers are deemed to extract large-grained feature extraction as compared to small-grained features learned by add layers.

3.4 Shift Add Net implementation

3.4.1 Overview of the structure

To better validate our aforementioned Hypothesis (1), i.e., integrating the two weak players (shift and add) into one can lead to networks with much improved task accuracy and hardware efﬁciency as compared to networks with merely one of the two weak players, we adopt SOTA bit-shift-based and add-based networks design to implement the shift and add layers of our Shift Add Net in this paper. In this way, we can better evaluate that the resulting designs improved performance comes merely from the integration effect of the two, ruling out potential impact due to a different design. The overall structure of Shift Add Net is illustrated in Fig. 1, which can be formulated as follows:

O = ka(ks(I, s 2p), wa), where ks(x, w) = X x T w and ka(x, w) = X x w 1 (1)

where I and O denote the input and output activations, respectively; ks( , ) and ka( , ) are kernel functions to perform the inner products and subtractions of a convolution; wa denotes the weights

in the add layers, and ws = s 2p represents the weights in the shift layers, in which s are sign ﬂip operators s { 1, 0, 1} and the powers of 2 parameters p can represent the bit-wise shift.

Dimensions of shift and add layers. A shift layer in Shift Add Net adopts the same strides and weight dimensions as that in the corresponding multiﬁcation-based DNNs (e.g., Conv Net), followed by an add layer which adapts its kernel sizes and input channels to match the reduced feature maps. Although in this way Shift Add Net contains slightly more weights than that of Conv Net/Adder Net (e.g., 1.3MB vs. 1.03 MB in Conv Net/Adder Net (FP32) on Res Net20), it consumes less energy costs to achieve similar accuracies because data movement is the cost bottleneck in hardware acceleration [32, 33, 34]. Shift Add Net can be further quantized to 0.4 MB (FIX8) without hurting the accuracy, which will be demonstrated using experiments in Sec. 4.2.

3.4.2 Backpropagation in Shift Add Net

Shift Add Net adopts SOTA bit-shift-based and add-based networks design during backpropagation. Here we explicitly formulate both the inference and backpropagation of the shift and add layers. The add layers during inference can be expressed as:

Oa[co][e][f] = X xa wa 1

s=0 xa[ci][e + r][f + s] wa[co][ci][r][s] 1, (2)

where 0 co < CO, 0 e < E, 0 f < F, and speciﬁcally, CI and CO, E and F, R and S stand for the number of the input and output channels, the size of the input and output feature maps, and the size of the weight ﬁlters, respectively; and Oa, xa, and wa denote the output and input activations, and the weights, respectively. Based on the above notation, we formulate the add layers backpropagation in the following equations:

Oa[co][e][f] wa[co][ci][r][s] = xa[ci][e + r][f + s] wa[co][ci][r][s], (3)

Oa[co][e][f] xa[ci][e + r][f + s] = HT(xa[ci][e + r][f + s] wa[co][ci][r][s]), (4)

where HT denotes the Hard Tanh function following Adder Net [5] to prevent gradients from exploding. Note that the difference over Adder Net is that the strides of Shift Add Net s add layers are always equal to one, while its shift layers share the same strides with its corresponding Conv Net.

Next, we use the above notation to introduce both the inference and backpropagation design of Shift Add Net s shift layers, with one additional symbol U denoting the stride:

Os[co][e][f] = X x T s ws =

s=0 xs[ci][e U + r][f U + s] s[co][ci][r][s] 2p[co][ci][r][s],

Os ws ws ln 2, Oa

Os w T s, (6)

where Os, xs, and ws denote the output and input activations, and the weights of shift layers, respectively; Oa/ Os follows Equ. (4) to perform backpropagation since a shift layer is followed by an add layer in Shift Add Net.

3.4.3 Shift Add Net variant: ﬁxing the shift layers

Inspired by the recent success of freezing anchor weights [25, 24] for over-parameterized networks, we hypothesize that freezing the over-parameterized shift layers (large-grained anchor weights) in Shift Add Net can potentially lead to a good generalization ability, motivating us to develop a variant of Shift Add Net with ﬁxed shift layers. In particular, Shift Add Net with ﬁxed shift layers simply means the shift weight ﬁlters s and p in Equ. (1) remain the same after initialization. Training Shift Add Net with ﬁxed shift layers is straightforward because the shift weight ﬁlters (i.e., s and p in Equ.(1)) do not need to be updated (i.e., skipping the corresponding gradient calculations) while the error can be backpropagated through the ﬁxed shift layers in the same way as they are backpropagated through the learnable shift layers (see Equ. (6)). Moreover, we further prune the ﬁxed shift layers to only reserve the necessary large-grained anchor weights to design a more energy-efﬁcient Shift Add Net.

4 Experiment results

In this section, we ﬁrst describe our experiment setup, and then benchmark Shift Add Net over SOTA DNNs. After that, we evaluate Shift Add Net in the context of domain adaptation. Finally, we present ablation studies of Shift Add Net s shift and add layers.

4.1 Experiment setup Models and datasets. We consider two DNN models (i.e., Res Net-20 [35] and VGG19-small models [36]) on six datasets: two classiﬁcation datasets (i.e., CIFAR-10/100) and four Io T datasets (including MHEALTH [37], Flat Cam Face [38], USCHAD [39], and Head-pose detection [40]). Speciﬁcally, the Head-pose dataset contains 2,760 images, and we adopt randomly sampled 80% for training and the remaining 20% for testing the correctness of three outputs: front, left, and right [41]; the Flat Cam Face dataset contains 23,838 face images captured using a Flat Cam lensless imaging system [38], which are resized to 76 76 before being used.

Training settings. For the CIFAR-10/100 and Head-pose datasets, the training takes a total of 160 epochs with a batch size of 256, where the initial learning rate is set to 0.1 and then divided by 10 at the 80-th and 120-th epochs, respectively, and a SGD solver is adopted with a momentum of 0.9 and a weight decay of 10 4 following [42]. For the Flat Cam Face dataset, we follow [43] to pre-train the network on the VGGFace 2 dataset for 20 epochs before adapting to the Flat Cam Face images. For the trainable shift layers, we follow [4] to adopt 50% sparsity by default; and for the MHEALTH and USCHAD datasets, we follow [44] to use a DCNN model and train it for 40 epochs.

Baselines and evaluation metrics. Baselines: We evaluate the proposed Shift Add Net over two SOTA multiplication-less networks, including Adder Net [5] and Deep Shift (use Deep Shift (PS) by default) [4], and also compare it to the multiplication-based Conv Net [45] under a comparable energy cost ( 30% more than Adder Net (FP32)). Evaluation metrics: For evaluating real hardware efﬁciency, we measure the energy cost of all DNNs on a SOTA FPGA platform, ZYNQ-7 ZC706 [9]. Note that our energy measurements in all experiments include the DRAM access costs.

4.2 Shift Add Net over SOTA DNNs on standard training

Experiment settings. For this set of experiments, we consider the general Shift Add Net with learnable shift layers. For the two SOTA multiplication-less baselines: Adder Net [5] and Deep Shift [4], the latter of which quantizes its activations to 16-bit ﬁxed-point for shifting purposes while its backpropagation uses a ﬂoating-point precision. As ﬂoating-point additions are more expensive than multiplications [46], we refer to SOTA quantization techniques [47, 48] for quantizing both the forward (weights and activations) and backward (errors and gradients) parameters to 32/8-bit ﬁxedpoint (FIX32/8), for evaluating the potential energy savings of both the Shift Add Net and Adder Net.

Shift Add Net over SOTA on classiﬁcation. The results on four datasets and two DNNs in Fig. 2 (a), (b), (e), and (d) show that Shift Add Net can consistently outperform all competitors in terms of the measured energy cost, while improving the task accuracies. Speciﬁcally, with full-precision ﬂoating-point (FP32) Shift Add Net even surpasses both the multiplication-based Conv Net and the Adder Net: when training Res Net-20 on CIFAR-10, Shift Add Net reduces 33.7% and 44.6% of the training energy costs as compared to Adder Net and Conv Net [45], respectively, outperforming SOTA multiplication-based Conv Net and thus validating our Hypothesis (2) in Section 3.1; and Shift Add Net demonstrates notably improved robustness to quantization as compared to Adder Net: a quantized Shift Add Net with 8-bit ﬁxd-point presentation reduces 65.1% 75.0% of the energy costs over the reported one of Adder Net (with a ﬂoating-point precision, as denoted as FP32) while offering comparable accuracies (-1.79% +0.18%), and achieves a greatly higher accuracy (7.2% 37.1%) over the quantized Adder Net (FIX32/8) while consuming comparable or even less energy costs (-25.2% 25.2%). Meanwhile, Shift Add Net achieves 2.41% 16.1% higher accuracies while requiring 34.1% 70.9% less energy costs, as compared to Deep Shift [4]. This set of results also verify our Hypothesis (1) in Section 3.1 that integrating the weak shift and add players can lead to improved network expressive capacity with negligible or even lower hardware costs. We also compare Shift Add Net with the baselines in an apple-to-apple manner based on the same quantization format (e.g., FIX32). For example, when evaluated on VGG-19 with CIFAR-10 (see Fig. 2 (c)), Shift Add Net consistently (1) improves the accuracuies by 11.6%, 10.6%, and 37.1% as compared to Adder Net in FIX32/16/8 formats, at comparable energy costs (-25.2% 15.7%); and (2) improves the accuracies by 26.8%, 26.2%, and 24.2% as compared to Deep Shift (PS) using FIX32/16/8 formats, with comparable or slighly higher energy overheads. To further

FP32 FIX32/16/8

(a) (b) (c)

(d) (e) (f)

FIX32/16/8 FP32

FP:FIX16; BP:FP32

FP:FIX16; BP:FP32

FIX8 FIX8 FIX32

FIX-32/16/8

FP:FIX16; BP:FP32

FP:FIX16; BP:FP32

FP:FIX16; BP:FP32

FP:FIX16; BP:FP32

Figure 2: Tesing accuracy vs. energy cost of Shift Add Net over Adder Net [5] (add only), Deep Shift [4] (shift only), and multiplication-based Conv Net [45], using Res Net-20 and VGG19-small models on CIFAR-10/100 and two Io T datasets.

Figure 3: Testing accuracy s trajectories visualization for Shift Add Net, Adder Net [5], and Deep Shift [4] versus both training epochs and energy costs when evaluated on Res Net-20 with CIFAR-10.

analyze Shift Add Net s improved robustness to quantization, we compare the discriminative power of Adder Net and Shift Add Net by visualizing the class divergences using the t-SNE algorithm [49], as shown in the supplement. Shift Add Net over SOTA on Io T applications. We further evaluate Shift Add Net over the SOTA baselines on the two Io T datasets to evaluate its effectiveness on real-world Io T tasks. As shown in Fig. 2 (c) and (f), Shift Add Net again consistently outperforms the baselines under all settings in terms of efﬁciency-accuracy trade-offs. Speciﬁcally, compared with Adder Net, Shift Add Net achieves 34.1% 80.9% energy cost reductions while offering 1.08% 3.18% higher accuracies; and compared with Deep Shift (PS), Shift Add Net achieves 34.1% 50.9% energy savings while improving accuracies by 5.5% 6.9%. This set of experiments show that Shift Add Net s effectiveness and superiority extends to read-world Io T applications. We also observe similar improved efﬁciency-accuracy trade-offs on the MHEALTH [37] and USCHAD [39] datasets and report the performance in the supplement. Shift Add Net over SOTA on training trajectories. Fig. 3 (a) and (b) visualize the testing accuracy s trajectories of Shift Add Net and the two baselines versus both the training epoch and energy cost, respectively, on Res Net-20 with CIFAR-10. We can see that Shift Add Net achieves a comparable or higher accuracy with fewer epochs and energy costs, indicating its better generalization capability.

4.3 Shift Add Net over SOTA on domain adaption and ﬁne-tuning

To further evaluate the potential capability of Shift Add Net for on-device learning [50], we consider the training settings of adaptation and ﬁne-tuning:

6.0 10.0 20.0 30.0 40.0 0.2 0.5 1.0 1.5 2.5 3.0 3.5 4.0 4.5 5.0

5.0 10.0 20.0 0.4 0.8 1.2 1.6 2.4 1.2 1.5 1.8 2.1 2.4

Adder Net Deep Shift(PS) Shift Add Net Shift Add Net(Fix Shift)

FIX32 FIX8 FIX32

- 25.2% FIX8

FIX32 FIX32 FIX32

FP:FIX16; BP:FP32

+ 1.43% - 40.9%

- 90.0% - 52.0%

(a) (b) (c)

(d) (e) (f)

FP:FIX16; BP:FP32

FP:FIX16; BP:FP32

FP:FIX16; BP:FP32

Figure 4: Testing accuracy vs. energy cost of Shift Add Net with ﬁxed shift layers over Adder Net [5] (add only), Deep Shift [4] (shift only), and multiplication-based Conv Net [45], using the Res Net-20 and VGG19-small models on the CIFAR-10/100 and two Io T datasets.

Table 2: Adaptation and ﬁne-tuning results comparisons using Res Net-20 trained on CIFAR-10.

Setting Methods Accuracy (%) Energy Costs (MJ) Adaptation Finetuning

Res Net20 on CIFAR10

Deep Shift 58.41 51.31 9.88 Adder Net 79.79 84.23 25.41 Shift Add Net 81.50 84.61 16.82 Shift Add Net (Fixed) 85.10 84.88 11.04

Adaptation. We split CIFAR-10 into two non-overlapping subsets. We ﬁrst pre-train the model on one subset and then retrain it on the other set to see how accurately and efﬁciently they can adapt to the new task. The same splitting is applied to the test set. Fine-tuning. Similarly, we randomly split CIFAR-10 into two non-overlapping subsets, the difference is that each subset contains all classes. After pre-training on the ﬁrst subset, we ﬁne-tune the model on the other, expecting to see a continuous growth in performance. Tab. 2 compares the testing accuracies and training energy costs of Shift Add Net and the baselines. We can see that Shift Add Net always achieves a better accuracy over the two SOTA multiplication-less networks. First, compared to Adder Net, Shift Add Net boosts the accuracy by 5.31% and 0.65%, while reducing the energy cost by 56.6% on the adaptation and ﬁne-tuning scenarios, respectively; Second, compared to Deep Shift, Shift Add Net notably improves the accuracy by 26.69% and 33.57% on the adaptation and ﬁne-tuning scenarios, respectively, with a marginally increased energy (10.5%).

4.4 Ablation studies of Shift Add Net We next study Shift Add Net s shift and add layers for better understanding this new network.

4.4.1 Shift Add Net: ﬁxing the shift layers or not Shift Add Net with ﬁxed shift layers. In this set of experiments, we study Shift Add Net with the shift layers ﬁxed or learnable. As shown in Fig. 4, we can see that (1) Overall, Shift Add Net with ﬁxed shift layers can achieve up to 90.0% and 82.8% energy savings than Adder Net (with ﬂoating-point or ﬁxed-point precisions) and Deep Shift, while leading to comparable or better accuracies (-3.74% +31.2% and 3.5% 23.6%), respectively; and (2) interestingly, Shift Add Net with ﬁxed shift layers also surpasses the generic Shift Add Net from two aspects: First, it always demands less energy

(a) Weights Distribution in Adder Net

(b) Weights Distribution in Shift Add Net

Test Accuracy

Prune 30% Prune 50% Prune 70% Prune 90%

Prune 30% Prune 50% Prune 70% Prune 90%

(c) Test Accuracy Comparison

Figure 6: Left: Histograms of the weights in the 11-th add layer of Res Net20 trained on CIFAR-10. Right: Comparing accuracies of Shift Add Net with Adder Net under different pruning ratios.

costs (25.2% 40.9%) to achieve a comparable or even better accuracy; and second, it can even achieve a better accuracy and better robustness to quantizaiton (up to 10.8% improvement for 8-bit ﬁxed-point training) than the generic Shift Add Net with learnable shift layers, when evaluated with VGG19-small on CIFAR-100.

0 20 40 60 80 100 120 140 160 Epochs

Testing Acc. (%)

Prune 30% Shift Prune 50% Shift Prune 70% Shift Prune 90% Shift Adder Net

Figure 5: Testing accuracy vs. training epochs for the Adder Net [5] and pruned Shift Add Nets on Res Net-20 with CIFAR-10.

Shift Add Net with its ﬁxed shift layers pruned. As it has become a common practice to prune multiplication-based DNNs before deploying into resource-constrained devices, we are thus curious whether this can be extended to our Shift Add Net. To do so, we randomly prune the shift layers by 30%, 50%, 70% and 90%, and compare the testing accuracy versus the training epochs for both the pruned Shift Add Nets and its corresponding Adder Net. Fig. 5 shows that Shift Add Net maintains its fast convergence beneﬁt even when the shift layers are largely pruned (e.g., up to 70%).

4.4.2 Shift Add Net: sparsify the add layers or not

Sparsifying the add layers allows us to further reduce the used parameters and save training costs. Similar to quantization, we observe that even slightly pruning Adder Net incurs an accuracy drop. As shown in Fig. 6, we visualize the distribution of weights in the 11-th add layer when using a Res Net-20 as backbone under different pruning ratios. Note that only non-zero weights are shown in the histogram for better visualization. We can see that networks with only adder layers, i.e., Adder Net, fail to provide a wide dynamic range for the weights (collapse to narrow distribution ranges) at high pruning ratios, while Shift Add Net can preserve a consistently wide dynamic ranges of weights. That explains the improved robustness of Shift Add Net to sparsiﬁcation. The test accuracy comparisons in Fig. 6 (c) demonstrate that when pruning 50% of the parameters in the add layers, Shift Add Net can still achieve 80.42% test accuracy while the accuracy of Adder Nets collapses to 51.47%.

5 Conclusion

We propose a multiplication-free Shift Add Net for efﬁcient DNN training and inference inspired by the well-known shift and add hardware expertise, and show that Shift Add Net achieves improved expressiveness and parameter efﬁciency, solving the drawbacks of networks with merely shift and add operations. Moreover, Shift Add Net enables more ﬂexible control of different levels of granularity in the network training than Conv Net. Interestingly, we ﬁnd that ﬁxing Shift Add Net s shift layers even leads to a comparable or even better accuracy for over-parameterized networks on our considered Io T applications. Extensive experiments and ablation studies demonstrate the superior energy efﬁciency, convergence, and robustness of Shift Add Net over its add or shift only counterparts. We believe many promising problems are still open to be discussed for our proposed new network, an immediate future work is to explore the theoretical ground of such a ﬁxed regularization.

Broader impact

Efﬁcient DNN training goal. Recent DNN breakthroughs rely on massive data and computational power. Also, the modern DNN training requires massive yet inefﬁcient multiplications in the convolution, making DNN training very challenging and limiting the practical applications on resource-constrained mobile devices. First, training DNNs causes prohibitive computational costs. For example, training a medium scale DNN, Res Net-50, requires ten to the power of eighteen ﬂoatingpoint operations or FLOPs [51]. Second, DNN training has raised pressing environmental concerns. For instance, the carbon emission of training one DNN can be as high as one American cars life-long emission [52, 50]. Therefore, efﬁcient DNN training has become a very important research problem.

Generic hardware-inspired algorithm. To achieve the efﬁcient training goal, this paper takes one further step along the direction of multiplication-less deep networks. by drawing a very fundamental idea in the hardware-design practice, computer processors, and even digital signal processing. It has been known for long that multiplications can be performed with additions and logical bit-shifts [6], whose hardware implementation is very simple and much faster [8], without compromising the result quality or precision. The above clever shortcut saves arithmetic operations, and can readily be applied to accelerating the hardware implementation of any machine learning algorithm involving multiplication (either scalar, vector or matrix). But our curiosity is well beyond this: we are supposed to learn from this hardware-level shortcut", for designing efﬁcient learning algorithms.

Societal consequences. Success of this project enables both efﬁcient online training and inference of state-of-the-art DNNs in pervasive resource-constrained platforms and applications. As machine learning powered edge devices have penetrated all walks of life, the project is expected to generate tremendous impacts on societies and economies. Progress on this paper will enable ubiquitous DNNpowered intelligent functions in edge devices, across numerous camera-based Internet-of-Things (Io T) applications such as trafﬁc monitoring, self-driving and smart cars, personal digital assistants, surveillance and security, and augmented reality. We believe the hardware-inspired Shift Add Net is a signiﬁcant efﬁcient network training methods, which would make an impact to the society.

[1] Mark Horowitz. Energy table for 45nm process. In Stanford VLSI wiki. 2014.

[2] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero ﬂop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9127 9135, 2018.

[3] Weijie Chen, Di Xie, Yuan Zhang, and Shiliang Pu. All you need is a few shifts: Designing efﬁcient convolutional neural networks for image classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7241 7250, 2019.

[4] Mostafa Elhoushi, Farhan Shaﬁq, Ye Tian, Joey Yiwei Li, and Zihao Chen. Deepshift: Towards multiplication-less neural networks. ar Xiv preprint ar Xiv:1905.13298, 2019.

[5] Hanting Chen, Yunhe Wang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. Addernet: Do we really need multiplications in deep learning? The IEEE Conference on Computer Vision and Pattern Recognition, 2020.

[6] Ping Xue and Bede Liu. Adaptive equalizer using ﬁnite-bit power-of-two quantizer. IEEE transactions on acoustics, speech, and signal processing, 34(6):1603 1611, 1986.

[7] Y. Lin, S. Zhang, and N. R. Shanbhag. Variation-tolerant architectures for convolutional neural networks in the near threshold voltage regime. In 2016 IEEE International Workshop on Signal Processing Systems (Si PS), pages 17 22, 2016.

[8] Michele Marchesi, Gianni Orlandi, Francesco Piazza, and Aurelio Uncini. Fast neural networks without multipliers. IEEE transactions on Neural Networks, 4(1):53 62, 1993.

[9] Getting Started Guide. Zynq-7000 all programmable soc zc706 evaluation kit (ise design suite 14.7). 2012.

[10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017.

[11] Y. Lin, C. Sakr, Y. Kim, and N. Shanbhag. Predictive Net: An energy-efﬁcient convolutional neural network via zero prediction. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1 4, 2017.

[12] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. ar Xiv preprint ar Xiv:1510.03009, 2015.

[13] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv preprint ar Xiv:1602.02830, 2016.

[14] Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems 32, pages 7533 7544. 2019.

[15] Dehua Song, Yunhe Wang, Hanting Chen, Chang Xu, Chunjing Xu, and Da Cheng Tao. Addersr: Towards energy efﬁcient image super-resolution. ar Xiv preprint ar Xiv:2009.08891, 2020.

[16] Yixing Xu, Chang Xu, Xinghao Chen, Wei Zhang, Chunjing Xu, and Yunhe Wang. Kernel based progressive distillation for adder neural networks. ar Xiv preprint ar Xiv:2009.13044, 2020.

[17] Jose-Luis Sanchez-Romero, Antonio Jimeno-Morenilla, Rafael Molina-Carmona, and Jose Perez-Martinez. An approach to the application of shift-and-add algorithms on engineering and industrial processes. Mathematical and Computer Modelling, 57(7-8):1800 1806, 2013.

[18] Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, and Zhangyang Wang. E2-train: Training state-of-the-art cnns with over 80% energy savings. In Advances in Neural Information Processing Systems, pages 5138 5150, 2019.

[19] Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G Baraniuk, Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efﬁcient training of deep networks. In International Conference on Learning Representations, 2019.

[20] Zhaohui Yang, Yunhe Wang, Chuanjian Liu, Hanting Chen, Chunjing Xu, Boxin Shi, Chao Xu, and Chang Xu. Legonet: Efﬁcient convolutional neural networks with lego ﬁlters. In International Conference on Machine Learning, pages 7005 7014, 2019.

[21] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1580 1589, 2020.

[22] Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan Song, Quoc Le, Qiang Liu, and Dale Schuurmans. Go wide, then narrow: Efﬁcient training of deep thin networks. ar Xiv preprint ar Xiv:2007.00811, 2020.

[23] Hanting Chen, Yunhe Wang, Han Shu, Yehui Tang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. Frequency domain compact 3d convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1641 1650, 2020.

[24] Weiyang Liu, Rongmei Lin, Zhen Liu, James M Rehg, Li Xiong, and Le Song. Orthogonal over-parameterized training. ar Xiv preprint ar Xiv:2004.04690, 2020.

[25] Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 19 28, 2017.

[26] Xilinx Inc. Xilinx zynq-7000 soc zc706 evaluation kit. https://www.xilinx.com/ products/boards-and-kits/ek-z7-zc706-g.html. (Accessed on 09/30/2020).

[27] Or Sharir and Amnon Shashua. On the expressive power of overlapping architectures of deep learning. In International Conference on Learning Representations, 2018.

[28] Huasong Zhong, Xianggen Liu, Yihui He, and Yuchun Ma. Shift-based primitives for efﬁcient convolutional neural networks. ar Xiv preprint ar Xiv:1809.08458, 2018.

[29] Arman Afrasiyabi, Diaa Badawi, Baris Nasir, Ozan Yildi, Fatios T Yarman Vural, and A Enis Çetin. Non-euclidean vector product for neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6862 6866. IEEE, 2018.

[30] Chen Wang, Jianfei Yang, Lihua Xie, and Junsong Yuan. Kervolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

[31] Haobin Dou and Xihong Wu. Coarse-to-ﬁne trained multi-scale convolutional neural networks for image classiﬁcation. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1 7, 2015.

[32] Yang Zhao, Xiaohan Chen, Yue Wang, Chaojian Li, Haoran You, Yonggan Fu, Yuan Xie, Zhangyang Wang, and Yingyan Lin. Smartexchange: Trading higher-cost memory storage/access for lower-cost computation. ar Xiv preprint ar Xiv:2005.03403, 2020.

[33] Weitao Li, Pengfei Xu, Yang Zhao, Haitong Li, Yuan Xie, and Yingyan Lin. Timely: Pushing data movements and interfaces in pim accelerators towards local and in time domain, 2020.

[34] B. Murmann. Mixed-signal computing for deep neural network inference. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pages 1 11, 2020.

[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

[37] Oresti Banos, Rafael Garcia, Juan A Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez, and Claudia Villalonga. mhealthdroid: a novel framework for agile development of mobile health applications. In International workshop on ambient assisted living, pages 91 98. Springer, 2014.

[38] J. Tan, L. Niu, J. K. Adams, V. Boominathan, J. T. Robinson, R. G. Baraniuk, and A. Veeraraghavan. Face detection and veriﬁcation using lensless cameras. IEEE Trans. Comput. Imag., 5(2):180 194, June 2019.

[39] Mi Zhang and Alexander A Sawchuk. Usc-had: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pages 1036 1043, 2012.

[40] Nicolas Gourier, Daniela Hall, and James L Crowley. Estimating face orientation from robust detection of salient facial structures. In FG Net workshop on visual observation of deictic gestures, volume 6, page 7. FGnet (IST 2000 26434) Cambridge, UK, 2004.

[41] Vivek Boominathan, Jesse Adams, Jacob Robinson, and Ashok Veeraraghavan. Phlatcam: Designed phase-mask based thin lensless camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[42] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2019.

[43] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition, 2018.

[44] Wenchao Jiang and Zhaozheng Yin. Human activity recognition using wearable sensors by deep convolutional neural networks. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1307 1310, 2015.

[45] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.

[46] A. Beaumont-Smith, N. Burgess, S. Lefrere, and C. C. Lim. Reduced latency ieee ﬂoating-point standard adder architectures. In Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336), pages 35 42, 1999.

[47] Yukuan Yang, Lei Deng, Shuang Wu, Tianyi Yan, Yuan Xie, and Guoqi Li. Training highperformance and large-scale deep neural networks with full 8-bit integers. Neural Networks, 2020.

[48] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. In Advances in neural information processing systems, pages 5145 5153, 2018.

[49] Laurens Van der Maaten. t-distributed stochastic neighbor embedding (t-sne), 2014.

[50] Chaojian Li, Tianlong Chen, Haoran You, Zhangyang Wang, and Yingyan Lin. HALO: Hardware-aware learning to optimize. In The 16th European Conference on Computer Vision (ECCV 2020), 2020.

[51] Yang You, Zhao Zhang, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Imagenet training in 24 minutes. ar Xiv preprint ar Xiv:1709.05011, 2017.

[52] Emma Strubell, Ananya Ganesh, and Andrew Mc Callum. Energy and policy considerations for deep learning in nlp. ar Xiv preprint ar Xiv:1906.02243, 2019.