# winograd_algorithm_for_addernet__684a00eb.pdf Winograd Algorithm for Adder Net Wenshuo Li 1 Hanting Chen 1 2 Mingqiang Huang 3 Xinghao Chen 1 Chunjing Xu 1 Yunhe Wang 1 Adder neural network (Adder Net) is a new kind of deep model that replaces the original massive multiplications in convolutions by additions while preserving the high performance. Since the hardware complexity of additions is much lower than that of multiplications, the overall energy consumption is thus reduced significantly. To further optimize the hardware overhead of using Adder Net, this paper studies the winograd algorithm, which is a widely used fast algorithm for accelerating convolution and saving the computational costs. Unfortunately, the conventional Winograd algorithm cannot be directly applied to Adder Nets since the distributive law in multiplication is not valid for the ℓ1-norm. Therefore, we replace the element-wise multiplication in the Winograd equation by additions and then develop a new set of transform matrixes that can enhance the representation ability of output features to maintain the performance. Moreover, we propose the ℓ2-to-ℓ1 training strategy to mitigate the negative impacts caused by formal inconsistency. Experimental results on both FPGA and benchmarks show that the new method can further reduce the energy consumption without affecting the accuracy of the original Adder Net. 1. Introduction The effectiveness of deep neural networks has been well demonstrated in a large variety of machine learning problems. With the rapid development of the accessible datasets, learning theory and algorithms and the computing hardware, the performance of considerable computer vision tasks has been improved by these neural networks, especially convolutional neural networks (CNNs). (Krizhevsky et al., 2012) 1Noah s Ark Lab, Huawei Technologies. 2Peking University. 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. Correspondence to: Wenshuo Li , Yunhe Wang . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Method Relative Power Winograd CNN 2.71 Adder Net 2.1 Winograd Adder Net 1 1 Figure 1. Comparison of relative power consuming between CNN, Winograd CNN, Adder Net and Winograd Adder Net. All data is achieved under 8-bit fixed-point number. *: The relative power of Winograd CNN is estimated by theoretical analysis. first applies the deep CNN on the large-scale image classification and a series of subsequent network architectures are explored for boosting the accuracy such as Res Net (He et al., 2016), Efficient Net (Tan & Le, 2019), and Ghost Net (Han et al., 2020). In addition, there is a great number of networks presented for addressing different task including object detection (Tan et al., 2020), segmentation (Tao et al., 2020), and low-level computer vision tasks (Guo et al., 2019; Zhang & Patel, 2018; Ren et al., 2019). Although these models can obtain state-of-the-art performance, most of them require massive computations and cannot be easily used on portable devices such as microphones, robots and self-driving cars. To reduce the computational costs of pre-trained deep neural networks without affecting their performance is also a very important problem, a series of works have been explored for removing the network redundancy such as pruning (Han et al., 2015), distillation (Hinton et al., 2015), and neural architecture search (Liu et al., 2018a). Besides, according to the investigation in (Dally, 2015), the energy consuming varies largely with different operations and different numeric precision (e.g., the energy consumption of an 32bit multiplication is about 100 larger than that of an 8-bit addition). Therefore, quantization is now becoming the most common scheme for deploying deep neural networks on resource limited devices. (Qiu et al., 2016) finds that 8-bit fixed-point number is sufficient for CNN to achieve a promising accuracy, and soon 8-bit fixed-point number becomes a common practice. Furthermore, (Courbariaux et al., 2016) proposes binary networks to quantize neural network to binary values (i.e., +1 and -1) to have an extreme simplification of deep networks. (Rastegari et al., 2016) inserts a scale factor after each binarized layer to enhance the representational capability. (Lin et al., 2017) proposes ABC-Net to use the linear Winograd Algorithm for Adder Net combination of binary bases to approximate floating-number weights and activations. (Liu et al., 2020) proposes RSign and RPRe LU to learn the distribution reshape and shift for enhancing the performance. However, the main disadvantage of binary quantization is still the great loss of accuracy, e.g., the performance of the recent binary net is about 5% lower than that of the baseline CNN with the same architecture on the Image Net benchmark. Recently, (Chen et al., 2020) proposes the adder neural network (Adder Net), which uses the conventional ℓ1-norm to calculate the output features. Since the cost of addition is much cheaper than that of multiplication (e.g., 8-bit addition is 7 times cheaper than 8-bit multiplication), Adder Net can significantly reduce the energy consumption of a given CNN model with comparable performance (Xu et al., 2020). Additionally, the fast calculation algorithms, including FFT (Mathieu et al., 2013) and Winograd algorithm (Winograd, 1980), are widely used for improving the efficiency and reducing the computational complexity of deep neural networks. The Winograd algorithm is the most popular and effective method in acclerating CNNs (Lavin & Gray, 2016), since it has the best performance on accelerating 3 3 layers, which is most commonly used in the modern neural architectures. Some following work focuses on the further optimization of the applications of Winograd algorithm for CNNs. To combine Winograd algorithm with neural network pruning, training in Winograd domain is proposed and the results show little loss of accuracy (Liu et al., 2018b). Although the Adder Net can significantly reduce the overall energy cost of the resulting neural network, the energy consumption of convolutional layers could be obviously optimized by the Winograd algorithm as shown in Figure 1. Thus, we are motivated to explore the fast calculation algorithm for adder layers to further reduce the energy costs of using deep neural networks. However, due to distributive law is not applicable to the operation (i.e., sum of absolute values) in Adder Net, the conventional Winograd algorithm cannot be directly used. Therefore, we first thoroughly analyze the difficulties of applying the Winograd algorithm to Adder Net and explore a new paradigm for optimizing the inference of adder layers. The main contributions of this paper are summarized as follows: We propose to inherit the original framework of the Winograd algorithm for optimizing Adder Net, and replace the original element-wise multiplication by adder operation, i.e., the ℓ1-norm for using additions. We then analyze the unbalance of feature maps in the Winograd for Adder Net, and investigate the optimal transform matrix for maximally enhancing the feature representation ability of the new output features. In addition, we present a ℓ2-to-ℓ1 training strategy to adapt the Winograd Adder Net paradigm and avoid the decline on the network performance. Experiments conducted on benchmark datasets show that the performance of Winograd Adder Net is comparable to that of the baseline model, while achieving an about 2.1 lower energy consumption on Field Programmable Gate Array (FPGA). 2. Preliminaries We briefly review the Adder Nets and Winograd algorithm. 2.1. Adder Net Different from convolutional neural networks, Adder Net (Chen et al., 2020) proposes to use ℓ1-norm to conduct the feed-forward process for extracting features. This method replaces multiplications with additions, which brings benefits for energy consumption and circuits area. The inference process is formulated as Y (m, n, t) = X i,j,k |F(i, j, k, t) X(m + i, n + j, k)|, (1) where Y represents the output features, F represents weights and X represents input features. The backward process of weights F and feature maps X is approximated with ℓ2-norm and Hard Tanh instead of sign function, respectively. Y (m, n, t) F(i, j, k, t) = X(m + i, n + j, k) F(i, j, k, t), (2) Y (m, n, t) X(m + i, n + j, k) = HT(F(i, j, k, t) X(m+i, n+j, k)), (3) where HT( ) is short for Hard Tanh function x, 1 < x < 1, 1, x < 1, 1, x > 1. Since the norms of gradients in Adder Net are smaller than that in CNNs, the authors propose an adaptive learning rate for different layers in Adder Nets. The learning rate for each layer l could be formulated by: Fl = γ αl L(Fl), (4) k || L(Fl)||2 . (5) The following work expands the application scope, such as super-resolution (Song et al., 2020). Adder Net has shown its potential to replace CNN in many computer vision tasks and attracted a lot of attention. 2.2. Winograd algorithm Winograd algorithm (Winograd, 1980) is a widely used fast calculation method, which can accelerate the convolution Winograd Algorithm for Adder Net calculation in the signal processing area. (Lavin & Gray, 2016) applies Winograd algorithm in convolutional neural networks and largely reduce the computation cost of CNNs. Denote the length of filter as r, the output length as m and the corresponding Winograd algorithm as F(m, r). The Winograd algorithm of F(2, 3) can be formulated as Y = AT [[Gg GT ] [BT d B]]A, (6) 1 0 1 1 1 1 0 1 1 0 0 1 2 1 2 1 2 1 2 1 2 1 2 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 where g represents the 3 3 convolution filter, d represents a 4 4 tile of input feature map and respresents the elementwise multiplication. A, G and B are the transform matrix for output, weights and input, respectively. When forward process is performed, ˆg = Gg GT could be calculated at advance to reduce the calculation overhead, since g would not be changed. Therefore, Equation (6) can be reformulated as: Y = AT [ˆg [BT d B]]A. (8) Considering the complexity of transformation and the numeric precision issues, F(2 2, 3 3) is the most commonly used form in practice (Liu et al., 2018b; Yan et al., 2020). Moreover, with m > 2 or r > 3, the transformation matrix could not be binary, making it harder to be applied to Adder Nets. Therefore, we only focus on F(2 2, 3 3) in the following sections. 3.1. Winograd Algorithm for Adder Net As Adder Net and Winograd algorithm can all improve the efficiency of neual networks, we explore to combine the two techniques together to further reduce the computation cost. In this section, we will introduce the vanilla form of Winograd algorithm on Adder Net. As shown in Equation (8), the Winograd algorithm consists of several parts of calculations, including pre-transformations for filters Gg GT , pre-transformations for inputs BT d B, element-wise multiplications, and output transformations using matrix A. Since the calculations in input pre-transformations and output transformations only contain additions, we do not need to modify them. Therefore, we only replace the element-wise multiplications with ℓ1-distance, the calculations can be reformulated as: Y = AT [ |ˆg [BT d B]|]A. (9) A, G, and B are the same as those we introduced in Section 2.2. represents element-wise minus operation. (Additions and minus operation are actually the same, since minus operation could be implemented by additions of complement.) And | | represents the absolute operation for each element in the matrix. Here we give a brief analysis of the complexity of Winograd algorithm for Adder Net. The calculation of Winograd algorithm consists of four parts: weight pre-transformations, input pre-transformations, element-wise minus and absolute operation of weights and output transformations. The transformation of weights could be calculated before deployment, so we do not take this part into account. We denote the shape of input features as (N, Cin, Xh, Xw), and the shape of weights as (Cout, Cin, Kh, Kw). The input features could be divided into N Cin Xh 2 groups for applying Winograd algorithm. Each group requires 3 additions since each column and row of matrix B has two non-zero values (1 or 1), which means the final BT d B results are the sum of four values. For the element-wise minus and absolute operation of weights and features, each group requires 16 additions, and there are N Cout Cin Xh 2 groups in total. Since the results of addition and absolute operation need to be accumulated, the times of additions should be doubled, which results in N Cout Cin Xh 2 16 2 additions. The output features could be divided into Cout Xh Xw groups and each group needs 8 additions since the matrix A has 3 non-zero values each column. The total additions of three parts is 2 (Cout Cin 16 2+Cin 3+Cout 8). (10) Since the values of Cin and Cout are generally dozens or hundreds, last two items can be ignored. Then the formula (10) becomes N Xh Xw Cout Cin 8. (11) The total additions of original Adder Net are N Xh Xw Cin Cout 9 2. (12) Thus, the Winograd algorithm for Adder Net only requires about 4 9 additions of original Adder Net. However, since the absolute value is used in the calculation of Adder Net, the distributive law is not valid. So the Winograd form in Equation (9) is not equal to the original addition operation in Equation (1). Moreover, the accumulative absolute values put forward higher requirements for output transform matrix A. Let X = |ˆg [BT d B]|, and denote elements in X and Y as x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 , Y = y0 y1 y2 y3 Winograd Algorithm for Adder Net We expand the equation Y = AT XA and get y0 = x0 + x1 + x2 + x4 + x5 + x6 + x8 + x9 + x10, y1 = x1 x2 x3 + x5 x6 x7 + x9 x10 x11, y2 = x4 + x5 + x6 x8 x9 x10 x12 x13 x14, y3 = x5 x6 x7 x9 + x10 + x11 x13 + x14 + x15. We can find that the number of additions and that of minus operations in each equation is not the same. The magnitude of each x is usually similar. Since all elements in X are negative, the magnitude of output features Y is not consistent for each yi. The unbalance of different positions obviously affect the performance of the network. In the next section, we will introduce our method to mitigate the influence of this two problem. 3.2. Optimal Transform Matrix As discussed above, if we apply the Winograd algorithm for Adder Nets, the overall computational complexity during the inference can be reduced. In this section, we explore new transform matrixes to solve the feature unbalanced problem. There are two requirements of the transform matrixes. The modified transform matrixes could balance the magnitude of all positions of output features Y . The output of Winograd algorithm transformed by modified matrixes should be equal to that of the original form in CNN. The first requirement is to ensure the output of Winograd Adder Net have a similar magnitude which can be properly handled by the following layers (Batchnorm and Re LU). The second requirement is to remain the basic characteristic of conventional Winograd algorithm and make it universal to CNNs. Theorem 1 The general solution of the Winograd form F(2, 3) is Y = AT [[Gg GT ] [BT d B]]A. and the matrixes are α0 α0c0 β0 β0c1 γ0 γ0c2 0 δ0 α1 (c1 c0)(c2 c0) α1c0 (c1 c0)(c2 c0) α1c2 0 (c1 c0)(c2 c0) β1 (c0 c1)(c2 c1) β1c1 (c0 c1)(c2 c1) β1c2 1 (c0 c1)(c2 c1) γ1 (c0 c2)(c1 c2) γ1c2 (c0 c2)(c1 c2) γ1c2 2 (c0 c2)(c1 c2) 0 0 δ1 c1c2 α0α1 c0c2 β0β1 c0c1 γ0γ1 c0c1c2 γ0γ1 c0c1+c0c2+c1c2 δ0δ1 1 α0α1 1 β0β1 1 γ0γ1 c0+c1+c2 δ0δ1 0 0 0 1 δ0δ1 in which c0, c1 and c2 are arbitrary rational numbers, and αi, βi, γi and δi, i = 0, 1 are arbitrary real numbers. Proof For the Winograd form F(2, 3) , denote input sequence y and filter sequence g with length two and three as [y0, y1] and [g0, g1, g2], then the results of convolution operation [d0, d1, d2, d3] would be t,τ t>0 ytgτ t, d0 = (y g)(0) = y0g0, d1 = (y g)(1) = y0g1 + y1g0, d2 = (y g)(2) = y1g1 + y2g0, d3 = (y g)(3) = y1g2. The results of the convolution can be derived from the product of two discrete sequence polynomials y(n) and g(n). y(n) = y0 + y1n, g(n) = g0 + g1n + g2n2, d(n) = y(n)g(n) = y0g0 + (y0g1 + y1g0)n + (y0g2 + y1g1)n2 + y1g2n3 = (y g)(0) + (y g)(1)n + (y g)(2)n2 + (y g)(3)n3. The coefficients of ni, i = 0...3 term in polynomials d(n) are actually the results of the ith term in convolution, so we can get the results of convolution by calculating the polynomials. In order to solve the polynomial coefficients, we need to construct an equivalent transformation for d(n). We divide d(n) into two parts, mutual prime polynomial M(n) and remainder d (n). The order of M(n) is the same as that of d(n) so that the coefficient of M(n) is Then the problem of solving polynomial coefficients is converted into the problem of solving remainders. Denote three relatively prime polynomials as m0(n) = a0n + b0, m1(n) = a1n + b1, m2(n) = a2n + b2, in which a0, a1, a2 are arbitrary non-zero integers and b0, b1, b2 are arbitrary integers. Then we get M(n) = (a0n + b0)(a1n + b1)(a2n + b2) (13) = a0a1a2(n + c0)(n + c1)(n + c2) (14) c0 = b0/a0, c1 = b1/a1, c2 = b2/a2. (15) where c0, c1 and c2 are arbitrary rational numbers, and d(n) = t M(n) + d (n), t = y1g2. (16) We use Extended Euclidean algorithm to solve the inverse elements [( M(n) mi(n)) 1]mi(n). The results are 1 (c1 c0)(c2 c0), 1 (c0 c1)(c2 c1), 1 (c0 c2)(c1 c2) for m0(n), m1(n), m2(n), respectively. According to Chinese remainder theorem, i=0 d i(n) M(n) mi(n)[( M(n) mi(n)) 1]mi(n). (17) Winograd Algorithm for Adder Net Substituted into Equation (16) and then we get d 0(n) = (y0 c0y1)(c2 0g2 c0g1 + g0) (c1 c0)(c2 c0) , (18) d 1(n) = (y0 c1y1)(c2 1g2 c1g1 + g0) (c0 c1)(c2 c1) , (19) d 2(n) = (y0 c2y1)(c2 2g2 c2g1 + g0) (c1 c2)(c0 c2) , (20) t = y1g2. (21) and the input transformation d(n) = tn3 + [d 0(n) + d 1(n) + d 2(n) + (c0 + c1 + c2)t]n2 + [(c1 + c2)d 0(n) + (c0 + c2)d 1(n) + (c0 + c1)d 2(n) + (c0c1 + c0c2 + c1c2)t]n + (c1c2d 0(n) + c0c2d 1(n) + c0c1d 2(n) + c0c1c2t]. From Equation (18)-(21), we have α0α1d 0(n) β0β1d 1(n) γ0γ1d 2(n) δ0δ1t α0 α0c0 β0 β0c1 γ0 γ0c2 0 δ0 Since the division of d 0(n), d 1(n), d 2(n)and t is arbitrary, we add coefficients αi, βi, γi and δi, i = 0, 1 to maintain the generability of the solution. Moreover, the order of rows in matrix A and matrix G can be swapped simultaneously, and the corresponding column in matrix B should also be swapped. Then the coefficients of polynomial s(x) can be represented with[α0α1d 0(n), β0β1d 1(n), γ0γ1d 2(n), δ0δ1t] d0 d1 d2 d3 α0α1d 0(n) β0β1d 1(n) γ0γ1d 2(n) δ0δ1t Now we get the convolution result x = B[Gg Ay]. For the correlation operation we need in CNN, s would be the input and h would be the output, so we have y = AT [Gg BT d]. Then the 1-D result is nested to itself to obtain the 2-D result Y = AT [Gg GT BT d B]A. In order to reduce the amount of calculation during the inference process, the elements in matrix A should be chosen from 0, 1, 1 to avoid shift or multiplication operations. So c0, c1 and c2 could only be chosen from 0, 1 and 1, one of each. Without losing generality, we set α1 = 1, δ0 = 1 and other coefficients to 1, then we get the standard Winograd algorithm for convolution like Equation (6)-(7). Denote that the number of +1 and 1 in matrix A of column i is pi and k pi, in which k represents the total number of non-zero elements. According to the previous results, we have k = 3 in all columns of matrix A. Theorem 2 i, j, m, n, the number of additions and minus operations in the calculations of output feature Yi,j and Ym,n would be equal respectively if and only if pi = pj = pm = pn. Proof i, j, m, n, if the number of additions of Yi,j and Ym,n is the same, then we have pipj + (k pi)(k pj) = pmpn + (k pm)(k pn) k[(pi + pj) (pm + pn)] = 2(pipj pmpn) Since i, j, m, n are all arbitrary, to ensure the equation always established, we have (pi + pj) (pm + pn) 0 and pipj pmpn 0. That is to say, pi = pm, pj = pn or pi = pn, pj = pm. Since i, j, m, n are arbitrary, actually we have pi = pj = pm = pn. Based on the conclusions we deduce above, we can modify the coefficients αi, βi, γi and δi, i = 0, 1 to let the number of +1 and 1 in every column of matrix A keep the same. It is easy to draw the conclusion that there are only four matrixes Ai, i = 0...3 which meets our requirements. AT 0 = 1 1 1 0 0 1 1 1 , AT 1 = 1 1 1 0 0 1 1 1 AT 2 = 1 1 1 0 0 1 1 1 , AT 3 = 1 1 1 0 0 1 1 1 Correspondingly, we can get the matrixes Gi, i = 0...3. 3.3. Training with L2-to-L1 Distance Although we can solve the feature unbalanced problem by optimal transform matrix proposed in the above section, the winograd form of Adder Net is still not equal to its original form. Here we introduce a training method to mitigate the gap of Winograd Adder Net and original Adder Net. According to the training strategy described in Adder Net (Chen et al., 2020), the output of ℓ2-Adder Net can be regarded as a linear transformation of that in the conventional CNN. For Winograd algorithm, ℓ2-norm also means a better appoximation of multiplication, since ℓ2 is composed of squares and multications, i.e., Y = AT [ ([Gg GT ] [BT d B])2]A = AT [(2[Gg GT ] [BT d B] [Gg GT ]2 [BT d B]2)]A However, ℓ1-norm is more hardware friendly than ℓ2-norm since it requires no multiplications. To improve the network accuracy and maintain the hardware efficiency, we propose to train Adder Net after applying the Winograd algorithm in an ℓ2-to-ℓ1 distance paradigm. In the inference process, the adder layer is formulated as t = F(i, j, k, t) X(m + i, n + j, k). (22) Winograd Algorithm for Adder Net Table 1. Results on CIFAR-10 and CIFAR-100 datasets Model Method #Mul #Add CIFAR-10 Accuracy CIFAR-100 Accuracy Res Net-20 Winograd CNN 19.40M 19.84M 92.25% 68.14% Adder Net - 80.74M 91.84% 67.60% Winograd Adder Net - 39.24M 91.56% 67.96% Res Net-32 Winograd CNN 31.98M 32.74M 93.29% 69.74% Adder Net - 137.36M 93.01% 69.02% Winograd Adder Net - 64.72M 92.34% 69.87% Y (m, n, t) = X i,j,k (|t|)p. (23) The backward process is formulated as Y (m, n, t) X(m + i, n + j, k) = p tp 1 sign(t). (24) Y (m, n, t) F(i, j, k, t) = p ( t)p 1 sign( t). (25) During the training process, we gradually reduce the exponent p from 2 to 1. Then the forward and backward process finally calculated as followed, Y (m, n, t) = X i,j,k |t| (26) Y (m, n, t) X(m + i, n + j, k) = sign(t). (27) Y (m, n, t) F(i, j, k, t) = sign( t). (28) To ensure the continuity of approximation, we do not apply the ℓ2 gradients for F and Hard Tanh gradients for X as Adder Net. The adaptive learning rate in Equation (5) proposed in Adder Net is adapted to stabilize the training process. There are several strategies to reduce the exponent p. We denote the step of reduction as s. Training until converge and then reducing p. Train network with cosine annealing learning rate until the learning rate close to 0. Then reduce p with a certain step s and restart the training process. Reducing p during the converge process. Reduce p every k epoch of the training process and the step s is set to k epochs. We will give detailed analysis of different strategies in the experimental results. Since the kernel transformation ˆg = Gg GT in Winograd algorithm is not equivalent for Adder Net, we do not perform weight transformation during the training process. Instead, we directly train the weights in the Winograd domain. We also compare different ways to deal with weights in the ablation study part. 4. Experiments Here we conduct experiments to show the effectiveness of our proposed Winograd algorithm for Adder Net. The experiments are done on several commonly used datasets, including MNIST, CIFAR and Image Net. We also make some ablation studies and give visualization of features to provide insights of our methods. All experiments are made via Py Torch on NVIDIA Tesla V100 GPU. For all experiments, we use the transform matrix A0 and G0, and other Ai and Gi matrixes can achieve the similar results. 4.1. Classification Experiments on MNIST First we evaluate our method on the MNIST dataset. To facilitate Winograd algorithm, we replace 5 5 layers with 3 3 layers in the original Le Net-5BN (Le Cun et al., 1998). The detailed network structure is shown in the supplemental material. The learning rate is set to 0.1 at the beginning and decay with the cosine function in the following 100 epochs. We use SGD optimizer with momentum as 0.9, and the batch size is set as 256. The original Adder Net achieves a 99.28% accuracy while the Winograd Adder Net achieves 99.19%. Meanwhile, the Winograd Adder Net requires only 401.1M additions instead of 746.8M additions required by original Adder Net. Experiments on CIFAR We also evaluate Winograd Adder Net on the CIFAR dataset, including CIFAR-10 and CIFAR-100. The data settings are the same as (He et al., 2016). The initial learning rate is set to 0.1 and then decays with a cosine learning rate schedule. The model is trained for 800 epochs and the training batch size is 256. The hyper-parameter η in Equation (5) is set to 0.1. To make fair comparison, we follow the settings in (Chen et al., 2020) to set the first and last layers as full-precision convolutional layers. The results are shown in table 1 and the results of CNN and Adder Net are from (Chen et al., 2020). We only count the additions of adder part instead of the whole neural network. At the cost of little accuracy loss, the number of additions is reduced by more than 50%. Experiments on Image Net Image Net is a large scale vision dataset which consists of 224 224 pixel RGB images. We train Res Net-18 follow the original data settings in (He et al., 2016). We train Winograd Adder Net on 8 GPUs with Winograd Algorithm for Adder Net Figure 2. Training accuracy and training loss of Winograd Adder Net Res Net-18 on Image Net Figure 3. Dimension reduction results of features in Winograd for Adder Net (left) and original Adder Net (right). The visualization results are very close, which means that Winograd for Adder Net attracts the similar features to original Adder Net (a) Input features (b) Output features with modified A (c) Output features with original A Figure 4. Comparison of the feature heatmaps under different matrix A. Without the modified matrix A, there is a obvious grid in the heatmap shown in figure 4(c). batch size 512, and the total training epochs are 150. The weight decay is set as 0.0001 and the momentum is 0.9. The hyper-parameter η is set to 0.05 for Winograd Adder Net. Experimental results are shown in Figure 2 and we use the Adder Net baseline from their paper (Chen et al., 2020). Winograd Adder Net achieves a 66.2% top-1 accuracy and an 86.8% top-5 accuracy in Res Net-18, which is slightly less than Adder Net (67.0% top-1/87.6% top-5). If we extend the training epochs to 250, Winograd Adder Net achieved 66.5% top-1 accuracy while Adder Net got no improvement. Besides, Winograd Adder Net uses only 1.72G adder operations compared with 3.39G in Adder Net. 4.2. Ablation Study In this part, we evaluate the effectiveness of our proposed transform matrix and ℓ2-to-ℓ1 training method. All experiments in this section is done with Res Net-18 network on the CIFAR-10 dataset. First we compare different methods to reduce parameter p in the ℓ2-to-ℓ1 training strategy. The total training epochs are set to 800 to make fair comparison. The evaluation results are shown in Table 3. We can find that reducing p during the converge process with p = 35 is the best. We provide the curves of training loss (green lines) and accuracy (blue lines) in the upper figure of Figure 5. When p=140, the training process is unstable with an accuracy drop (- 0.12% as shown in Table 3). In addition, the lower figure of Figure 5 shows the values of weights using norm reduction with different settings. It is obvious that when p=35, the curve of weight norm is the most closed to that of using only ℓ2-norm, i.e., the network trained using our method can successfully approximate the ℓ2-norm wino Adder Net. Thus, we set p=35 in our experiments and obtain better performance. We also show the comparison of three ways to deal with kernel transformation. The first way is the same as Winograd algorithm for convolution layers. We apply kernel transformation (KT) to weights during every inference process and update the origin 3 3 kernel. The second way and the third way are to train the network in the Winograd domain. For these two ways, we initialize weights with normal distribution initialization for the 4 4 Winograd kernel and for the 3 3 original Adder kernel, respectively. The results are shown in Table 4. Training with kernel transform has the worst performance, since the inconsistent transform makes the training harder. Other two ways have similar results, so we recommend to directly initialize Winograd kernel due to its convenience. Next we evaluate the effectiveness of our proposed methods. The results are shown in Table 5. Without any modification, the original Winograd algorithm only achieves 83.87%. Modified transform matrix and our ℓ2-to-ℓ1 training strategy brings 4.38% and 5.38% accuracy improvement, respectively. Finally the combination achieves 91.56%, which is comparable with that of Adder Net. 4.3. Visualization To intuitively perceive the effectiveness of Winograd for Adder Net, we visualize the features. We acquire the features Winograd Algorithm for Adder Net Table 2. FPGA Simulation Results of original Adder Net and Winograd Adder Net Method Module #cycle Hardware Resource Total Energy Consuming (Equivalent)1 original Adder Net total 7062 7130 50.4M Winograd Adder Net padding 900 31 0.03M input transform 3136 433 1.36M calculation 3140 6900 21.7M output transform 3136 309 0.97M total - 7673 24.0M 1 Since the ratio of hardware resource usage is close to 100%, we use the hardware resource overhead to approximate equivalent power consumption. L1 norm P=1 P=35 P=140 L2 norm Accuracy/Loss Curve Absolute Mean Value of Weights Figure 5. Upper: Trending of bsolute mean value of weights during training process. Lower: Training loss and accuracy. Table 3. Ablation Study on the Reduction Method of p Method Accuracy Training until converge 89.24 Reducing during converge with p = 1 90.94 Reducing during converge with p = 35 91.56 Reducing during converge with p = 140 91.44 Table 4. Ablation Study on the Kernel Transformation Method Accuracy Training w/ KT 89.19 Training Init Winograd kernel 91.56 w/o KT Init adder kernel and transform 91.28 of the last adder layer in our modified Le Net-5-BN, and then use t-SNE method to reduce the dimension to 2. The result of original Adder Net is shown in the left of Figure 3 and the result of Winograd for Adder Net is shown in the right. From the results, we can find that the two are extremely close, which means that Winograd for Adder Net attracts the similar features to original Adder Net. Besides, we visualize the output features of Winograd Adder and original Adder layer, to intuitively display the effect of modifying matrix A. From Figure 4, we can find that output features with original A show a distinct grid while output features with modified A do not show this phenomenon. 4.4. FPGA simulation To evaluation the energy efficiency of our method in the runtime, we implement the Winograd algorithm for Adder Net and original Adder Net on FPGA.The designed parallelism of calculation is 256, which means that 16 input channels and 16 output channels are calculated simultaneously. Table 5. Ablation Study on Our Proposed Methods Mod A ℓ2-to-ℓ1-norm CIFAR-10 CIFAR-100 83.87% 54.72% 88.25% 62.00% 89.25% 62.83% 91.56% 67.96% We take a single layer with input shape (N,Cin, Xh, Xw) = (1, 16, 28, 28) and kernel shape (Cout, Cin, Kw, Kh) = (16, 16, 3, 3) as an example. The comparison of Winograd Adder Net and original Adder Net is shown in Table 2. We can find that Winograd Adder Net requires only 24.0/50.4 47.6% energy consuming of original Adder Net. As we analysed in Section 3.1, the theoretical cost of Winograd Adder Net is 45.4% of that of original Adder Net with Cin = 16 and Cout = 16. So our implementation validates this theoretical result. Moreover, with the pipeline technique, Winograd Adder Net may achieve about 50% latency reduction (estimated). 5. Conclusion In this paper, we propose the Winograd algorithm for Adder Net. We replace the element-wise multiplications in the Winograd equation with additions to further reduce the energy costs of CNN. To mitigate the accuracy loss brought by the replacement, we develop a set of new transform matrixes to balance output features and introduce an ℓ2-to-ℓ1 training method for the Winograd Adder Net paradigm. As a result, the proposed method reduces about 52.4% energy consumption on our FPGA simulation while achieving similar performance with the original Adder Net, which would have a excellent prospects in future hardware design. Winograd Algorithm for Adder Net Chen, H., Wang, Y., Xu, C., Shi, B., Xu, C., Tian, Q., and Xu, C. Addernet: Do we really need multiplications in deep learning? In CVPR, pp. 1468 1477, 2020. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv preprint ar Xiv:1602.02830, 2016. Dally, W. High-performance hardware for machine learning. NIPS Tutorial, 2, 2015. Guo, S., Yan, Z., Zhang, K., Zuo, W., and Zhang, L. Toward convolutional blind denoising of real photographs. In CVPR, pp. 1712 1722, 2019. Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., and Xu, C. Ghostnet: More features from cheap operations. In CVPR, pp. 1580 1589, 2020. Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both weights and connections for efficient neural networks. ar Xiv preprint ar Xiv:1506.02626, 2015. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 1097 1105, 2012. Lavin, A. and Gray, S. Fast algorithms for convolutional neural networks. In CVPR, pp. 4013 4021, 2016. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lin, X., Zhao, C., and Pan, W. Towards accurate binary convolutional neural network. In Advances in neural information processing systems, pp. 345 353, 2017. Liu, H., Simonyan, K., and Yang, Y. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018a. Liu, X., Pool, J., Han, S., and Dally, W. J. Efficient sparsewinograd convolutional neural networks. ar Xiv preprint ar Xiv:1802.06367, 2018b. Liu, Z., Shen, Z., Savvides, M., and Cheng, K.-T. Reactnet: Towards precise binary neural network with generalized activation functions. ar Xiv preprint ar Xiv:2003.03488, 2020. Mathieu, M., Henaff, M., and Le Cun, Y. Fast training of convolutional networks through ffts. ar Xiv preprint ar Xiv:1312.5851, 2013. Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S., et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 26 35, 2016. Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525 542. Springer, 2016. Ren, D., Zuo, W., Hu, Q., Zhu, P., and Meng, D. Progressive image deraining networks: A better and simpler baseline. In CVPR, pp. 3937 3946, 2019. Song, D., Wang, Y., Chen, H., Xu, C., Xu, C., and Tao, D. Addersr: Towards energy efficient image super-resolution. ar Xiv preprint ar Xiv:2009.08891, 2020. Tan, M. and Le, Q. V. Efficientnet: Rethinking model scaling for convolutional neural networks. ar Xiv preprint ar Xiv:1905.11946, 2019. Tan, M., Pang, R., and Le, Q. V. Efficientdet: Scalable and efficient object detection. In CVPR, pp. 10781 10790, 2020. Tao, A., Sapra, K., and Catanzaro, B. Hierarchical multiscale attention for semantic segmentation. ar Xiv preprint ar Xiv:2005.10821, 2020. Winograd, S. Arithmetic complexity of computations, volume 33. Siam, 1980. Xu, Y., Xu, C., Chen, X., Zhang, W., Xu, C., and Wang, Y. Kernel based progressive distillation for adder neural networks. ar Xiv preprint ar Xiv:2009.13044, 2020. Yan, D., Wang, W., and Chu, X. Optimizing batched winograd convolution on gpus. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 32 44, 2020. Zhang, H. and Patel, V. M. Densely connected pyramid dehazing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194 3203, 2018.