# composite_binary_decomposition_networks__73ffda0d.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Composite Binary Decomposition Networks You Qiaoben,1 Zheng Wang,2 Jianguo Li,3 Yinpeng Dong,1 Yu-Gang Jiang,2 Jun Zhu1 1Dept. of Comp. Sci. & Tech., State Key Lab for Intell. Tech. & Sys., Institute for AI, Tsinghua University 2School of Computer Science, Fudan University 3Intel Labs China qby 222@126.com, {zhengwang17,ygj}@fudan.edu.cn, jianguo.li@intel.com, {dyp17@mails., dcszj@}tsinghua.edu.cn Binary neural networks have great resource and computing efficiency, while suffer from long training procedure and non-negligible accuracy drops, when comparing to the fullprecision counterparts. In this paper, we propose the composite binary decomposition networks (CBDNet), which first compose real-valued tensor of each layer with a limited number of binary tensors, and then decompose some conditioned binary tensors into two low-rank binary tensors, so that the number of parameters and operations are greatly reduced comparing to the original ones. Experiments demonstrate the effectiveness of the proposed method, as CBDNet can approximate image classification network Res Net-18 using 5.25 bits, VGG-16 using 5.47 bits, Dense Net-121 using 5.72 bits, object detection networks SSD300 using 4.38 bits, and semantic segmentation networks Seg Net using 5.18 bits, all with minor accuracy drops. 1 Introduction With the remarkable improvements of Convolutional Neural Networks (CNNs), varied excellent performance has been achieved in a wide range of pattern recognition tasks, such as image classification (Krizhevsky et al. 2012; Szegedy et al. 2015; He et al. 2016; Huang et al. 2017), object detection (Girshick et al. 2014; Ren et al. 2015; Shen et al. 2017) and semantic segmentation (Long et al. 2015; Badrinarayanan et al. 2017), etc. A well-performed CNN based systems usually need considerable storage and computation power to store and calculate millions of parameters in tens or even hundreds of CNN layers. Therefore,the deployment of CNNs to some resource limited scenarios is hindered, especially low-power embedded devices in the emerging Internet-of-Things (Io T) domain. Many efforts have been devoted to optimizing the inference resource requirement of CNNs, which can be roughly divided into three categories according to the life cycle of deep models. First, design-time network optimization considers designing efficient network structures from scratch in a handcraft way such as Mobile Net (Howard et al. 2017), interlacing/shuffle networks (Zhang et al. 2017; 2018), or Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1This work was done when You Qiaoben and Zheng Wang were interns at Intel Labs. Jianguo Li is the corresponding author. normalization tensor flatten binary spatial decomposition binary composition * rank(𝐴 1) Figure 1: Overall framework illustration of CBDNet. even automatic search way such as NASNet (Zoph and Le 2016), PNASNet (Liu et al. 2017a). Second, training-time network optimization tries to simplify the pre-defined network structures on neural connections (Han et al. 2015; 2016), filter structures (Wen et al. 2016; Li et al. 2017; Liu et al. 2017b), and even weight precisions (Chen et al. 2015; Courbariaux et al. 2016; Rastegari et al. 2016) through regularized retraining or fine-tuning or even knowledge distilling (Hinton et al. 2015). Third, deploy-time network optimization tries to replace heavy/redundant components/structures in pre-trained CNN models with efficient/lightweight ones in a training-free way. Typical works include low-rank decomposition (Denton et al. 2014), spatial decomposition (Jaderberg et al. 2014), channel decomposition (Zhang et al. 2016) and network decoupling (Guo et al. 2018). To produce desired outputs, it is obvious that the first two categories of methods require a time-consuming training procedure with full training-set available, while methods of the third category may not require training-set, or in some cases require a small dataset (e.g., 5000 images) to calibrate some parameters. The optimization process can be typically done within dozens of minutes. Therefore, in case that the customers can t provide training data due to privacy or confidential issues, it is of great value when software/hardware vendors help their customers optimize CNN based solutions. It also opens the possibility for on-device learning to compression, and online learning with new ingress data. In consequence, there is a strong demand for modern deep learning frameworks or hardware (GPU/ASIC/FPGA) vendors to provide deploy-time model optimizing tools. However, current deploy-time optimization methods can only provide very limited optimization (2 4 in compression/speedup) over original models. Meanwhile, binary neu- ral networks (Courbariaux et al. 2015; 2016), which aim for training CNNs with binary weights or even binary activations, attract much more attention due to their high compression rate and computing efficiency. However, binary networks generally suffer much from a long training procedure and non-negligible accuracy drops, when comparing to the full-precision (FP32) counterparts. Many efforts have been spent to alleviate this problem in training-time optimization (Rastegari et al. 2016; Zhou et al. 2016). This paper considers the problem from a different perspective via raising the question: is it possible to directly transfer full-precision networks into binary networks at deploy-time in a trainingfree way? We study this problem, and give a positive answer by proposing a solution named composite binary decomposition networks (CBDNet). Figure 1 illustrates the overall framework of the proposed method. The main contributions of this paper are summarized as below: We show that full-precision CNN models can be directly transferred into highly parameter and computing efficient multi-bits binary network models in a training-free way by the proposed CBDNet. We propose an algorithm to first expand full-precision tensors of each conv-layer with a limited number of binary tensors, and then decompose some conditioned binary tensors into two low-rank binary tensors. To our best knowledge, we are the first to study the network sparsity and the low-rank decomposition in the binary space. We demonstrate the effectiveness of CBDNet on different classification networks including VGGNet, Res Net, Dense Net as well as detection network SSD300 and semantic segmentation network Seg Net. This verifies that CBDNet is widely applicable. Related Work Binary Neural Networks Binary neural networks (Courbariaux et al. 2015; 2016; Rastegari et al. 2016) with high compression rate and great computing efficiency, have progressively attracted attentions owing to their great inference performance. Particularly, Binary Connect (BNN) (Courbariaux et al. 2015) binarizes weights to +1 and 1 and substitutes multiplications with additions and subtractions to speed up the computation. As well as binarizing weight values plus one scaling factor for each filter channel, Binary weighted networks (BWN) (Rastegari et al. 2016) extends it to XNORNet with both weights and activations binarized. Do Re Fa Net (Zhou et al. 2016) binarizes not merely weights and activations, but also gradients for the purpose of fast training. However, binary networks are facing the challenge that accuracy may drops non-negligibly, especially for very deep models (e.g., Res Net). In spite of the fact that (Hou et al. 2017) directly consider the loss to mitigate possible accuracy drops to mitigate during binarization, which gain more accurate results than BWN and XNOR-Net, it still has gap to the full-precision counterparts. A novel training procedure named stochastic quantization (Dong et al. 2017) was introduced to narrow down such gaps. All these works belongs to the training-time optimization category in summary. Deploy-time Network Optimization Deploy-time network optimization tries to replace some heavy CNN structures in pre-trained CNN models with efficient ones in a training-free way. Low-rank decomposition (Denton et al. 2014) exploits low-rank nature within CNN layers, and shows that fully-connected (FC) layers can be efficiently compressed and accelerated with low-rank approximations, while conv-layers can not. Spatial decomposition (Jaderberg et al. 2014) factorizes the kh kw convolutional filters into a linear combination of a horizontal filter 1 kw and a vertical filter kh 1. Channel decomposition (Zhang et al. 2016) decomposes one conv-layer into two layers, while the first layer has the same filter-size but with less channels, and the second layer uses a 1 1 convolution to mix output of the first one. Network decoupling (Guo et al. 2018) decomposes the regular convolution into the successive combination of depthwise convolution and pointwise convolution. Due to its simplicity, deploy-time optimization has many potential applications for software/hardware vendors as aforementioned. However, it suffers from relatively limited optimization gains (2 4 in compression/speedup) over original full-precision models. Binary Network Decomposition Few existing works like us consider transferring fullprecision networks into multi-bits binary networks in a training-free way. Binary weighted decomposition (BWD) (Kamiya et al. 2017) takes each filter as a basic unit as BWN, and expands each filter into a linear combination of binary filters and a FP32 scalar. ABC-Net (Lin et al. 2017) approximates full-precision tensor with a linear combination of multiple binary tensors and FP32 scalar weights during trainingprocedure to obtain multi-bits binary networks. Our method is quite different to these two works. We further consider the redundance and sparsity in the expanded binary tensors, and try to decompose binary tensors. The decomposition is similar to spatial decomposition (Jaderberg et al. 2014) but in the binary space. Hence, our binary decomposition step can also be viewed as binary spatial decomposition. Method As is known, parameters of each conv-layer in CNNs could be represented as a 4-dimensional (4D) tensor. We take tensor as our study target. We first present how to expand fullprecision tensors into a limited number of binary tensors. Then we show some binary tensors that fulfill certain conditions can be decomposed into two low-rank binary tensors, and propose an algorithm for that purpose. Tensor Binary Composition Suppose the weight parameters of a conv-layer are represented by a 4D tensor Wt Rn k k m, where n is the number of input channels, m is the number of output channels, and k k is the convolution kernel size. For each element w Wt, we first normalize it with w = w/wmax, (1) where wmax = maxwi Wt{|wi|}. The normalized tensor is denoted as Wt. The normalization makes every element w top-1 accuracy Resnet-18 Dense Net-121 VGG-16 top-5 accuracy Resnet-18 Dense Net-121 VGG-16 Figure 2: Performance on Image Net for different networks with binary tensor expansion using different J bits. Dashedline indicates FP32 accuracy. Left is for top-1, right is for top5. 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Normalized weights Relative scale [log(count)] 2 9 2 7 2 5 2 3 2 1 quantized value of weights sparse ratio Figure 3: Weight distribution of all conv-layers for Res Net-18. (a) normalized weights, (b) sparse-ratio of each binary quantized weight. within range [ 1, 1]. For simplicity, we denote the magnitude of w as ˆw, i.e., w = sign( w) ˆw, where sign( ) is a sign function which equals to -1 when w < 0, otherwise 1. And ˆw [0, 1] can be expressed by the composite of a series of fixed-point basis as i=0 ai 2 i, (2) where ai {0, 1} is a binary coefficient indicating whether certain power-of-two term is activated or not, i {0, , J 2} means totally J 1 bits is needed for the representation. When taking the sign bit into consideration, w requires J bits to represent. Denote the magnitude of the normalized tensor as ˆ Wt. Tensor binary composition is a kind of tensor expansion, when each element of the tensor is binary expanded and expressed by the same bit rate J as i=0 Ai 2 i, (3) where Ai {0, 1}n k k m is 4D binary tensor. J will impact the approximation accuracy, while larger J gives more accurate results. We empirically study three different Image Net CNN models. Figure 2 shows that J = 7 is already sufficiently good to keep a balance between the accuracy and the efficiency of the expansion. Different Ai may have different sparsity, which could be further utilized to compress the binary tensor. Figure 3a illustrates the distribution of normalized weights w in all the layers of Res Net-18, which looks like a Laplacian distribution, where most weight values concentrate in the range ( 0.5, 0.5). This suggests that 1 is very rare in some binary tensor Ai with smaller i, since smaller i corresponds to bigger values in the power-of-two expansion. Figure 3b further shows the average sparsity of each binary tensor Ai, which also verifies that Ai with smaller i is much more sparse. Due to the sparsity of Ai, we next perform binary tensor decomposition to further reduce the computation complexity, as introduced in the next section. Binary expansion with α scaling factor The nonsaturation direct expansion from FP32 to low-bits will yield non-negligible accuracy loss as shown in (Migacz 2017; Krishnamoort 2018). A scaling factor is usually introduced and learnt to minimize the loss through an additional calibration procedure (Migacz 2017; Krishnamoort 2018). Similarly, we impose a scaling factor α to Eq.(1) as w = α w/wmax, (4) where α 1 is a parameter to control the range of w [ α, α]. When the scaling factor α is allowed, the normalized weight ˆw [0, α] can be expressed with a composite of power-of-two terms as below: i= q ai 2 i, (5) where q = log2 α and J also denotes the number bits of the weight, including J 1 bits for magnitude and 1 sign bit. The corresponding tensor form can be written as ˆ Wα t XJ q 2 i= q Ai 2 i. (6) Note that the scaling factor α will shift the powerof-two bases from {20, 2 1, , 2 J+2} for Eq.(3) to {2q, , 20, , 2 J+q+2} for Eq.(6). When α = 1, we have q = log2 α = 0, which makes Eq.(6) reduce to the case without scaling factor as in Eq.(3). Binary Tensor Decomposition We have shown that some binary tensors Ai are sparse. As sparse operations require specific hardware/software accelerators, it is not preferred in many practical usages. In deploy-time network optimization, researches show that full-precision tensor could be factorized into two much smaller and more efficient tensors (Jaderberg et al. 2014; Zhang et al. 2016). Here, we attempt to extend the spatialdecomposition (Jaderberg et al. 2014) to our binary case. For the simplicity of analysis, we flatten the 4D tensor ˆ Wt Rn k k m into the weight matrix W R(n k) (k m) , so does for each Ai. Here the matrix height and width are n k and k m respectively. We then factorize a sparse matrix A into two smaller matrices as A = B C, (7) where matrix B {0, 1}(n k) c and matrix C {0, 1}c (k m). Note that our method is significantly different from the vector decomposition method (Kamiya et al. 2017), which keeps B binary, the other full-precision. On the contrary, we keep both B and C binary. This decomposition has the special meaning in conv-layers. It decomposes a conv-layer with k k spatial filters into two layers one layer with k 1 spatial filters and the other with 1 k spatial filters. Suppose the feature map size is h w, then the number of operations is n m k2 h w for matrix A, while the number of operations reduces to (m + n) c k h w for B C. We have the following lemma regarding to the difference before and after binary decomposition. Lemma 1 (1) The computing cost ratio for A over B C is n m k/c (m+n). (2) The bit-rate compression ratio from A to B C is also n m k/c (m+n). (3) c < n m k (m+n) can yield real parameter and computing operation reduction. Binary Matrix Decomposition We first review the property of matrix rank: rank(B C) min{rank(B), rank(C)}. (8) Comparing with binary matrix factorization methods (Zhang et al. 2007; Miettinen 2010), which tend to minimize certain kind of loss like |A B C| and find matrices B and C iteratively, we attempt to decompose A into matrices B and C without any loss when c rank(A) is satisfied. Theorem 1 If c rank(A), binary matrix A {0, 1}(n k) (k m) can be losslessly factorized into binary matrices B {0, 1}(n k) c and C {0, 1}c (k m). Proof According to the Gaussian elimination method, matrix A can be converted to an upper triangular matrix D. Our intuition is to construct matrices B and C through the process of Gaussian elimination. Assume n m, Pi is the transform matrix representing the i-th primary transformation, matrix D can be expressed as: i=0 Pk i A, (9) where is element-wise binary multiply operator so that A B = (A B) mod 2. For simplicity, we use instead of here. As Pi {0, 1}(n k) (n k) is the permutation transform matrix, the inverse matrix of Pi exists. Therefore, A can be decomposed into the following form: i=0 P 1 i D (10) Since D only contains value 1 in the first r rows where r = rank(A), D can be decomposed into two matrices: where D1 {0, 1}r (k m) is the first r rows of matrix D, I is a r r identity matrix. Then matrix A can be written as: i=0 P 1 i I 0 ) D1 . (12) Algorithm 1 Binary matrix decomposition Input: binary matrix A with size h A w A Output: matrix rank r, matrix B, matrix C (A = B C) 1: function BINARYMATDECOMPOSITION(A) 2: if h A w A then 3: A AT 4: transpose = T rue 5: end if 6: r 0; 7: B identity matrix h A w A 8: for c 1 to w A do 9: l first raw satisfy the constraints: A[l, c] = 1 and l r + 1 10: Reverse row l & r + 1, P corresponding transition matrix 11: r r + 1; 12: B B P 1 13: for row r + 1 to h A do 14: if A[row, c] > 0 then 15: A[row, :] (A[r, :] + A[row, :]) mod 2 16: P corresponding transition matrix 17: B B P 1 18: end if 19: end for 20: end for 21: C first r rows of A 22: P first r rows are identity matrix, other h A r rows are zeros. 23: B B P 24: if transpose then 25: Return r, CT , BT 26: else 27: Return r, B, C 28: end if 29: end function We then obtain the size (n k) r matrix B as B = Qk i=0 P 1 i I 0 , and the size r (k m) matrix C as C = D1 exactly without any loss. This procedure also indicates that the minimum bottleneck parameter c is rank(A), i.e., c rank(A). This ends the proof. Based on this proof, we outline the binary matrix decomposition procedure in Algorithm 1. From the proof, we should also point out that the proposed binary decomposition is suitable for both conv-layers and FC-layers. Note that for binary permutation matrix P, its inverse matrix P 1 equals to itself, i.e., P 1 = P. Suppose h A and w A are height and width of matrix A, the computing complexity of B P is just O(h A) as P is a permutation matrix. The computing complexity of Algorithm 1 is O(w A h2 A). Losslessly Compressible of Ai? Theorem 1 shows that only when c rank(A), our method could produce lossless binary decomposition, while Lemma 1 shows that only c < n m k (m+n) could yield practical parameter and computing operations reduction. We have the following corollary: Corollary 1 Binary matrix A is losslessly compressible based on Theorem 1 when rank(A) c < n m k However, it is unknown which Ai in Eq.(3) was losslessly compressible before decomposition. The brute-force way is trying to decompose each Ai as in Eq.(7), and then keeping those satisfying Definition 1. This is obviously inefficient and impracticable. Alternatively, we may use some 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0.0 maxrank(Iw 2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0.0 maxrank(Iw 3) Figure 4: Maximum rank ratio per-layer of Res Net-18. heuristic cue to estimate which subset of {Ai} could be losslessly compressible. According to Figure 3, Ai with smaller i is more sparse than that with bigger i. Empirically, more sparsity corresponds to smaller rank(Ai). Based on Theorem 1, this pushes us to seek the watershed value j so that those Ai where i j are losslessly compressible, while other Ai (i > j) are not. This requires introducing a variable into the definition of Ai, so that we choose the Ai defined by Eq.(6) rather than Eq.(3). As is known, the Ai sequence defined by Eq.(6) is {A q, , A0, , AJ q 2} where q = log2 α . Here, we seek for the optimal α, so that j = 0 is the watershed, i.e., {A q, , A0} are losslessly compressible. For simplicity, we still denote the flatten matrix of the tensor ˆ Wα t in Eq.(6) as W R(n k) (k m). We propose to use the indicator matrix described below for easy analysis. Definition 1 The indicator matrix of W is defined as Iw>β {0, 1}(n k) (k m), in which the value at position (x, y) is Iw>β[x, y] = If(W[x, y] > β), where β [0, α] is a parameter, If( ) is an element-wise indication function, which equals to 1 when the prediction is true, otherwise 0. Based on this definition, Ai in Eq.(6) can be written as Ai[x, y] = If( W[x, y] + w 2 i mod 2 = 1), (13) where w = 2 J+q+1 is the largest throwing-away powerof-two terms in Eq.(6). Define w = w + w, according to Eq.(6), the rank of matrix A0 can be expressed as: rank(A0) = rank( X n I/2 1 i=0 I(2i+1) w <(2i+2)) = rank( Xn I where n I = α . Based on the matrix rank property rank(A + B) rank(A) + rank(B), (14) we derive the upper bound of the rank(Ai) as: rank(A0) Xn I i=1 rank(Iw i). (15) The empirical results show that when rank(Iw 1) 0.5 min{n k, m k}, and the rank of the indicator matrix satisfies the following constraints: (1) rank(Iw 2) rank(Iw 1) C0, (2) max{ rank(Iw i) rank(Iw 1)} C1, 3 i n I. (16) Algorithm 2 Binary search α to satisfy rank condition. Input: weight matrix W, expected rank c Output: scalar value α 1: function SCALARVALUESEARCH(W , c) 2: min 0, max max number of 1 value in a full rank matrix 3: sort W in a descending order to a vector v 4: while min max do 5: center (min + max)/2 6: α 1/v[center] 7: Compute indicator matrix Iw 1 8: Compute rank r of Iw 1 9: if r > c then 10: max center 1 11: else if r < c then 12: min center + 1 13: else 14: Return α 15: end if 16: end while 17: Return α = 1/v[max] 18: end function With the bound (15), we get the bound of the rank by the indicator matrix Iw 1: rank(A0) ((α 2)+ C1 + C0 + 1) rank(Iw 1), (17) where (x)+ equals to x if x > 0, otherwise 0. We make some empirical study on Res Net-18, and find that C0 0.2 in most of the layers as in Figure 4a, C1 0.05 in most of the layers as in Figure 4b, and α 8 in all the layers. Hence, we have a simpler method to estimate rank(A0) by: rank(A0) (4 0.05 + 0.2 + 1) rank(Iw 1) = 1.4 rank(Iw 1). Therefore, we transfer the optimization of α to the optimization of rank(Iw 1). A binary search algorithm for scaling factor a In practice, we may give an expected upper bound c for rank(Iw 1), and search the optimal α to satisfy rank(Iw 1) c. (18) As w [0, α] and α > 1, w [0, 1] only corresponds to 1/α portion of the whole range of w. Generally, weight matrix W should be sorted and traversed to compute the rank(Iw 1). Instead of using this time-consuming method, we propose an efficient solution based on the binary search algorithm. Suppose that every element in weight matrix W has a unique value, we sort W to be a vector v with length N = n m k2 in descending order, so that v[1] is the largest element. Assume index i satisfy the constraint: v[i] 1 > v[i + 1], then the indicator matrix can be expressed as: Iw 1 = Ip(w)