# composite_binary_decomposition_networks__73ffda0d.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Composite Binary Decomposition Networks

You Qiaoben,1 Zheng Wang,2 Jianguo Li,3 Yinpeng Dong,1 Yu-Gang Jiang,2 Jun Zhu1

1Dept. of Comp. Sci. & Tech., State Key Lab for Intell. Tech. & Sys., Institute for AI, Tsinghua University 2School of Computer Science, Fudan University 3Intel Labs China qby 222@126.com, {zhengwang17,ygj}@fudan.edu.cn, jianguo.li@intel.com, {dyp17@mails., dcszj@}tsinghua.edu.cn

Binary neural networks have great resource and computing efﬁciency, while suffer from long training procedure and non-negligible accuracy drops, when comparing to the fullprecision counterparts. In this paper, we propose the composite binary decomposition networks (CBDNet), which ﬁrst compose real-valued tensor of each layer with a limited number of binary tensors, and then decompose some conditioned binary tensors into two low-rank binary tensors, so that the number of parameters and operations are greatly reduced comparing to the original ones. Experiments demonstrate the effectiveness of the proposed method, as CBDNet can approximate image classiﬁcation network Res Net-18 using 5.25 bits, VGG-16 using 5.47 bits, Dense Net-121 using 5.72 bits, object detection networks SSD300 using 4.38 bits, and semantic segmentation networks Seg Net using 5.18 bits, all with minor accuracy drops. 1

Introduction With the remarkable improvements of Convolutional Neural Networks (CNNs), varied excellent performance has been achieved in a wide range of pattern recognition tasks, such as image classiﬁcation (Krizhevsky et al. 2012; Szegedy et al. 2015; He et al. 2016; Huang et al. 2017), object detection (Girshick et al. 2014; Ren et al. 2015; Shen et al. 2017) and semantic segmentation (Long et al. 2015; Badrinarayanan et al. 2017), etc. A well-performed CNN based systems usually need considerable storage and computation power to store and calculate millions of parameters in tens or even hundreds of CNN layers. Therefore,the deployment of CNNs to some resource limited scenarios is hindered, especially low-power embedded devices in the emerging Internet-of-Things (Io T) domain. Many efforts have been devoted to optimizing the inference resource requirement of CNNs, which can be roughly divided into three categories according to the life cycle of deep models. First, design-time network optimization considers designing efﬁcient network structures from scratch in a handcraft way such as Mobile Net (Howard et al. 2017), interlacing/shufﬂe networks (Zhang et al. 2017; 2018), or

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1This work was done when You Qiaoben and Zheng Wang were interns at Intel Labs. Jianguo Li is the corresponding author.

normalization tensor flatten

binary spatial decomposition binary composition

* rank(𝐴 1)

Figure 1: Overall framework illustration of CBDNet.

even automatic search way such as NASNet (Zoph and Le 2016), PNASNet (Liu et al. 2017a). Second, training-time network optimization tries to simplify the pre-deﬁned network structures on neural connections (Han et al. 2015; 2016), ﬁlter structures (Wen et al. 2016; Li et al. 2017; Liu et al. 2017b), and even weight precisions (Chen et al. 2015; Courbariaux et al. 2016; Rastegari et al. 2016) through regularized retraining or ﬁne-tuning or even knowledge distilling (Hinton et al. 2015). Third, deploy-time network optimization tries to replace heavy/redundant components/structures in pre-trained CNN models with efﬁcient/lightweight ones in a training-free way. Typical works include low-rank decomposition (Denton et al. 2014), spatial decomposition (Jaderberg et al. 2014), channel decomposition (Zhang et al. 2016) and network decoupling (Guo et al. 2018). To produce desired outputs, it is obvious that the ﬁrst two categories of methods require a time-consuming training procedure with full training-set available, while methods of the third category may not require training-set, or in some cases require a small dataset (e.g., 5000 images) to calibrate some parameters. The optimization process can be typically done within dozens of minutes. Therefore, in case that the customers can t provide training data due to privacy or conﬁdential issues, it is of great value when software/hardware vendors help their customers optimize CNN based solutions. It also opens the possibility for on-device learning to compression, and online learning with new ingress data. In consequence, there is a strong demand for modern deep learning frameworks or hardware (GPU/ASIC/FPGA) vendors to provide deploy-time model optimizing tools. However, current deploy-time optimization methods can only provide very limited optimization (2 4 in compression/speedup) over original models. Meanwhile, binary neu-

ral networks (Courbariaux et al. 2015; 2016), which aim for training CNNs with binary weights or even binary activations, attract much more attention due to their high compression rate and computing efﬁciency. However, binary networks generally suffer much from a long training procedure and non-negligible accuracy drops, when comparing to the full-precision (FP32) counterparts. Many efforts have been spent to alleviate this problem in training-time optimization (Rastegari et al. 2016; Zhou et al. 2016). This paper considers the problem from a different perspective via raising the question: is it possible to directly transfer full-precision networks into binary networks at deploy-time in a trainingfree way? We study this problem, and give a positive answer by proposing a solution named composite binary decomposition networks (CBDNet). Figure 1 illustrates the overall framework of the proposed method. The main contributions of this paper are summarized as below: We show that full-precision CNN models can be directly transferred into highly parameter and computing efﬁcient multi-bits binary network models in a training-free way by the proposed CBDNet. We propose an algorithm to ﬁrst expand full-precision tensors of each conv-layer with a limited number of binary tensors, and then decompose some conditioned binary tensors into two low-rank binary tensors. To our best knowledge, we are the ﬁrst to study the network sparsity and the low-rank decomposition in the binary space. We demonstrate the effectiveness of CBDNet on different classiﬁcation networks including VGGNet, Res Net, Dense Net as well as detection network SSD300 and semantic segmentation network Seg Net. This veriﬁes that CBDNet is widely applicable.

Related Work Binary Neural Networks Binary neural networks (Courbariaux et al. 2015; 2016; Rastegari et al. 2016) with high compression rate and great computing efﬁciency, have progressively attracted attentions owing to their great inference performance. Particularly, Binary Connect (BNN) (Courbariaux et al. 2015) binarizes weights to +1 and 1 and substitutes multiplications with additions and subtractions to speed up the computation. As well as binarizing weight values plus one scaling factor for each ﬁlter channel, Binary weighted networks (BWN) (Rastegari et al. 2016) extends it to XNORNet with both weights and activations binarized. Do Re Fa Net (Zhou et al. 2016) binarizes not merely weights and activations, but also gradients for the purpose of fast training. However, binary networks are facing the challenge that accuracy may drops non-negligibly, especially for very deep models (e.g., Res Net). In spite of the fact that (Hou et al. 2017) directly consider the loss to mitigate possible accuracy drops to mitigate during binarization, which gain more accurate results than BWN and XNOR-Net, it still has gap to the full-precision counterparts. A novel training procedure named stochastic quantization (Dong et al. 2017) was introduced to narrow down such gaps. All these works belongs to the training-time optimization category in summary.

Deploy-time Network Optimization Deploy-time network optimization tries to replace some heavy CNN structures in pre-trained CNN models with efﬁcient ones in a training-free way. Low-rank decomposition (Denton et al. 2014) exploits low-rank nature within CNN layers, and shows that fully-connected (FC) layers can be efﬁciently compressed and accelerated with low-rank approximations, while conv-layers can not. Spatial decomposition (Jaderberg et al. 2014) factorizes the kh kw convolutional ﬁlters into a linear combination of a horizontal ﬁlter 1 kw and a vertical ﬁlter kh 1. Channel decomposition (Zhang et al. 2016) decomposes one conv-layer into two layers, while the ﬁrst layer has the same ﬁlter-size but with less channels, and the second layer uses a 1 1 convolution to mix output of the ﬁrst one. Network decoupling (Guo et al. 2018) decomposes the regular convolution into the successive combination of depthwise convolution and pointwise convolution. Due to its simplicity, deploy-time optimization has many potential applications for software/hardware vendors as aforementioned. However, it suffers from relatively limited optimization gains (2 4 in compression/speedup) over original full-precision models.

Binary Network Decomposition Few existing works like us consider transferring fullprecision networks into multi-bits binary networks in a training-free way. Binary weighted decomposition (BWD) (Kamiya et al. 2017) takes each ﬁlter as a basic unit as BWN, and expands each ﬁlter into a linear combination of binary ﬁlters and a FP32 scalar. ABC-Net (Lin et al. 2017) approximates full-precision tensor with a linear combination of multiple binary tensors and FP32 scalar weights during trainingprocedure to obtain multi-bits binary networks. Our method is quite different to these two works. We further consider the redundance and sparsity in the expanded binary tensors, and try to decompose binary tensors. The decomposition is similar to spatial decomposition (Jaderberg et al. 2014) but in the binary space. Hence, our binary decomposition step can also be viewed as binary spatial decomposition.

Method As is known, parameters of each conv-layer in CNNs could be represented as a 4-dimensional (4D) tensor. We take tensor as our study target. We ﬁrst present how to expand fullprecision tensors into a limited number of binary tensors. Then we show some binary tensors that fulﬁll certain conditions can be decomposed into two low-rank binary tensors, and propose an algorithm for that purpose.

Tensor Binary Composition Suppose the weight parameters of a conv-layer are represented by a 4D tensor Wt Rn k k m, where n is the number of input channels, m is the number of output channels, and k k is the convolution kernel size. For each element w Wt, we ﬁrst normalize it with w = w/wmax, (1) where wmax = maxwi Wt{|wi|}. The normalized tensor is denoted as Wt. The normalization makes every element w

top-1 accuracy

Resnet-18 Dense Net-121 VGG-16

top-5 accuracy

Resnet-18 Dense Net-121 VGG-16

Figure 2: Performance on Image Net for different networks with binary tensor expansion using different J bits. Dashedline indicates FP32 accuracy. Left is for top-1, right is for top5.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Normalized weights

Relative scale [log(count)]

2 9 2 7 2 5 2 3 2 1

quantized value of weights

sparse ratio

Figure 3: Weight distribution of all conv-layers for Res Net-18. (a) normalized weights, (b) sparse-ratio of each binary quantized weight.

within range [ 1, 1]. For simplicity, we denote the magnitude of w as ˆw, i.e., w = sign( w) ˆw, where sign( ) is a sign function which equals to -1 when w < 0, otherwise 1. And ˆw [0, 1] can be expressed by the composite of a series of ﬁxed-point basis as

i=0 ai 2 i, (2)

where ai {0, 1} is a binary coefﬁcient indicating whether certain power-of-two term is activated or not, i {0, , J 2} means totally J 1 bits is needed for the representation. When taking the sign bit into consideration, w requires J bits to represent. Denote the magnitude of the normalized tensor as ˆ Wt. Tensor binary composition is a kind of tensor expansion, when each element of the tensor is binary expanded and expressed by the same bit rate J as

i=0 Ai 2 i, (3)

where Ai {0, 1}n k k m is 4D binary tensor. J will impact the approximation accuracy, while larger J gives more accurate results. We empirically study three different Image Net CNN models. Figure 2 shows that J = 7 is already sufﬁciently good to keep a balance between the accuracy and the efﬁciency of the expansion. Different Ai may have different sparsity, which could be further utilized to compress the binary tensor. Figure 3a illustrates the distribution of normalized weights w in all the layers of Res Net-18, which looks like a Laplacian distribution, where most weight values concentrate in the range ( 0.5, 0.5). This suggests that 1 is very rare in some binary tensor Ai with smaller i, since smaller i corresponds to bigger values in the power-of-two expansion. Figure 3b further shows the average sparsity of each binary tensor Ai, which also veriﬁes that Ai with smaller i is much more sparse. Due to the sparsity of Ai, we next perform binary tensor decomposition to further reduce the computation complexity, as introduced in the next section.

Binary expansion with α scaling factor The nonsaturation direct expansion from FP32 to low-bits will yield non-negligible accuracy loss as shown in (Migacz 2017;

Krishnamoort 2018). A scaling factor is usually introduced and learnt to minimize the loss through an additional calibration procedure (Migacz 2017; Krishnamoort 2018). Similarly, we impose a scaling factor α to Eq.(1) as

w = α w/wmax, (4)

where α 1 is a parameter to control the range of w [ α, α]. When the scaling factor α is allowed, the normalized weight ˆw [0, α] can be expressed with a composite of power-of-two terms as below:

i= q ai 2 i, (5)

where q = log2 α and J also denotes the number bits of the weight, including J 1 bits for magnitude and 1 sign bit. The corresponding tensor form can be written as

ˆ Wα t XJ q 2

i= q Ai 2 i. (6)

Note that the scaling factor α will shift the powerof-two bases from {20, 2 1, , 2 J+2} for Eq.(3) to {2q, , 20, , 2 J+q+2} for Eq.(6). When α = 1, we have q = log2 α = 0, which makes Eq.(6) reduce to the case without scaling factor as in Eq.(3).

Binary Tensor Decomposition We have shown that some binary tensors Ai are sparse. As sparse operations require speciﬁc hardware/software accelerators, it is not preferred in many practical usages. In deploy-time network optimization, researches show that full-precision tensor could be factorized into two much smaller and more efﬁcient tensors (Jaderberg et al. 2014; Zhang et al. 2016). Here, we attempt to extend the spatialdecomposition (Jaderberg et al. 2014) to our binary case. For the simplicity of analysis, we ﬂatten the 4D tensor ˆ Wt Rn k k m into the weight matrix W R(n k) (k m) , so does for each Ai. Here the matrix height and width are n k and k m respectively. We then factorize a sparse matrix A into two smaller matrices as

A = B C, (7)

where matrix B {0, 1}(n k) c and matrix C {0, 1}c (k m). Note that our method is signiﬁcantly different from the vector decomposition method (Kamiya et al.

2017), which keeps B binary, the other full-precision. On the contrary, we keep both B and C binary. This decomposition has the special meaning in conv-layers. It decomposes a conv-layer with k k spatial ﬁlters into two layers one layer with k 1 spatial ﬁlters and the other with 1 k spatial ﬁlters. Suppose the feature map size is h w, then the number of operations is n m k2 h w for matrix A, while the number of operations reduces to (m + n) c k h w for B C. We have the following lemma regarding to the difference before and after binary decomposition. Lemma 1 (1) The computing cost ratio for A over B C is n m k/c (m+n). (2) The bit-rate compression ratio from A to B C is also n m k/c (m+n). (3) c < n m k

(m+n) can yield real parameter and computing operation reduction.

Binary Matrix Decomposition We ﬁrst review the property of matrix rank:

rank(B C) min{rank(B), rank(C)}. (8)

Comparing with binary matrix factorization methods (Zhang et al. 2007; Miettinen 2010), which tend to minimize certain kind of loss like |A B C| and ﬁnd matrices B and C iteratively, we attempt to decompose A into matrices B and C without any loss when c rank(A) is satisﬁed. Theorem 1 If c rank(A), binary matrix A {0, 1}(n k) (k m) can be losslessly factorized into binary matrices B {0, 1}(n k) c and C {0, 1}c (k m). Proof According to the Gaussian elimination method, matrix A can be converted to an upper triangular matrix D. Our intuition is to construct matrices B and C through the process of Gaussian elimination. Assume n m, Pi is the transform matrix representing the i-th primary transformation, matrix D can be expressed as:

i=0 Pk i A, (9)

where is element-wise binary multiply operator so that

A B = (A B) mod 2.

For simplicity, we use instead of here. As Pi {0, 1}(n k) (n k) is the permutation transform matrix, the inverse matrix of Pi exists. Therefore, A can be decomposed into the following form:

i=0 P 1 i D (10)

Since D only contains value 1 in the ﬁrst r rows where r = rank(A), D can be decomposed into two matrices:

where D1 {0, 1}r (k m) is the ﬁrst r rows of matrix D, I is a r r identity matrix. Then matrix A can be written as:

i=0 P 1 i I 0

) D1 . (12)

Algorithm 1 Binary matrix decomposition

Input: binary matrix A with size h A w A Output: matrix rank r, matrix B, matrix C (A = B C) 1: function BINARYMATDECOMPOSITION(A) 2: if h A w A then 3: A AT 4: transpose = T rue 5: end if 6: r 0; 7: B identity matrix h A w A 8: for c 1 to w A do 9: l ﬁrst raw satisfy the constraints: A[l, c] = 1 and l r + 1 10: Reverse row l & r + 1, P corresponding transition matrix 11: r r + 1; 12: B B P 1 13: for row r + 1 to h A do 14: if A[row, c] > 0 then 15: A[row, :] (A[r, :] + A[row, :]) mod 2 16: P corresponding transition matrix 17: B B P 1 18: end if 19: end for 20: end for 21: C ﬁrst r rows of A 22: P ﬁrst r rows are identity matrix, other h A r rows are zeros. 23: B B P 24: if transpose then 25: Return r, CT , BT 26: else 27: Return r, B, C 28: end if 29: end function

We then obtain the size (n k) r matrix B as B = Qk i=0 P 1 i I 0

, and the size r (k m) matrix C as

C = D1 exactly without any loss. This procedure also indicates that the minimum bottleneck parameter c is rank(A), i.e., c rank(A). This ends the proof. Based on this proof, we outline the binary matrix decomposition procedure in Algorithm 1. From the proof, we should also point out that the proposed binary decomposition is suitable for both conv-layers and FC-layers. Note that for binary permutation matrix P, its inverse matrix P 1 equals to itself, i.e., P 1 = P. Suppose h A and w A are height and width of matrix A, the computing complexity of B P is just O(h A) as P is a permutation matrix. The computing complexity of Algorithm 1 is O(w A h2 A).

Losslessly Compressible of Ai? Theorem 1 shows that only when c rank(A), our method could produce lossless binary decomposition, while Lemma 1 shows that only c < n m k

(m+n) could yield practical parameter and computing operations reduction. We have the following corollary:

Corollary 1 Binary matrix A is losslessly compressible based on Theorem 1 when rank(A) c < n m k

However, it is unknown which Ai in Eq.(3) was losslessly compressible before decomposition. The brute-force way is trying to decompose each Ai as in Eq.(7), and then keeping those satisfying Deﬁnition 1. This is obviously inefﬁcient and impracticable. Alternatively, we may use some

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0.0

maxrank(Iw 2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0.0

maxrank(Iw 3)

Figure 4: Maximum rank ratio per-layer of Res Net-18.

heuristic cue to estimate which subset of {Ai} could be losslessly compressible. According to Figure 3, Ai with smaller i is more sparse than that with bigger i. Empirically, more sparsity corresponds to smaller rank(Ai). Based on Theorem 1, this pushes us to seek the watershed value j so that those Ai where i j are losslessly compressible, while other Ai (i > j) are not. This requires introducing a variable into the deﬁnition of Ai, so that we choose the Ai deﬁned by Eq.(6) rather than Eq.(3). As is known, the Ai sequence deﬁned by Eq.(6) is {A q, , A0, , AJ q 2} where q = log2 α . Here, we seek for the optimal α, so that j = 0 is the watershed, i.e., {A q, , A0} are losslessly compressible. For simplicity, we still denote the ﬂatten matrix of the tensor ˆ Wα t in Eq.(6) as W R(n k) (k m). We propose to use the indicator matrix described below for easy analysis. Deﬁnition 1 The indicator matrix of W is deﬁned as Iw>β {0, 1}(n k) (k m), in which the value at position (x, y) is Iw>β[x, y] = If(W[x, y] > β), where β [0, α] is a parameter, If( ) is an element-wise indication function, which equals to 1 when the prediction is true, otherwise 0. Based on this deﬁnition, Ai in Eq.(6) can be written as

Ai[x, y] = If( W[x, y] + w

2 i mod 2 = 1), (13)

where w = 2 J+q+1 is the largest throwing-away powerof-two terms in Eq.(6). Deﬁne w = w + w, according to Eq.(6), the rank of matrix A0 can be expressed as:

rank(A0) = rank( X n I/2 1

i=0 I(2i+1) w <(2i+2))

= rank( Xn I

where n I = α . Based on the matrix rank property

rank(A + B) rank(A) + rank(B), (14)

we derive the upper bound of the rank(Ai) as:

rank(A0) Xn I

i=1 rank(Iw i). (15)

The empirical results show that when rank(Iw 1) 0.5 min{n k, m k}, and the rank of the indicator matrix satisﬁes the following constraints:

(1) rank(Iw 2)

rank(Iw 1) C0,

(2) max{ rank(Iw i)

rank(Iw 1)} C1, 3 i n I. (16)

Algorithm 2 Binary search α to satisfy rank condition.

Input: weight matrix W, expected rank c Output: scalar value α 1: function SCALARVALUESEARCH(W , c) 2: min 0, max max number of 1 value in a full rank matrix 3: sort W in a descending order to a vector v 4: while min max do 5: center (min + max)/2 6: α 1/v[center] 7: Compute indicator matrix Iw 1 8: Compute rank r of Iw 1 9: if r > c then 10: max center 1 11: else if r < c then 12: min center + 1 13: else 14: Return α 15: end if 16: end while 17: Return α = 1/v[max] 18: end function

With the bound (15), we get the bound of the rank by the indicator matrix Iw 1:

rank(A0) ((α 2)+ C1 + C0 + 1) rank(Iw 1), (17)

where (x)+ equals to x if x > 0, otherwise 0. We make some empirical study on Res Net-18, and ﬁnd that C0 0.2 in most of the layers as in Figure 4a, C1 0.05 in most of the layers as in Figure 4b, and α 8 in all the layers. Hence, we have a simpler method to estimate rank(A0) by:

rank(A0) (4 0.05 + 0.2 + 1) rank(Iw 1) = 1.4 rank(Iw 1). Therefore, we transfer the optimization of α to the optimization of rank(Iw 1).

A binary search algorithm for scaling factor a In practice, we may give an expected upper bound c for rank(Iw 1), and search the optimal α to satisfy

rank(Iw 1) c. (18)

As w [0, α] and α > 1, w [0, 1] only corresponds to 1/α portion of the whole range of w. Generally, weight matrix W should be sorted and traversed to compute the rank(Iw 1). Instead of using this time-consuming method, we propose an efﬁcient solution based on the binary search algorithm. Suppose that every element in weight matrix W has a unique value, we sort W to be a vector v with length N = n m k2 in descending order, so that v[1] is the largest element. Assume index i satisfy the constraint: v[i] 1 > v[i + 1], then the indicator matrix can be expressed as: Iw 1 = Ip(w)<i+1 (19) where p(w) is the index position of the element w in array v, i.e., v[p(w)] = w. Hence, there are i ones in the indicator matrix while others are zeros. Comparing matrix Ip(w)<i+1 with matrix Ip(w)<i, we have the property:

Ip(w)<i+1[x, y] =

( Ip(w)<i[x, y] p(w[x, y]) = i Ip(w)<i[x, y] + 1 p(w[x, y]) = i (20)

Model FP32 CBDNet Bitrate top-1(%) top-5(%) size-MB top-1(%) top-5(%) size-MB Res Net-18 66.41 87.37 44.57 65.27(-1.14) 86.62(-0.75) 7.31 5.25 VGG-16 68.36 88.44 527.7 67.36(-1.00) 87.81(-0.63) 90.3 5.47 Dense Net-121 74.41 92.14 30.11 73.13(-1.28) 91.29(-0.85) 5.38 5.72

Table 1: Performance of different CNNs on Image Net and CBDNet. Numbers in bracket indicates accuracy drop relative to FP32 models. Last-column lists bitrate of CBDNet.

With the property of Eq.(14), we get the relation between the indicator matrices with adjacent index:

|rank(Ip(w)<i+1) rank(Ip(w)<i)| 1. (21)

This indicates that rank(Ip(w)<i) are continuous integers. Hence, i [i1, i2] always exists to satisfy below constraints:

rank(Ip(w)<i) = c, (rank(Ip(w)<i1) c) (rank(Ip(w)<i2) c) < 0. (22)

Eq.(22) provides the property to design a binary search algorithm as outlined in Algorithm 2, which could ﬁnd the local optimal of α when matrix W has some identical elements. The computing complexity of this algorithm is O(N log(N)), where N = n k2 m. Here, c is a tunable hyper-parameter. In practice, we did not directly give c since different layers may require different c. Instead, we assume m n and deﬁne b = c/n k as bottleneck ratio, since we reduce the neuron number from n k to c by the binary matrix B. Then we tune b and make it constant over all the layers, while b < 0.5 could provide compression effect. After getting α, we obtain q = log2 α , and only do binary decomposition for the subset {A q, , A0} as they are losslessly compressible.

Experiments This section conducts experiments to verify the effectiveness of CBDNet on various Image Net classiﬁcation networks (Russakovsky et al. 2015), as well as object detection network SSD300 (Liu et al. 2016) and semantic segmentation network Seg Net (Badrinarayanan et al. 2017).

Classiﬁcation Networks on Image Net The Image Net dataset contains 1.2 million training images, 100k test images, and 50k validation images. Each image is classiﬁed into one of 1000 object categories. The images are cropped to 224 224 before fed into the networks. We use the validation set for evaluation and report the classiﬁcation performance via top-1 and top-5 accuracies. Table 1 gives an overall performance of different evaluated networks. Note we decomposed all the conv-layers and FC-layers with the proposed method in this study. The resulted bit-rate is deﬁned by 32 size(CBDNet)/size(FP32), where size(x) gives the model size of model x.

Model 1: Res Net-18 (He et al. 2016) consists of 21 convlayers and 1 FC-layer. The top-1/top-5 accuracy of the FP32 model is 66.41/87.37 in our single center crop evaluation. Figure 5 shows the accuracy and the bit-rate of CBDNet at different bottleneck ratios. The effective bit-rate of CBDNet is 5.25 with b=0.3 and J=7, while the top-1/top-5 accuracy drops by 1.14% and 0.75% respectively.

Model FP32 BWD CBDNet top-5(%) top-5(%) Bitrate top-5(%) Bitrate Res Net-152 92.11 90.25(-1.86) 6 91.61(-0.5) 5.37 VGG-16 88.44 86.28(-2.16) 6 87.81(-0.63) 5.47

Table 2: CBDNet vs BWD on Res Net and VGGNet.

Model 2: VGG-16 (Simonyan and Zisserman 2015) consists of 13 conv-layers and 3 FC-layers. The top-1/top-5 accuracy of the FP32 model is 68.36/88.44 in our single center crop evaluation. Figure 6 shows the accuracy and the bitrate of CBDNet at different bottleneck ratios. The effective bit-rate of CBDNet is 5.47 with b=0.3 and J=7, while the top-1/top-5 accuracy drops by 1% and 0.63% respectively.

Model 3: Dense Net-121 (Huang et al. 2017) consists of 121 layers. The top-1/top-5 accuracy of the FP32 model is 74.41/92.14 in our single center crop evaluation. Figure 7 shows the accuracy and the bit-rate of CBDNet at different bottleneck ratios. The effective bit-rate is 5.72 with b = 0.4 and J = 7, while the top-1/top-5 accuracy drops by 1.28% and 0.85% respectively.

Comparison to State-of-the-art We further compare the accuracy and bit-rate to the prior art binary weighted decomposition (BWD) (Kamiya et al. 2017). Table 2 lists the comparison results on Res Net-152 and VGG-16. Note that, BWD keeps the ﬁrst conv-layer un-decomposed to ensure no signiﬁcant accuracy drops, while our CBDNet decomposes all the layers. It shows that our CBDNet achieves much less accuracy drops even with smaller bit-rate.

Detection Networks We apply CBDNet to the object detection network SSD300 (Liu et al. 2016). The evaluation is performed on the VOC0712 dataset. SSD300 is trained with the combined training set from VOC 2007 trainval and VOC 2012 trainval ( 07+12 ), and tested on the VOC 2007 testset. We compare the performance between original SSD300 and our CBDNet. Figure 8a shows the comparison results, in which we also include the composite-only (without binary decomposition) results. The effective bit-rate of CBDNet is 4.38, with 1.34% drop of the mean Average Precision (m AP). That veriﬁes the effectiveness of our CBDNet on object detection networks.

Semantic Segmentation Networks We also perform an evaluation on the semantic segmentation network Seg Net (Badrinarayanan et al. 2017). The experiment is conducted on the Cityscapes dataset (Cordts et al. 2016) of the 11 class version, for fair comparison to results by (Kamiya et al. 2017). We use the public available Seg Net model, and test on the Cityscapes validation set. Figure 8b shows the comparison results, in which the composite-only results are also included. The effective bit-rate of CBDNet on Seg Net is 5.18, with only 0.8% drop of the mean of intersection over union (m IOU). In comparison, under the same setting, the BWD (Kamiya et al. 2017) requires 6 bits but yields more than 2.75% accuracy loss. Figure 9 further illustrates the segmentation results on two tested images, which compares results

4.5 5.0 5.5 6.0 6.5 bits

top-1 accuracy

b: 0.5 b: 0.4 b: 0.3 b: 0.2

4.5 5.0 5.5 6.0 6.5 bits

top-5 accuracy

b: 0.5 b: 0.4 b: 0.3 b: 0.2

Figure 5: CBDNet accuracy on Image Net for Res Net-18. Dashed-line indicates FP32 accuracy. b for bottleneck ratio.

4.5 5.0 5.5 6.0 6.5 7.0 bits

top-1 accuracy

b: 0.5 b: 0.4 b: 0.3 b: 0.2

4.5 5.0 5.5 6.0 6.5 7.0 bits

top-5 accuracy

b: 0.5 b: 0.4 b: 0.3 b: 0.2

Figure 6: CBDNet accuracy on Image Net for VGG-16. Dashed-line indicates FP32 accuracy. b for bottleneck ratio.

5.0 5.5 6.0 6.5 7.0 bits

top-1 accuracy

b: 0.5 b: 0.4 b: 0.3 b: 0.2

5.0 5.5 6.0 6.5 7.0 bits

top-5 accuracy

b: 0.5 b: 0.4 b: 0.3 b: 0.2

Figure 7: CBDNet accuracy on Image Net for Dense Net-121. Dashed-line indicates FP32 accuracy. b for bottleneck ratio.

3 4 5 6 7 8 bits

original SSD300 SSD300 Composite J-bits SSD300 with CBDNet

3 4 5 6 7 bits

original Seg Net Seg Net Composite J-bits Seg Net with CBDNet

Figure 8: Accuracy comparison on (a) SSD300 and (b) Seg Net.

Fence Vehicle Pedestrain Bike Non labeled

Tree Road Pole Building

Sign Symbol

input image ground truth BWD Seg Net CBDNet

Figure 9: Semantic segmentation results on City Scapes. CBDNet is better than BWD on sign and pole .

from ground truth, Seg Net, BWD, and our CBDNet. It is obvious that our CBDNet gives better results than BWD. That veriﬁes the effectiveness of our CBDNet on semantic segmentation networks.

Inference Speed with CBDNet

We make the inference speed comparison between CBDNet and the FP32 version on Intel Core i7-6700 CPU with 32GB RAM. We optimize one conv-layer of VGG-16 using SSE4.2 instructions (128-bit SIMD) for both the FP32 version and our CBDNet version. For CBDNet, we quantize the input into 8-bits using the Tensor Flow quantizer (Krishnamoort 2018), and then perform bitwise operations. We project results in this layer to the whole network based on operation numbers per-layer, and show that our CBDNet is about 4.02 faster than the FP32 counterpart. In compar-

ison, BWD only gives 2.07 speedup over the FP32 version with SSE4.2 optimization, under 6-bits weight quantization. This clearly demonstrates our advantage over BWD (Kamiya et al. 2017). We believe speciﬁc designed hardware with dedicated bitwise accelerators could realize ultra efﬁcient inference for our CBDNet.

This paper proposes composite binary decomposition networks (CBDNet), which can directly transfer pre-trained full-precision CNN models into multi-bits binary network models in a training-free way, while remain model accuracy and computing efﬁciency. The method contains two steps, composite real-valued tensors into a limited number of binary tensors, and decomposing certain conditioned binary tensors into two low-rank tensors for parameter and computing efﬁciency. Experiments demonstrate the effectiveness of the proposed method on various classiﬁcation networks like Res Net, Dense Net, VGGNet, as well as object detection network SSD300, and semantic segmentation network Seg Net.

Acknowledgement: Y. Qiaoben and J. Zhu were supported by the National Key Research and Development Program of China (2017YFA0700904), NSFC projects (61620106010, 61621136008, 61332007) and the MIIT Grant of Int. Man. Comp. Stan (2016ZXFB00001). Yu-Gang Jiang was supported partly by NSFC projects (61622204, U1509206).

References Badrinarayanan, V.; Kendall, A.; Cipolla, R.; et al. 2017. Seg Net: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans PAMI 39. Chen, W.; Wilson, J.; S, T.; et al. 2015. Compressing neural networks with the hashing trick. In ICML. Cordts, M.; Omran, M.; Ramos, S.; et al. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR. Courbariaux, M.; Bengio, Y.; David, J.-P.; et al. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS. Courbariaux, M.; Bengio, Y.; David, J.-P.; et al. 2016. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. In ICLR. Denton, E.; Zaremba; Lecun, Y.; et al. 2014. Exploiting linear structure within convolutional networks for efﬁcient evaluation. In NIPS. Dong, Y.; Ni, R.; Li, J.; et al. 2017. Learning accurate lowbit deep neural networks with stochastic quantization. In BMVC. Girshick, R.; Donahue, J.; Darrell, T.; et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR. Guo, J.; Li, Y.; Lin, W.; et al. 2018. Network decoupling: From regular to depthwise separable convolutions. In BMVC. Han, S.; Pool, J.; Tran, J.; Dally, W.; et al. 2015. Learning both weights and connections for efﬁcient neural network. In NIPS. Han, S.; Mao, H.; Dally, W. J.; et al. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Hinton, G.; Vinyals, O.; Dean, J.; et al. 2015. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531. Hou, L.; Yao, Q.; Kwok, J. T.; et al. 2017. Loss-aware binarization of deep networks. In ICLR. Howard, A. G.; Zhu, M.; Chen, B.; et al. 2017. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861. Huang, G.; Liu, Z.; Weinberger, K. Q.; et al. 2017. Densely connected convolutional networks. In CVPR. Jaderberg, M.; Vedaldi, A.; Zisserman, A.; et al. 2014. Speeding up convolutional neural networks with low rank expansions. In BMVC. Kamiya, R.; Yamashita, T.; Ambai, M.; et al. 2017. Binarydecomposed dcnn for accelerating computation and compressing model without retraining. In ICCV Workshop. Krishnamoort, R. 2018. Quantizing deep convolutional networks for efﬁcient inference: A whitepaper. ar Xiv preprint ar Xiv:1805.07941.

Krizhevsky, A.; Sutskever, I.; Hinton, G. E.; et al. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS. Li, H.; Kadav, A.; I, D.; et al. 2017. Pruning ﬁlters for efﬁcient convnets. In ICLR. Lin, X.; Zhao, C.; Pan, W.; et al. 2017. Towards accurate binary convolutional neural network. In NIPS. Liu, W.; Anguelov, D.; Erhan, D.; et al. 2016. SSD: Single shot multibox detector. In ECCV. Liu, C.; Zoph, B.; Neumann, M.; et al. 2017a. Progressive neural architecture search. ar Xiv preprint ar Xiv:1712.00559. Liu, Z.; Li, J.; Shen, Z.; et al. 2017b. Learning efﬁcient convolutional networks through network slimming. In ICCV. Long, J.; Shelhamer, E.; Darrell, T.; et al. 2015. Fully convolutional networks for semantic segmentation. In CVPR. Miettinen, P. 2010. Sparse boolean matrix factorizations. In ICDM. Migacz, S. 2017. 8-bit inference with Tensor RT. Technical report, NVidia. Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. XNOR-Net: Imagenet classiﬁcation using binary convolutional neural networks. In ECCV. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster RCNN: Towards real-time object detection with region proposal networks. In NIPS. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; et al. 2015. Image Net Large Scale Visual Recognition Challenge. IJCV 115(3). Shen, Z.; Liu, Z.; Li, J.; et al. 2017. DSOD: Learning deeply supervised object detectors from scratch. In ICCV. Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR. Szegedy, C.; Liu, W.; Jia, Y.; et al. 2015. Going deeper with convolutions. In CVPR. Wen, W.; Wu, C.; Wang, Y.; et al. 2016. Learning structured sparsity in deep neural networks. In NIPS. Zhang, Z.; Li, T.; Ding, C.; and Zhang, X. 2007. Binary matrix factorization with applications. In ICDM. Zhang, X.; Zou, J.; Sun, J.; et al. 2016. Accelerating very deep convolutional networks for classiﬁcation and detection. IEEE Trans PAMI 38(10). Zhang, T.; Qi, G.; Wang, J.; et al. 2017. Interleaved group convolutions. In ICCV. Zhang, X.; Zhou, X.; Sun, J.; et al. 2018. Shufﬂenet: An extremely efﬁcient convolutional neural network for mobile devices. In CVPR. Zhou, S.; Wu, Y.; Ni, Z.; et al. 2016. Do Re Fa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160. Zoph, B., and Le, Q. V. 2016. Neural architecture search with reinforcement learning. In ICLR.