# decoupled_convolutions_for_cnns__72bbb875.pdf

Decoupled Convolutions for CNNs

Guotian Xie,1,2 Ting Zhang,4 Kuiyuan Yang,3 Jianhuang Lai,1,2 Jingdong Wang4

1School of Data and Computer Science, Sun Yat-Sen University 2Guangdong Province Key Laboratory of Information Security 3Deep Motion, 4Microsoft Research xieguotian1990@gmail.com,{Ting.Zhang, jingdw}@microsoft.com kuiyuanyang@deepmotion.ai, stsljh@mail.sysu.edu.cn

In this paper, we are interested in designing small CNNs by decoupling the convolution along the spatial and channel domains. Most existing decoupling techniques focus on approximating the ﬁlter matrix through decomposition. In contrast, we provide a two-step interpretation of the standard convolution from the ﬁlter at a single location to all locations, which is exactly equivalent to the standard convolution. Motivated by the observations in our decoupling view, we propose an effective approach to relax the sparsity of the ﬁlter in spatial aggregation by learning a spatial conﬁguration, and reduce the redundancy by reducing the number of intermediate channels. Our approach achieves comparable classiﬁcation performance with the standard uncoupled convolution, but with a smaller model size over CIFAR-100, CIFAR-10 and Image Net.

Introduction Since Alex Net (Krizhevsky, Sutskever, and Hinton 2012) successfully applied Convolutional Neural Network (CNN) to Image Net and won the challenge by a large margin in 2012, CNNs become the most widely used model for image classiﬁcation (He et al. 2016), object detection (Ren et al. 2015; Redmon and Farhadi 2016) and image segmentation (Long, Shelhamer, and Darrell 2015; Kolesnikov and Lampert 2016) and so on. CNNs have become deeper and deeper (Simonyan and Zisserman 2014; Szegedy et al. 2015; He et al. 2015; 2016; Huang et al. 2016), ranging from tens of layers to thousands of layers to pursue better performance, and have become wider and wider as well, such as Wide Residual Networks (Zagoruyko and Komodakis 2016). Another research direction is designing more effective ﬁlters. There have been many works on ﬁlter design, and most of them can be categorized into two types. One is to decompose the ﬁlter matrix into several low rank matrices (Ioannou et al. 2015; Denton et al. 2014; Zhang et al. 2015; Kim et al. 2015; Tai et al. 2015; Jaderberg, Vedaldi, and Zisserman 2014; Mamalet and Garcia 2012), the other is to view the ﬁlter as a sparse matrix, where some works sparsify the

This work was done when Guotian Xie was an intern at Microsoft Research, Beijing, P.R. China. Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

channel extent, e.g., group convolution (Ioannou et al. 2016; Zhang et al. 2017), channel-wise convolution or separable ﬁlters (Chollet 2016) and other works sparsify the spatial extent with smaller ﬁlters, e.g., 3 3, 1 3 and 3 1 (Szegedy et al. 2016). In this paper, in contrast to design the ﬁlters, we are interested in decoupling the convolution along the spatial and channel domains and propose an effective approach based on the decoupled interpretation. We start from analyzing the process of convolution on the input, and decompose this process into two steps. First each location in the input is projected across the channel domain. In this way, the projection along channel domain is not related to the spatial information of the input. Second, we accumulate the projections of the locations across spatial domain, and this process is only related to the spatial relationship. We reformulate the decoupled two steps in a convolution form, ﬁrst conducting 1 1 across channel-domain convolution, and then conducting across spatial-domain convolution with a spatial conﬁguration. This process is denoted as decoupling spatial convolution. From this decoupling view, we found that the decoupled structure of standard spatial convolution is unbalance, in which the 1 1 across channel-domain convolution is in a high dimensional space that might lead to redundancy, whereas the across spatial-domain convolution is a structured sparse group convolution. To solve this problem, we propose a balance decoupling spatial convolution (BDSC) to relax the sparsity of across spatial-domain convolution by learning a spatial conﬁguration, and to reduce the redundancy of across channel-domain convolution by reducing the intermediate output channels. In this way, we found in our experiments that, the performance of the models using our decoupling convolution drops slightly comparing with the standard spatial convolution, yet the model size is smaller than models of standard spatial convolution. Our contributions in this paper are:

1. We decouple the standard spatial convolution of CNN into two parts, an across channel-domain convolution and an across spatial-domain convolution.

2. We propose the balance decoupling spatial convolution to relax the sparsity of the ﬁlter in spatial aggregation by learning a spatial conﬁguration, and to reduce the redundancy of 1 1 across channel-domain convolution by re-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Filter 2 3 3

Input 2 3 3

(a) Decomposing Filter (b) Decoupling Convolution

Spatial configuration 1 9 3 3

Input X 2 6 6

Intermediate output Z 9 6 6

Output Y 1 6 6

Expand into 3D spatial configuration

Decomposing 9 2 1 1

Response 1 1 1

0 0 1 0 1 0 1 0 0

... ... ...

= = = = = = = = =

m 1 1 1 1 1 1 1 1 1

Figure 1: Illustrating the decoupled convolution by (a) decomposing ﬁlter and (b) decoupling convolution. In (a), an entry is obtained by ﬁrst the projection across channel-domain and then the aggregation across spatial-domain. In (b), the convolution is decoupled into an across channel-domain convolution and an across spatial-domain convolution.

ducing the number of intermediate output channels.

3. Our experiments on CIFAR-100, CIFAR-10 and Image Net demonstrate that models using the proposed balance decoupling spatial convolution get slightly drop in performance comparing with models using the standard spatial convolution, but with a smaller model size.

Decoupling Spatial Convolution A convolutional layer maps a three-dimensional tensor, denoted as input X RCin H W , to a three-dimensional tensor, denoted as output Y RCout H W , where H W is the spatial size of feature map in the tensor (the spatial size of the input feature map and the output feature map are assumed to be the same), Cin is the number of channels in the input and Cout is the number of channels in the output. The ﬁlters in a convolutional layer are parameterized by a four dimensional tensor W RCout Cin Kh Kw, where Kh Kw is the spatial size of the ﬁlter and W(o, , , ) is the oth ﬁlter, denoted as Wo RCin Kh Kw, o = 1, , Cout. We will ﬁrst show the process of ﬁlter decomposition, and then reformulate this process into convolution decoupling. All vectors in this paper are column vectors.

Decomposing Filter Let Yo RH W be the oth feature map of the output, we have Yo = Wo X, where denotes the convolution operation. An entry y in Yo is obtained through ﬁrst the projection across channel-domain and then the aggregation across spatial-domain. Let the corresponding input denoted as Xcor RCin Kh Kw,

1. Across channel-domain projection. Decomposing Wo along the spatial-domain and then we obtain {wo u,v}u=1, ,Kh,v=1, ,Kw, where

wo u,v = Wo( , u, v) RCin is the column of Wo at the position (u, v). Accordingly, decomposing Xcor as {xcor u,v} corresponding to {wo u,v}u=1, ,Kh,v=1, ,Kw. Then the output of across channel-domain projection is obtained as,

wo 1,1 T xcor 1,1 wo 1,2 T xcor 1,2 wo Kh,Kw T xcor Kh,Kw

Here, S = Kh Kw and zo RS is the intermediate output. This process is illustrated in Figure 1 (a) step 1.

2. Across spatial-domain aggregation. The spatial-domain aggregation is performed on the intermediate outputs using a spatial mask M 1Kh Kw, and we denote m = vec(M),where vec( ) is a function to vectorize a tensor. The output of the aggregation across spatial-domain is,

yo = m T zo, (2)

which is equivalent to the response of the oth ﬁlter on the input Xcor. This process is shown in Figure 1 (a) step 2.

Figure 1 (a) shows an example of this decomposition process, in which Cin = 2, and the spatial size of the ﬁlter Wo is 3 3. The ﬁrst step decomposes Wo along the spatialdomain into 9 single columns, and each is multiplied with the corresponding column of Xcor to get the intermediate feature, which has 9 responses, corresponding to 9 = 3 3 different spatial locations. Then across spatial-domain aggregation is conducted on the 9 responses, using the mask m = [1 1 1 1 1 1 1 1 1]T . The above analysis is for one ﬁlter at a single location. When there are Cout ﬁlters, the intermediate output of the

ﬁrst step becomes z = [z T 1 , z T 2 , , z T Cout]T , and the ﬁnal output of the second step is,

y = ˆ Mz (3)

m T 0 0 0 m T 0 0 0 m T

z1 z2 z Cout

where z RS, ˆ M {0, 1}Cout S, S = Cout Kh Kw, and y RCout. Collecting all locations of output maps together, we show that each step actually can be formed as a convolution. As a result, the convolution is decoupled into an across channeldomain convolution and an across spatial-domain convolution.

Decoupling Convolution In this section, we will give mathematical formulation to show that each step can be regarded as a convolution. Across channel-domain convolution. It is easy to see that the projection across channel-domain is equivalent to a 1 1 convolution with ﬁlters W being reshaped to W RS Cin 1 1, and

w Cout Kh,Kw

Therefore the across channel-domain convolution is given as,

Z = W X, (6)

where Z is a three-dimensional tensor with size being S H W. Across spatial-domain convolution. We ﬁrst expand the spatial mask M into a mask tensor similar to the one shown in Figure 1(b). The resulting mask tensor denoted as Me {0, 1}(Kh Kw) Kh Kw satisﬁes that there is only one entry valued as 1 in Me(k, , ) (k = 1, , (Kh Kw)) and all the entries valued as 1 are at different locations of spatial map of size Kh Kw. As a result, we can see that across spatial-domain aggregation is equivalent to a Kh Kw convolution with ﬁlters M {0, 1}Cout S Kh Kw,

Y = M Z, (7)

m T 0 0 0 m T 0 0 0 m T

and m = vec(Me). (9)

In summary, the convolution can be decoupled as,

Y = M ( W X), (10)

where the input is ﬁrst fed into across channel-domain operation which is a 1 1 convolution that maps the input to a very high dimensional space, and then fed into across spatial-domain operation which is a Kh Kw convolution that handles the spatial information and meanwhile performs the dimension reduction. Figure 1 (b) shows the decoupling convolution process and the comparison with the standard convolution. We can see that for each convolution Y = W X, there exists a 1 1 convolution tensor W and a Kh Kw convolution tensor M, such that Y = W X = M ( W X). That is, the two-step interpretation is exactly equivalent to the standard convolution. After the decoupling, W is not related to spatial-domain any more. It maps the input in current feature space to another feature space. On the other hand, M encodes the spatial relationship that if the entry of M is 1, meaning that the corresponding feature is related, otherwise is 0. So we denote M as the spatial conﬁguration.

Balance Decoupling Spatial Convolution From the decoupled spatial convolution shown in Equation (10), we observe that: (i) the spatial conﬁguration M is corresponding to a structured sparse group convolution; and (ii) the 1 1 convolution of ﬁlters W maps the features from a low dimensional space into a high dimensional space (from input with Cin dimensions to output with S = Cout Kh Kw dimensions). This is an unbalance structure, and we think that the intermediate representation contains too many channels and the spatial conﬁguration M are too sparse. Motivated by these observations, we propose the balance decoupling spatial convolution (BDSC), with a learned spatial conﬁguration and an unaggressive 1 1 convolution by setting S = Cin.

Relax the Sparsity of Spatial Conﬁguration The across spatial-domain convolution is a 3 3 ﬁxed sparse group convolution. In fact, we can learn a spatial conﬁguration Ml to relax the ﬁxed sparse constraint. In the training of standard convolution neural network, it is not easy to learn the spatial conﬁguration directly. Instead, we add a ﬂoating-point precision tensor Q corresponding to the spatial conﬁguration Ml, and update this ﬂoating-point precision tensor Q. When performing forward propagation, we constrain that Ml = sign(Q). The approximated gradients, however, are not so smooth. So we adopt the techniques of XNOR (Rastegari et al. 2016) to learn Ml by introducing a vector α, Q(o, , , ) = α(o) Ml(o, , , ). (11)

According to XNOR net, the best Ml and α to approximate Q by Q are, Ml = sign(Q),

n Q(o, , , ) 1, (12)

where n = S Kh Kw. More details about the derivation can be found in the paper (Rastegari et al. 2016). Then we

approximate the gradient w.r.t. Q g Q as g Q, i.e., we use g Q to update Q. The training process of the spatial conﬁguration Ml is shown in Algorithm 1. Note that the spatial conﬁguration Ml learned using Algorithm 1 are valued as 1 or 1, which can be easily transfered to be valued as 0 or 1 by M = 1

2( Ml + 1).

Data: Input X, ﬂoat-type Q corresponding to the spatial conﬁguration, and gradients from backward L

Y Result: Feature maps Y, Q after updating, spatial conﬁguration Ml, α for approximation clamp Q to range [-1,1] Ml = sign(Q) for oth ﬁlter in this layer do

n Q(o, , , ) 1 Q(o, , , ) = α(o) Ml(o, , , ) end Y=Convolution Forward(X, Q); L Q=Convolution Backward( L

Q Update(Q, L

Algorithm 1: The training process of Ml

Reduce the Redundancy of 1 1 Convolution

The ﬁlters W RS Cin 1 1 in the across channel-domain projection map the features to a high dimensional space (from the input channel number being Cin to the output channel number being S = Cout Kh Kw). Usually, we have S > Cin. For example, a convolution layer in Res Net-18 with setting Cout = 512, Kh = Kw = 3 will result in S = 512 3 3 > Cin = 512. That is, the across channel-domain projection of standard convolution is a mapping from a low dimensional space to a high dimensional space, which may cause a redundancy. To reduce the redundancy, we set S = Cin, which is the smallest projection dimension to provide lossless projection.

Analysis We denote the proposed scheme as balance decoupling spatial convolution (BDSC), with an unaggressive 1 1 convolution of W RS Cin 1 1 by setting S = Cin, followed by a 3 3 convolution of a learned spatial conﬁguration M {0, 1}Cout S Kh Kw. In the following, we discuss the number of parameters and FLOPs compared BDSC with the standard convolution, where we assume the number of the input channels and the output channels are the same, i.e., Cout = Cin = C.

#Params. The number of parameters in a convolution layer with the ﬁlters being W RC C Kh Kw is C C Kh Kw with ﬂoat type. The balance decoupling spatial convolution layer in our network contains the projection ﬁlters W RC C 1 1 and the spatial conﬁguration

M {0, 1}C C Kh Kw, where the number of parameters in W is C C with ﬂoat type and the number of parameters in M is C C Kh Kw with binary value {0, 1}, which takes 1 32C C Kh Kw with respect to ﬂoat type. Thus the compression rate is,

rp = C C Kh Kw 1 32C C Kh Kw + C C . (13)

With a typical setting that Kh = Kw = 3, the compression rate is r = 1 1 32 + 1

FLOPs. For a standard convolution layer, the FLOPs is

H W C C Kh Kw

with H W being the spatial size of the output. For our network, M is a tensor with value {0, 1}, and an entry may be valued as 1 with a probability q. The convolution of spatial conﬁguration with value 0 and 1 contains only additions and no multiplications. Therefore the FLOPs of across spatial-domain convolution with spatial conﬁguration M is q

2 FLOPs of the standard convolution. The FLOPs of both across channel-domain convolution and across spatialdomain convolution is

H W (C C + q

2C C Kh Kw). (14)

In summary, the speed up rate is

rf = H W C C Kh Kw H W (C C + q

2C C Kh Kw). (15)

With a typical setting that Kh = Kw = 3 and q = 1

2 (in the experiments, usually q < 1

2), the speed up rate is 1 1 9 + 1

Experiments Datasets We use three datasets to demonstrate our network. The ﬁrst is the benchmark Image Net dataset (ILSVRC2012) (Russakovsky et al. 2015) that consists of 1, 000 classes. Image Net contains over 1.2 million training images and 50, 000 validation images. For testing, we report the top-1 accuracy of center crop of the validation dataset of Image Net. The results reported are the best performance of model during training. The other two are CIFAR-100 dataset, which contains 50000 training images and 10000 test images, each labeled with 100 classes and CIFAR-10 dataset, which also consists of 50000 training images and 10000 test images, each labeled with 10 classes. We randomly resize the 32 32 image to scale within the range [32,40] and randomly crop a 32 32 patch with randomly horizontal mirroring for training. Then we test on the 10000 test images on the size 32 32.

Setup We implement our model based on Caffe (Jia et al. 2014). For the classiﬁcation task of 1000 classes of Image Net, we train all the models for 500, 000 iterations with batch size

Model Res Net32-1 Res Net32-2 Res Net32-3 Res Net32-4 Accuracy Model size Accuracy Model size Accuracy Model size Accuracy Model size Standard convolution 0.6839 1.83MB 0.7283 7.17MB 0.7450 16.0MB 0.7568 28.4MB BDSC-1 0.6648 0.36MB 0.7199 1.20MB 0.7392 2.58MB 0.7502 4.49MB BDSC-2 0.6824 0.63MB 0.7326 2.25MB 0.7458 4.91MB 0.7566 8.61MB BDSC-3 0.6969 0.90MB 0.7357 3.30MB 0.7499 7.24MB 0.7587 15.4MB

Table 1: Comparison between standard convolution and BDSC-p with S = p Cin based on Res Net32-α over CIFAR-100. BDSC-p achieves better performance with a larger p. Compared with the standard convolution, BDSC-3 achieves better performance with smaller model size.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

model size (MB)

standard convolution

Figure 2: Illustrating the results of BDSC-p with S = p Cin on CIFAR-100. When p decreases, the curve of BDSC-p moves to the left, which shows that we can reduce the redundancy by decreasing p while maintain the same level of accuracy.

256. For CIFAR-100 and CIFAR-10, we train 180, 000 iterations with batch size 64. The weight decay is set as 0.0001 and the momentum is 0.9. we set the initial learning rate as 0.1 and divided the learning rate by 10 for each 150, 000 iterations on Image Net and for each 50, 000 iterations on CIFAR. On Image Net, we use multi-scale (randomly resizing the image to scale within range [256,480]) and randomly crop with randomly horizontal mirroring for data augmentation. We initialize the weights with the MSRA initialization techniques introduced in (He et al. 2015) and train the model from scratch. We train all models by SGD with Nesterov momentum. We use a factor α to multiply the width of the network, e.g., Res Net32-α , where the α means that we widen the width of each layers in Res Net32 by multiplying a factor α. In all models for Imange Net and CIFAR, we keep the ﬁrst layer and the last layer unchanged.

Empirical study

The Effect of Intermediate Feature Width. In the across channel-domain convolution, we set S = Cin to reduce the redundancy and guarantee the lossless projection. In this experiment, we explore how the dimension of S reﬂect the redundancy of the standard convolution and use Res Net32-α on CIFAR-100 dataset. In addition, we view p of S = p Cin as a variable, and investigate when p becomes larger, how

model Res Net32-2 Res Net32-4 Acc. Model size Acc. Model size BDSC-ﬂoat 0.7354 8.35MB 0.7597 33.0MB BDSC 0.7199 1.20MB 0.7502 4.49MB

Table 2: Comparison between models using ﬂoat-type spatial ﬁlters (denoted as BDSC-ﬂoat) and our proposed BDSC which uses {0, 1}-type spatial ﬁlters. We use Acc. to represent accuracy. Our model achieves inferior performance, but largely reduces the model size.

the accuracy-against-model-size curve changes. The results are shown in Figure 2 with different width S = p Cin where p = 1, 2, 3. We denote those models as BDSC-p with setting S = p Cin, e.g., BDSC-1 is the Res Net32-α by setting p = 1, BDSC-2 is the Res Net32-α by setting p = 2 and so on. The previous analysis shows that Res Net32-α with the standard convolution is equivalent to a decoupled model with a ﬁxed sparse spatial conﬁguration in the across spatialdomain convolution and p = 9 in the across channel-domain convolution. Thus Res Net32-α with the standard convolution can be viewed as an extreme case of our BDSC model with the most redundancy. From Figure 2 where α changes from 1 to 4 (denoted as from left to right on each line), we found that the accuracy of Res Net32-α with the standard convolution grows slowly when the model size increases, while the accuracy of BDSC-p grows faster. For example, the curve of BDSC2 reaches the highest accuracy at the point (8.61, 0.756) (these numbers are shown in Table 1), while the curve of the standard convolution reaches to the highest accuracy at the point (28.4, 0.7568), which indicates that about 28.4 8.6 = 19.8MB parameters of Res Net32-4 with the standard convolution are waste. This shows that there exist large redundancy in the models using the standard convolution. When p change from 3 to 1, the curve of BDSC-p gradually moves to the left. This phenomena shows that the redundancy is gradually reduced when p becomes smaller. It can be seen that BDSC-1 shows a good trade-off between model size and accuracy, and as a result, setting p = 1 is a suggested choice to design a model to reduce the redundancy of the standard convolution.

The Effect of the Type of Spatial Conﬁguration. In BDSC, the spatial conﬁguration Ml is forced to be a tensor with value 0 or 1. To verify whether this setting is efﬁcient, we compare two types of Ml, one is with ﬂoat type (32bits), and the other is with value {0, 1} (1bit). We do experiments

Model Res Net18 Standard convolution Depthwise BDSC Accuracy 0.6944 0.6570 0.6898 Model size 44.6MB 8.89MB 9.00MB

Model Res Net34 Standard convolution Depthwise BDSC Accuracy 0.7294 0.6868 0.7219 Model size 83.2MB 13.4MB 14.6MB

Table 3: Comparison between standard convolution, depthwise separable convolution (denoted as Depthwise) and BDSC over Image Net. Our BDSC with smaller model size performs slightly worse than standard convolution. Compared with depthwise separable convolution, our model achieves better performance with similar model size.

on Res Net32-2 and Res Net32-4 on CIFAR-100 and the comparison over the model size and accuracy is given in Table 2. We can see that models with {0, 1}-type spatial ﬁlters perform worse than models with ﬂoat-type spatial ﬁlters, and the gaps are 1.55% and 0.95% on Res Net32-2 and Res Net32-4 respectively. These gaps are acceptable as models with {0, 1}-type spatial ﬁlters achieve smaller model size, saving 7.15MB and 28.51MB respectively. This shows that the spatial conﬁguration in our network is reasonable.

Results The experiments are conducted on three aspects. First we compare BDSC with the standard convolution based on Res Net (He et al. 2016). Then we show the advantage of BDSC over depthwise separable convolution (Chollet 2016). At last, we demonstrate the effectiveness of BDSC on densely connected network (Huang et al. 2016).

Comparison with Standard Convolution We use Res Net (He et al. 2016) as the baseline, and our models replace all the convolution layers with BDSC except the ﬁrst convolution layer. Image Net. Table 3 shows the results of models with the standard convolution and our models with BDSC. We implement the baseline Res Net18 and Res Net34 by ourselves, and the performance is comparable to the results in the original paper (He et al. 2016). It can be seen that by reducing the redundancy, our BDSC models can achieve 5 compression rate. By learning ﬂexible spatial conﬁguration, our model Res Net18-BDSC gets top-1 accuracy of 0.6898 and Res Net34-BDSC gets top-1 accuracy of 0.7219, which is comparable to 0.6944 on Res Net18 and 0.7294 on Res Net34. On both models, the top-1 accuracy drops less than 0.75% but the model sizes are reduced about 5 rate. This demonstrates that our models can better explore the parameter effectiveness while maintain high accuracy. The empirical rate number is about 5 , which is smaller than the rate 7 in theoretical. This might be caused by the size of classiﬁer, the ﬁrst convolution layer and possibly other cost. CIFAR. Similar to Image Net, our models are formed by replacing the standard spatial convolution layer with BDSC on

the Res Net32-2 and Res Net74-2 . The results are shown in Table 4. Compared with standard convolution, the top-1 accuracy of Res Net32-2 with BDSC-1 is close to the accuracy of Res Net32-2 with standard convolution. For example on CIFAR-100, Res Net32-2 of BDSC-1 has 0.7199 top-1 accuracy, while Res Net32-2 of standard convolution has 0.7283 top-1 accuracy, about 0.8% drop of accuracy. While the model size of our network is 1.20MB, achieves 6 less than the network with standard convolution. We also show the accuracy-against-model-size curve on CIFAR-100 by varying α in Res Net32-α in Figure 3. It can be seen that our model with BDSC can largely boost the accuracy with the increasing model size. Note that the accuracy of Res Net32-1 with BDSC is 0.6648, which gets about 1.9% drop in performance compared to 0.6839 in Res Net32-1 with the standard convolution. The reason might be that the intermediate output dimension of 1 1 convolution of BDSC is too small here given the output channels of Res Net32-1 are [16, 32, 64] for each stage respectively, which leads to that the intermediate representation is not sufﬁcient to express enough information. In this case where the number of intermediate input channel of a model is small, we suggest to set p in S = p Cin to be a bit larger, e.g., p > 1. From Table 1, we can see that setting p = 2 on Res Net32-1 leads to nearly no drop in top-1 accuracy, while the model size of BDSC is still smaller than the standard convolution model (0.63MB for BDSC, vs 1.83MB for the standard convolution model).

Comparison with Depthwise Separable Convolution. Depthwise separable convolution (Chollet 2016) decouples the standard convolution into 3 3 depthwise convolution followed by a 1 1 convolution. Here we do experiments to compare BDSC with depthwise convolution. Image Net. We replace the standard convolution layer in Res Net with the corresponding depthwise separable convolution and set the decay on the depthwise 3 3 convolution to 0 during training. The comparison results are shown in Table 3, from which we can see that BDSC model has similar model size as model with depthwise separable convolution, but achieves higher accuracy higher, about 3% better on both Res Net18 and Res Net34. CIFAR. The comparison on CIFAR datasets are shown in Table 4 and Figure 3. We can see that our model achieves higher accuracy than depthwise separable convolution in Table 4. Figure 3 shows the accuracy-against-model-size curve of BDSC and depthwise separable convolution on Res Net32-α . Our model with BDSC, achieves higher accuracy compared with depthwise separable convolution at the same level of models size. For example, top-1 accuracy of Res Net32-1 of BDSC is about 1.8% higher than that of depthswise separable convolution, which shows the advantage of BDSC over depthwise separable convolution. We think that the superior performance of BDSC than depthwise separable convolution stems from the differences of the spatial relationship encoding ability. The 3 3 depthwise convolution is a channel-wise convolution, encoding the spatial relationship within one channel. BDSC uses the learned spatial conﬁguration, which encodes the spatial relationship not

CIFAR-100 CIFAR-10 Res Net32-2 Res Net74-2 Res Net32-2 Res Net74-2 Accuracy Model size Accuracy Model size Accuracy Model size Accuracy Model size Standard convolution 0.7283 7.17MB 0.7476 17.5MB 0.9369 7.12MB 0.9430 17.5MB BDSC-3 0.7357 3.30MB 0.7515 7.96MB 0.9379 3.25MB 0.9394 7.91MB Depthwise 0.6937 1.08MB 0.7201 2.48MB 0.9225 1.04MB 0.9309 2.44MB BDSC-1 0.7199 1.20MB 0.7332 2.83MB 0.9334 1.16MB 0.9380 2.78MB

Table 4: Comparison between standard convolution, depthwise separable convolution (denoted as Depthwise) and BDSC over CIFAR. BDSC-3 with smaller model size achieves comparable performance to standard convolution. BDSC-1 achieves better performance at similar level of model size compared with depthwise separable convolution.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

model-size (MB)

standard convolution BDSC depthwise

Figure 3: Comparison between standard convolution, depthwise separable convolution (denoted as depthwise) and BDSC based on Res Net32-α over CIFAR-100. Our model achieves better performance than both standard convolution and depthwise separable convolution with the same model size.

only within one channel, but also across channels. Analysis. We compare the number of parameters and FLOPs between our BDSC and depthwise separable convolution. For depthwise separable convolution with input and output channels being C, the number of parameters is,

C Kh Kw + C C, (16)

and the FLOPs is,

H W(C Kh Kw + C C). (17)

Therefore, the compression rates of #params and FLOPs comparing BDSC with depthwise separable convolution, which are denoted as rp and rf respectively, are,

1 32C C Kh Kw + C C

C Kh Kw + C C =

rf = H W( 1

4C C Kh Kw + C C) H W(C Kh Kw + C C) =

When C is in range [32, 2048], rp is in range [1, 1.27], and rf is in range [2.54, 3.22]. This suggests that depthwise separable convolution is faster than BDSC with the number

Model CIFAR-10 CIFAR-100 Acc. Model size Acc. Model size Dense Net 0.9476 3.98MB 0.7558 4.13MB Dense Net* 0.9433 3.98MB 0.7450 4.13MB BDSC-6 0.9494 2.95MB 0.7629 3.10MB

Table 5: Comparison between standard convolution and BDSC based on Dense Net. We implement Dense Net by ourselves and report the results denoted as Dense Net*. We also report the results from the paper denoted as Dense Net. We set S = 6Cout in BDSC-6. Acc. in the table means accuracy.

of parameters at a similar level. However, from Figure 3, one can see that at the same level of model size, our model BDSC achieves better performance. Accuracy, model size and speed are three things we consider to balance. So our BDSC achieves an alternative balance among speed, model size and accuracy, and it s a good choice in the case where accuracy is mostly considered as well as small model size and moderately speedup.

Comparison over Densely Connected Networks We also show the effectiveness of our BDSC over densely connected networks (Huang et al. 2016). Dense Net-40(k = 12) is adopted to conduct experiments on CIFAR-100 and CIFAR-10. We use the same data augmentation as (Huang et al. 2016) and train for 400 epochs, and the results are shown in table 5. We also report the results number from the paper, which are denoted as Dense Net. We can see that although the results of Dense Net* implemented by ourselves performs worse than the numbers from the paper, our models with BDSC blocks still achieve better performance than Dense Net with a smaller model size.

In this paper, we present a novel two-step interpretation of convolution by decoupling it into an across channeldomain convolution and an across spatial-domain convolution. Based on the interpretation, we propose an effective approach by relaxing the sparsity of the ﬁxed sparse ﬁlter in across spatial-domain convolution and by reducing the redundancy of 1 1 convolution. Empirical results on Image Net and CIFAR datasets demonstrate that our proposed balance decoupling spatial convolution can achieve a model with small size but still performs comparable to standard convolution.

Acknowledgments

This project was supported by the NSFC (U1611461, 61573387).

Chollet, F. 2016. Xception: Deep learning with depthwise separable convolutions. ar Xiv preprint ar Xiv:1610.02357. Denton, E. L.; Zaremba, W.; Bruna, J.; Le Cun, Y.; and Fergus, R. 2014. Exploiting linear structure within convolutional networks for efﬁcient evaluation. In Advances in Neural Information Processing Systems, 1269 1277. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, 1026 1034. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770 778. Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2016. Densely connected convolutional networks. ar Xiv preprint ar Xiv:1608.06993. Ioannou, Y.; Robertson, D.; Shotton, J.; Cipolla, R.; and Criminisi, A. 2015. Training cnns with low-rank ﬁlters for efﬁcient image classiﬁcation. ar Xiv preprint ar Xiv:1511.06744. Ioannou, Y.; Robertson, D.; Cipolla, R.; and Criminisi, A. 2016. Deep roots: Improving cnn efﬁciency with hierarchical ﬁlter groups. ar Xiv preprint ar Xiv:1605.06489. Jaderberg, M.; Vedaldi, A.; and Zisserman, A. 2014. Speeding up convolutional neural networks with low rank expansions. ar Xiv preprint ar Xiv:1405.3866. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. ar Xiv preprint ar Xiv:1408.5093. Kim, Y.-D.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; and Shin, D. 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. ar Xiv preprint ar Xiv:1511.06530. Kolesnikov, A., and Lampert, C. H. 2016. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European Conference on Computer Vision, 695 711. Springer. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Image Net classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431 3440. Mamalet, F., and Garcia, C. 2012. Simplifying convnets for fast learning. Artiﬁcial Neural Networks and Machine Learning ICANN 2012 58 65.

Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. Xnor-net: Imagenet classiﬁcation using binary convolutional neural networks. In European Conference on Computer Vision, 525 542. Springer. Redmon, J., and Farhadi, A. 2016. Yolo9000: better, faster, stronger. ar Xiv preprint ar Xiv:1612.08242. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 91 99. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3):211 252. Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1 9. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. ar Xiv preprint ar Xiv:1602.07261. Tai, C.; Xiao, T.; Zhang, Y.; Wang, X.; et al. 2015. Convolutional neural networks with low-rank regularization. ar Xiv preprint ar Xiv:1511.06067. Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146. Zhang, X.; Zou, J.; Ming, X.; He, K.; and Sun, J. 2015. Efﬁcient and accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1984 1992. Zhang, T.; Guo-Jun, Q.; Bin, X.; and Jingdong, W. 2017. Interleaved group convolutions. International Conference on Computer Vision.