# circconv_a_structured_convolution_with_low_complexity__b70c34f7.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Circ Conv: A Structured Convolution with Low Complexity

Siyu Liao, Bo Yuan Department of Electrical and Computer Engineering, Rutgers University siyu.liao@rutgers.edu, bo.yuan@soe.rutgers.edu

Deep neural networks (DNNs), especially deep convolutional neural networks (CNNs), have emerged as the powerful technique in various machine learning applications. However, the large model sizes of DNNs yield high demands on computation resource and weight storage, thereby limiting the practical deployment of DNNs. To overcome these limitations, this paper proposes to impose the circulant structure to the construction of convolutional layers, and hence leads to circulant convolutional layers (Circ Convs) and circulant CNNs. The circulant structure and models can be either trained from scratch or re-trained from a pre-trained non-circulant model, thereby making it very ﬂexible for different training environments. Through extensive experiments, such strong structureimposing approach is proved to be able to substantially reduce the number of parameters of convolutional layers and enable signiﬁcant saving of computational cost by using fast multiplication of the circulant tensor.

Introduction Large-scale deep neural networks (DNNs), especially deep convolutional neural networks (CNNs), have achieved extraordinary success in various artiﬁcial intelligence applications such as image recognition, video analysis, etc. (Krizhevsky, Sutskever, and Hinton 2012; He et al. 2016; Karpathy et al. 2014). However, the large model sizes of DNNs make themselves both computation-intensive and memory-intensive, thereby potentially hindering the expected widespread deployment of DNNs in many latencysensitive resource-constrained applications. To address these limitations, many approaches (Han, Mao, and Dally 2015; Gong et al. 2014; Wen et al. 2016; Feng and Darrell 2015) have been proposed to reduce the computational cost and/or memory footprint of DNNs. In general, those existing efforts can be roughly categorized as two types: fully-connected layer-oriented reduction, such as connection pruning (Han, Mao, and Dally 2015)1, weight clustering (Gong et al. 2014), and convolutional layeroriented reduction, such as low rank approximation (Jaderberg, Vedaldi, and Zisserman 2014), sparsity regularization

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1It can also bring reduction for convolutional layers to some degree. But the most reduction in the number of parameters is achieved on fully-connected layers.

(Wen et al. 2016; Feng and Darrell 2015). Nowadays, consider 1) convolutional layers consume most of the computational processing in DNNs and 2) many state-of-theart DNNs, such as Res Net (He et al. 2016) and Inception (Szegedy et al. 2015), use very few fully-connected layers that only contain a small portion of parameters of the entire models (e.g. less than 5% parameters for fully-connected layers in Res Net-152), the reduction on computational cost and numbers of parameters of convolutional layers become very essential. Technical preview and advantages. In this paper we propose to impose the circulant structure to the construction of convolutional layers shown in Figure 1, yielding low-computation-complexity, low-space-cost circulant convolutional layers (Circ Convs) and the corresponding circulant CNNs. Different from prior convolutional layer-oriented compression approaches that are based on the unstructured tensors, the model-size reduction in this paper results from the use of circulant tensors (Rezghi and Eld en 2011): The weight tensors for convolutional layers, which were original unstructured, are now constructed in the circulant format, thereby leading to substantial reduction in computational cost and numbers of parameters. In short, the proposed approach brings the following advantages: 1) It reduces the space cost of the convolutional layers because of the inherent spatial regularity of circulant tensors, thereby resulting in high compression ratios for the overall network model sizes. 2) It saves the computation of the convolutional layers by leveraging the fast circulant tensor-speciﬁc multiplication algorithm, and hence greatly reduces the computational cost of the entire networks. 3) It enables the improved accuracy for the corresponding circulant CNNs as compared with the similar-size noncirculant CNNs. In other words, the beneﬁts of model-size reduction resulting from using circulant convolutional layers can translate to the increase of accuracy. 4) The circulant structure can be imposed by either training from scratch or re-training from a pre-trained noncirculant model, thereby making circulant CNNs very ﬂexible for different training environments. We conduct extensive experiments on circulant CNNs and the results show that the proposed circulant structureimposing approach can effectively reduce the model sizes

Figure 1: Illustration of a circulant weight tensor. Blocks of the same color in the middle share the same set of kernel weights (on the right). This signiﬁcantly reduces the total amount of parameters needed to represent this tensor. In addition, the placement of blocks displays a circulant structure, facilitating FFT-based fast algorithms.

and ﬂoating point operations (FLOPs) with negligible accuracy drop. In addition, the experiments on wide Res Nets show that the proposed approach leads to better accuracy than the non-circulant Res Net models with similar numbers of parameters. Furthermore, we also compare the FLOPs of the proposed circulant CNN models with the non-circulant CNN models, and experimental results show that our proposed method can reduce FLOPs for the inference.

Related Work

Weight pruning/clustering. (Gong et al. 2014) proposes to cluster the weights to reduce the model sizes of DNNs. In that work, various types of weight clustering, including scalar quantization, product quantization, residual quantization, are investigated. (Han, Mao, and Dally 2015) proposes a multi-step compression pipeline comprising of weight pruning, clustering, and quantization to achieve high compression ratios for the entire networks. Notice that as indicated in (Wen et al. 2016), because most parameter reduction in (Gong et al. 2014) and (Han, Mao, and Dally 2015) are achieved on fully-connected layers, the reduction in the computational cost of convolutional layers is not signiﬁcant. In addition, it is also found that weight parameters in frequency domain can be pruned (Wang et al. 2016) or clustered via a hashing function (Chen et al. 2016). Low rank approximation (LRA). LRA is an efﬁcient approach to compress DNNs (Jaderberg, Vedaldi, and Zisserman 2014; Sainath et al. 2013; Zhao, Li, and Gong 2016). In (Jaderberg, Vedaldi, and Zisserman 2014), various types of LRA-based solutions are proposed to reduce the numbers of parameters and computational cost of convolutional layers. However, the model-size reduction using LRA usually requires costly reiterations of decomposition and ﬁne-tuning to minimize the approximation error and retain the accuracy. Sparsity regularization. Increasing the sparsity of network by performing regularization is another popular technique to reduce model sizes of DNNs. (Feng and Darrell 2015), (Girosi, Jones, and Poggio 1995) and (Wen et al. 2016) propose several sparsity-introducing techniques by leveraging different types of regularization, such as L1norm, group-lasso etc. Though sparsity regularization essentially provides a stable reduction in computational cost, the resulting reduction in model size is not signiﬁcant. Structured transform. By using structured matrices, the structured transform can enable very high compression ra-

tios for fully-connected (FC) layers. In (Sindhwani, Sainath, and Kumar 2015; Cheng et al. 2015; Moczulski et al. 2015), weight matrices are constructed in the format of structured matrices to achieve signiﬁcant reduction in model sizes. (Zhao et al. 2017) further proves that the low displacement rank-based neural networks, which are the generalization of the structured networks can still exhibit universal approximation property. However, the FC layer-speciﬁc approaches in (Sindhwani, Sainath, and Kumar 2015; Cheng et al. 2015; Moczulski et al. 2015) cannot be directly applied to the popular and important convolutional layers. Instead, consider FC layer can be viewed as a type of special convolutional layer, our proposed circulant convolution-imposing approach has more generality and is very useful for practical applications. Particularly, compared with (Cheng et al. 2015) and (Sindhwani, Sainath, and Kumar 2015), we generalize the structure from regular 2D weight matrix in fully connected layer to the 4D weight tensor in convolutional layer, where the underlying computation is totally different. Moreover, this paper is signiﬁcantly different from two related works (Ding et al. 2017; Wang et al. 2018) in following aspects: 1) we propose an approach on direct operation on 4D circulant tensor with all the algorithm-level details; while prior works require extra processing step to convert tensor to matrix; 2) we perform comprehensive experiments on circulant convolution and present accuracy results and analysis on different datasets; 3) we propose a novel algorithm to convert non-circulant tensor into circulant tensor, thereby making obtaining circulant convolution on pretrained non-circulant models become possible; while prior works can only train the structure from scratch.

Imposing Circulant Structure to Convolutional Layers

Circulant Convolutional Layer

Conventional convolutional layer. In general, a convolutional layer maps a 3-dimensional input tensor X RW0 H0 C0 into a 3-dimensional output tensor Y RW2 H2 C2 through convolution with a 4-dimensional kernel tensor W RW1 H1 C0 C2. Here Wi and Hi for i = 0, 1, 2, are the spatial width and height of the input, kernel, and output tensor, respectively; C0 and C2 are the number of input channels and output channels. The convolution operation is expressed as follows:

Y(w2, h2, c2) =

X(w2 w1, h2 h1,

c0) W(w1, h1, c0, c2) . (1)

Although stride can be set for convolution, we consider the case of stride that equals to 1 to make a better understanding of circulant convolution. It should be noted that stride wouldn t affect our convolution algorithm and design. Moreover, we can express Eq. 1 in the form of a ﬁber multiplied by a slice as below:

Y(w2, h2, :) =

X(w2 w1, h2 h1, :)

W(w1, h1, :, :) , (2)

where and : denote the matrix-vector multiplication and the range of indices, respectively. Circulant convolutional layer. Different from a conventional convolutional layer, the circulant convolutional layer has a weight tensor W that exhibits circulant structure. In other words, the W of a circulant convolution layer is a 4D circulant tensors (Rezghi and Eld en 2011). In general, a circulant tensor can exhibit circulant structure along any pair of its dimensions. However, as W1 and H1 are usually much smaller than C0 and C2 for tensor W, we impose the circulant structure along the input channel and output channel dimensions to achieve high model-size compression ratio. Note that in practice we need to partition the tensor W into circulant sub-tensors of size W1 H1 N N. This is necessary because the circulant structure requires that the two corresponding dimension must be equal, while C0 and C2 are usually not the same. Larger N means larger compression ratio but it could hurt the model performance to some degree. By adjusting the partition size N we can balance the trade-off between compression ratio and model accuracy. More speciﬁcally, let N be the partition size with C0 = R N and C2 = S N 2, then W can be deﬁned by a 4-dimensional base tensor W RW1 H1 RN S:

W(w1, h1, c0, c2) = W (w1, h1, p, q), (3)

where p, q are indices satisfying c0/N = p/N , c2/N = q, and c0 c2 p (mod N). Fig. 1 illustrates the circulant structure of weight tensor W. From this ﬁgure, it can be seen that the circulant structure is imposed to W along the input/output channel dimensions. The blockcirculant weight tensor consists of six circulant weight subtensors, where different colors represent different circulant weight sub-tensors. Each circulant weight sub-tensor consists of sixteen kernel ﬁlters that are represented in different colors such as green and yellow.

Fast Forward and Backward Propagation Schemes on Circulant Convolutional Layer

Eq. 3 shows that the weight tensor W of a circulant convolutional layer exhibits the circulant structure and has the reduced number of independent parameters. Besides, according to the tensor theory (Rezghi and Eld en 2011), circulant tensor also has the advantage of fast multiplication. Since multiplication is the kernel computation in neural network training and inference, the existence of fast multiplication of circulant tensor enables the immediate reduction in computational cost. Next, we describe the fast forward and backward propagation schemes by leveraging the fast multiplication of circulant weight tensor.

2Zero-padding is needed when N does not divide C0 or C2.

Fast forward propagation. We ﬁrst present the fast forward propagation scheme. Recall that Eq. 2 is the forward propagation scheme for a general convolutional layer. To ease the notation, deﬁne Nk = ((k 1)N + 1, ..., k N) for k = 1, ..., max(R, S), and rewrite Eq. 2 as below:

Y(w2, h2, Ni) =

h2 h1, Nj) W(w1, h1, Nj, Ni) , (4)

where i {1, . . . , S}. According to (Rezghi and Eld en 2011; Pan 2012), Fast Fourier Transform (FFT) can be used to accelerate the multiplication of a ﬁber and a slice of circulant tensor with time complexity reduced from O(N 2) to O(N log N). Therefore, when W is a circulant tensor, Eq. 4 can be reformulated using FFT as below:

Y(w2, h2, Ni) = iﬀt W1 X

j=1 ﬀt X(w2 w1,

h2 h1, Nj) ﬀt W (w1, h1, Nj, Ni) . (5)

Here is the element-wise multiplication. Fast backward propagation. Now consider backward propagation. Given loss function L, it is well known that the goal of backpropgation algorithm (Le Cun et al. 1998) is to compute gradients of loss function L with respect to each weight and input. Hence according to the chain rule, the gradient computation for circulant convolutional layer can be derived from Eq. 3 and Eq. 4 as below:

L W (w1, h1, p, q) =

c2=(q 1)N+1 L Y(w2, h2, c2) Y(w2, h2, c2) W (w1, h1, p, q), (6)

L X(x, y, c0) =

c2 c0( mod N) L Y(w1 + x, h1 + y, c2) Y(w1 + x, h1 + y, c2)

X(x, y, c0) . (7)

Again, according to (Rezghi and Eld en 2011), when W is a circulant tensor, Eq. 6 and Eq. 7 can also be accelerated by using FFT as below:

L W (w1, h1, Nj, i) = ifft(

fft( L Y(w2, h2, Ni)) fft(x j)) (8)

L X(x, y, Nj) = ifft(

L Y(w1 + x, h1 + y, Ni)) fft(w j,i)), (9)

where x j and w j,i are ﬁbers X(w1, h1, T) and W (x, y, (j 1)N + T, i) with T = (1, N, ..., 2). Capability of training Circulant CNN from scratch. It should be noted that the gradient computations described in Eq. 8 and Eq. 9 are actually based on W . Since we can always construct the circulant tensor W from base tensor W using Eq. 3, Eq. 8 and 9 imply that the circulant structure of weight tensor W is always kept during the training phase. In other words, if we initialize W as the circulant tensor at the initialization stage of training, then during the training procedure Eq. 8 and 9 can guarantee W always exhibit circulant structure. Therefore, a circulant CNN can be completely trained from the scratch.

Conversion from Non-circulant Tensor to Circulant Tensor Forward and backward propagation section indicates that a circulant convolutional layer can be trained from scratch. In this subsection, we also present a conversion technique that can directly convert a non-circulant weight tensor to a circulant one. Such conversion is very useful when a pre-trained model is already available and needs to be imposed with circulant structure. Speciﬁcally, the proposed conversion technique is based on the circulant approximation approach (Chu and Plemmons 2003) used for circulant matrix. In matrix theory, let Z1 RN N denote a permutation matrix as following:

0 1 0 . . . 0 0 0 1 . . . 0 ... ... ... ... 0 1 1 0 0 . . . 0

Then a circulant matrix Wcirc RN N with its ﬁrst row w = (w0, w1, . . . , w N 1) can be represented in the polynomial form of Z1 as follows:

Wcirc = PN 1 i=0 wi Z1 i. (11)

According to (Chu and Plemmons 2003), for a noncirculant matrix Wnon circ RN N, its nearest circulant matrix Wcirc (measured in the Frobenius norm) is given by projection:

w = proj NWnon circ,

wi w, wi = 1

N Wnon circ, Z1 i F, (12)

where , F is the Frobenius inner product. Note that the 4-D weight tensor of a convolutional layer can be viewed as a matrix of size W1 W2 where each entry is a matrix of size C0 C2. Therefore, by using Eq. 12, the conversion from a non-circulant tensor Wnon circ to

Table 1: Comparison with (Mathieu, Henaff, and Le Cun 2013) in terms of FFT, time and space complexity, where N = C0 = C2.

Approach Time Complexity Space Complexity FFT Type Original O(W2H2W1H1N 2) O(W1H1N 2) N/A This Work O(W2H2W1H1N log N) O(W1H1RNS) 1-D (Mathieu, Henaff, and Le Cun 2013) O(N 2W0H0 log W0H0) O(W1H1N 2) 2-D

a circulant tensor Wcirc can be achieved by performing the projection as follows:

W (w1, h1, Nj, i) = proj NWnon circ(w1, h1, Nj, Ni), (13) where W is the base tensor that deﬁnes circulant tensor Wcirc, and the mapping from W to Wcirc is given in Eq. 3. Capability of training Circulant CNN from a pretrained model. Based on the conversion scheme shown in Eq. 13, any non-circulant convolutional layer of a pretrained model can be directly converted to a circulant convolutional layer. Typically such direct conversion brings nonnegligible accuracy drop incurred by the approximation error. In order to recover the accuracy, further re-training on the converted model is needed by following the backward propagation scheme in Eq. 8 and 9. Consequently, a noncirculant pre-trained model can be imposed with circulant structure by using the proposed circulant conversion and retraining schemes with preserving high accuracy.

Efﬁciency on Space and Computation

Table 1 summarizes the space and time complexity of the circulant convolutional layers. It can be seen that the proposed circulant structure-imposing approach enables simultaneous improvement on both space efﬁciency and computation efﬁciency. Larger N can result in larger FFT size and lower space and time complexity. Also, compared with the 2-D FFT-based fast convolution in (Mathieu, Henaff, and Le Cun 2013), our 1-D FFT-based approach has much lower space and time complexity since N is typically much larger than R and S.

Experiments

Dataset, Baseline & Experiment Environment. We evaluate our circulant structure-imposing approaches on two typical image classiﬁcation datasets: CIFAR-10 (Krizhevsky and Hinton 2009) and Image Net ILSVRC-2012 (Deng et al. 2009). For each dataset, we take classical network models (Res Net (He et al. 2016) for CIFAR-10 and Alex Net (Krizhevsky, Sutskever, and Hinton 2012) for Image Net) as the baseline models. The compressed circulant CNN models are generated by replacing convolutional layers of the baseline models with circulant convolutional layers. All models in this paper are trained using NVIDIA Ge Force GTX 1080 GPUs and Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz. Selection of Training Strategy. As presented in Section imposing Circ Conv, the circulant CNN model can be trained either from scratch or re-trained from a pre-trained non-circulant model. In our experiments we evaluate these

two different training strategies on different datasets. Experimental results show that with the same compression conﬁguration setting, the compressed circulant CNN models generated by these two training strategies have very similar test accuracies. Therefore in this paper we only report the results using training-from-scratch strategy.

Res Net on CIFAR-10 In this experiment, Res Net-32 is selected as the baseline model due to its high accuracy and easiness of training. The training data is augmented by following the method in (Simonyan and Zisserman 2014): First pad each side of the image with four pixels and then apply 32 32 sized random crops with horizontal ﬂipping. The compressed Res Net-32 models are trained using stochastic gradient descent (SGD) optimizer with learning rate 0.1, momentum 0.9, batch size 64 and weight decay 0.0001. Model setting. Res Net-32 consists of 15 convolutional blocks, where each convolutional block contains two or three convolutional layers. Considering the number of possible compression conﬁgurations on different convolutional layers is very large, we choose to make the layers in the same block have the same compression ratio. In other words, a block-wise compression strategy is adopted. Notice that because the ﬁrst few convolutional layers of Res Net are very sensitive for compression (He et al. 2016), in this experiment we do not impose circulant structure to the convolutional layers in the 1st and 2nd blocks of Res Net-32. Besides, the 6th and 11th blocks are not compressed due to their small weight tensor. Table 2 shows the detailed compression conﬁgurations for different convolutional blocks of Res Net-32. Here we explore 7 different compression conﬁgurations and then obtain 7 compressed models. For each compressed model, the compression ratios for its component convolutional blocks are listed in the row direction. Here each number i in a speciﬁc compression conﬁguration scheme indicates the compression ratio as i for the convolutional layers in the corresponding convolutional block. When the block is associated with 1, that means the corresponding convolutional block is not compressed. Notice that due to the sensitivity of front blocks, for all the 7 models in Table 2 the compression ratios of the front blocks are typically less than those of the later blocks. Trade-off between accuracy and model size. Figure 2 shows the test error for 7 compressed models. It can be seen that model 1 even achieves slightly better performance with a smaller model size than the baseline. Moreover, model 2, 3 and 4 achieve around 50% reduction in model size with negligible accuracy drop. With more aggressive compression conﬁgurations are selected (such as model 5, 6 and 7), more reduction in model size can be further achieved with slight increase of test error. Trade-off between accuracy and FLOPs. Our experiment also shows that the use of circulant convolutional layer helps reduce computational cost signiﬁcantly. As shown in Figure 3, compressed model 1 and 2 achieve fewer FLOPs than baseline with the same or even less test error. For model 3, it can achieve 50% reduction in FLOPs with negligible

Table 2: Compression Conﬁgurations. For the convolutional block with compression ratio i, all the convolutional layers in that block has the same compression ratio i.

Partitioned Model ID Block ID 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 3 1 1 2 2 2 4 4 4 1 1 2 2 2 4 4 5 1 1 2 2 2 4 4 6 1 1 1 1 1 1 1 7 1 1 2 2 4 4 8 8 1 1 2 2 4 4 8 9 1 1 2 2 4 4 8 10 1 1 2 2 4 4 8 11 1 1 1 1 1 1 1 12 1 2 2 4 4 4 16 13 1 2 2 4 4 4 16 14 2 2 2 4 4 4 16 15 2 2 2 4 4 4 16 baseline (He et al. 2016):Res Net-32 without partitioning

test error increase. An interesting discovery is that though model 4, 5 and 6 have more aggressive compression conﬁgurations than model 3, their corresponding reduction in FLOPs are less than what model 3 achieves. This is because the convolutional layers in the model 3 are mainly compressed with the factor of 2, which corresponds to 2-point FFT computation that only needs real number operations.

Baseline 7.0

80 180 280 380 480

) % ( r o r r E ts e T

Parameters (K)

Figure 2: Res Net-32 Test Error and Model Size. Use of circulant convolutional layer can bring half of parameters reduction with negligible test error increase.

Wide Res Net on CIFAR-10

We also conduct the experiment on CIFAR-10 dataset using Wide Res Net (Zagoruyko and Komodakis 2016), which has better performance than conventional Res Net in term of test accuracy. In this experiment, the compressed Wide Res Net models are trained using SGD with learning rate 0.01, momentum 0.9, batch size 64 and weight decay 0.0005. Model settings. To construct baseline Wide Res Net models, we take the same basic convolutional block structure in

6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0

40% 50% 60% 70% 80% 90% 100%

) % ( r o r r E ts e T

FLOPs percentage (%)

Figure 3: Res Net-32 Test Error and Model Size. Use of circulant convolutional layer can bring half of FLOPs reduction with negligible test error increase.

Figure 4: Wide Res Net Model Size Reduction. Compared with baseline models, compressed models achieve similar model size as Res Net-110. Compressed model named like 48-4 has 48 convolutional layers and widening parameter as 4.

(Zagoruyko and Komodakis 2016) and set different numbers of convolutional blocks and widening parameters for different models. To achieve better performance, we add two more blocks to the convolutional blocks that are wider than 16 k, where k is the inherent widening parameter of each block. Different from the experiment in Res Net experiment section, this experiment on Wide Res Net adopts very aggressive compression strategy: For one convolutional layer, if the numbers of input channels (C0) and output channels (C2) are the same, then the compression ratio for that layer is i = C0 = C2; otherwise the convolutional layer is not compressed. We apply this compression strategy to ﬁve different Wide Res Net baselines and obtain ﬁve compressed Wide Res Net models. For each compressed model, it is labeled with two numbers ( d-k ), where d and k denote the number of convolutional layers ( depth ) and widening parameter ( width ), respectively. These compressed models are compared with their corresponding baseline models as well as Res Net-110, which achieves the best performance on CIFAR-10 in (He et al. 2016). Model size reduction. Figure 4 shows the number of parameters of Wide Res Net baselines and the corresponding compressed models after imposing circulant structure. It can

be seen that the compressed models greatly reduce the model size. In particular, model 60-4 can achieve 8.35 times reduction in the model size. Also, it can be seen that the numbers of parameters of Wide Res Net models are similar to the size of Res Net-110 after applying circulant convolutional layer. For instance, Model 48-4 has around 1.6M parameters which is less than 1.7M for Res Net-110. Test error analysis. Figure 5 shows test errors of baseline Wide Res Net models and the corresponding compressed models using circulant convolutional layer. It can be seen that all compressed models have slightly test error increase less than 1%. In addition, compared with the state-of-the-art Res Net-110, all of the compressed models have around 1% test error decrease.

Figure 5: Wide Res Net Test Error. Baseline models are different original Wide Res Nets and they are compared with the corresponding compressed models and Res Net-110.

Figure 6: Wide Res Net FLOPs. The overall FLOPs measure the FLOPs percentage of compressed models over corresponding baselines. We also list FLOPs percentage of compressed convolutional blocks over original blocks.

Comparison with Res Net-110. From Figure 5 we can see that all compressed Wide Res Net models have less test error than Res Net-110. Meanwhile, Figure 4 shows these compressed models have similar numbers of parameters as compared with Res Net-110, and model 48-4 has even fewer parameters. These results demonstrate that circulant structure-imposing approach can be useful in reducing model redundancy and holding less test error while maintaining similar model sizes.

Table 3: Comparison among Alex Net models. Alex Net Model

Compression Conﬁguration Test Error (%) Parameters (%) FLOPs(%)

Original (Baseline) N/A 42.9 100 100

Circ Conv 1-2-2-2-2 42.75 50.36 31.3 Circ Conv 1-2-2-4-2 42.99 40.01 31.3 Circ Conv 1-2-4-2-2 43.13 45.19 31.3 (Wen et al. 2016) N/A 42.75 51.20 39.0 (Wen et al. 2016) N/A 43.00 44.40 43.0 (Wen et al. 2016) N/A 43.25 42.30 45.0

FLOPs reduction. As shown in Figure 6, we measure the overall FLOPs reduction of Wide Res Net. It is found that the compressed Wide Res Net models can achieve signiﬁcant reduction in FLOPs: all of them only require around 36% FLOPs as compared to the corresponding baseline models. In addition, the FLOPs reduction for the compressed blocks are very signiﬁcant. From Figure 6 it can be seen that the FLOPs in the compressed blocks of all compressed models are only less than 6% of the corresponding uncompressed blocks in the original Wide Res Net baseline models.

Alex Net on Image Net

To test the effectiveness of the proposed circulant-imposing approach on large-scale datasets, we evaluate the performance of circulant CNNs on Image Net (ILSVRC2012). Here the baseline model is Alex Net (Krizhevsky, Sutskever, and Hinton 2012). All training images are randomly distorted as suggested in (Szegedy et al. 2016). We train our Alex Net models using RMSprop (Tieleman and Hinton 2012) with learning rate 0.01, momentum 0.9, batch size 32 and decay 0.9. Model settings. We explore three different compression conﬁgurations for the ﬁve convolutional layers in Alex Net. Table 3 listed the detailed compression conﬁguration schemes by using notation a-b-c-d-e . For instance, 1-2-2-2-2 means the ﬁrst convolutional layer is not compressed, and the rest four layers are compressed with the factor of 2. By using these conﬁgurations, three compressed Alex Net models are generated and compared with original Alex Net baseline model. Also, since SSL in (Wen et al. 2016) is the state-of-the-art work that explores the relationship between accuracy and compressed model size for Alex Net, we also compare our circulant convolutional layer-based compressed Alex Net models with three SSL regularization-based compressed Alex Net models in (Wen et al. 2016). Test error analysis. Table 3 shows the test errors of compressed Alex Net models by using circulant structureimposing and SSL approaches. It can be seen that both these two approaches can render the compressed models with the similar test errors to the original Alex Net model. Among them, the circulant model with 1-2-2-2-2 compression conﬁguration achieves the least test error. Model size reduction. Table 3 shows the percentage of number of parameters of each model over original Alex Net model. It can be seen that circulant convolution-based models have similar numbers of parameters to SSL-based models. Among them the circulant convolution-based model

with 1-2-2-4-2 compression conﬁguration has the least number of parameters. FLOPs reduction. Table 3 shows the percentage of FLOPs of each model over the original Alex Net model. It can be found that all circulant convolution-based models require fewer FLOPs than the SSL regulation-based models. All the circulant convolution-based models have around 31% FLOPs of original uncompressed Alex Net baseline. Overall comparison. As shown in Table 3 , circulant convolution-based models have similar accuracy to the stateof-the-art SSL models while maintaining similar number of parameters. Meanwhile, Table 3 shows that circulant convolution-based models requires less FLOPs than the SSL models when targeting to the similar accuracy. Therefore, imposing circulant structure to convolutional layer is a very promising accuracy-retained approach to reduce both the space and computational costs.

Acknowledgement

This paper is the collaborative work with Zhe Li, Liang Zhao, Qinru Qiu and Yanzhi Wang. Their names do not appear in the author list due to the mistake during submission. The authors would like to appreciate anonymous reviewers valuable comments and suggestions. This work is supported by the National Science Foundation Awards CCF-1854742.

In this paper, we propose to impose the circulant structure to convolutional neural network. This structure-imposing approach leads to signiﬁcant reduction in model size, FLOPs with negligible accuracy drop. Complexity analysis and experiments on different datasets and different network models demonstrate the effectiveness of the proposed approach.

Chen, W.; Wilson, J.; Tyree, S.; Weinberger, K. Q.; and Chen, Y. 2016. Compressing convolutional neural networks in the frequency domain. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1475 1484. ACM. Cheng, Y.; Yu, F. X.; Feris, R. S.; Kumar, S.; Choudhary, A.; and Chang, S.-F. 2015. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, 2857 2865. Chu, M. T., and Plemmons, R. J. 2003. Real-valued low rank circulant approximation. SIAM Journal on Matrix Analysis and Applications 24(3):645 659. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248 255. IEEE. Ding, C.; Liao, S.; Wang, Y.; Li, Z.; Liu, N.; Zhuo, Y.; Wang, C.; Qian, X.; Bai, Y.; Yuan, G.; et al. 2017. Circnn: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual

IEEE/ACM International Symposium on Microarchitecture, 395 408. ACM. Feng, J., and Darrell, T. 2015. Learning the structure of deep convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 2749 2757. Girosi, F.; Jones, M.; and Poggio, T. 1995. Regularization theory and neural networks architectures. Neural computation 7(2):219 269. Gong, Y.; Liu, L.; Yang, M.; and Bourdev, L. 2014. Compressing deep convolutional networks using vector quantization. ar Xiv preprint ar Xiv:1412.6115. Han, S.; Mao, H.; and Dally, W. J. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770 778. Jaderberg, M.; Vedaldi, A.; and Zisserman, A. 2014. Speeding up convolutional neural networks with low rank expansions. ar Xiv preprint ar Xiv:1405.3866. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; and Fei-Fei, L. 2014. Large-scale video classiﬁcation with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725 1732. Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Mathieu, M.; Henaff, M.; and Le Cun, Y. 2013. Fast training of convolutional networks through ffts. ar Xiv preprint ar Xiv:1312.5851. Moczulski, M.; Denil, M.; Appleyard, J.; and de Freitas, N. 2015. Acdc: A structured efﬁcient linear layer. ar Xiv preprint ar Xiv:1511.05946. Pan, V. 2012. Structured matrices and polynomials: uniﬁed superfast algorithms. Springer Science & Business Media. Rezghi, M., and Eld en, L. 2011. Diagonalization of tensors with circulant structure. Linear Algebra and its Applications 435(3):422 447. Sainath, T. N.; Kingsbury, B.; Sindhwani, V.; Arisoy, E.; and Ramabhadran, B. 2013. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 6655 6659. IEEE. Simonyan, K., and Zisserman, A. 2014. Very deep convo-

lutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556. Sindhwani, V.; Sainath, T.; and Kumar, S. 2015. Structured transforms for small-footprint deep learning. In Advances in Neural Information Processing Systems, 3088 3096. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1 9. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818 2826. Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2):26 31. Wang, Y.; Xu, C.; You, S.; Tao, D.; and Xu, C. 2016. Cnnpack: packing convolutional neural networks in the frequency domain. In Advances in neural information processing systems, 253 261. Wang, Y.; Ding, C.; Li, Z.; Yuan, G.; Liao, S.; Ma, X.; Yuan, B.; Qian, X.; Tang, J.; Qiu, Q.; et al. 2018. Towards ultra-high performance and energy efﬁciency of deep learning systems: an algorithm-hardware co-optimization framework. In Association for the Advancement of Artiﬁcial Intelligence. Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2074 2082. Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146. Zhao, L.; Liao, S.; Wang, Y.; Tang, J.; and Yuan, B. 2017. Theoretical properties for neural networks with weight matrices of low displacement rank. ar Xiv preprint ar Xiv:1703.00144. Zhao, Y.; Li, J.; and Gong, Y. 2016. Low-rank plus diagonal adaptation for deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 5005 5009. IEEE.