# doubly_convolutional_neural_networks__2eb7bd8e.pdf

Doubly Convolutional Neural Networks

Shuangfei Zhai Binghamton University Vestal, NY 13902, USA szhai2@binghamton.edu

Yu Cheng IBM T.J. Watson Research Center Yorktown Heights, NY 10598, USA chengyu@us.ibm.com

Weining Lu Tsinghua University Beijing 10084, China luwn14@mails.tsinghua.edu.cn

Zhongfei (Mark) Zhang Binghamton University Vestal, NY 13902, USA zhongfei@cs.binghamton.edu

Building large models with parameter sharing accounts for most of the success of deep convolutional neural networks (CNNs). In this paper, we propose doubly convolutional neural networks (DCNNs), which signiﬁcantly improve the performance of CNNs by further exploring this idea. In stead of allocating a set of convolutional ﬁlters that are independently learned, a DCNN maintains groups of ﬁlters where ﬁlters within each group are translated versions of each other. Practically, a DCNN can be easily implemented by a two-step convolution procedure, which is supported by most modern deep learning libraries. We perform extensive experiments on three image classiﬁcation benchmarks: CIFAR-10, CIFAR-100 and Image Net, and show that DCNNs consistently outperform other competing architectures. We have also veriﬁed that replacing a convolutional layer with a doubly convolutional layer at any depth of a CNN can improve its performance. Moreover, various design choices of DCNNs are demonstrated, which shows that DCNN can serve the dual purpose of building more accurate models and/or reducing the memory footprint without sacriﬁcing the accuracy.

1 Introduction

In recent years, convolutional neural networks (CNNs) have achieved great success to solve many problems in machine learning and computer vision. CNNs are extremely parameter efﬁcient due to exploring the translation invariant property of images, which is the key to training very deep models without severe overﬁtting. While considerable progresses have been achieved by aggressively exploring deeper architectures [1, 2, 3, 4] or novel regularization techniques [5, 6] with the standard "convolution + pooling" recipe, we contribute from a different view by providing an alternative to the default convolution module, which can lead to models with even better generalization abilities and/or parameter efﬁciency.

Our intuition originates from observing well trained CNNs where many of the learned ﬁlters are the slightly translated versions of each other. To quantify this in a more formal fashion, we deﬁne the k-translation correlation between two convolutional ﬁlters within a same layer Wi, Wj as:

ρk(Wi, Wj) = max x,y { k,...,k},(x,y) =(0,0) < Wi, T(Wj, x, y) >f

Wi 2 Wj 2 , (1)

where T( , x, y) denotes the translation of the ﬁrst operand by (x, y) along its spatial dimensions, with proper zero padding at borders to maintain the shape; < , >f denotes the ﬂattened inner

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Figure 1: Visualization of the 11 11 sized ﬁrst layer ﬁlters learned by Alex Net [1]. Each column shows a ﬁlter in the ﬁrst row along with its three most 3-translation-correlated ﬁlters. Only the ﬁrst 32 ﬁlters are shown for brevity.

Figure 2: Illustration of the averaged maximum 1-translation correlation, together with the standard deviation, of each convolutional layer for Alex Net [1] (left), and the 19-layer VGGNet [2] (right), respectively. For comparison, for each convolutional layer in each network, we generate a ﬁlter set with the same shape from the standard Gaussian distribution (the blue bars). For both networks, all the convolutional layers have averaged maximum 1-translation correlations that are signiﬁcantly larger than their random counterparts.

product, where the two operands are ﬂattened into column vectors before taking the standard inner product; 2 denotes the ℓ2 norm of its ﬂattened operand. In other words, the k-translation correlation between a pair of ﬁlters indicates the maximum correlation achieved by translating one ﬁlter up to k steps along any spatial dimension. As a concrete example, Figure 1 demonstrates the 3-translation correlation of the ﬁrst layer ﬁlters learned by the Alex Net [1], with the weights obtained from the Caffe model zoo [7]. In each column, we show a ﬁlter in the ﬁrst row and its three most 3-translation-correlated ﬁlters (that is, ﬁlters with the highest 3-translation correlations) in the second to fourth row. Only the ﬁrst 32 ﬁlters are shown for brevity. It is interesting to see for most ﬁlters, there exist several ﬁlters that are roughly its translated versions.

In addition to the convenient visualization of the ﬁrst layers, we further study this property at higher layers and/or in deeper models. To this end, we deﬁne the averaged maximum k-translation correlation of a layer W as ρk(W) = 1

N PN i=1 max N j=1,j =i ρk(Wi, Wj), where N is the number of ﬁlters. Intuitively, the ρk of a convolutional layer characterizes the average level of translation correlation among the ﬁlters within it. We then load the weights of all the convolutional layers of Alex Net as well as the 19-layer VGGNet [2] from the Caffe model zoo, and report the averaged maximum 1-translation correlation of each layer in Figure 2. In each graph, the height of the red bars indicates the ρ1 calculated with the weights of the corresponding layer. As a comparison, for each layer we have also generated a ﬁlter bank with the same shape but ﬁlled with standard Gaussian samples, whose ρ1 are shown as the blue bars. We clearly see that all the layers in both models demonstrate averaged maximum translation correlations that are signiﬁcantly higher than their random counterparts. In addition, it appears that lower convolutional layers generally have higher translation correlations, although this does not strictly hold (e.g., conv3_4 in VGGNet).

Motivated by the evidence shown above, we propose the doubly convolutional layer (with the double convolution operation), which can be plugged in place of a convolutional layer in CNNs, yielding the doubly convolutional neural networks (DCNNs). The idea of double convolution is to learn groups ﬁlters where ﬁlters within each group are translated versions of each other. To achieve this, a doubly convolutional layer allocates a set of meta ﬁlters which has ﬁlter sizes that are larger than the effective ﬁlter size. Effective ﬁlters can be then extracted from each meta ﬁlter, which corresponds to convolving the meta ﬁlters with an identity kernel. All the extracted ﬁlters are then concatenated, and convolved with the input. Optionally, one can also choose to pool along activations produced by ﬁlters from the same meta ﬁlter, in a similar spirit to the maxout networks [8]. We also show that double convolution can be easily implemented with available deep learning libraries by utilizing the efﬁcient

Figure 3: The architecture of a convolutional layer (left) and a doubly convolutional layer (right). A doubly convolutional layer maintains meta ﬁlters whose spatial size z z is larger than the effective ﬁlter size z z. By pooling and ﬂattening the convolution output, a doubly convolutional layer produces ( z z+1

s )2 times more channels for the output image, with s s being the pooling size.

convolutional kernel. In our experiments, we show that the additional level of parameter sharing by double convolution allows one to build DCNNs that yield an excellent performance on several popular image classiﬁcation benchmarks, consistently outperforming all the competing architectures with a margin. We have also conﬁrmed that replacing a convolutional layer with a doubly convolutional layer consistently improves the performance, regardless of the depth of the layer. Last but not least, we show that one is able to balance the trade off between performance and parameter efﬁciency by leveraging the architecture of a DCNN.

2.1 Convolution

We deﬁne an image I Rc w h as a real-valued 3D tensor, where c is the number of channels; w, h are the width and height, respectively. We deﬁne the convolution operation, denoted by Iℓ+1 = Iℓ Wℓ, as follows:

Iℓ+1 k,i,j = X

c [1,c],i [1,z],j [1,z] Wℓ k,c ,i ,j Iℓ c ,i+i 1,j+j 1,

k [1, cℓ+1], i [1, wℓ+1], j [1, hℓ+1].

Here Iℓ Rcℓ wℓ hℓis the input image; Wℓ Rcℓ+1 cℓ z z is a set of cℓ+1 ﬁlters, with each ﬁlter of shape cℓ z z; Iℓ+1 Rcℓ+1 wℓ+1 hℓ+1 is the output image. The spatial dimensions of the output image wℓ+1, hℓ+1 are by default wℓ+ z 1 and hℓ+ z 1, respectively (aka, valid convolution), but one can also pad a number of zeros at the borders of Iℓto achieve different output spatial dimensions (e.g., keeping the spatial dimensions unchanged). In this paper, we use a loose notation by freely allowing both the LHS and RHS of to be either a single image (ﬁlter) or a set of images (ﬁlters), with proper convolution along the non-spatial dimensions.

A convolutional layer can thus be implemented with a convolution operation followed by a nonlinearity function such as Re LU, and a convolutional neural network (CNN) is constructed by interweaving several convolutoinal and spatial pooling layers.

2.2 Double convolution

We next introduce and deﬁne the double convolution operation, denoted by Iℓ+1 = Iℓ Wℓ, as follows:

Oℓ+1 i,j,k = Wℓ k Iℓ :,i:(i+z 1),j:(j+z 1),

Iℓ+1 (nk+1):n(k+1),i,j = pools(Oℓ+1 i,j,k), n = (z z + 1

k [1, cℓ+1], i [1, wℓ+1], j [1, hℓ+1].

Here Iℓ Rcℓ wℓ hℓand Iℓ+1 Rncℓ+1 wℓ+1 hℓ+1 are the input and output image, respectively. Wℓ Rcℓ+1 cℓ z z are a set of cℓ+1 meta ﬁlters, with ﬁlter size z z , z > z; Oℓ+1 i,j,k R(z z+1) (z z+1) is the intermediate output of double convolution; pools( ) deﬁnes a spatial pooling function with pooling size s s (and optionally reshaping the output to a column vector, inferred from the context); is the convolution operator deﬁned previously in Equation 2.

In words, a double convolution applies a set of cℓ+1 meta ﬁlters with spatial dimensions z z , which are larger than the effective ﬁlter size z z. Image patches of size z z at each location (i, j) of the input image, denoted by Iℓ :,i:(i+z 1),j:(j+z 1), are then convolved with each meta ﬁlter, resulting an output of size z z + 1 z z + 1, for each (i, j). A spatial pooling of size s s is then applied along this resulting output map, whose output is ﬂattened into a column vector. This produces an output feature map with ncℓ+1 channels. The above procedure can be viewed as a two step convolution, where image patches are ﬁrst convolved with meta ﬁlters, and the meta ﬁlters then slide across and convolve with the image, hence the name double convolution.

A doubly convolutional layer is by analogy deﬁned as a double convolution followed by a nonlinearity; and substituting the convolutional layers in a CNN with doubly convolutional layers yields a doubly convolutional neural network (DCNN). In Figure 3 we have illustrated the difference between a convolutional layer and a doubly convolutional layer. It is possible to vary the combination of z, z , s for each doubly convolutional layer of a DCNN to yield different variants, among which three extreme cases are:

(1) CNN: Setting z = z recovers the standard CNN; hence, DCNN is a generalization of CNN.

(2) Concat DCNN: Setting s = 1 produces a DCNN variant that is maximally parameter efﬁcient. This corresponds to extracting all sub-regions of size z z from a z z sized meta ﬁlter, which are then stacked to form a set of (z z + 1)2 ﬁlters with size z z. With the same amount of parameters, this produces (z z+1)2z2

(z )2 times more channels for a single layer.

(3) Maxout DCNN: Setting s = z z + 1, i.e., applying global pooling on Oℓ+1, produces a DCNN variant where the output image channel size is equal to the number of the meta ﬁlters. Interestingly, this yields a parameter efﬁcient implementation of the maxout network [8]. To be concrete, the maxout units in a maxout network are equivalent to pooling along the channel (feature) dimension, where each channel corresponds to a distinct ﬁlter. Maxout DCNN, on the other hand, pools along channels which are produced by the ﬁlters that are translated versions of each other. Besides the obvious advantage of reducing the number of parameters required, this also acts as an effective regularizer, which is veriﬁed later in the experiments at Section 4.

Implementing a double convolution is also readily supported by most main stream GPU-compatible deep learning libraries (e.g., Theano which is used in our experiments), which we have summarized in Algorithm 1. In particular, we are able to perform double convolution by two steps of convolution, corresponding to line 4 and line 6, together with proper reshaping and pooling operations. The ﬁrst convolution extracts overlapping patches of size z z from the meta ﬁlters, which are then convolved with the input image. Although it is possible to further reduce the time complexity by designing a specialized double convolution module, we ﬁnd that Algorithm 1 scales well to deep DCNNs, and large datasets such as Image Net.

3 Related work

The spirit of DCNNs is to further push the idea of parameter sharing of the convolutional layers, which is shared by several recent efforts. [9] explores the rotation symmetry of certain classes of images, and hence proposes to rotate each ﬁlter (or alternatively, the input) by a multiplication of 90 which produces four times ﬁlters with the same amount of parameters for a single layer. [10] observes that ﬁlters learned by Re LU CNNs often contain pairs with opposite phases in the lower layers. The authors accordingly propose the concatenated Re LU where the linear activations are concatenated with their negations and then passed to Re LU, which effectively doubles the number of ﬁlters. [11] proposes the dilated convolutions, where additional ﬁlters with larger sizes are generated by dilating the base convolutional ﬁlters, which is shown to be effective in dense prediction tasks such as image segmentation. [12] proposes a multi-bias activation scheme where k, k 1, bias terms are learned for each ﬁlter, which produces a k times channel size for the convolution output.

Algorithm 1: Implementation of double convolution with convolution.

Input: Input image Iℓ Rcℓ wℓ hℓ, meta ﬁlters Wℓ Rcℓ+1 z z , effective ﬁlter size z z, pooling size s s. Output: Output image Iℓ+1 Rncℓ+1 wℓ+1 hℓ+1, with n = (z z+1)2

2 Iℓ Identity Matrix(cℓz2) ;

3 Reorganize Iℓto shape cℓz2 cℓ z z;

4 Wℓ Wℓ Iℓ; /* output shape: cℓ+1 cℓz2 (z z + 1) (z z + 1) */

5 Reorganize Wℓto shape cℓ+1(z z + 1)2 cℓ z z;

6 Oℓ+1 Iℓ Wℓ; /* output shape: cℓ+1(z z + 1)2 wℓ+1 hℓ+1 */

7 Reorganize Oℓ+1 to shape cℓ+1wℓ+1hℓ+1 (z z + 1) (z z + 1) ;

8 Iℓ+1 pools(Oℓ+1) ; /* output shape: cℓ+1wℓ+1hℓ+1 z z+1

9 Reorganize Iℓ+1 to shape cℓ+1( z z+1

s )2 wℓ+1 hℓ+1 ;

Additionally, [13, 14] have investigated the combination of more than one transformations of ﬁlters, such as rotation, ﬂipping and distortion. Note that all the aforementioned approaches are orthogonal to DCNNs and can theoretically be combined in a single model. The need of correlated ﬁlters in CNNs is also studied in [15], where similar ﬁlters are explicitly learned and grouped with a group sparsity penalty.

While DCNNs are designed with better performance and generalization ability in mind, they are also closely related to the thread of work on parameter reduction in deep neural networks. The work of Vikas and Tara [16] addresses the problem of compressing deep networks by applying structured transforms. [17] exploits the redundancy in the parametrization of deep architectures by imposing a circulant structure on the projection matrix, while allowing the use of FFT for faster computations. [18] attempts to obtain the compression of the fully-connected layers of the Alex Nettype network with the Fastfood method. Novikov et al. [19] use a multi-linear transform (Tensor-Train decomposition) to attain reduction of the number of parameters in the linear layers of CNNs. These work differ from DCNNs as most of their focuses are on the fully connected layers, which often accounts for most of the memory consumption. DCNNs, on the other hand, apply directly to the convolutional layers, which provides a complementary view to the same problem.

4 Experiments

4.1 Datasets

We conduct several sets of experiments with DCNN on three image classiﬁcation benchmarks: CIFAR-10, CIFAR-100, and Image Net. CIFAR-10 and CIFAR-100 both contain 50,000 training and 10,000 testing 32 32 sized RGB images, evenly drawn from 10 and 100 classes, respectively. Image Net is the dataset used in the ILSVRC-2012 challenge, which consists of about 1.2 million images for training and 50,000 images for validation, sampled from 1,000 classes.

4.2 Is DCNN an effective architecture?

4.2.1 Model speciﬁcations

In the ﬁrst set of experiments, we study the effectiveness of DCNN compared with two different CNN designs. The three types of architectures subject to evaluation are:

(1) CNN: This corresponds to models using the standard convolutional layers. A convolutional layer is denoted as C-<c>-<z>, where c, z are the number of ﬁlters and the ﬁlter size, respectively.

(2) Maxout CNN: This corresponds to the maxout convolutional networks [8], which uses the maxout unit to pool along the channel (feature) dimensions with a stride k. A maxout convolutional layer is denoted as MC-<c>-<z>-<k>, where c, z, k are the number of ﬁlters, the ﬁlter size, and the feature pooling stride, respectively.

Table 1: The conﬁgurations of the models used in Section 4.2. The architectures on the CIFAR-10 and CIFAR-100 datasets are the same, except for the top softmax layer (left). The architectures on the Image Net dataset are variants of the 16-layer VGGNet [2] (right). See the details about the naming convention in Section 4.2.1.

CNN DCNN Maxout CNN

C-128-3 DC-128-4-3-2 MC-512-3-4 C-128-3 DC-128-4-3-2 MC-512-3-4

C-128-3 DC-128-4-3-2 MC-512-3-4 C-128-3 DC-128-4-3-2 MC-512-3-4

C-128-3 DC-128-4-3-2 MC-512-3-4 C-128-3 DC-128-4-3-2 MC-512-3-4

C-128-3 DC-128-4-3-2 MC-512-3-4 C-128-3 DC-128-4-3-2 MC-512-3-4

Global Average Pooling

CNN DCNN Maxout CNN

C-64-3 DC-64-4-3-2 MC-256-3-4 C-64-3 DC-64-4-3-2 MC-256-3-4

C-128-3 DC-128-4-3-2 MC-512-3-4 C-128-3 DC-128-4-3-2 MC-512-3-4

C-256-3 DC-256-4-3-2 MC-1024-3-4 C-256-3 DC-256-4-3-2 MC-1024-3-4 C-256-3 DC-256-4-3-2 MC-1024-3-4

C-512-3 DC-512-4-3-2 MC-2048-3-4 C-512-3 DC-512-4-3-2 MC-2048-3-4 C-512-3 DC-512-4-3-2 MC-2048-3-4

C-512-3 DC-512-4-3-2 MC-2048-3-4 C-512-3 DC-512-4-3-2 MC-2048-3-4 C-512-3 DC-512-4-3-2 MC-2048-3-4

Global Average Pooling

(3) DCNN: This corresponds to using the doubly convolutional layers. We denote a doubly convolutional layer with c ﬁlters as DC-<c>-<z >-<z>-<s>, where z , z, s are the meta ﬁlter size, effective ﬁlter size and pooling size, respectively, as in Equation 3. In this set of experiments, we use the Maxout DCNN variant, whose layers are readily represented as DC-<c>-<z >-<z>-<z z + 1>.

We denote a spatial max pooling layer as P-<s> with s as the pooling size. For all the models, we apply batch normalization [6] immediately after each convolution layer, after which Re LU is used as the nonlinearity (including Maxout CNN, which makes out implementation slightly different from [8]). Our model design is similar to VGGNet [2] where 3 3 ﬁlter sizes are used, as well as Network in Network [20] where fully connected layers are completely eliminated. Zero padding is used before each convolutional layer to maintain the spatial dimensions unchanged after convolution. Dropout is applied after each pooling layer. Global average pooling is applied on top of the last convolutional layer, which is fed to a Softmax layer with a proper number of outputs.

All the three models on each dataset are of the same architecture w.r.t. the number of layers and the number of units per layer. The only difference thus resides in the choice of the convolutional layers. Note that the architecture we have used on the Image Net dataset resembles the 16-layer VGGNet [2], but without the fully connected layers. The full speciﬁcation of the model architectures is shown in Table 1.

4.2.2 Training protocols

We preprocess all the datasets by extracting the mean for each pixel and each channel, calculated on the training sets. All the models are trained with Adadelta [21] on NVIDIA K40 GPUs. Bath size is set as 200 for CIFAR-10 and CIFAR-100, and 128 for Image Net.

Data augmentation has also been explored. On CIFAR-10 and CIFAR-100, We follow the simple data augmentation as in [2]. For training, 4 pixels are padded on each side of the images, from which 32 32 crops are sampled with random horizontal ﬂipping. For testing, only the original 32 32 images are used. On Image Net, 224 224 crops are sampled with random horizontal ﬂipping; the standard color augmentation and the 10-crop testing are also applied as in Alex Net [1].

4.2.3 Results

The test errors are summarized in Table 2 and Table 3, where the relative # parameters of DCNN and Maxout CNN compared with the standard CNN are also shown. On the moderately-sized datasets CIFAR-10 and CIFAR-100, DCNN achieves the best results of the three control experiments, with and without data augmentation. Notably, DCNN consistently improves over the standard CNN with a margin. More remarkably, DCNN also consistently outperforms Maxout CNN, with 2.25 times less parameters. This on the one hand proves that the doubly convolutional layers greatly improves the model capacity, and on the other hand veriﬁes our hypothesis that the parameter sharing introduced by double convolution indeed acts as a very effective regularizer. The results achieved by DCNN on the two datasets are also among the best published results compared with [20, 22, 23, 24].

Besides, we also note that DCNN does not have difﬁculty scaling up to a large dataset as Image Net, where consistent performance gains over the other baseline architectures are again observed. Compared with the results of the 16-layer VGGNet in [2] with multiscale evaluation, our DCNN implementation achieves comparable results, with signiﬁcantly less parameters.

Table 2: Test errors on CIFAR-10 and CIFAR-100 with and without data augmentation, together with the relative # parameters compared with the standard CNN.

Model # Parameters Without Data Augmentation With Data Augmentation CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100

CNN 1. 9.85% 34.26% 9.59% 33.04% Maxout CNN 4. 9.56% 33.52% 9.23% 32.37% DCNN 1.78 8.58% 30.35% 7.24% 26.53%

NIN [20] 0.92 10.41% 35.68% 8.81% - DSN [22] - 9.78% 34.57% 8.22% - APL [23] - 9.59% 34.40% 7.51% 30.83% ELU [24] - - - 6.55% 24.28%

4.3 Does double convolution contribute to every layer?

In the next set of experiments, we study the effect of applying double convolution to layers at various depths. To this end, we replace the convolutional layers at each level of the standard CNN deﬁned in 4.2.1 with a doubly convolutional layer counterpart (e.g., replacing a C-128-3 layer with a DC-128-4-3-2 layer). We hence deﬁne DCNN[i-j] as the network resulted from replacing the i jth convolutional layer of a CNN with its doubly convolutional layer counterpart, and train {DCNN[1-2], DCNN[3-4], DCNN[5-6], DCNN[7-8]} on CIFAR-10 and CIFAR-100 following the same protocol as that in Section 4.2.2. The results are shown in Table 4. Interestingly, the doubly convolutional layer is able to consistently improve the performance over that of the standard CNN regardless of the depth with which it is plugged in. Also, it seems that applying double convolution at lower layers contributes more to the performance, which is consistent with the trend of translation correlation observed in Figure 2.

Table 3: Test errors on Image Net, evaluated on the validation set, together with the relative # parameters compared with the standard CNN.

Model Top-5 Error Top-1 Error # Parameters CNN 10.59% 29.42% 1. Maxout CNN 9.82% 28.4% 4. DCNN 8.23% 26.27 % 1.78

VGG-16 [2] 7.5% 24.8% 9.3 Res Net-152 [4] 5.71% 21.43% 4.1 Goog Le Net [3] 7.9% - 0.47

Table 4: Inserting the doubly convolutional layer at different depths of the network.

Model CIFAR-10 CIFAR-100

CNN 9.85% 34.26% DCNN[1-2] 9.12% 32.91% DCNN[3-4] 9.23% 33.27% DCNN[5-6] 9.45% 33.58% DCNN[7-8] 9.57% 33.72% DCNN[1-8] 8.58% 30.35%

4.4 Performance vs. parameter efﬁciency

In the last set of experiments, we study the behavior of DCNNs under various combinations of its hyper-parameters, z , z, s. To this end, we train three more DCNNs on CIFAR-10 and CIFAR-100, namely {DCNN-32-6-3-2, DCNN-16-6-3-1, DCNN-4-10-3-1}. Here we have overloaded the notation for a doubly convolutional layer to denote a DCNN which contains correspondingly shaped doubly convolutional layers (the DCNN in Table 1 thus corresponds to DCNN-128-4-3-2). In particular, DCNN-32-6-3-2 produces a DCNN with the exact same shape and number of parameters of those of the reference CNN; DCNN-16-6-3-1, DCNN-4-10-3-1 are two Concat DCNN instances from Section 2.2, which produce larger sized models with same or less amount of parameters. The results, together with the effective layer size and the relative number of parameters, are listed in Table 5. We see that all the variants of DCNN consistently outperform the standard CNN, even when fewer parameters are used (DCNN-4-10-3-1). This veriﬁes that DCNN is a ﬂexible framework which allows one to either maximize the performance with a ﬁxed memory budget, or on the other hand, minimize the memory footprint without sacriﬁcing the accuracy. One can choose the best suitable architecture of a DCNN by balancing the trade off between performance and the memory footprint.

Table 5: Different architecture conﬁgurations of DCNNs.

Model CIFAR-10 CIFAR-100 Layer size # Parameters

CNN 9.85% 34.26% 128 1. DCNN-32-6-3-2 9.05% 32.28% 128 1. DCNN-16-6-3-1 9.16% 32.54% 256 1. DCNN-4-10-3-1 9.65% 33.57% 256 0.69 DCNN-128-4-3-2 8.58% 30.35% 128 1.78

5 Conclusion

We have proposed the doubly convolutional neural networks (DCNNs), which utilize a novel double convolution operation to provide an additional level of parameter sharing over CNNs. We show that DCNNs generalize standard CNNs, and relate to several recent proposals that explore parameter redundancy in CNNs. A DCNN can be easily implemented by modern deep learning libraries by reusing the efﬁcient convolution module. DCNNs can be used to serve the dual purpose of 1) improving the classiﬁcation accuracy as a regularized version of maxout networks, and 2) being parameter efﬁcient by ﬂexibly varying their architectures. In the extensive experiments on CIFAR-10, CIFAR-100, and Image Net datasets, we have shown that DCNNs signiﬁcantly improves over other architecture counterparts. In addition, we have shown that introducing the doubly convolutional layer to any layer of a CNN improves its performance. We have also experimented with various conﬁgurations of DCNNs, all of which are able to outperform the CNN counterpart with the same or fewer number of parameters.

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012.

[2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 9, 2015.

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. ar Xiv preprint ar Xiv:1512.03385, 2015.

[5] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014.

[6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675 678. ACM, 2014.

[8] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. ar Xiv preprint ar Xiv:1302.4389, 2013.

[9] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. ar Xiv preprint ar Xiv:1602.02660, 2016.

[10] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectiﬁed linear units. ar Xiv preprint ar Xiv:1603.05201, 2016.

[11] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. ar Xiv preprint ar Xiv:1511.07122, 2015.

[12] Hongyang Li, Wanli Ouyang, and Xiaogang Wang. Multi-bias non-linear activation in deep neural networks. ar Xiv preprint ar Xiv:1604.00676, 2016.

[13] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in neural information processing systems, pages 2537 2545, 2014.

[14] Taco S Cohen and Max Welling. Group equivariant convolutional networks. ar Xiv preprint ar Xiv:1602.07576, 2016.

[15] Koray Kavukcuoglu, Rob Fergus, Yann Le Cun, et al. Learning invariant features through topographic ﬁlter maps. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1605 1612. IEEE, 2009.

[16] Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3088 3096. Curran Associates, Inc., 2015.

[17] Yu Cheng, Felix X. Yu, Rogerio Feris, Sanjiv Kumar, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In International Conference on Computer Vision (ICCV), 2015.

[18] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. In International Conference on Computer Vision (ICCV), 2015.

[19] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS). 2015.

[20] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ar Xiv preprint ar Xiv:1312.4400, 2013.

[21] Matthew D Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.

[22] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. ar Xiv preprint ar Xiv:1409.5185, 2014.

[23] Forest Agostinelli, Matthew Hoffman, Peter J. Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. Co RR, abs/1412.6830, 2014.

[24] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015.