# dianet_denseandimplicit_attention_network__a7e408fc.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

DIANet: Dense-and-Implicit Attention Network

Zhongzhan Huang,1 Senwei Liang,2 Mingfu Liang,3 Haizhao Yang2,4

1Tsinghua University, 2Purdue University, 3Northwestern University, 4National University of Singapore hzz dedekinds@foxmail.com, {liang339, haizhao}@purdue.edu, mingfuliang2020@u.northwestern.edu

Attention networks have successfully boosted the performance in various vision problems. Previous works lay emphasis on designing a new attention module and individually plug them into the networks. Our paper proposes a novel-andsimple framework that shares an attention module throughout different network layers to encourage the integration of layer-wise information and this parameter-sharing module is referred to as Dense-and-Implicit-Attention (DIA) unit. Many choices of modules can be used in the DIA unit. Since Long Short Term Memory (LSTM) has a capacity of capturing long-distance dependency, we focus on the case when the DIA unit is the modiﬁed LSTM (called DIA-LSTM). Experiments on benchmark datasets show that the DIA-LSTM unit is capable of emphasizing layer-wise feature interrelation and leads to signiﬁcant improvement of image classiﬁcation accuracy. We further empirically show that the DIA-LSTM has a strong regularization ability on stabilizing the training of deep networks by the experiments with the removal of skip connections (He et al. 2016a) or Batch Normalization (Ioffe and Szegedy 2015) in the whole residual network.

Introduction

Attention, a cognitive process that selectively focuses on a small part of information while neglects other perceivable information (Anderson 2005), has been used to effectively ease neural networks from learning large information contexts from sentences (Vaswani et al. 2017; Britz et al. 2017; Cheng, Dong, and Lapata 2016), images (Xu et al. 2015; Luong, Pham, and Manning 2015) and videos (Miech, Laptev, and Sivic 2017). Especially in computer vision, deep neural networks (DNNs) incorporated with special operators that mimic the attention mechanism can process informative regions in an image efﬁciently. These operators are modularized and plugged into networks as attention modules (Hu, Shen, and Sun 2018; Woo et al. 2018; Park et al. 2018; Wang et al. 2018; Hu et al. 2018; Cao et al. 2019).

Equal contribution Corresponding author Part of the work was done in Singapore Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

DIA unit (shared)

Layer-1 Layer-2

DIA unit (shared)

DIA unit (shared)

DIA unit (shared)

Layer-1 Layer-2

Figure 1: Left. Explicit structure of the network with DIA unit. Right. Implicit connection caused by DIA unit.

Previous works lay emphasis on designing a new attention module and individually plug them into networks. Generally, the attention module can be divided into three parts: extraction, processing and recalibration. First, the plug-in module extracts internal features of a network which can be squeezed channel-wise information (Hu, Shen, and Sun 2018; Li et al. 2019) or spatial information (Wang et al. 2018; Woo et al. 2018; Park et al. 2018). Next, the module processes the extraction and generates a mask to measure the importance of the features via a fully connected layer (Hu, Shen, and Sun 2018), convolution layer (Wang et al. 2018). Last, the mask is applied to recalibrate the features. Previous works focus on designing effective ways to process the extracted features. There is one obvious common ground where the attention modules are individually plugged into each layer throughout DNNs (Hu, Shen, and Sun 2018; Woo et al. 2018; Park et al. 2018; Wang et al. 2018). Our Framework. Differently, we propose a novel-andsimple framework that shares an attention module throughout different network layers to encourage the integration of layer-wise information and this parameter-sharing module is referred to as Dense-and-Implicit-Attention (DIA) unit. The structure and computation ﬂow of a DIA unit is visualized in Figure 2. There are also three parts: extraction ( 1 ), processing ( 2 ) and recalibration ( 3 ) in the DIA unit. The 2 is the main module in the DIA unit to model network attention and is the key innovation of the proposed method where the parameters of the attention module are shared. Characteristics and Advantages. (1) As shown in Figure 2, the DIA unit is placed parallel to the network backbone, and it is shared with all the layers in the same

DIA unit (shared)

DIA unit (shared)

() () emp F

() () emp F

Figure 2: DIA units in the residual network. Fext means the operation for extracting different scales of features. Femp means the operation for emphasizing or recalibrating features.

stage (the collection of successive layers with same spatial size, as deﬁned in (He et al. 2016a)) to improve the interaction of layers at different depth. (2) As the DIA unit is shared, the number of parameter increment from the DIA unit remains roughly constant as the depth of the network increases. We show the feasibility of our framework by applying SE module (Hu, Shen, and Sun 2018) in the DIA unit. SE module, a representative of attention mechanism, is used for each block individually in its original design. In our framework, we share the same SE module (referred to as DIA-SE) throughout all layers in the same stage. It is easy to see that DIA-SE has the same computation cost as SE, but Table 1 shows that DIA-SE has a better generalization and smaller parameter increment.

model #P(M) top1-acc.

Org 1.73 73.43( 0.43) SE 1.93 75.03( 0.33) DIA-SE 1.74 75.74( 0.41)

Table 1: Testing accuracy (mean std %) on CIFAR100 and Res Net164 with different attention modules. Org means the original backbone of Res Net164. #P(M) means the number of parameters (million).

Implicit and Dense Connection. We illustrate how the DIA unit connects all layers in the same stage implicitly and densely. Consider a stage consisting of many layers in Figure 1 (Left). It is an explicit structure with a DIA unit and one layer seems not to connect the other layers except the network backbone. In fact, the different layers use the parameter-sharing attention module and the layer-wise information jointly inﬂuences the update of learnable parameters in the module, which causes implicit connections between layers with the help of the shared DIA unit as in Figure 1 (Right). Since there is communication between every pair of layers, the connections over all layers are dense.

The idea of parameter sharing is also used in Recurrent Neural Network (RNN) to capture contextual information so we consider applying RNN in our framework to model the layer-wise interrelation. Since Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) has a

strong capacity of capturing long-distance dependency, we mainly focus on the case when we use LSTM in DIA unit (DIA-LSTM) and the remainder of our paper studies DIA-LSTM. Figure 3 is the showcase of DIA-LSTM. A global average pooling (GAP) operator (as the 1 in Figure 2) is used to extract global information from current layer. A LSTM module (as the 2 in Figure 2) is used to integrate multi-scale information and there are three inputs passed to the LSTM: the extracted global information from current raw feature map, the hidden state vector ht 1, and the cell state vector ct 1 from previous layers. Then the LSTM outputs the new hidden state vector ht and the new cell state vector ct. The cell state vector ct stores the information from the t-th layer and its preceding layers. The new hidden state vector ht (dubbed as attention vector in our work) is then applied back to the raw feature map by channel-wise multiplication (as the 3 in Figure 2) to recalibrate the feature. The LSTM in the DIA unit plays a role to bridge the current layer and the preceding layers such that the DIA unit can adaptively learn the non-linearity relationship between features in two different dimensions. The ﬁrst dimension of features is the internal information of the current layer. The second dimension represents the outer information, regarded as layer-wise information, from the preceding layers. The non-linearity relationship between these two dimensions can beneﬁt attention modeling for the current layer. The multidimensional modeling enables DIA-LSTM to have regularization effect.

Our Contribution We summarize our contribution as followed,

1. We propose a novel-and-simple framework that shares an attention module throughout different network layers to encourage the integration of layer-wise information.

2. We propose incorporating LSTM in the DIA unit (DIALSTM) and show the effectiveness of DIA-LSTM for image classiﬁcation by conducting experiments on benchmark datasets and popular networks.

Related Works Attention Mechanism in Computer Vision. Previous works use the attention mechanism in image classiﬁcation via utilizing a recurrent neural network to select and process local regions at high resolution sequentially (Mnih et al.

DIA unit (shared)

DIA unit (shared)

( ) 1 (; ) t f

( ) 1 (; ) t f

Figure 3: The showcase of DIA-LSTM. In the LSTM cell, ct is the cell state vector and ht is the hidden state vector. GAP means global average pool over channels and means channel-wise multiplication.

2014; Zhao et al. 2017). Concurrent attention-based methods tend to construct operation modules to capture nonlocal information in an image (Wang et al. 2018; Cao et al. 2019), and model the interrelationship between channelwise features (Hu, Shen, and Sun 2018; Hu et al. 2018). The combination of multi-level attention is also widely studied (Park et al. 2018; Woo et al. 2018; Wang et al. 2019; 2017). Prior works (Wang et al. 2018; Cao et al. 2019; Hu, Shen, and Sun 2018; Hu et al. 2018; Park et al. 2018; Woo et al. 2018; Wang et al. 2019) usually insert an attention module in each layer individually. In this work, the DIA unit is innovatively shared for all the layers in the same stage of the network, and the existing attention modules can be composited into the DIA unit readily. Besides, we adopt a global average pooling in part 1 to extract global information and a channel-wise multiplication in part 3 to recalibrate features, which is similar to SENet (Hu, Shen, and Sun 2018). Dense Network Topology. Dense Net proposed in (Huang et al. 2017) connects all pairs of layers directly with an identity map. Through reusing features, Dense Net has the advantage of higher parameter efﬁciency, a better capacity of generalization, and more accessible training than alternative architectures (Lin, Chen, and Yan 2013; He et al. 2016a; Srivastava, Greff, and Schmidhuber 2015b). Instead of explicitly dense connections, the DIA unit implicitly links layers at different depth via a shared module and leads to dense connection. Multi-Dimension Feature Integration. (Wolf and Bileschi 2006) experimentally analyzes that even the simple aggregation of low-level visual features sampled from a wide inception ﬁeld can be efﬁcient and robust for context representation, which inspires (Hu, Shen, and Sun 2018; Hu et al. 2018) to incorporate multi-level features to improve the network representation. (Li, Ouyang, and Wang 2016) also demonstrates that by biasing the feature response in each convolutional layers using different activation functions, the deeper layer could achieve the better capacity of capturing the abstract pattern in DNN. In the DIA unit, the high nonlinearity relationship between multidimensional features is learned and integrated via the LSTM module.

Dense-and-Implicit Attention Network In this section, we formally introduce the DIA-LSTM unit. We incorporate the modiﬁed LSTM module in the DIA unit.

Afterward, a DIANet is referred to as a network built with DIA-LSTM units.

Formulation of DIA-LSTM Unit

Figure 3 shows a DIA-LSTM unit built with a residual network (He et al. 2016a), the input of the t-th layer is xt RW H N, where W, H and N mean width, height and the number of channels, respectively. f( ; θ(t) 1 ) is the residual mapping at the t-th layer with parameters θ(t) 1 as introduced in (He et al. 2016a). Let at = f(xt; θ(t) 1 ) RW H N. Next, a global average pooling denoted as GAP( ) is applied to at to extract global information from features in the current layer. Then GAP(at) RN is passed to LSTM along with a hidden state vector ht 1 and a cell state vector ct 1 ( h0 and c0 are initialized as zero vectors). The LSTM ﬁnally generates a current hidden state vector ht RN and a cell state vector ct RN as

(ht, ct) = LSTM(GAP(at), ht 1, ct 1; θ2), (1)

where θ2 is the trainable parameter within the LSTM. In our model, the hidden state vector ht is regarded as attention vector to adaptively recalibrate feature maps. We apply channel-wise multiplication to enhance the importance of features, i.e., at ht. We obtain xt+1 after skip connection, i.e., xt+1 = xt + at ht. Table 2 shows the formulation of Res Net, SENet, and DIANet, and Part (b) is the main difference between them. The LSTM module is used repeatedly and shared with different layers in parallel to the network backbone. Therefore the number of parameters θ2 in an LSTM does not depend on the number of layers in the backbone, e.g., t. SENet utilizes an attention-module consisted of fully connected layers to model the channel-wise dependency for each layer individually (Hu, Shen, and Sun 2018). The total number of parameters brought by the added-in modules depends on the number of layers in the backbone and increases concerning the number of layers.

Modiﬁed LSTM Module

Now we introduce the modiﬁed LSTM module used in Figure 3. The design of the attention module usually requires the value of the attention vector in the range [0, 1]

Res Net SENet DIANet (ours)

(a) at = f(xt; θ(t) 1 ) at = f(xt; θ(t) 1 ) at = f(xt; θ(t) 1 ) (b) - ht = FC(GAP(at); θ(t) 2 ) (ht, ct) = LSTM(GAP(at), ht 1, ct 1; θ2) (c) xt+1 = xt + at xt+1 = xt + at ht xt+1 = xt + at ht

Table 2: Formulation for the structure of Res Net, SENet and DIANet. f is the convolution layer. GAP( ) means global average pooling. FC means fully connected layer and θ(t) 2 is the trainable parameter within SE module at the t-th layer.

and also requires a small parameter increment. We conduct some modiﬁcations in the LSTM module used in DIALSTM. As shown in Figure 4, compared to the standard LSTM (Hochreiter and Schmidhuber 1997) module, there are two modiﬁcations in our purposed LSTM: 1) a shared linear transformation to reduce input dimension of LSTM; 2) a carefully selected activation function for better performance. (1) Parameter Reduction. A standard LSTM consists of four linear transformation layers as shown in Figure 4 (Top). Since yt, ht 1 and ht are of the same dimension N, the standard LSTM may cause 8N 2 parameter increment as shown in the last section (Analysis). When the number of channels is large, e.g., N = 210, the parameter increment of the added-in LSTM module in the DIA unit will be over 8 million, which can hardly be tolerated. As shown Figure 4 (Top), ht 1 and yt in the standard LSTM are passed to four linear transformation layers with the same input and output dimension N. In the DIA-LSTM, a linear transformation layer (denoted as Linear1 in Figure 4 (Bottom)) with a smaller output dimension are applied to ht 1 and yt. We use reduction ratio r in Linear1 . Speciﬁcally, we reduce the dimension of the input from N to N/r and then apply the Re LU activation function to increase nonlinearity in this module. The dimension of the output of the Re LU layer is changed back to N when the output is passed to those four linear transformation functions. This modiﬁcation can enhance the relationship between the inputs for different parts in DIA-LSTM and also effectively reduce the number of parameters by sharing a linear transformation for dimension reduction. The number of parameter increment reduces from 8N 2 to 10N 2/r as shown in the last section (Analysis). When we choose an appropriate reduction ratio r, we can make a better trade-off between parameter reduction and the performance of DIANet. Further experimental results will be discussed in the ablation study later. (2) Activation Function. Sigmoid (σ(z) = 1/(1 + e z)) is used in many attention-based methods like SENet (Hu, Shen, and Sun 2018), CBAM (Woo et al. 2018) to generate attention maps as a gate mechanism. As shown in Figure 4 (Bottom), we change the activation function of the output layer from Tanh to Sigmoid. Further discussion will be presented in ablation study.

Experiments

In this section, we evaluate the performance of the DIALSTM unit in image classiﬁcation and empirically demonstrate its effectiveness. We conduct experiments on popular

Linear Linear Linear Linear

Linear Linear Linear

Figure 4: Top. The standard LSTM cell. Bottom. The modiﬁed LSTM cell in DIA-LSTM unit. We highlight the modiﬁed component in the modiﬁed LSTM. σ means the sigmoid activation. Linear means the linear transformation.

networks for benchmark datasets. Since SENet (Hu, Shen, and Sun 2018) is also a channel-speciﬁc attention model, we compare DIANet with SENet. For a fair comparison, we adjust the reduction ratio such that the number of parameters of DIANet is similar to that of SENet. Dataset and Model. We conduct experiments on CIFAR10, CIFAR100 (Krizhevsky and Hinton 2009) and Image Net 2012 (Russakovsky et al. 2015) using Res Net (He et al. 2016a), Pre Res Net (He et al. 2016b), WRN (Zagoruyko and Komodakis 2016) and Res Ne Xt (Xie et al. 2017). CIAFR10 or CIFAR100 has 50k train images and 10k test images of size 32 by 32, but has 10 and 100 classes respectively. Image Net 2012 (Russakovsky et al. 2015) comprises 1.28 million training and 50k validation images from 1000 classes, and the random cropping of size 224 by 224 is used

Dataset original SENet DIANet

#P(M) top1-acc. #P(M) top1-acc. #P(M) top1-acc. r

Res Net164 CIFAR100 1.73 73.43 1.93 75.03 1.95 76.67 4 Pre Res Net164 CIFAR100 1.73 76.53 1.92 77.41 1.96 78.20 4 WRN52-4 CIFAR100 12.07 79.75 12.42 80.35 12.30 80.99 4 Res Ne Xt101,8x32 CIFAR100 32.14 81.18 34.03 82.45 33.01 82.46 4

Res Net164 CIFAR10 1.70 93.54 1.91 94.27 1.92 94.58 4 Pre Res Net164 CIFAR10 1.70 95.01 1.90 95.18 1.94 95.23 4 WRN52-4 CIFAR10 12.05 95.96 12.40 95.95 12.28 96.17 4 Res Ne Xt101,8x32 CIFAR10 32.09 95.73 33.98 96.09 32.96 96.24 4

Res Net34 Image Net 21.81 73.93 21.97 74.39 21.98 74.60 20 Res Net50 Image Net 25.58 76.01 28.09 76.61 28.38 77.24 20 Res Net152 Image Net 60.27 77.58 66.82 78.36 65.85 78.87 10 Res Ne Xt50,32x4 Image Net 25.03 77.19 27.56 78.04 27.83 78.32 20

Table 3: Testing accuracy (%) on CIFAR10, CIFAR100 and Image Net 2012. #P(M) means the number of parameters (million). The rightmost r indicates the reduction ratio of DIANet.

in our experiments. The details can be found in Appendix. Image Classiﬁcation. As shown in Table 3, DIANet improves the testing accuracy signiﬁcantly over the original networks and consistently comparing with SENet for different datasets. In particular, the performance improvement of the Res Net with the DIA unit is most remarkable. Due to the popularity of Res Net, the DIA unit may be applied in other computer vision tasks.

Ablation Study In this section, we conduct ablation experiments to explore how to better plug DIA-LSTM units in different neural network structures and gain a deeper understanding of the role of components in the unit. All experiments are performed on CIFAR100 with Res Net. For simplicity, DIANet164 is denoted as a 164-layer Res Net built with DIA-LSTM units. Reduction Ratio. The reduction ratio is the only hyperparameter in DIANet. The main advantage of our model is improving the generalization ability with a light parameter increment. The smaller reduction ratio causes a higher parameter increment and model complexity. This part investigates the trade-off between the model complexity and performance. As shown in Table 4, the number of parameters of the DIANets decreases with the increasing reduction ratio, but the testing accuracy declines slightly, which suggests that the model performance is not sensitive to the reduction ratio. In the case of r = 16, the DIANet164 has 0.05M parameter increment compared to the original Res Net164 but the testing accuracy of the DIANet164 is 76.50% while that of the Res Net164 is 73.43%. The Depth of Networks. Generally in practice, DNNs with a larger number of parameters do not guarantee sufﬁcient performance improvement. Deeper networks may contain extreme feature and parameter redundancy (Huang et al. 2017). Therefore, designing a new structure of deep neural networks is necessary (He et al. 2016a; Huang et al. 2017; Srivastava, Greff, and Schmidhuber 2015a; Hu, Shen, and Sun 2018; Hu et al. 2018; Wang et al. 2018). Since DIA

Ratio r #P(M) top1-acc.

1 2.59(+0.86) 76.88 4 1.95(+0.22) 76.67 8 1.84(+0.11) 76.42 16 1.78(+0.05) 76.50

Table 4: Test accuracy (%) with different reduction ratio on CIFAR100 with Res Net164. The value in bracket means the parameter increment compared with the original Res Net164 (1.73M).

units change the topology of DNN backbones, evaluating the effectiveness of DIANet structure is important. Here we investigate how the depth of DNNs inﬂuences DIANets in two aspects: (1) the performance of DIANets compared to SENets of various depth; (2) the parameter increment of DIANets. The results in Table 5 show that as the depth increases from 83 to 407 layers, the DIANet with a smaller number of parameters can achieve higher classiﬁcation accuracy than the SENet. Moreover, the DIANet83 can achieve a similar performance as the SENet245, and DIANet164 outperforms all the SENets with at least 1.13% and at most 58.8% parameter reduction. They imply that the DIANet is of higher parameter efﬁciency than SENet. The results also suggest that: for DIANet, as shown in Figure 3, the DIA-LSTM unit will pass more layers recurrently with a deeper depth. The DIA-LSTM can handle the interrelationship between the information of different layers in much deeper DNN and ﬁgure out the long-distance dependency between layers. Activation Function and the Number of Stacking Cells. We choose different activation functions in the output layer of LSTM in Figure 4 (Bottom) and different numbers of stacking LSTM cells to explore the effects of these two factors. In Table 6, we ﬁnd that the performance has been signiﬁcantly improved after replacing tanh in the standard

Figure 5: Visualization of feature integration for each stage by random forest. Each row presents the importance of source layers hn, 1 n < t contributing to the target layer ht.

CIFAR-100 SENet DIANet(r = 4)

Depth #P(M) acc. #P(M) acc.

Res Net83 0.99 74.67 1.11(+0.12) 75.02 Res Net164 1.93 75.03 1.95(+0.02) 76.67 Res Net245 2.87 75.03 2.78( 0.09) 76.79 Res Net407 4.74 75.54 4.45( 0.29) 76.98

Table 5: Test accuracy (%) with Res Net of different depth on CIFAR100.

LSTM with sigmoid. As shown in Figure 4 (Bottom), this activation function is located in the output layer and directly changes the effect of memory unit ct on the output of the output gate. In fact, the sigmoid is used in many attentionbased methods like SENet as a gate mechanism. The test accuracy of different choices of LSTM activation functions in Table 6 shows that sigmoid better helps LSTM as a gate to rescale channel features. Table 12 in the SENet paper (Hu, Shen, and Sun 2018) shows the performance of different activation functions like: sigmoid > tanh > Re LU (bigger is better), which coincides to our reported results. When we use sigmoid in the output layer of LSTM, the increasing number of stacking LSTM cells does not necessarily lead to performance improvement but may lead to performance degradation. However, when we choose tanh, the situation is different. It suggests that, through the stacking of LSTM cells, the scale of the information ﬂow among them is changed, which may affect the performance.

Analysis In this section, we study some properties of DIANet, including feature integration and regularization effect on stabilizing training. Firstly, the layers are connected by the DIA-LSTM unit in DIANet and we use the random forest model (Gregorutti, Michel, and Saint-Pierre 2017) to visualize how the current layer depends on the preceding layers. Secondly, we study the stabilizing training effect of DIANet by removing all the Batch Normalization (Ioffe and Szegedy 2015) or the skip connection in the residual network.

#P(M) Activation #LSTM cells top1-acc.

1.95 sigmoid 1 76.67 1.95 tanh 1 75.24 1.95 Re LU 1 74.62 3.33 sigmoid 3 75.20 3.33 tanh 3 76.47

Table 6: Test accuracy (%) on CIFAR100 with DIANet164 of different activation function at the output layer in the modiﬁed LSTM and different number of stacking LSTM cells.

Feature Integration

Here we try to understand the dense connection from the numerical perspective. As shown in Figure 3 and 1, the DIALSTM bridges the connections between layers by propagating the information forward through ht and ct. Moreover, ht at different layers are also integrating with ht , 1 t < t in DIA-LSTM. Notably, ht is applied directly to the features in the network at each layer t. Therefore the relationship between ht at different layers somehow reﬂects layer-wise connection degree. We explore the nonlinear relationship between the hidden state ht of DIA-LSTM and the preceding hidden state ht 1, ht 2, ..., h1, and visualize how the information coming from ht 1, ht 2, ..., h1 contributes to ht. To reveal this relationship, we consider using the random forest to visualize variable importance. The random forest can return the contributions of input variables to the output separately in the form of importance measure, e.g., Gini importance (Gregorutti, Michel, and Saint-Pierre 2017). The computation details of Gini importance can be referred to Algorithm 1. Take hn, 1 n < t as input variables and ht as output variable, we can get the Gini importance of each variable hn, 1 n < t. Res Net164 contains three stages, and each stage consists of 18 layers. We conduct three Gini importance computation to each stage separately. As shown in Figure 5, each row presents the importance of source layers hn, 1 n < t contributing to the target layer ht. In

Algorithm 1 Calculate feature integration using Gini importance by Random Forest Input: H: composed of h1,h2,...,ht from Stage i; # The size of H is (bz cz fz) # bz denotes the number of training samples # cz denotes the number of channels in Stage i # fz denotes the number of layers in Stage i Output: The hotmap G about the feature integration for Stage i; 1: initial G = ; 2: for j = 1 to fz 1 do 3: x [h1, h2, ..., hj 1]; 4: y [hj]; 5: x x.reshape(bz,(fz j) cz); 6: RF Random Forest Regressor(); 7: RF.ﬁt(x,y); 8: Gini importances RF.feature importances ; # The length of Gini importance is (fz j) cz 9: res ; s 0; cnt 0; 10: for k = 0 to (fz j) do 11: s s + Gini importance(k); 12: cnt cnt + 1; 13: if cnt == cz-1 then 14: res.add(s);s 0;cnt 0; 15: end if G.add(res/max(res)); 16: end for 17: end for

each sub-graph of Figure 5, the diversity of variable importance distribution indicates that the current layer uses the information of the preceding layers. The interaction between shallow and deep layers in the same stage reveals the effect of implicitly dense connection. In particular, taking h17 in stage 1 (the last row) as an example, h16 or h15 does not intuitively provide the most information for h17, but h5 does. The DIA unit can adaptively integrate information between multiple layers. Moreover, in Figure 5 (stage 3), the infor-

remove #P(M) #P(M) top1-acc. top1-acc.

stage1 1.94 0.01 76.27 0.40 stage2 1.90 0.05 76.25 0.42 stage3 1.78 0.17 75.40 1.27

Table 7: The test accuracy (%) of DIANet164 with the removal of DIA-LSTM unit in different stage.

mation interaction with previous layers in stage 3 is more intense and frequent than that of the ﬁrst two stages. Correspondingly, as shown in Table 7, in the experiments when we remove the DIA-LSTM unit in stage 3, the classiﬁcation accuracy decreases from 76.67 to 75.40. However, when it in stage 1 or 2 is removed, the performance degradation is very similar, falling to 76.27 and 76.25 respectively. Also note that for DIANet, the number of parameter increment in stage 2 is larger than that of stage 1. It implies that the signiﬁcant performance degradation after the removal of stage

3 may be not only due to the reduction of the number of parameters but due to the lack of dense feature integration.

The Effect on Stabilizing Training Removal of Batch Normalization. Small changes in shallower hidden layers may be ampliﬁed as the information propagates within the deep architecture and sometimes result in a numerical explosion. Batch Normalization (BN) (Ioffe and Szegedy 2015) is widely used in the deep networks since it stabilizes the training by normalization the input of each layer. DIA-LSTM unit recalibrates the feature maps by channel-wise multiplication, which plays a role of scaling similar to BN. Table 8 shows the performance of the models of different depth trained on CIFAR100 and BNs are removed in these networks. The experiments are conducted on a single GPU with batch size 128 and initial learning rate 0.1. Both the original Res Net, SENet face problem of numerical explosion without BN while the DIANet can be trained with depth up to 245. In Table 8, at the same depth, SENet has larger number of parameters than DIANet but still comes to numerical explosion without BN, which means that the number of parameters is not the case for stabilization of training but sharing mechanism we proposed may be the case. Besides, comparing with Table 5, the testing accuracy of DIANet without BN still can keep up to 70%. The scaling learned by DIANet integrates the information from preceding layers and enables the network to choose a better scaling for features of current layer. Removal of Skip Connection. The skip connection has become a necessary structure for training DNNs (He et al. 2016b). Without skip connection, the DNN is hard to train due to the reasons like the gradient vanishing (Bengio et al. 1994; Glorot and Bengio 2010; Srivastava, Greff, and Schmidhuber 2015a). We conduct the experiment where all the skip connections are removed in Res Net56 and count the absolute value of gradient at the output tensor of each stage. As shown in Figure 6 which presents the gradient distribution with all skip connection removal, DIANet (blue) obviously enlarges the mean and variance of the gradient distribution, which enables larger absolute value and diversity of gradient and relieves gradient degradation to some extent. Without Data Augment. Explicit dense connections may help bring more efﬁcient usage of parameters, which makes the neural network less prone to overﬁt (Huang et al. 2017). Although the dense connections in DIA-LSTM are implicit, the DIANet still shows the ability to reduce overﬁtting. To verify it, We train the models without data augment to reduce the inﬂuence of regularization from data augment. As shown in Table 9, DIANet achieves lower testing error than Res Net164 and SENet. To some extent, the implicit and dense structure of DIANet may have regularization effect.

Number of Parameters in LSTM This section shows the number of parameter costs in the standard LSTM and the modiﬁed LSTM with the reduction ratio r. The input yt, the hidden state vector ht 1 and the output in Figure 4 are of the same size N, which is equal to the number of channels.

original SENet DIANet(r = 16)

#P(M) top1-acc. #P(M) top1-acc. #P(M) top1-acc.

Res Net83 0.88 nan 0.98 nan 0.94 70.58 Res Net164 1.70 nan 1.91 nan 1.76 72.36 Res Net245 2.53 nan 2.83 nan 2.58 72.35 Res Net326 3.35 nan 3.75 nan 3.41 nan

Table 8: Testing accuracy (%). We train models of different depth without BN on CIFAR-100. nan indicates the numerical explosion.

Figure 6: The distribution of gradient in each stage of Res Net56 without all the skip connections.

Models CIFAR-10 CIFAR-100

Res Net164 87.32 60.92 SENet 88.30 62.91 DIANet 89.25 66.73

Table 9: Test accuracy (%) of the models without data augment with Res Net164.

Standard LSTM. There are four linear transformations in the standard LSTM as shown in Figure 4 (Top) to control the information ﬂow with input yt and ht 1 respectively. To simplify the calculation, the bias is omitted. Therefore, for the yt, the number of parameters of four linear transformations is equal to 4N 2. Similarly, the number parameters of four linear transformations for input ht 1 is equal to 4N 2. The total number equals to 8N 2.

DIA-LSTM. As shown in Figure 4 (Bottom), there is a linear transformation to reduce the dimension at the beginning, which reduces the dimension of input yt from N to N/r. The number of parameters for the linear transformation is equal to N 2/r. Then the output will be passed to four linear transformations the same as the standard LSTM. The number of parameters of four linear transformations is equal to 4N 2/r. Therefore, for input yt and reduction ratio r, the total number of parameters is 5N 2/r. Similarly, the number of parameters with input ht 1 is the same as that concerning the input yt. The total number of parameters is 10N 2/r.

Conclusion In this paper, we proposes a novel-and-simple framework that shares an attention module throughout different network layers to encourage the integration of layer-wise information. The parameter-sharing module is called Dense-and Implicit Attention (DIA) unit. We propose incorporating LSTM in DIA unit (DIA-LSTM) and show the effectiveness of DIA-LSTM for image classiﬁcation by conducting experiments on benchmark datasets and popular networks. We further empirically show that the DIA-LSTM has a strong regularization ability on stabilizing the training of deep networks by the experiments with the removal of skip connections or Batch Normalization (Ioffe and Szegedy 2015) in the whole residual network.

Acknowledgments S. Liang and H. Yang gratefully acknowledge the support of National Supercomputing Center (NSCC) SINGAPORE and High Performance Computing (HPC) of National University of Singapore for providing computational resources, and the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Sincerely thank Xin Wang from Tsinghua University for providing personal computing resource. H. Yang thanks the support of the startup grant by the Department of Mathematics at the National University of Singaporet, the Ministry of Education in Singapore for the grant MOE2018-T2-2-147.

References Anderson, J. R. 2005. Cognitive psychology and its implications. Macmillan.

Bengio, Y.; Simard, P.; Frasconi, P.; et al. 1994. Learning long-term dependencies with gradient descent is difﬁcult. IEEE transactions on neural networks 5(2):157 166. Britz, D.; Goldie, A.; Luong, M.-T.; and Le, Q. 2017. Massive exploration of neural machine translation architectures. ar Xiv preprint ar Xiv:1703.03906. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; and Hu, H. 2019. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. ar Xiv preprint ar Xiv:1904.11492. Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short-term memory-networks for machine reading. EMNLP 2016. Glorot, X., and Bengio, Y. 2010. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, 249 256. Gregorutti, B.; Michel, B.; and Saint-Pierre, P. 2017. Correlation and variable importance in random forests. Statistics Computing 27(3):659 678. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Identity mappings in deep residual networks. In European Conference on Computer Vision. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735 1780. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; and Vedaldi, A. 2018. Gather-excite: Exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems, 9401 9411. Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132 7141. Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2017. Densely connected convolutional networks. In Computer Vision and Pattern Recognition. Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, 448 456. JMLR.org. Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Li, X.; Wang, W.; Hu, X.; and Yang, J. 2019. Selective kernel networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 510 519. Li, H.; Ouyang, W.; and Wang, X. 2016. Multi-bias nonlinear activation in deep neural networks. In International conference on machine learning, 221 229. Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. ar Xiv preprint ar Xiv:1312.4400. Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. ar Xiv preprint ar Xiv:1508.04025.

Miech, A.; Laptev, I.; and Sivic, J. 2017. Learnable pooling with context gating for video classiﬁcation. ar Xiv preprint ar Xiv:1706.06905. Mnih, V.; Heess, N.; Graves, A.; et al. 2014. Recurrent models of visual attention. In Advances in neural information processing systems, 2204 2212. Park, J.; Woo, S.; Lee, J.-Y.; and Kweon, I. S. 2018. Bam: Bottleneck attention module. ar Xiv preprint ar Xiv:1807.06514. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3):211 252. Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015a. Training very deep networks. In Advances in neural information processing systems, 2377 2385. Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015b. Highway networks. ar Xiv preprint ar Xiv:1505.00387. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; and Tang, X. 2017. Residual attention network for image classiﬁcation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Nonlocal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794 7803. Wang, X.; Cai, Z.; Gao, D.; and Vasconcelos, N. 2019. Towards universal object detection by domain attention. Co RR abs/1904.04402. Wolf, L., and Bileschi, S. 2006. A critical view of context. International Journal of Computer Vision 69(2):251 261. Woo, S.; Park, J.; Lee, J.-Y.; and So Kweon, I. 2018. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 3 19. Xie, S.; Girshick, R.; Doll ar, P.; Tu, Z.; and He, K. 2017. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition. Xu, K.; Ba, J. L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R. S.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, 2048 2057. JMLR.org. Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. In BMVC. Zhao, B.; Wu, X.; Feng, J.; Peng, Q.; and Yan, S. 2017. Diversiﬁed visual attention networks for ﬁne-grained object classiﬁcation. IEEE Transactions on Multimedia 19(6):1245 1256.