# auto_learning_attention__66ef34ab.pdf

Auto Learning Attention

Benteng Ma1,2 Jing Zhang2 Yong Xia1,3 Dacheng Tao2

1Northwestern Polytechnical University, China 2The University of Sydney, Australia 3Research & Development Institute of Northwestern Polytechnical University, Shenzhen {mabenteng@mail,yxia@}.nwpu.edu.cn {jing.zhang1, dacheng.tao}@sydney.edu.au

Attention modules have been demonstrated effective in strengthening the representation ability of a neural network via reweighting spatial or channel features or stacking both operations sequentially. However, designing the structures of different attention operations requires a bulk of computation and extensive expertise. In this paper, we devise an Auto Learning Attention (Auto LA) method, which is the ﬁrst attempt on automatic attention design. Speciﬁcally, we deﬁne a novel attention module named high order group attention (HOGA) as a directed acyclic graph (DAG) where each group represents a node, and each edge represents an operation of heterogeneous attentions. A typical HOGA architecture can be searched automatically via the differential Auto LA method within 1 GPU day using the Res Net-20 backbone on CIFAR10. Further, the searched attention module can generalize to various backbones as a plug-and-play component and outperforms popular manually designed channel and spatial attentions for many vision tasks, including image classiﬁcation on CIFAR100 and Image Net, object detection and human keypoint detection on COCO dataset. Code is available at https://github.com/btma48/Auto LA.

1 Introduction

Attention learning has been increasingly incorporated into convolutional neural networks (CNNs) [1], aiming to compact the image representation and strengthen its discriminatory power [2, 3, 4, 5]. It has been widely recognized that attention learning is beneﬁcial for many computer vision tasks, such as image classiﬁcation, segmentation, and object detection.

There are two types of typical attention mechanisms. The channel attention is able to reinforce the informative channels and to suppress irrelevant channels of feature maps [2], while the spatial attention enables CNNs to dynamically concentrate processing resources at the location of interest, resulting in better and more effective processing of information [4]. Let either the channel attention or spatial attention be treated as the ﬁrst-order attention. The combination of both channel attention and spatial attention constitutes the second-order attention, which has been proven in benchmarks to produce better performance than either ﬁrst-order attention by modulating the feature maps in both channel-wise and spatial-wise [4]. Accordingly, we propose to extend attention modules from the ﬁrstor second-order to a higher order, i.e., arranging more basic attention units structurally. However, considering the highly variable structures and hyperparameters of basic attention units, exhaustively searching the architecture of high order attention module is an exponential explosion in complexity.

indicates equal contribution.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Recent years have witnessed the unprecedented success of neural architecture search (NAS) in the automated design of neural network architectures, surpassing human designs on various tasks [6, 7, 8, 9, 10, 11, 12]. We advocate the use of NAS to search the optimal architecture of high order attention module, which however is challenging for several reasons. First, there is no explicit off-the-shelf deﬁnition of the search space for attention modules, where various attention operations may be included [13]. Second, the sequential structure for arranging different attention operations should be computationally efﬁcient so that it can be searched within affordable computational budgets. Third, how to search the attention module, e.g., given the backbone or together with the backbone cells, remains unclear. Fourth, the searched attention module is expected to generalize well to various backbones and tasks.

In this paper, we propose an Auto Learning Attention (Auto LA) method for automatically searching efﬁcient and effective plug-and-play attention modules for various well-established backbone networks. Speciﬁcally, we ﬁrst deﬁne a novel concept of attention module, i.e., high order group attention (HOGA), by exploring a split-reciprocate-aggregate strategy. Technically, each HOGA block receives feature tensor from each block in the backbone as input, which is divided into K groups along the channel dimension to reduce the computational complexity. Then, a directed acyclic graph (DAG) [14] is constructed, where each node is associated with a split group, and each edge represents a speciﬁc attention operation. The sequential connections between different nodes can represent different combinations of basic attention operations, resulting in various ﬁrst-order to Kth order attention modules, which indeed constitute a search space of HOGA. By customizing DARTS [8] for our problem, the explicit HOGA structure can be searched efﬁciently within 1 GPU day on a modern GPU given a ﬁxed backbone network (e.g., Res Net-20) on CIFAR10. Extensive experiments demonstrate the obtained HOGA generalizes well on various backbones and outperforms previous hand-crafted attentions for many vision tasks, including image classiﬁcation on the CIFAR100 and Image Net datasets, object detection, and human keypoint detection on the COCO dataset.

To summarize, the contribution of our paper is three-fold. First, to the best of our knowledge, Auto LA is the ﬁrst attempt to extend NAS to search plug-and-play attention modules beyond the backbone architecture. Second, we deﬁne a novel concept of attention module named HOGA that can represent high order attentions and the previous channel attention and spatial attention can be treated as its special cases. Third, we utilize a differentiable search method to search the optimal HOGA module efﬁciently, which can generalize well on various backbones and outperform previous attention modules for many vision tasks.

2 Related work

Attention mechanism. The attention mechanism was originally introduced in neural machine translation to handle long-range dependencies [15], which enables the model to attend to important regions within a context adaptively. Self-attention was added to CNNs by either using channel attention or non-local relationships across the image [2, 3, 16, 17, 18]. As different feature channels encode different semantic concepts, the squeeze-and-excitation (SE) attention captures channel correlations by selectively modulating the scale of channels [2, 19]. Spatial attention is also explored together with the channel attention in [4], resulting in a second-order attention module called CBAM and achieving superior performance. In [19, 20], the attention is extended to multiple independent branches which achieves improved performance than the original one. In contrast to these handcrafted attention modules, we deﬁne the high order group attention and construct the search space accordingly where SE [2] and CBAM [4] are special instances in it. Consequently, a more effective attention module can be searched automatically, outperforming both SE [2] and CBAM [4] on various vision tasks.

Neural Architecture Search. In terms of the NAS methods, reinforcement learning [9, 21, 22], sequential optimization [10, 23], evolutionary algorithms [24, 25, 26], random search [27, 28], and performance predictors [29, 30] tend to demand immense computational resources which probably not suitable for efﬁcient search. Recent NAS methods reduce the search time signiﬁcantly by weight-sharing [14, 31, 32] and continuous relaxation of the space [8, 33]. DARTS [8] and its variants [34, 35, 36] only need to train a single neural network with repeated cells during the searching process [37], providing elegant differentiable solutions to optimizing network weights and architecture simultaneously. Besides, DARTS is computationally efﬁcient which is only slightly slower than training one architecture in the search space. Instead of searching basic cells and stacking

them sequentially to form the backbone network, we propose to search an efﬁcient and plug-andplay attention module given a ﬁxed backbone, aiming to enhance the representation capacity of the backbone. The searched attention module shows good generalization ability for various backbones and downstream tasks, implying that the proposed method could be complementary to existing NASbased search of the backbone architectures. Note that the architectures of backbone and attention module can be searched alternatively or synchronously in a framework that we leave as future work.

3 Auto Learning Attention

3.1 High Order Group Attention

First-order Attention

Second-order Attention High order Attention (Order = K)

Group Split of F

Concatenate to form 𝐹

Figure 1: Attention Order. The typical channel attention, spatial attention, and normalization attention are all ﬁrst-order attentions. CBAM is a second-order attention. The solid line represents the speciﬁc attention operation and the dotted line in High order Attention represents the candidate attention operation which will be searched automatically.

From the view of computational ﬂow, attention operation represents a function that transforms the input tensor F to an enhanced representation ˆF through a series of attention operations. This can be formalized in the computational graph, where the operations are represented as a directed acyclic graph with a set of nodes U. Each node Uk represents a tensor (we use the same symbol to represent the node and the tensor at the node without causing ambiguity). An attention operation o O is deﬁned on the edge between Uk and its parent nodes Pk. In the typical ﬁrst-order attention module, each node has a single parent and |Pk| = 1. Denoting the parent feature tensor Pk RC H W as input, the above attention operation can be deﬁned as:

Uk = o (Pk) . (1)

Obviously, the increased order of attention may increase the computational complexity. To generate an efﬁcient high order attention module, we divide the input tensor F into K groups along the channel dimension, where K is a cardinally hyper-parameter. In this case, we get F = {F0, F1, ..., FK 1} which is illustrated in Figure 1. Then, a series of operations o O (where O is the search space which will be further explained in section 3.2) are applied on the split group features {Fi}K 1 i=0 . This process generates K intermediate features as shown in Figure 1, where the kth intermediate tensor is calculated as: Uk = ok,k Fk + X

j<k oj,k Uj, (2)

where Uk RH W C/K for k 0, 1, ..., K 1, H, W and C are the sizes of the block output feature map, oi,j O denotes the attention operation between node Ui and Uj. Speciﬁcally, ok,k represents the attention operation applied from Fk to Uk. These intermediate tensors are concatenated along the channel axis to generate the ﬁnal attentive output:

ˆU = ϕ[U0; U1; ...; UK 1, U], (3)

where U = PK 1 i=0 Ui, [; ] denotes the concatenation along the channel axis, and ϕ denotes the mapping function learned by 1 1 convolutional layers and channel attention.

Architecture Search phase Architecture inference phase

Figure 2: HOGA architecture search. The left represents the searching stage of HOGA architecture and the line between each two nodes represents the mixed attention operations with architecture weight αoi,j and the right represents the searched HOGA architecture which is a sampled sub-graph of the full graph used in the searching stage.

3.2 Attention Searching Space

Given the structure of HOGA deﬁned above, searching an HOGA module means sampling an attention operation from a search space for each oj,k in Eq. (2). Here we propose an attention search space which includes typical attention operations used in the literature. We also include some operations such as identity mapping as attention operation in our search space for a uniﬁed terminology. Details are presented as follows.

(a) Channel Attention: Following [2], we deﬁne the channel attention as:

Ac (F) = S fc F c avg F, (4)

where F c avg RC 1 1 represents the global spatial average pooled feature from the input F, fc is a Multilayer perceptron (MLP), and S denotes the Sigmoid activation function.

(b) Channel Attention v2: In contrast to the channel attention deﬁned in Eq. (4), a variant of it is to utilize both average pooling and max pooling when calculating the pooled feature, i.e.,:

Ac2 (F) = S fc F c avg + fc (F c max) F, (5)

where F c max is the max pooled feature. Note that F c ave and F c max are fed into the same MLP fc.

(c) Spatial Attention: Similar to the channel attention v2, we deﬁne the spatial attention as:

As (F) = S fs F s avg; F s max F, (6)

where F s avg and F s max represent the global spatial average and max pooled feature, respectively. [; ] denotes the concatenation operator. fs represents a convolutional layer. In this paper, we use a depth-wise convolutional layer with a 3 3 kernel and a dilation rate chosen from {1, 2} to improve the computational efﬁciency of the attention module.

(d) Normalization Attention: Normalization attention refers to normalizing a tensor to the scale within [0, 1] and using it at the attention map, i.e.,

An (F) = S (fn (F)) F, (7)

where fn is a depth-wise convolutional layer like fs in Eq. (6).

(e) Convolutional Block Attention Module (CBAM): We also include the CBAM in our search space, which can be treated as a combination of the channel attention v2 and spatial attention, i.e.,:

Ab(F) = As (Ac2 (F)) . (8)

(f) Identity and Zero Attention: Speciﬁcally, we also name two special attention operations to represent the identity mapping and zero mapping, respectively, i.e.,

Ak(F) = k F, k {1, 0}. (9)

3.3 HOGA Architecture Search

With the comprehensive coverage of the search space and the deﬁnition of HOGA, we can formulate the problem of HOGA architecture search. The ultimate objective of the searching is to ﬁnd an

optimized HOGA architecture that minimizes the expected loss L. We denote the datasets as D, the attention search space as O and the space of all candidate HOGA architectures as H. A general architecture search algorithm is deﬁned as the following mapping:

Γ : D O H. (10)

Given a speciﬁc dataset d, which is split into a training partition dtrain and a validation partition dval, the searching algorithm estimates the model hα,θ Hα, where α is the architecture parameter of the HOGA and θ is the learnable weight of the model. The best HOGA architecture is searched by minimizing a loss function L as follows:

α = argmin α L(Γ(α, dtrain), dval). (11)

As shown in Figure 2, the HOGA can be represented as a directed acyclic graph (DAG): G = (V, E). Each node Ui V, i = 0, 1, ..K 1, represents an intermediate tensor, and the corresponding edge ei,j E represents a candidate attention operation which is predeﬁned in the searching space O. Let oi,j = {o0 i,j, o1 i,j, ..., o M 1 i,j } be the set of candidate attention operations on edge ei,j, where M = |O|. In each HOGA module, we assume that each operation outputting Uk only receives a single input from a former node. Accordingly, the intermediate tensor in HOGA is obtained by:

m |O| om k,k(Fk|βk,k,m = 1) + X

m |O| om i,k(Ui|βi,k,m = 1), (12)

where βi,j,m {0, 1}, m {0, 1, ..., M 1}, denoting whether the attention operation om i,j is applied. Since O is a discrete set, similar to [8], we introduce a continuous relaxation to make the search space of HOGA continuous so that the architecture can be optimized through gradient descent. We relaxes the attention categorical choice to a softmax over all operations in O to form a fusion output:

ˆoi,j(Ui) = X

exp(αoi,j) P

o O exp(αo i,j)o(Ui), (13)

where αoi,j denotes the corresponding architecture weight for the attention operation oi,j on ei,j. And a discrete architecture can be obtained by replacing each mixed operation ˆoi,j with the most likely operation and we adopt the the following approximation to generate the optimal discrete attention architecture: softmax(αoi,j) βi,j, (14)

where βi,j = (βi,j,1, βi,j,2, ..., βi,j,M).

The attention architecture search problem is reduced to learn α and the network weight ω that minimize the validation loss Lval(ω , α ), which can be solved by the bi-level optimization:

min α Lval(ω (α), α)

s.t. ω (α) = argmin ω Ltrain(ω, α) (15)

To solve Eq. (15), we adopt the ﬁrst-order approximation in [8] and partition the training data into two disjoint sets train A and train B. Then we optimize both the weights of the network and architecture parameters of the HOGA by alternating gradient descent, i.e.,

Train the network weights ω by Ltrain A(ω, α),

Update HOGA architecture weights α by Ltrain B(ω, α),

where the loss function L is the cross entropy calculated on the mini-batch. The partition of training data is to prevent the architecture from overﬁtting the train data. As shown in Figure 2, the candidate attention operations at each edge are mixed with the normalized continuous coefﬁcient αoi,j during the training phase. When ﬁnishing the training, we replace the mixed attention for edge ei,j to a single operation. Thus ei,j can be encoded into a one-hot vector where

βi,j,m = 1, m = argmink αok i,j 0, othervise (16)

Consequently, the speciﬁc attention operation can be determined for each edge.

Table 1: Comparison of different attention modules on CIFAR10.

Acc(%) Param.(M) FLOPS(G)

Res Net20 [1] 91.95 0.27 0.04 Res Net20 + SE [2] 92.30 0.29 0.04 Res Net20 + CBAM [4] 92.81 0.30 0.04 Res Net20 + Auto LA 93.38 0.34 0.05

Res Net32 [1] 92.55 0.46 0.07 Res Net32 + SE [2] 93.16 0.49 0.07 Res Net32 + CBAM [4] 93.47 0.49 0.07 Res Net32 + Auto LA 94.33 0.51 0.09

Res Net56 [1] 93.03 0.85 0.13 Res Net56 + SE [2] 94.02 0.90 0.13 Res Net56 + CBAM [4] 94.10 0.92 0.13 Res Net56 + Auto LA 94.78 1.04 0.16

Table 2: Comparison of different attention modules on CIFAR100

Acc(%). Param.(M) FLOPS(G)

Res Net20 [1] 75.42 4.07 0.65 Res Net20 + SE [2] 76.84 4.13 0.65 Res Net20 + CBAM [4] 76.93 4.14 0.67 Res Net20 + Auto LA 77.85 5.23 0.71

Res Net32 [1] 75.72 6.85 1.10 Res Net32 + SE [2] 77.81 6.97 1.10 Res Net32 + CBAM [4] 78.01 7.01 1.10 Res Net32 + Auto LA 78.57 8.91 1.30

Res Net56 [1] 77.56 12.41 2.01 Res Net56 + SE [2] 79.05 13.84 2.01 Res Net56 + CBAM [4] 79.07 13.85 2.02 Res Net56 + Auto LA 79.59 14.48 2.37

𝑂 : 𝐴ଵ, skip connection

𝑂ଵଵ: 𝐴௦, (kernel =3, dilation=1)

𝑂 ଵ: 𝐴 , (kernel =3, dilation=1)

𝑂ଶଶ: 𝐴 , (kernel =7, dilation=1)

𝑂ଵଶ: 𝐴ଵ, skip connection

𝑂 ଶ: 𝐴 , (kernel =3, dilation=1)

𝑂 ଷ: 𝐴ଵ, skip connection

𝑂ଵଷ: 𝐴 , (kernel =3, dilation=2)

𝑂ଶଷ: 𝐴௦, (kernel =3, dilation=1)

𝑂ଷଷ: 𝐴 , (kernel =3, dilation=1)

Figure 3: The searched architecture of HOGA.

4 Experiments

4.1 Datasets

Four benchmark datasets, including CIFAR10 [38], CIFAR100 [38], Image Net ILSVRC2012 [39], and COCO [40], are used for this study.

4.2 Experiment Setup

HOGA is a general module which can be integrated into any well-established CNN architectures and is end-to-end trainable along with the backbone. Taking Res Net20 [1] as an exemplar backbone network, where the base number of channel (width) is 16, we search the best architecture of the attention module on it and then transfer the searched attention module to Res Net-32 and Res Net-56 for the evaluation on CIFAR10 and CIFAR100. To evaluate the capabilities of image classiﬁcation on larger datasets, we transfer the searched attention module to Res Net-18, Res Net-34, Res Net-50, Res Net101 [1], and Wider Res Net [41] and train them on Image Net. When testing on CIFAR100 and Image Net, the base channel number of the network is set to 64. To further evaluate the generalization capability [42] of the searched HOGA, we incorporate it into Res Net50, pretrain the network on Image Net, and apply the pretrained network to two heavy downstream tasks, i.e., object detection and human keypoint detection on the COCO dataset.

In the training stage, we set the order of HOGA to K = 4 to achieve a trade-off between accuracy and complexity. We randomly split the training set of CIFAR10 into two parts evenly, one for tuning network parameters (denoting train A) and the other one for tuning the attention architecture (denoting train B). The architecture search procedure is conducted for a total of 100 epochs with a batch size of 128. When training network weights ω, we adopt the SGD optimizer with a momentum 0.9 and a weight decay 0.0003, and the cosine learning rate policy that decays from 0.025 to 0.001 [43]. The initial value of α before softmax is sampled from a standard Gaussian and times 0.001. In the evaluation stage, the standard test set is used.

4.3 Image Classiﬁcation Results on CIFAR10 and CIFAR100

In the evaluation stage on CIFAR10, the entire training set is used, and the network is trained from scratch for 500 epochs with a batch size of 256. The results are summarized in Table 1 and Table 2, which demonstrate the searched HOGA attention (denoting Auto LA ) outperforms other attention baselines with slightly more computations. Figure 3 shows the searched architecture of HOGA. As

Table 3: Comparison of different attention modules on Res Ne Xt and PNAS.

Acc (%) Param (M) FLOPS (G) Acc (%) Param (M) FLOPS (G)

Res Next 94.76 1.71 0.28 PNAS 93.34 0.72 0.08 Res Next + SE 95.22 2.23 0.30 PNAS + SE 93.71 0.75 0.08 Res Next + CBAM 95.31 2.24 0.31 PNAS + CBAM 93.79 0.76 0.08 Res Next + Auto LA 95.67 2.35 0.41 PNAS + Auto LA 94.10 0.91 0.11

can be seen, compared to the hand-crafted attention modules, the searched HOGA contains more connections between various types of attentions. We also presented the results for other different backbones, including Res Ne Xt [44] and the one searched by PNAS [23] on CIFAR10 in Table 3. From Table 1, Table 2, and Table 3, we can see that the HOGA searched by Auto LA outperforms other attention modules on CIFAR10 when deployed on highly variable architectures including Res Net, Res Ne Xt, and PNAS, indicating the consistent superiority of the HOGA searched by Auto LA over previous attention methods.

4.4 Image Classiﬁcation Results on Image Net

Table 4: Comparison of different attentions on Image Net

Top-1 Error (%) Param. (M) FLOPS. (G)

Res Net18 [1] 29.60 11.69 1.81 Res Net18 + SE [2] 29.41 11.78 1.82 Res Net18 + CBAM [4] 29.27 12.01 1.82 Res Net18 + Auto LA 27.90 13.51 2.10

WRes Net18 [41] 26.85 26.85 3.87 WRes Net18 + SE [2] 26.21 26.07 3.87 WRes Net18 + CBAM [4] 26.10 26.08 3.89 WRes Net18 + Auto LA 25.02 29.76 4.55

Res Net34 [1] 26.69 21.80 3.67 Res Net34 + SE [2] 26.13 21.96 3.67 Res Net34 + CBAM [4] 25.99 21.96 3.67 Res Net34 + Auto LA 24.65 24.63 4.29

Res Net50 [1] 24.56 25.56 4.11 Res Net50 + SE [2] 23.14 28.09 4.12 Res Net50 + CBAM [4] 22.66 28.09 4.12 Res Net50 + Auto LA 21.82 29.39 4.73

Res Net101 [1] 23.38 44.55 7.57 Res Net101 + SE [2] 22.35 49.33 7.58 Res Net101 + CBAM [4] 21.51 49.33 7.58 Res Net101 + Auto LA 20.95 51.81 8.94

We also perform image classiﬁcation on the Image Net dataset to evaluate the searched HOGA module for this more challenging task. We adopt the same data augmentation scheme as [4, 8] for training. We plug the searched HOGA module from the above experiment into various backbone networks including Res Net18, Res Net34, Res Net50, Res Net101 [1], and Wide Res Net [41]. More training details are in the supplementary material.

Table 4 summarizes the results on the validation set of Image Net. As can be seen, CBAM [4] marginally outperforms SE [2] on shallow backbones such as Res Net18, and Res Net34. By contrast, it achieves better performance than SE by a larger margin on the Wider Res Net, Res Net50, and Res Net101 backbones. We suspect that the second-order attention beneﬁts from more diverse and representative features from the deeper backbones. As for the proposed Auto LA, it outperforms all baselines by large margins on all backbones. We present the explanations as follows. Firstly, Auto LA searches the optimal architecture of HOGA from numerous candidates in the search space via a differentiable algorithm. It has already surpassed many candidates which may include some ﬁrstand second-order attentions. Secondly, due to the diverse combinations and transformations in HOGA, it can learn more representative features after a series of feature mapping steps even for shallow backbone networks.

Table 5: Results of other attentions modules on Image Net.

Top-1 Error (%) Param (M) FLOPS (G)

Res Net50 + GENet [3] 22.00 31.20 3.87 Res Net50 + Aug Att [17] 22.30 24.30 7.90 Res Net50 + GCNet [45] 22.30 28.08 3.87 Res Net50 + Auto LA 21.82 29.39 4.73

Further, we also compare the performance with other recent well-designed attention modules, including GENet [3], GCNet[45], and Aug Att [17] where GCNet and Aug Atte are non-local attentions. The results are listed in Table 5. All this models are based on Res Net50 backbone with different attention modules. With comparable or even less parameters and FLOPS, the proposed Auto LA outperforms other attention modules by a substantial margin.

4.5 FLOPS Fair Comparison and Ablation Study on Image Classiﬁcation

To further analyse the performance of Auto LA, we increase the width of the backbone networks for SE and CBAM (denoted by Wide ) and further customize SE and CBAM using the group split operation (denoted by HOG ), resulting in a speciﬁc instantiation of HOGA (i.e., k=4) in which all the operations in HOGA are SE/CBAM attentions in these two cases and the FLOPS are fair for the Auto LA. The results on CIFAR10 are listed in Table 6. It reveals that HOGA searched by Auto LA (k=4)) still outperforms SE and CBAM by a large margin. And the expansion backbone with SE and CBAM even contain more parameters than Auto LA (k=4), which conﬁrms the superiority of the proposed Auto LA.

We also present the ablation study on the number of group split (i.e., the hyper-parameter K). From Table 6, less groups mean lower order of attentions in HOGA, leading to inferior performance. If we set K = 8 or other larger number, the parameters and FLOSP would increase a lot, thus we take K = 4 in our ﬁnal setting. We also test the generalization ability of HOGA searched on Res Net56 (denoted by Auto LA_56 ) on a new backbone, i.e., Res Net20. Although the results are inferior to the ones searched directly on Res Net20, this HOGA still outperforms SE and CBAM. We also compare the attention modules generated by random search and Auto LA in Table 6. The HOGA searched by Auto LA outperforms its randomly searched counterparts (denoted by Rand ). Note that the attention modules by random search exceed SE and CBAM.

Table 6: Experiments with fair settings of parameters and FLOPGs and ablation study results on CIFAR10.

Acc (%) Param (M) FLOPS (G) Acc (%) Param (M) FLOPS (G)

Res Net20 + SE 92.30 0.29 0.04 Res Net32 + SE 93.16 0.49 0.07 Res Net20 + CBAM 92.81 0.3 0.04 Res Net32 + CBAM 93.47 0.49 0.07

Res Net20_Wide + SE 93.16 0.36 0.05 Res Net32_Wide_SE 94.08 0.62 0.09 Res Net20_Wide + CBAM 93.13 0.37 0.05 Res Net32_Wide_CBAM 93.92 0.63 0.09

Res Net20 + HOG_SE (k=4) 92.87 0.32 0.05 Res Net32 + HOG_SE (k=4) 93.62 0.54 0.09 Res Net20 + HOG_CBAM (k=4) 93.07 0.35 0.05 Res Net32 + HOG_CBAM (k=4) 93.72 0.56 0.09

Res Net20 + Auto LA (k=2) 93.18 0.33 0.05 Res Net32 + Auto LA (k=2) 93.81 0.49 0.09 Res Net20 + Auto LA_56 (k=4) 93.31 0.35 0.05 Res Net32 + Auto LA_56 (k=4) 94.18 0.57 0.09 Res Net20 + Rand_HOGA (k=4) 93.28 0.35 0.05 Res Net32 + Rand_HOGA (k=4) 94.15 0.59 0.09 Res Net20 + Auto LA (k=4) 93.38 0.34 0.05 Res Net32 + Auto LA (k=4) 94.33 0.52 0.09

4.6 Object Detection Results on COCO

Image classiﬁcation networks provide generic image features that may be transferred to other computer vision tasks. As an example, we evaluate the usefulness of the searched HOGA module for object detection in this part. Speciﬁcally, we choose the popular object detection framework named Single-Shot Detector (SSD) [46] and a popular two-stage framework Faster RCNN [47] + FPN [48] use Res Net50 with different attentions (e.g., SE, CBAM, and HOGA) pretrained on Image Net dataset as the backbone networks. We train the detection models on COCO dataset and take average precision as the evaluation metric [49]. More implementation details can be found in the supplementary material.

Table 7: Comparison of object detection results on COCO dataset (Average Precision). We adopt SSD and Faster RCNN + FPN detection frameworks and apply different attention modules to the base network.

SSD Faster RCNN + FPN

Res Net50 25.01 36.03 Res Net50 + SE 26.05 36.47 Res Net50 + CBAM 26.63 36.55 Res Net50 + Auto LA 27.78 37.21

The results are summarized in Table 7. As can be seen, CBAM outperforms SE owing to the additional spatial attention. The model using Auto LA obtains the best score 27.78 AP, which is higher than the vanilla Res Net50 backbone by 2.77 AP with SSD framework. Compared with the manually designed SE and CBAM attentions, it also outperforms them by a large margin. Further, Auto LA also achieves better results in the Faster RCNN + FPN framework. It owes to the larger receptive ﬁelds of HOGA introduced by high order attention operations, which enables to produce discriminative attention proposals and capture multi-scale context. These results further conﬁrms the superiority of HOGA over existing attentions for object detection.

Res Net50 Res Net50 + SE Res Net50 + CBAM Res Net50 + Auto LA

Res Net50 Res Net50 + SE Res Net50 + CBAM Res Net50 + Auto LA

Grad-CAM Guided Back Propagation

Figure 4: Visual inspection of the networks with Grad-CAM [52] and Guided Back Propagation [53]

4.7 Human Keypoint Detection Results on COCO

We further assess the generalization of Auto LA on the task of human keypoint detection which aims at detecting human body keypoints. We adopt the model in [50] and follow [51] for evaluating different attentions. Similar to Section 4.6, we plug different attention modules into Res Net50 and train them on Image Net dataset. Then, we use them as the pretrained backbone networks and train them on COCO dataset. More implementation details can be found in the supplementary material.

Table 8: Human keypoint detection results

AP AP @.5 AR

Res Net50 72.8 89.9 78.5 Res Net50 + SE 73.6 90.2 79.3 Res Net50 + CBAM 73.6 90.0 79.3 Res Net50 + Auto LA 74.6 90.5 80.0

The results are summarized in Table 8. As can be seen, CBAM improves the vanilla Res Net50 but does not perform better than SE for this task. We suspect that since the input of the keypoint detection model is a cropped and re-scaled person detection region where the human body is salient, therefore, the spatial attention may not beneﬁt the model more given the channel attention. However, the searched HOGA outperforms all of them by large margins, demonstrating that diverse combinations and transformations of various attention operations actually matters.

4.8 Visual Inspection on the Networks with Different Attention Modules For the qualitative analysis, we apply the Grad-CAM [52] and guided back propagation [53] to inspect the layer4 in Res Net50 with different attention modules. In Figure 4, we can see that the Grad-CAM masks of the network with Auto LA cover the target object regions more precisely than other methods. These results show that the network integrated with the searched HOGA can learn more discriminative features by attending to the target object and discarding irrelevant information.

Further, we use the class selectivity index metric [54] to analyze the features in different layers of models with different attention modules on the validation data of Image Net. We also analyze the performance of CIFAR10 classiﬁcation by the Res Net20, Res Net32, and Res Net56 backbones with different attention modules using Barnes-Hut-SNE [55]. These two visual inspections on different attention modules are shown and analysed in the supplementary material.

5 Conclusion and Future Work

In this work, we present the ﬁrst attempt to search efﬁcient and effective plug-and-play high order attention modules for various well-established backbone networks. We propose a new attention module named high order group attention and search its explicit architecture via a differential method efﬁciently. The searched attention module generalizes well on various backbones and outperforms manually designed attentions on many typical computer vision tasks. For the future work, we will formulate the backbone and attention architecture into a uniﬁed framework and search their optimal architectures in an alternative or synchronous manner.

Acknowledgement This work was supported by the the National Natural Science Foundation of China under grants 61771397, China Scholarship Council, Science and Technology Innovation Committee of Shenzhen Municipality under Grant JCYJ20180306171334997 and Australian Research Council Project FL-170100117.

Broader Impact

Machine learning and related technologies have already achieved remarkable performance in many areas. Current methods still require intensive empirical efforts for network design and hyperparameter ﬁne-tuning. Our research can search the optimal high order group attention module automatically, and the searched module is computationally efﬁcient and generalizes well to various tasks. It will help to build a strong deep neural network model automatically without having to rely on the manual design of the attention architecture. Since machine learning can promote the development of industry, healthcare, and education, Auto ML can accelerate this process by offering various speciﬁc optimal models that ﬁt different hardware platforms and latency constraints.

However, Auto ML usually searches the model without domain knowledge, and may result in some uncertain and unreliable models that will make confusing decisions. The excessive trust in these decision will lead to many ethics issues. For example, when the diagnostic system optimized by Auto ML leads to the death of the patients or other property damage, who should be responsible for this? What s more, the abuse of Auto ML may cause horrible disasters, especially in military applications. Machine learning can optimize the design of weapons to make them adapted to the speciﬁc operational conditions. Auto ML will speed up this process and makes it possible to search the optimal system under any different constraints and to customize the mass production of weapons. The weapon design systems optimized by Auto ML will cause a great threat to world peace, and we advocate the Auto ML will not apply to the ﬁeld of military or warfare.

Further, Auto ML will tip the scales in favor of the developed countries that are developing these technologies to improve living conditions across the board in a variety of ways. However, the labor force in developing countries is largely unskilled, and the use of Auto ML in many cases means higher unemployment, lower-income, and more social unrest. The purpose of artiﬁcial intelligence in this condition should be to enhance their workforce skills, not to replace them. As a researcher, we need to work principally to make sure technology matches our values.

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[2] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[3] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2018.

[4] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In European Conference on Computer Vision (ECCV), 2018.

[5] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In-So Kweon. Bam: Bottleneck attention module. In British Machine Vision Conference (BMVC), 2018.

[6] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[7] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[8] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. International Conference on Learning Representations (ICLR), 2019.

[9] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. International Conference on Learning Representations (ICLR), 2017.

[10] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[11] Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, and Chunhua Shen. Nas-fcos: Fast neural architecture search for object detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[12] Ruijie Quan, Xuanyi Dong, Yu Wu, Linchao Zhu, and Yi Yang. Auto-reid: Searching for a part-aware convnet for person re-identiﬁcation. In IEEE International Conference on Computer Vision (ICCV), 2019.

[13] Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Dollár. On network design spaces for visual recognition. In IEEE International Conference on Computer Vision (ICCV), 2019.

[14] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efﬁcient neural architecture search via parameters sharing. International Conference on Machine Learning (ICML), 2018.

[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), 2017.

[16] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems (NIPS), 2019.

[17] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[18] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[19] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. ar Xiv preprint ar Xiv:2004.08955, 2020.

[20] Binghui Chen, Weihong Deng, and Jiani Hu. Mixed high-order attention network for person re-identiﬁcation. In IEEE International Conference on Computer Vision (ICCV), 2019.

[21] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. International Conference on Learning Representations (ICLR), 2017.

[22] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural network architecture generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[23] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In European Conference on Computer Vision (ECCV), 2018.

[24] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. Large-scale evolution of image classiﬁers. In International Conference on Machine Learning (ICML), 2017.

[25] Lingxi Xie and Alan Yuille. Genetic cnn. In IEEE International Conference on Computer Vision (ICCV), 2017.

[26] David R So, Chen Liang, and Quoc V Le. The evolved transformer. International Conference on Machine Learning (ICML), 2019.

[27] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. In IEEE International Conference on Computer Vision (ICCV), 2019.

[28] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2019.

[29] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efﬁcient architecture search. International Conference on Learning Representations (ICLR), 2018.

[30] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems (NIPS), 2018.

[31] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. Single-path nas: Designing hardware-efﬁcient convnets in less than 4 hours. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECMLPKDD), 2019.

[32] Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. ar Xiv preprint ar Xiv:1907.01845, 2019.

[33] Chaoyang He, Haishan Ye, Li Shen, and Tong Zhang. Milenas: Efﬁcient neural architecture search via mixed-level reformulation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[34] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In IEEE International Conference on Computer Vision (ICCV), 2019.

[35] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. Pc-darts: Partial channel connections for memory-efﬁcient differentiable architecture search. International Conference on Learning Representations (ICLR), 2020.

[36] Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations (ICLR), 2019.

[37] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning (ICML), 2018.

[38] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Citeseer, Tech. Rep, 2009.

[39] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[40] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.

[41] S Zagoruyko and N Komodakis. Wide residual networks. British Machine Vision Conference (BMVC), 2016.

[42] Fengxiang He, Bohan Wang, and Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks. International Conference on Learning Representations, 2020.

[43] Fengxiang He, Tongliang Liu, and Dacheng Tao. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. 2019.

[44] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492 1500, 2017.

[45] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In IEEE International Conference on Computer Vision Workshops, 2019.

[46] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV), 2016.

[47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91 99, 2015.

[48] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117 2125, 2017.

[49] Zhe Chen, Jing Zhang, and Dacheng Tao. Recursive context routing for object detection. International Journal of Computer Vision, pages 1 19, 2020.

[50] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018.

[51] Jing Zhang, Zhe Chen, and Dacheng Tao. Towards high performance human keypoint detection. ar Xiv preprint ar Xiv:2002.00537, 2020.

[52] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision (ICCV), 2017.

[53] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. International Conference on Learning Representations (ICLR), 2015.

[54] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. International Conference on Learning Representations (ICLR), 2018.

[55] Laurens Van Der Maaten. Barnes-hut-sne. International Conference on Learning Representations (ICLR), 2013.