# auto_learning_attention__66ef34ab.pdf Auto Learning Attention Benteng Ma1,2 Jing Zhang2 Yong Xia1,3 Dacheng Tao2 1Northwestern Polytechnical University, China 2The University of Sydney, Australia 3Research & Development Institute of Northwestern Polytechnical University, Shenzhen {mabenteng@mail,yxia@}.nwpu.edu.cn {jing.zhang1, dacheng.tao}@sydney.edu.au Attention modules have been demonstrated effective in strengthening the representation ability of a neural network via reweighting spatial or channel features or stacking both operations sequentially. However, designing the structures of different attention operations requires a bulk of computation and extensive expertise. In this paper, we devise an Auto Learning Attention (Auto LA) method, which is the first attempt on automatic attention design. Specifically, we define a novel attention module named high order group attention (HOGA) as a directed acyclic graph (DAG) where each group represents a node, and each edge represents an operation of heterogeneous attentions. A typical HOGA architecture can be searched automatically via the differential Auto LA method within 1 GPU day using the Res Net-20 backbone on CIFAR10. Further, the searched attention module can generalize to various backbones as a plug-and-play component and outperforms popular manually designed channel and spatial attentions for many vision tasks, including image classification on CIFAR100 and Image Net, object detection and human keypoint detection on COCO dataset. Code is available at https://github.com/btma48/Auto LA. 1 Introduction Attention learning has been increasingly incorporated into convolutional neural networks (CNNs) [1], aiming to compact the image representation and strengthen its discriminatory power [2, 3, 4, 5]. It has been widely recognized that attention learning is beneficial for many computer vision tasks, such as image classification, segmentation, and object detection. There are two types of typical attention mechanisms. The channel attention is able to reinforce the informative channels and to suppress irrelevant channels of feature maps [2], while the spatial attention enables CNNs to dynamically concentrate processing resources at the location of interest, resulting in better and more effective processing of information [4]. Let either the channel attention or spatial attention be treated as the first-order attention. The combination of both channel attention and spatial attention constitutes the second-order attention, which has been proven in benchmarks to produce better performance than either first-order attention by modulating the feature maps in both channel-wise and spatial-wise [4]. Accordingly, we propose to extend attention modules from the firstor second-order to a higher order, i.e., arranging more basic attention units structurally. However, considering the highly variable structures and hyperparameters of basic attention units, exhaustively searching the architecture of high order attention module is an exponential explosion in complexity. indicates equal contribution. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Recent years have witnessed the unprecedented success of neural architecture search (NAS) in the automated design of neural network architectures, surpassing human designs on various tasks [6, 7, 8, 9, 10, 11, 12]. We advocate the use of NAS to search the optimal architecture of high order attention module, which however is challenging for several reasons. First, there is no explicit off-the-shelf definition of the search space for attention modules, where various attention operations may be included [13]. Second, the sequential structure for arranging different attention operations should be computationally efficient so that it can be searched within affordable computational budgets. Third, how to search the attention module, e.g., given the backbone or together with the backbone cells, remains unclear. Fourth, the searched attention module is expected to generalize well to various backbones and tasks. In this paper, we propose an Auto Learning Attention (Auto LA) method for automatically searching efficient and effective plug-and-play attention modules for various well-established backbone networks. Specifically, we first define a novel concept of attention module, i.e., high order group attention (HOGA), by exploring a split-reciprocate-aggregate strategy. Technically, each HOGA block receives feature tensor from each block in the backbone as input, which is divided into K groups along the channel dimension to reduce the computational complexity. Then, a directed acyclic graph (DAG) [14] is constructed, where each node is associated with a split group, and each edge represents a specific attention operation. The sequential connections between different nodes can represent different combinations of basic attention operations, resulting in various first-order to Kth order attention modules, which indeed constitute a search space of HOGA. By customizing DARTS [8] for our problem, the explicit HOGA structure can be searched efficiently within 1 GPU day on a modern GPU given a fixed backbone network (e.g., Res Net-20) on CIFAR10. Extensive experiments demonstrate the obtained HOGA generalizes well on various backbones and outperforms previous hand-crafted attentions for many vision tasks, including image classification on the CIFAR100 and Image Net datasets, object detection, and human keypoint detection on the COCO dataset. To summarize, the contribution of our paper is three-fold. First, to the best of our knowledge, Auto LA is the first attempt to extend NAS to search plug-and-play attention modules beyond the backbone architecture. Second, we define a novel concept of attention module named HOGA that can represent high order attentions and the previous channel attention and spatial attention can be treated as its special cases. Third, we utilize a differentiable search method to search the optimal HOGA module efficiently, which can generalize well on various backbones and outperform previous attention modules for many vision tasks. 2 Related work Attention mechanism. The attention mechanism was originally introduced in neural machine translation to handle long-range dependencies [15], which enables the model to attend to important regions within a context adaptively. Self-attention was added to CNNs by either using channel attention or non-local relationships across the image [2, 3, 16, 17, 18]. As different feature channels encode different semantic concepts, the squeeze-and-excitation (SE) attention captures channel correlations by selectively modulating the scale of channels [2, 19]. Spatial attention is also explored together with the channel attention in [4], resulting in a second-order attention module called CBAM and achieving superior performance. In [19, 20], the attention is extended to multiple independent branches which achieves improved performance than the original one. In contrast to these handcrafted attention modules, we define the high order group attention and construct the search space accordingly where SE [2] and CBAM [4] are special instances in it. Consequently, a more effective attention module can be searched automatically, outperforming both SE [2] and CBAM [4] on various vision tasks. Neural Architecture Search. In terms of the NAS methods, reinforcement learning [9, 21, 22], sequential optimization [10, 23], evolutionary algorithms [24, 25, 26], random search [27, 28], and performance predictors [29, 30] tend to demand immense computational resources which probably not suitable for efficient search. Recent NAS methods reduce the search time significantly by weight-sharing [14, 31, 32] and continuous relaxation of the space [8, 33]. DARTS [8] and its variants [34, 35, 36] only need to train a single neural network with repeated cells during the searching process [37], providing elegant differentiable solutions to optimizing network weights and architecture simultaneously. Besides, DARTS is computationally efficient which is only slightly slower than training one architecture in the search space. Instead of searching basic cells and stacking them sequentially to form the backbone network, we propose to search an efficient and plug-andplay attention module given a fixed backbone, aiming to enhance the representation capacity of the backbone. The searched attention module shows good generalization ability for various backbones and downstream tasks, implying that the proposed method could be complementary to existing NASbased search of the backbone architectures. Note that the architectures of backbone and attention module can be searched alternatively or synchronously in a framework that we leave as future work. 3 Auto Learning Attention 3.1 High Order Group Attention First-order Attention Second-order Attention High order Attention (Order = K) Group Split of F Concatenate to form 𝐹 Figure 1: Attention Order. The typical channel attention, spatial attention, and normalization attention are all first-order attentions. CBAM is a second-order attention. The solid line represents the specific attention operation and the dotted line in High order Attention represents the candidate attention operation which will be searched automatically. From the view of computational flow, attention operation represents a function that transforms the input tensor F to an enhanced representation ˆF through a series of attention operations. This can be formalized in the computational graph, where the operations are represented as a directed acyclic graph with a set of nodes U. Each node Uk represents a tensor (we use the same symbol to represent the node and the tensor at the node without causing ambiguity). An attention operation o O is defined on the edge between Uk and its parent nodes Pk. In the typical first-order attention module, each node has a single parent and |Pk| = 1. Denoting the parent feature tensor Pk RC H W as input, the above attention operation can be defined as: Uk = o (Pk) . (1) Obviously, the increased order of attention may increase the computational complexity. To generate an efficient high order attention module, we divide the input tensor F into K groups along the channel dimension, where K is a cardinally hyper-parameter. In this case, we get F = {F0, F1, ..., FK 1} which is illustrated in Figure 1. Then, a series of operations o O (where O is the search space which will be further explained in section 3.2) are applied on the split group features {Fi}K 1 i=0 . This process generates K intermediate features as shown in Figure 1, where the kth intermediate tensor is calculated as: Uk = ok,k Fk + X j