# semanticaware_transformationinvariant_roi_align__f00dfa78.pdf

Semantic-Aware Transformation-Invariant Ro I Align

Guo-Ye Yang1, George Kiyohiro Nakayama2, Zi-Kai Xiao1, Tai-Jiang Mu1*, Xiaolei Huang3, Shi-Min Hu1

1BNRist Department of Computer Science and Technology, Tsinghua University 2Stanford University 3College of Information Sciences and Technology, Pennsylvania State University yanggy19@mails.tsinghua.edu.cn, w4756677@stanford.edu, xzk23@mails.tsinghua.edu.cn, taijiang@tsinghua.edu.cn, suh972@psu.edu, shimin@tsinghua.edu.cn

Great progress has been made in learning-based object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (Ro I) feature extractors which extract transformation-invariant Ro I features for different Ro I proposals, making refinement of bounding boxes and prediction of object categories more robust and accurate. However, previous Ro I feature extractors can only extract invariant features under limited transformations. In this paper, we propose a novel Ro I feature extractor, termed Semantic Ro I Align (SRA), which is capable of extracting invariant Ro I features under a variety of transformations for two-stage detectors. Specifically, we propose a semantic attention module to adaptively determine different sampling areas by leveraging the global and local semantic relationship within the Ro I. We also propose a Dynamic Feature Sampler which dynamically samples features based on the Ro I aspect ratio to enhance the efficiency of SRA, and a new position embedding, i.e., Area Embedding, to provide more accurate position information for SRA through an improved sampling area representation. Experiments show that our model significantly outperforms baseline models with slight computational overhead. In addition, it shows excellent generalization ability and can be used to improve performance with various state-ofthe-art backbones and detection methods. The code is available at https://github.com/cxjyxxme/Semantic Ro IAlign.

Introduction As a fundamental computer vision task, object detection aims to locate and recognize objects of interest in input images. In the last decade, great progress has been made in learning-based object detection methods, making them widely useful in our daily uses, such as face recognition, text detection, pedestrian detection, among others. Most existing detection methods can be grouped into two categories, i.e., one-stage detectors (Liu et al. 2016; Redmon et al. 2016) and two-stage detectors (Ren et al. 2015;

*Corresponding author is Tai-Jiang Mu. Arxiv version with appendices: https://arxiv.org/abs/2312. 09609 Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Ro I Pooling

Semantic Ro I Align

Figure 1: Previous Ro I feature extractor versus the proposed Semantic Ro I Align (SRA). Top: Ro I Pooling samples each feature in some specific positions, making extracted Ro I features sensitive to object poses. Bottom: SRA samples each feature from different semantic regions, making it capable of extracting invariant Ro I features under various transformations including object pose transformation.

He et al. 2017). One-stage detectors directly predict objects with a single neural network in an end-to-end manner. In contrast, two-stage detectors first propose a list of object proposals and then predict the proposals labels and refine bounding boxes with extracted Ro I features of each proposal. Ro I features provide better transformation-invariance for different proposal regions and using them can thus better refine bounding boxes and predict the category of each proposal in the second stage of a two-stage detector. In this paper, we mainly focus on improving the Ro I feature extractor for two-stage detectors. Similar objects may show great appearance differences in images due to different environmental conditions, object poses, etc., making detection difficult to be generalizable under various transformations (Girshick 2015). Therefore,

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

many Ro I feature extractors aim to extract transformationinvariant object features. Ro I Pooling (Girshick 2015) is the pioneering work for Ro I feature extraction, which pools features in a fixed number of sub-regions of the Ro I and obtains scale-invariant features. Ro I Align (He et al. 2017) further improves the positional accuracy of Ro I Pooing via bilinear interpolation. Ro I Transformer (Ding et al. 2019) extracts rotation-invariant features by rotating sampling positions with a regressive rotation angle. However, previous methods cannot extract invariant features for more complex transformations like perspective transformation and object pose transformation. Though there exist works like Deformable Ro I Pooling (DRo IPooling) (Dai et al. 2017; Zhu et al. 2019) that can extract invariant features under some complex transformations by adaptively adding a regressive offset to each sampling position, experiments show that it only achieves invariance under scaling transformation and cannot easily be extended to handle others such as rotation. This is because the sampling position offsets are regressed with convolutional networks, which need to be trained with transformed data (Krizhevsky, Sutskever, and Hinton 2012) and different transformations would require different kernels since the same position of a convolutional kernel may correspond to different object regions when the object is transformed under different transformations. In this paper, we regard different transformations like perspective and pose transformations as being comprised of spatial transformations of different semantic parts, while also considering that high-level features of semantic regions are more stable under varying transformations. From this perspective, we propose a Semantic Ro I Align (SRA) to extract transformation-invariant Ro I features by sampling features from different semantic regions. Ro I Pooling samples features at specific locations in a Ro I. A limitation of such sampling can be seen from the example shown in Figure 1 (top row); the 4-th sampling location extracts features of the background in the red Ro I, whereas that location extracts features of a player s leg in the blue Ro I. Such sampling loses invariance under pose transformations. In contrast, in our proposed SRA, we design a semantic attention module to obtain different semantic regions by leveraging the global and local semantic relationship within the Ro I. Then we sample features from the semantic regions, and concatenate the sampled features as the Ro I feature which is semanticaware and transformation-invariant. Since the computational efficiency of SRA determined by the feature sampling resolution, we propose a Dynamic Feature Sampler to dynamically sample features according to the aspect ratio of different Ro Is, which speeds up SRA while minimizing the impact on accuracy. Furthermore, previous positional embedding methods (Zhao, Jia, and Koltun 2020) only encode information of the sampling center, which cannot accurately represent regional information. We thus propose a new positional embedding, namely Area Embedding, which embeds positions in a sampling area into a fixed-length vector, providing more accurate position information. SRA can replace the Ro I feature extractor in most two-stage detectors and brings higher detection accuracy with a slight overhead in the number of network

parameters and computation. By using SRA as Ro I feature extractor for Faster RCNN (Ren et al. 2015), our method achieves 1.7% higher m AP in the COCO object detection task with only additional 0.2M parameters and 1.1% FLOPs compared to the baseline model. Meanwhile, it also exceeds other Ro I feature extractors with less computational overhead. To verify the generalizability of SRA, we equip it to various state-ofthe-art backbones and detection methods. Results show that SRA can consistently boost their detection accuracy. In summary, our contributions are: a novel Ro I feature extractor, i.e., Semantic Ro I Align, which is able to extract transformation-invariant Ro I features and can be plugged into most two-stage detectors to improve detection accuracy with little extra cost, a Dynamic Feature Sampler which makes SRA implementation efficient, and an Area Embedding which provides more comprehensive and accurate information of sampled positions. Extensive experiments that demonstrate the superiority of the SRA and its great generalizability to various stateof-the-art backbones and detection methods.

Related Work Object Detection and Ro I Feature Extractors In recent years, deep learning techniques are dominant in object detection. Most deep object detection methods can be categorized into two types: one-stage detectors (Redmon et al. 2016) and two-stage detectors (Girshick 2015). Faster R-CNN (Ren et al. 2015) is a two-stage network with a Regional Proposal Network (RPN) predicting multiple Ro I proposals; Ro I features are then extracted by an Ro I feature extractor to predict object bounding boxes and categories in the second-stage network. Mask R-CNN (He et al. 2017) proposed a general framework for object instance segmentation tasks. Dynamic head (Dai et al. 2021) proposed to use scale, spatial, and task-aware attention mechanisms to improve detection accuracy. Ro I feature extractors are used to extract transforminvariant features in two-stage detectors, so that the secondstage network can refine the bounding boxes and predict object categories more accurately. Ro I Pooling (Girshick 2015) performs scale-invariant feature extraction by dividing the Ro I into a fixed number of bins, pooling the features in each bin, and concatenating them into a vector of fixed size. Ro I Align (He et al. 2017) uses bilinear interpolation to more accurately extract features. Ro I Transformer (Ding et al. 2019) extracts rotation-invariant features by correcting the extracted features using a learned rigid transformation supervised by ground-truth oriented bounding boxes. However, these methods only model features invariant to rigid transformations, while ignoring non-rigid transformations. Deformable Ro I Pooling (Dai et al. 2017; Zhu et al. 2019) extracts features by adding a regressive offset to each sampling position of Ro I Pooling. Our experiments show that it can only extract invariant features under scale transformation which could be due to its learning the regressive offset by convolution, making it hard to generalize to other

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Dynamic Feature Sampler

Area Embedding

𝑓𝑖 [𝐶, ℎ𝑖, 𝑤𝑖]

Ro I Descriptor

Regressor Ro I Descriptor 𝑑𝑖 [𝐾]

Semantic Features

𝑠𝑖 [𝐾, ℎ𝑖, 𝑤𝑖]

Area Embeddings

𝑝𝑖 [𝑃, ℎ𝑖, 𝑤𝑖]

ℎ𝑖, 𝑤𝑖 1, 2

n-th sub-mask

Sampling mask

𝑚𝑖,𝑛 [ℎ𝑖, 𝑤𝑖]

ℎ𝑖 𝑤𝑖 groups

Output 𝑦𝑖 [𝑁, 𝐶] Feature Map

𝐹 [𝐶, 𝐻, 𝑊]

Ro I proposal

Feature of head

Figure 2: The network architecture of our Semantic Ro I Align. N means matrix multiplication, and L means concatenation.

transformations. Ro IAttn (Liang and Song 2022) proposes to enhance the Ro I features by passing them through multiple self-attention layers. However, simply doing so is limited w.r.t. the ability to obtain invariant Ro I features for two reasons: 1) performing self-attention on Ro I Align extracted features has more limited flexibility than sampling on the original feature map, 2) the regression ability of typical self-attention is insufficient for identifying specific semantic regions under different transformations. Our SRA obtains different semantic regions with a novel semantic attention structure by leveraging the global and local semantic relationship within the Ro I. We then sample the Ro I features from the semantic regions, which makes it easy to achieve invariance under more diverse transformations, and thus obtain higher detection accuracy than existing methods.

Attention Mechanism

In computer vision, attention can be regarded as an adaptive process, which mimics the human visual system s ability to focus on important regions. RAM (Mnih et al. 2014) is the pioneering work to introduce the attention concept in computer vision. After that, there have been some works (Hu, Shen, and Sun 2018; Zhao, Jia, and Koltun 2020) exploring the use of attention mechanisms for different computer vision tasks. Recently, transformer networks, which have achieved great success in natural language processing (Vaswani et al. 2017), are explored in computer vision and have shown great potential. Vi T (Dosovitskiy et al. 2020) is the first work to bring transformer into computer vision by regarding a 16 16 pixel region as a word and an image as a sentence. Due to the strong modeling capability of visual transformer networks, they have been applied to various vision tasks such as image recognition (Liu et al. 2021; Guo et al. 2022), object detection (Carion et al. 2020; Yang et al. 2021), etc. In this paper, we introduce a novel semantic attention mechanism to capture invariance Ro I features under more variety of transformations.

Methodology

In this section, we will first introduce the general architecture of the proposed Semantic Ro I Align (SRA). Next, we will detail how the semantic masks of SRA are obtained. We will then present a dynamic feature sampling method to dynamically sample features for SRA according to different Ro I aspect ratios, which improves model accuracy and efficiency. Finally, the proposed Area Embedding is introduced to replace the previous position embedding, so as to provide more accurate position information for the model.

Semantic Ro I Align

The pipeline of the proposed Semantic Ro I Align (SRA) is shown in Figure 2. The SRA extracted Ro I feature of an object consists of N sub-features, each of which is sampled in a specific semantic region, making the sampling position adaptive to image transformations, and thereby improving the transformation-invariance. In Figure 3, we visualize partial semantic masks (left 5 columns) produced by our SRA for 3 Ro I proposals (one for each row). The semantic samplings of SRA can sample on the same semantic parts for the object under different perspective transformations such as rotation (top row and middle row) and object pose transformations (top row and bottom row), giving the extracted Ro I feature better transformation-invariance, and thus being beneficial to bounding boxes regression and semantic labels prediction in the second-stage network. The inputs of SRA are a feature map F with shape (C, H, W) where C, H and W represent the number of feature channels, height, and width of the feature map, respectively, and a list of Ro I proposals R = {Ri} where Ri = {xi,0, yi,0, xi,1, yi,1} indicates a bounding box in the feature map with (xi,0, yi,0) and (xi,1, yi,1) being the coordinates of the top left corner and the bottom right corner, respectively. For each Ro I proposal Ri, SRA first exploits the Dynamic Feature Sampler to sample a feature map fi from the input feature map F with the bounding box Ri. We then obtain N semantic masks mi = {mi,n}, 1 n N, which

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

1 2 3 4 5 6

Figure 3: Semantic sampling masks of SRA (columns 1 to 5), sampling locations of DRo IPooling (Zhu et al. 2019) (column 6), and sampling mask of directly passing features extracted by Ro I Align through a standard self-attention layer (column 7). Each row represents an Ro I in an image. The object in the middle row is rotated by 30 degrees from the top row, and the bottom row shows another object of the same class as the top row. Red and yellow sampling masks are overlaid on images and the sampling locations of DRo IPooling are indicated with different colors.

have the same size as fi. The output transformation-invariant Ro I feature yi of our SRA is finally obtained by sampling fi using the semantic masks mi. More specifically, yi is calculated as the weighted sum of fi elements using mi elements as weights:

k=1 fi(c, j, k) mi,n(j, k),

for all n {1, ..., N}, c {1, ..., C},

where hi, wi are the height and width of feature map fi. Next, we will introduce how the semantic masks mi are estimated in the SRA.

Obtaining Semantic Masks The pipeline of obtaining semantic masks of SRA is also shown in Figure 2. The goal of SRA is to generate N separable semantic part masks mi for the input Ro I proposal Ri and the sampled feature map fi. To achieve this, we want the value of mi,n(j, k) to be positively correlated with the likelihood that position (j, k) in Ri belongs to the n-th semantic part of the object in Ri. Let us denote that likelihood as m i,n(j, k). The likelihood is related to two factors,

namely, 1) what it is in Ri, and 2) what it is at the position (j, k) of Ri. The former is expressed by a K-dimensional Ro I descriptor di representing the overall features of Ri, and the latter is characterized by a semantic feature map si with shape (K, hi, wi) meaning the semantic feature at different positions in Ri. To make the final Ro I feature computed based on Eq. 1 transformation-invariant, the mi should transform accordingly when the object transforms. Therefore, we obtain the likelihood m i by using a same regressor at different positions (j, k): m i,n(j, k) = ξn([di, si(:, j, k)]),

for all n {1, ..., N}, j {1, ..., hi}, k {1, ..., wi}, (2)

where the ξ are N learnable sub-mask regressors, each composed of two lots of Norm-Re LU-Linear, [ , ] means the concatenate operation, and si(:, j, k) represents the semantic feature in the position (j, k). By doing so, if some transformation of the object causes the feature at position (j, k) to move to position (j , k ), the transformed masks ˆmi,n(j , k ) = ξn([ ˆdi, ˆsi(:, j , k )]) will have similar value with mi,n(j, k), since the transformed ˆdi and ˆsi(:, j , k ) are similar to di and si(:, j, k), respectively. This means the semantic masks transform accordingly with the transformation. We obtained si by performing a 1 1 convolution on fi, and we explored various forms of the Ro I Descriptor Regressor to obtain di: - Concatenation: di = ψ(Flatten(fi))

- Maximum: di = ψ( hi max j=1

wi max k=1 fi,(:,j,k))

- Average: di = ψ( 1 hi wi

k=1 fi,(:,j,k))

Here ψ is a linear layer with K output channels. Sampling features based on semantic masks may cause the model to lose position information, which is important for the object detection task. We thus use a position embedding pi (see (Zhao, Jia, and Koltun 2020)) to provide position information for the model, and use positional embedded m i,n(j, k) = ξn([di, si(:, j, k), pi(:, j, k)]) instead of Eq. 2. The pi is obtained by performing a 1 1 convolution with output channels of P on p i, where p i with shape (2, hi, wi) is the relative position of each position in the Ro I, and is normalized to [ 1, 1]: p i(1, j, k) =j/hi 2 1,

p i(2, j, k) =k/wi 2 1, (4)

The semantic masks mi is then obtained by mi = softmax(m i γ), where γ is an amplification factor that amplifies the backpropagation response of masks, and the softmax acts on the last two dimensions to ensure the sum of each mask to 1. Finally, the output Ro I feature yi is obtained by summing up elements of fi weighted by the N semantic masks mi as shown in Eq. 1.

Dynamic Feature Sampler In SRA, the semantic masks mi are estimated via the subsampled feature map fi, and thus the computational overhead of SRA is proportional to the input size of fi, i.e.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(b) Fixed size (a) Original size (c) Dynamic size

Figure 4: Different methods to determine the size of the subsampled feature map. (a) Using a size that is the same resolution as the original feature map is costly as it results in too many samples. (b) Using a fixed size may cause the aspect ratio of the region represented by each sub-sampled feature to be inconsistent for different Ro Is, which we believe is harmful to the model, e.g. a ratio of approximately 2 for the upper Ro I and 0.5 for the lower Ro I. (c) Our Dynamic Feature Sampler overcomes the limitations of the above two methods, yielding both consistent and limited samples.

hi wi. The size of feature map fi can be set to different values for different Ro Is. A straightforward solution is to set the size to the original resolution of Ri. However, this would lead to a large computational cost for some large Ro Is. Another way is to set the size to a fixed value. However, as shown in Figure 4(b), this may cause the aspect ratio of the region represented by each feature of fi to be inconsistent for different Ro Is, and experiments show that this will lead to loss of accuracy. To balance sampling quality and computational efficiency, we propose a Dynamic Feature Sampler to select the size of fi for each proposal, which keeps the aspect ratio of each sub-sampled region close to 1, and has a limited size. Specifically, for each Ro I Ri, we pick the size that has the closest aspect ratio to Ri while not exceeding a maximal area M. Mathematically, this can be formulated as:

hi, wi = arg min (h ,w ) Z+ 2

w xi,1 xi,0

The sub-sampled feature map fi is then obtained by dividing Ri region of F into hi wi blocks and averaging the feature values in each block. With the Dynamic Feature Sampler, our SRA yields a good performance with a small computational overhead.

Area Embedding In Eq. 4, we use position embedding of each grid center in the mask to provide sampling position information to the model. However, as shown in Figure 5(a), since we use a dynamic way to determine the size of the sub-sampled feature map, the same center position may represent different sampling areas. We thus propose Area Embedding to encode the

Figure 5: Schematic diagram comparison of traditional Position Embedding and proposed Area Embedding. (a) Position embedding only embeds the coordinates of the sampling center, while the same center position may represent different sampling areas. (b) Our Area Embedding embeds the entire sampling area.

sampled area of each point in the output feature with two fixed-length vectors, each representing both the position and the coverage on the horizontal and vertical axes. We set the length of this vector to M, which is the maximal number of samples per axis. For each point (j, k) Z2 sampled by SRA, we calculate p i by:

p i(1 M, j, k) = Upsample (One Hot(j; hi); M) ,

p i((M + 1) 2M, j, k) = Upsample (One Hot(k; wi); M) . (6) where the One Hot(b; a) operator takes an integer b less or equal to a as input and produces the one hot embedding of b within a vector of length a, and Upsample(v; M) upsamples vector v to a M-sized vector. The upsampling method can vary; in Figure 5, we use nearest sampling for convenience of illustration, while in our experiments we use linear sampling for higher accuracy. The Area Embedding provides the model more accurate sampling position information and experiments show that it improves the accuracy of the model.

Experiments We conduct our experiments on the MS COCO dataset (Lin et al. 2014), and use the train2017 for training and use the val2017 and the test2017 for testing. We report the standard COCO evaluation metrics including mean Average Precision (AP) under different Intersection over Union (Io U) thresholds and at different object scales, denoted as AP for the object detection task and AP m for the instance segmentation task. Our model is implemented based on Jittor (Hu et al. 2020) and JDet1 library. The implementation details of our model are given in the supplementary material.

Ablation Studies We first conduct a series of ablation experiments to verify the effectiveness of each part of the proposed model. The ablation experiments are conducted on the object detection task using the MS COCO validation set. We couple our proposed feature extractor with Faster-RCNN using Res Net-50 as the backbone. Ro I Align (He et al. 2017) is used as our baseline model, if not specifically mentioned.

1https://github.com/Jittor/JDet

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Method AP AP50 AP75 APS APM APL Params FLOPs Ro I Align 37.5 58.2 40.8 21.8 41.1 48.1 41.8M 340.9G DRo IPooling 37.9 59.4 41.8 22.4 41.4 49.3 149.4M 349.0G w/ Conv. 36.4 57.4 39.6 21.5 40.0 46.9 71.9M 350.1G w/ SA 37.6 58.0 40.8 20.9 41.1 48.7 42.7M 348.1G SRA (Ours) 39.2 59.6 42.6 22.5 42.6 51.9 42.0M 344.2G

Table 1: Results of comparative and ablation experiments between our SRA and baseline models.

The Effectiveness of SRA. To verify the effectiveness of the proposed SRA, we replaced it with Ro I Align and Modulated Deformable Ro I Pooling (DRo IPooling) (Zhu et al. 2019). The results in Table 1 show that, our model outperforms the baseline model by 1.7% AP with a minor computational and parameters cost, and also outperforms the DRo IPooling by 1.3% AP with a much smaller model. We also compared SRA with two other baselines, by performing some simple operations on the features extracted by Ro I Align: applying a convolutional layer on the features to obtain the sampling masks and re-sampling the features with these masks (denoted as w/ Conv. ), as well as directly passing the features through a standard multi-head self-attention layer (Vaswani et al. 2017) (denoted as w/ SA ). The results in Table 1 show that our model achieves a gain of 2.8% and 1.6% in AP, respectively, with a smaller number of parameters and FLOPs. This improvement can be attributed to its enhanced capability in identifying consistent semantic parts across diverse transformations, leading to better transformation-invariance. We also visualize some samplings of SRA, DRo IPooling, and w/ SA in Figure 3. Rows 1 & 2 together show how samplings respond to different transformations of the same object, while rows 1 & 3 together indicate how samplings respond to different objects of the same class. We found the sampling masks of SRA (columns 1 to 5) can be divided into two classes. The first class samples on different semantic parts. For example, columns 1-4 show the samples on the human s feet, head, and body, and around the human, respectively. The second class of sampling is for positioning, which is only activated in certain positions. For example, the 5th column is only activated on the bottom position of the Ro I. Our semantic samplings can sample on the same semantic parts for the object under different transformations, which gives the extracted Ro I feature better transformation invariance. In comparison, the sampling locations of DRo IPooling (column 6 of Figure 3) are distributed mostly inside the object as the object transforms, however, they will not vary accordingly with object pose transformations. Taking 3 samplings (in column 6) as an example, in the top row and middle row, the circles numbered from 1 to 3 do not rotate with the object, which means DRo IPooling can not always achieve transformation-invariance under some complex transformations like rotation. Also, simply passing features through a self-attention layer (w/ SA, column 7 of Figure 3) cannot ensure sampling on the same semantic parts,

D S A AP AP50 AP75 APS APM APL 31.4 52.2 32.7 17.8 35.1 40.0 36.2 57.5 38.6 21.0 39.8 46.6 37.3 58.3 40.7 21.5 41.0 48.1 38.9 59.3 42.3 22.5 42.5 51.3 37.4 58.4 40.4 22.1 41.0 48.4 36.4 57.3 39.0 20.9 39.9 46.7 PE 38.8 59.2 42.5 22.6 42.3 50.7 39.2 59.6 42.6 22.5 42.6 51.9

Table 2: Ablation study on the effectiveness of each module in our SRA. D, S, and A denote Ro I descriptor, semantic feature map, and Area Embedding respectively.

Setting AP AP50 AP75 APS APM APL DR=Con. 38.9 59.3 41.9 22.5 42.3 51.1 DR=Max. 39.0 59.3 42.9 22.5 42.6 51.1 DR=Avg. 39.2 59.6 42.6 22.5 42.6 51.9 γ = 1 38.3 58.7 41.8 22.2 41.8 50.2 γ = 5 38.7 59.2 42.0 22.1 42.5 51.0 γ = 50 39.2 59.6 42.6 22.5 42.6 51.9 γ = 500 36.8 57.6 39.9 21.2 40.2 48.1

Table 3: Experiments on different module settings. DR denotes Ro I Descriptor Regressor, and γ denotes the amplification factor.

Setting AP AP50 AP75 APS APM APL Params FLOPs N = 9 38.0 58.2 41.3 21.8 41.3 49.9 31.5M 340.7G N = 25 38.6 59.0 41.9 22.8 42.3 50.5 35.7M 342.1G N = 49 39.2 59.6 42.6 22.5 42.6 51.9 42.0M 344.2G N = 100 39.4 59.9 42.9 22.9 42.6 51.7 55.4M 348.7G size = 32 36.3 57.3 39.0 20.6 39.9 46.4 31.3M 337.8G size = 52 37.2 58.0 40.7 21.4 40.8 47.8 35.5M 339.0G size = 72 37.3 58.3 40.7 21.5 41.0 48.1 41.8M 340.9G size = 102 37.6 58.4 40.8 22.0 41.3 48.8 55.1M 344.9G

Table 4: Comparison between SRA with a different number of masks (N) and the baseline model with comparable Ro I sizes.

thus failing to obtain transformation-invariant Ro I features. Structure of SRA and Area Embedding. We also conduct experiments to verify the effectiveness of different components in the SRA by controlling whether to concatenate the Ro I descriptor (D), semantic feature map (S), and Area Embedding (A) when regressing the masks. The results are listed in Table 2. Comparing the last row with the 5th row in the table, our model with semantic feature map obtains a gain of 1.8% in AP, as our model determines the masks for sampling based on semantic features,

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Setting AP AP50 AP75 Avg. S Params FLOPs fixed 38.8 59.2 42.4 64 42.0M 344.1G M = 32 38.3 59.1 42.0 16 42.0M 341.7G M = 64 38.8 59.5 42.5 32 42.0M 342.5G M = 128 39.2 59.6 42.6 64 42.0M 344.2G M = 256 39.1 59.5 42.5 128 42.0M 347.9G

Table 5: Experiments on the Dynamic Feature Sampler. M denotes the dynamic feature map size limit.

Method AP AP50 AP75 APS APM APL Ro I Pooling 37.2 58.9 40.3 21.5 40.2 46.0 Ro I Align 37.7 58.9 40.6 21.9 40.7 46.4 DRo IPooling 38.1 60.0 42.0 22.0 41.2 47.2 Ada. Ro I Align 37.7 58.8 40.7 21.8 40.7 46.5 Pr. Ro I Align 37.8 58.9 40.9 22.1 40.8 46.7 Ro IAttn 38.0 59.3 40.9 22.4 41.1 46.9 SRA 39.2 59.8 42.6 22.6 42.1 49.0

Table 6: Comparison with different Ro I extractors on the MS COCO detection test-dev set.

which makes the sampled features invariant under a variety of transformations, thus achieving better performance. We also tested our model with or without Area Embedding (8th row and 6th row respectively) and replaced the Area Embedding with position embedding (denoted as PE, 7th row). The results show that the model with AE obtains 2.8% higher AP than without AE, and 0.4% higher AP than with PE, which demonstrates AE can describe more accurately the sampling information of Dynamic Feature Sampler and provide better position information for the model. Choices of the Ro I Descriptor Regressor. We tested various choices of the Ro I Descriptor Regressor in Eq. 3, denoted as DR=Con., Max., and Avg., respectively, where the choice of concatenation is tested on the 8 8 fixed size feature sampler as it cannot be adapted to the Dynamic Feature Sampler. The results are shown in Table 3. Though the choice of DR=Con. shows slightly better performance than 8 8 fixed size feature sampler with the choice of average (38.8% AP in Table 5), considering that it cannot be adapted to the Dynamic Feature Sampler and will lead to a larger amount of calculation and more parameters, we finally use the average Ro I Descriptor Regressor in our model. Parameters Setting. The number of masks N determines the size of the Ro I features extracted by our model. We tested different settings of N (denoted as N = x ) and compared them with the baseline model with different settings of Ro I Align output size (denoted as size = x ). The results in Table 4 show that, with the same Ro I feature size (comparing the 1st row with the 5th row, the 2nd row with the 6th row, etc.), our model has a 1.4%-1.9% higher AP, which proves that the transformation-invariant features extracted by our model contain richer information under the

same feature length and are more conducive to object detection. Considering the balance between model parameters and accuracy, we finally set N = 49. We also tested different settings of the amplification factor γ, denoted as γ = x in Table 3. Results show that setting it to an appropriate value is beneficial to the regression of the semantic mask, so we set γ = 50 according to the experiment. Dynamic Feature Sampler. To evaluate the effectiveness of the Dynamic Feature Sampler, we compared 8 8 fixed size feature sampler (denoted as fixed) with Dynamic Feature Sampler (M = 128), which has the same number of average samplings and similar FLOPs. As shown in the 1st and 4th row of Table 5, the Dynamic Feature Sampler obtained better results as its sub-sampled feature represented region has a more consistent aspect ratio. We also tested different dynamic feature map size limit M. A larger M brings a higher feature sampling resolution. In general, the accuracy increases with the increment in the resolution; however, the accuracy improvement brought by the increase in sampling resolution is limited by the resolution of the original feature map, and will gradually tend to zero. So, we choose M = 128 based on the experiment.

Comparison with Other Methods We also compared our model with other methods on the COCO test set. We first compared ours with different Ro I feature extractors: Ro I Pooling (Girshick 2015), Ro I Align (He et al. 2017), Adaptive Ro I Align (Jung et al. 2018), Precise Ro I Align (Jiang et al. 2018), DRo IPooling (Zhu et al. 2019), and Ro IAttn (Liang and Song 2022) on Faster R-CNN with Res Net50 as backbone, trained for 12 epochs. The results are shown in Table 6. One can see that our model achieves the best performance. In particular, compared to the Ro IAttn which incorporates several self-attention layers on the Ro I Align extracted feature, our model obtains 1.2% higher AP. The results demonstrate the advantage of our SRA method for extracting Ro I features that are invariant to more types of transformations. To verify the generalizability of our method, we also examined the performance gain by using SRA across different detection methods and backbones. Please refer to the supplementary material for more details. Our model improves the accuracy of various detection methods and backbone networks, demonstrating the generalizability of our model.

Conclusions In this paper, we proposed SRA, a transformation-invariant Ro I feature extractor. It regresses semantic masks based on a novel semantic attention structure, and obtains Ro I features by sampling the feature map with these semantic masks, making it invariant under more diverse transformations. We further proposed the Dynamic Feature Sampler to speed up the process while minimizing the impact on accuracy, and proposed Area Embedding to provide more accurate sampling area information. Benefiting from the capability and generalizability of SRA, experiments show that its utilization can bring significant performance improvement to various baselines and state-of-the-art models with a small computational and parameter overhead.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments This work was supported by the National Key Research and Development Program of China under Grant 2021ZD0112902, the National Natural Science Foundation of China under Grant 62220106003, and the Research Grant of Beijing Higher Institution Engineering Research Center and Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.

References Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-End Object Detection with Transformers. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, 764 773. Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; and Zhang, L. 2021. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7373 7382. Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; and Lu, Q. 2019. Learning Ro I transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2849 2858. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, 1440 1448. Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; and Hu, S.-M. 2022. Visual attention network. ar Xiv preprint ar Xiv:2202.09741. He, K.; Gkioxari, G.; Doll ar, P.; and Girshick, R. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961 2969. Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132 7141. Hu, S.-M.; Liang, D.; Yang, G.-Y.; Yang, G.-W.; and Zhou, W.-Y. 2020. Jittor: a novel deep learning framework with meta-operators and unified graph execution. Science China Information Sciences, 63: 1 21. Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; and Jiang, Y. 2018. Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision (ECCV), 784 799. Jung, I.; Son, J.; Baek, M.; and Han, B. 2018. Real-time mdnet. In Proceedings of the European conference on computer vision (ECCV), 83 98. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 1097 1105.

Liang, X.; and Song, P. 2022. Excavating roi attention for underwater object detection. In 2022 IEEE International Conference on Image Processing (ICIP), 2651 2655. IEEE. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, 740 755. Springer. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multibox detector. In European conference on computer vision, 21 37. Springer. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012 10022. Mnih, V.; Heess, N.; Graves, A.; et al. 2014. Recurrent models of visual attention. Advances in neural information processing systems, 27. Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779 788. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30. Yang, G.-Y.; Li, X.-L.; Martin, R. R.; and Hu, S.-M. 2021. Sampling equivariant self-attention networks for object detection in aerial images. ar Xiv preprint ar Xiv:2111.03420. Zhao, H.; Jia, J.; and Koltun, V. 2020. Exploring selfattention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10076 10085. Zhu, X.; Hu, H.; Lin, S.; and Dai, J. 2019. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9308 9316.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)