# medical_image_segmentation_using_squeezeandexpansion_transformers__9a80862b.pdf

Medical Image Segmentation Using Squeeze-and-Expansion Transformers

Shaohua Li1 , Xiuchao Sui1 , Xiangde Luo2 , Xinxing Xu1 , Yong Liu1 , Rick Goh1

1Institute of High Performance Computing, A*STAR, Singapore 2University of Electronic Science and Technology of China, Chengdu, China {shaohua, xiuchao.sui}@gmail.com, xiangde.luo@std.uestc.edu.cn, {xuxinx, liuyong, gohsm}@ihpc.a-star.edu.sg

Medical image segmentation is important for computer-aided diagnosis. Good segmentation demands the model to see the big picture and ﬁne details simultaneously, i.e., to learn image features that incorporate large context while keep high spatial resolutions. To approach this goal, the most widely used methods U-Net and variants, extract and fuse multi-scale features. However, the fused features still have small effective receptive ﬁelds with a focus on local image cues, limiting their performance. In this work, we propose Segtran, an alternative segmentation framework based on transformers, which have unlimited effective receptive ﬁelds even at high feature resolutions. The core of Segtran is a novel Squeeze-and-Expansion transformer: a squeezed attention block regularizes the self attention of transformers, and an expansion block learns diversiﬁed representations. Additionally, we propose a new positional encoding scheme for transformers, imposing a continuity inductive bias for images. Experiments were performed on 2D and 3D medical image segmentation tasks: optic disc/cup segmentation in fundus images (REFUGE 20 challenge), polyp segmentation in colonoscopy images, and brain tumor segmentation in MRI scans (Bra TS 19 challenge). Compared with representative existing methods, Segtran consistently achieved the highest segmentation accuracy, and exhibited good cross-domain generalization capabilities. The source code of Segtran is released at https://github.com/askerlee/segtran.

1 Introduction

Automated Medical image segmentation, i.e., automated delineation of anatomical structures and other regions of interest (ROIs), is an important step in computer-aided diagnosis; for example it is used to quantify tissue volumes, extract key quantitative measurements, and localize pathology [Schlemper et al., 2019; Orlando et al., 2020]. Good segmentation demands the model to see the big picture and ﬁne details at

Corresponding Author.

the same time, i.e., learn image features that incorporate large context while keep high spatial resolutions to output ﬁnegrained segmentation masks. However, these two demands pose a dilemma for CNNs, as CNNs often incorporate larger context at the cost of reduced feature resolution. A good measure of how large a model sees is the effective receptive ﬁeld (effective RF) [Luo et al., 2016], i.e., the input areas which have non-negligible impacts to the model output. Since the advent of U-Net [Ronneberger et al., 2015], it has shown excellent performance across medical image segmentation tasks. A U-Net consists of an encoder and a decoder, in which the encoder progressively downsamples the features and generates coarse contextual features that focus on contextual patterns, and the decoder progressively upsamples the contextual features and fuses them with ﬁne-grained local visual features. The integration of multiple scale features enlarges the RF of U-Net, accounting for its good performance. However, as the convolutional layers deepen, the impact from far-away pixels decay quickly. As a results, the effective RF of a U-Net is much smaller than its theoretical RF. As shown in Fig.2, the effective RFs of a standard U-Net and Deep Lab V3+ are merely around 90 pixels. This implies that they make decisions mainly based on individual small patches, and have difﬁculty to model larger context. However, in many tasks, the heights/widths of the ROIs are greater than 200 pixels, far beyond their effective RFs. Without seeing the bigger picture , U-Net and other models may be misled by local visual cues and make segmentation errors. Many improvements of U-Net have been proposed. A few typical variants include: U-Net++ [Zhou et al., 2018] and UNet 3+ [Huang et al., 2020], in which more complicated skip connections are added to better utilize multi-scale contextual features; attention U-Net [Schlemper et al., 2019], which employs attention gates to focus on foreground regions; 3D UNet [Çiçek et al., 2016] and V-Net [Milletari et al., 2016], which extend U-Net to 3D images, such as MRI volumes; Eff-UNet [Baheti et al., 2020], which replaces the encoder of U-Net with a pretrained Efﬁcient Net [Tan and Le, 2019]. Transformers [Vaswani et al., 2017] are increasingly popular in computer vision tasks. A transformer calculates the pairwise interactions ( self-attention ) between all input units, combines their features and generates contextualized features. The contextualization brought by a transformer is analogous to the upsampling path in a U-Net, except that

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Reshaping to original shape

Visual features Learnable sinusoidal

positional encoding

Squeeze-and-Expansion Transformer layers

Local features

Contextualized features

Segmentation

head 𝑥𝑖, 𝑦𝑗

CNN backbone

Input Encoder Transformer

Spatially flattening

Hi-res features

Coordinates

Figure 1: Segtran architecture. It extracts visual features with a CNN backbone, combines them with positional encodings of the pixel coordinates, and ﬂattens them into a sequence of local feature vectors. The local features are contextualized by a few Squeeze-and-Expansion transformer layers. To increase spatial resolution, an input FPN and an output FPN upsamples the features before and after the transformers.

(a) Fundus image (b) Ground truth (c) U-Net (d) Deeplab V3+ (e) Segtran

Figure 2: Effective receptive ﬁelds of 3 models, indicated by nonnegligible gradients in blue blobs and light-colored dots. Gradients are back-propagated from the center of the image. Segtran has nonnegligible gradients dispersed across the whole image (light-colored dots). U-Net and Deep Lab V3+ have concentrated gradients. Input image: 576 576.

it has unlimited effective receptive ﬁeld, good at capturing long-range correlations. Thus, it is natural to adopt transformers for image segmentation. In this work, we present Segtran, an alternative segmentation architecture based on transformers. A straightforward incorporation of transformers into segmentation only yields moderate performance. As transformers were originally designed for Natural Language Processing (NLP) tasks, there are several aspects that could be improved to better suit image applications. To this end, we propose a novel transformer design Squeeze-and-Expansion Transformer, in which a squeezed attention block helps regularize the huge attention matrix, and an expansion block learns diversiﬁed representations. In addition, we propose a learnable sinusoidal positional encoding that imposes a continuity inductive bias for the transformer. Experiments demonstrate that they lead to improved segmentation performance. We evaluated Segtran on two 2D medical image segmentation tasks: optic disc/cup segmentation in fundus images of the REFUGE 20 challenge, and polyp segmentation in colonoscopy images. Additionally, we also evaluated it on

a 3D image segmentation task: brain tumor segmentation in MRI scans of the Bra TS 19 challenge. Segtran has consistently shown better performance than U-Net and its variants (UNet++, UNet3+, Pra Net, and nn U-Net), as well as Deep Lab V3+ [Chen et al., 2018].

2 Related Work

Our work is largely inspired by DETR [Carion et al., 2020]. DETR uses transformer layers to generate contextualized features that represent objects, and learns a set of object queries to extract the positions and classes of objects in an image. Although DETR is also explored to do panoptic segmentation [Kirillov et al., 2019], it adopts a two-stage approach which is not applicable to medical image segmentation. A followup work of DETR, Cell-DETR [Prangemeier et al., 2020] also employs transformer for biomedical image segmentation, but its architecture is just a simpliﬁed DETR, lacking novel components like our Squeeze-and-Expansion transformer. Most recently, SETR [Zheng et al., 2021] and Trans U-Net [Chen et al., 2021] were released concurrently or after our paper submission. Both of them employ a Vision Transformer (Vi T) [Dosovitskiy et al., 2021] as the encoder to extract image features, which already contain global contextual information. A few convolutional layers are used as the decoder to generate the segmentation mask. In contrast, in Segtran, the transformer layers build global context on top of the local image features extracted from a CNN backbone, and a Feature Pyramid Network (FPN) generates the segmentation mask. [Murase et al., 2020] extends CNNs with positional encoding channels, and evaluates them on segmentation tasks. Mixed results were observed. In contrast, we veriﬁed through an ablation study that positional encodings indeed help Seg-

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

tran to do segmentation to a certain degree. Receptive ﬁelds of U-Nets may be enlarged by adding more downsampling layers. However, this increases the number of parameters and adds the risk of overﬁtting. Another way of increasing receptive ﬁelds is using larger stride sizes of the convolutions in the downsampling path. Doing so, however, sacriﬁces spatial precision of feature maps, which is often disadvantageous for segmentation [Liu and Guo, 2020].

3 Squeeze-and-Expansion Transformer

The core concept in a transformer is Self Attention, which can be understood as computing an afﬁnity matrix between different units, and using it to aggregate features:

Att_weight(X, X) = f(K(X), Q(X)) RN N, (1) Attention(X) = Att_weight(X, X) V (X), (2) Xout = FFN(Attention(X)), (3)

where K, Q, V are key, query, and value projections, respectively. f is softmax after dot product. Att_weight(X, X) is the pairwise attention matrix between input units, whose i, jth element deﬁnes how much the features of unit j contributes to the fused (contextualized) features of unit i. FFN is a feedforward network to further transform the fused features. The basic transformer above is extended to a multi-head attention (MHA) [Vaswani et al., 2017; Voita et al., 2019], aiming to capture different types of associations between input units. Each of the Nh heads computes individual attention wights and output features (C/Nh-dimensional), and their output features are concatenated along the channel dimension into C-dimensional features. Different heads operate in exclusively different feature subspaces. We argue that transformers can be improved in four aspects make them better suited for images:

1. In Eq.(2), the intermediate features Attention(X) are obtained by linearly combining the projected input features, where the attention matrix speciﬁes the combination coefﬁcients. As the attention matrix is huge: N N, with typically N > 1000, it is inherently vulnerable to noises and overﬁtting. Reducing the attention matrix to lower rank matrices may help.

2. In traditional transformers, the output features are monomorphic: it has only one set of feature transformations (the multi-head transformer also has one set of transformations after concatenation), which may not have enough capacity to fully model data variations. Just like a mixture of Gaussians almost always better depicts a data distribution than a single Gaussian, data variations can be better captured using a mixture of k transformers.

3. In traditional transformers, the key and query projections are independently learned, enabling them to capture asymmetric relationships between tokens in natural language. However, the relationships between image units are often symmetric, such as whether two pixels belong to the same segmentation class.

4. Pixels in an image have strong locality and semantic continuity. The two mainstream positional encoding

𝑥1 𝑥2 𝑥3 𝑥𝑁

𝑥1 𝑥2 𝑥3 𝑥𝑁 𝑥1

Figure 3: (a) Full self-attention (N N) vs. (b) Squeezed Attention Block (SAB). In SAB, ﬁrst input units x1, , x N attend with a codebook c1, , c M, yielding projected codebook features c 1, , c M, which then attend back with the input x1, , x N. The two attention matrices are N M and M N, respectively.

Concatenate

Mode 𝑁𝑚 Aggregate

Figure 4: (a) Multi-head attention (MHA) vs. (b) Expanded attention block (EAB). In MHA, each head outputs an exclusive feature subset. In contrast, EAB outputs Nm sets of complete features from Nm modes, and aggregates them with dynamic mode attention.

schemes [Carion et al., 2020; Dosovitskiy et al., 2021] do not fully impose such an inductive bias. This bias could be imposed by an improved positional encoding.

The Squeeze-and-Expansion Transformer aims to improve in all the four aspects. The Squeezed Attention Block computes attention between the input and M inducing points [Lee et al., 2019], and compresses the attention matrices to N M. The Expanded Attention Block is a mixture-of-experts model with Nm modes ( experts ). In both blocks, the query projections and key projections are tied to make the attention symmetric, for better modeling of the symmetric relationships between image units. In addition, a Learnable Sinusoidal Positional Encoding helps the model capture spatial relationships.

3.1 Squeezed Attention Block [Lee et al., 2019] proposes Induced Set Attention Block (ISAB) by bringing inducing points into the transformer. It was originally designed to learn good features of a set of unordered objects. Here we employ this design to squeeze the

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

bloated attention matrix, so as to reduce noises and overﬁtting in image tasks. We rename ISAB as Squeezed Attention Block (SAB) to highlight its new role in this context1. In SAB, inducing points are a set of M learned embeddings c1, , c M in an external discrete codebook. Usually M N, the number of input units. The inducing points are ﬁrst transformed into new embeddings C = c 1, , c M after attending with the input. The combination of these embeddings form the output features Xout = x 1, , x N (Fig.3):

C = Single-Head(X, C), (4)

Xout = EAB(C , X), (5)

where Single-Head( , ) is a single-head transformer, and EAB( , ) is an Expanded Attention Block presented in the next subsection. In each of the two steps, the attention matrix is of N M, much more compact than vanilla transformers. SAB is conceptually similar to the codebook used for discrete representation learning in [Esser et al., 2020], but the discretized features are further processed by a transformer. SAB can trace its lineage back to low-rank matrix factorization, i.e., approximating a data matrix Xn n P n d Qd n, which is a traditional regularization technique against data noises and overﬁtting. Conﬁrmed by an ablation study, SAB helps ﬁght against noises and overﬁtting as well.

3.2 Expanded Attention Block The Expanded Attention Block (EAB) consists of Nm modes, each an individual single-head transformer. They output Nm sets of contextualized features, which are then aggregated into one set using dynamic mode attention:

X(k) out = Mode(k)(X), (6)

B(k) = Linear(k)(X(k) out), (7) with k {1, , Nm},

G = softmax B(1), , B(Nm) , (8)

Xout = X(1) out, , X(Nm) out G , (9)

where the mode attention G RNu Nm is obtained by doing a linear transformation of each mode features, and taking softmax over all the modes. Eq.(9) takes a weighted sum over the modes to get the ﬁnal output features Xout. This dynamic attention is inspired by the Split Attention of the Res Nest model [Zhang et al., 2020b]. EAB is a type of Mixture-of-Experts [Shazeer et al., 2017], an effective way to increase model capacity. Although there is resemblance between multi-head attention (MHA) and EAB, they are essentially different, as shown in Fig.4. In MHA, each head resides in an exclusive feature subspace and provides unique features. In contrast, different modes in EAB share the same feature space, and the representation power largely remains after removing any single mode. The modes join forces to offer more capacity to model diverse data, as shown in an ablation study. In addition, EAB is also different

1We clarify that our contribution is a novel transformer architecture that combines SAB with an Expanded Attention Block.

from the Mixture of Softmaxes (Mo S) transformer [Zhang et al., 2020a]. Although Mo S transformer also uses k sets of queries and keys, it shares one set of value transformation.

3.3 Learnable Sinusoidal Positional Encoding A crucial inductive bias for images is the pixel locality and semantic continuity, which is naturally encoded by convolutional kernels. As the input to transformers is ﬂattened into 1-D sequences, positional encoding (PE) is the only source to inject information about spatial relationships. On the one hand, this makes transformers ﬂexible to model arbitrary shapes of input. On the other hand, the continuity bias of images is non-trivial to fully incorporate. This is a limitation of the two mainstream PE schemes: Fixed Sinusoidal Encoding and Discretely Learned Encoding [Carion et al., 2020; Dosovitskiy et al., 2021]. The former is spatially continuous but lacks adaptability, as the code is predeﬁned. The latter learns a discrete code for each coordinate without enforcing spatial continuity. We propose Learnable Sinusoidal Positional Encoding, aiming to bring in the continuity bias with adaptability. Given a pixel coordinate (x, y), our positional encoding vector pos(x, y) is a combination of sine and cosine functions of linear transformations of (x, y):

posi(x, y) = sin(aix + biy + ci) if i < C/2 cos(aix + biy + ci) if i C/2, (10)

where i indexes the elements in pos, {ai, bi, ci} are learnable weights of a linear layer, and C is the dimensionality of image features. To make the PE behave consistently across different image sizes, we normalize (x, y) into [0, 1]2. When the input image is 3D, Eq.(10) is trivially extended by using 3D coordinates (x, y, z). The encoding in Eq.(10) changes smoothly with pixel coordinates, and thus nearby units receive similar positional encodings, pushing the attention weights between them towards larger values, which is the spirit of the continuity bias. The learnable weights and sinusoidal activation functions make the code both adaptable and nonlinear to model complex spatial relationships [Tancik et al., 2020].

4 Segtran Architecture As a context-dependent pixel-wise classiﬁcation task, segmentation faces a conﬂict between larger context (lower resolution) and localization accuracy (higher resolution). Segtran partly circumvents this conﬂict by doing pairwise feature contextualization, without sacriﬁcing spatial resolutions. There are ﬁve main components in Segtran (Fig.1): 1) a CNN backbone to extract image features, 2) input/output feature pyramids to do upsampling, 3) learnable sinusoidal positional encoding, 4) Squeeze-and-Expansion transformer layers to contextualize features, and 5) a segmentation head.

4.1 CNN Backbone We employ a pretrained CNN backbone to extract features maps with rich semantics. Suppose the input image is X0 RH0 W0 D0, where for a 2D image, D0 = 1 or 3 is the number of color channels. For a 3D image, D0 3 is the

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

number of slices in the depth dimension. For 2D and 3D images, the extracted features are CNN(X0) RC H W , and CNN(X0) RC H W D, respectively. On 2D images, typically Res Net-101 or Efﬁcient Net-D4 is used as the backbone. For increased spatial resolution, we change the stride of the ﬁrst convolution from 2 to 1. Then H, W = H0/16, W0/16. On 3D images, 3D backbones like I3D [Carreira and Zisserman, 2017] could be adopted.

4.2 Transformer Layers Before being fed into the transformer, the visual features and positional encodings of each unit are added up before being fed to the transformer: Xspatial = Xvisual + pos(coordinates(X)). Xspatial is ﬂattened across spatial dimensions to a 1-D sequence X0 RNu C, where Nu is the total number of image units, i.e., points in the feature maps. The transformer consists of a few stacked transformer layers. Each layer takes input features X, computes the pairwise interactions between input units, and outputs contextualized features Xout of the same number of units. The transformer layers used are our novel design Squeeze-and Expansion Transformer (Section 3).

4.3 Feature Pyramids and Segmentation Head Although the spatial resolution of features is not reduced after passing through the transformer layers, for richer semantics, the input features to transformers are usually high-level features from the backbone. They are of a low spatial resolution, however. Hence, we increase their spatial resolution with an input Feature Pyramid Network (FPN) [Liu et al., 2018] and an output FPN, which upsample the feature maps at the transformer input end and output end, respectively. Without loss of generality, let us assume the Efﬁcient Net is the backbone. The stages 3, 4, 6, and 9 of the network are commonly used to extract multi-scale feature maps. Let us denote the corresponding feature maps as f 1, f 2, f 3, f 4, respectively. Their shapes are f i RCi Hi Wi, with Hi = H0

2i , Wi = W0

2i . As described above, f(X0) = f 4 is 1/16 of the original image, which is too coarse for accurate segmentation. Hence, we upsample it with an input FPN, and obtain upsampled feature maps f 34:

f 34 = upsample 2(f 4) + conv34(f 3), (11)

where conv34 is a 1 1 convolution that aligns the channels of f 3 to f 4, and upsample 2( ) is bilinear interpolation. f 34 is 1/8 of the original image, and is used as the input features to the transformer layers. As the transformer layers keep the spatial resolutions unchanged from input to output feature maps, the output feature maps g34 is also 1/8 of the input image. Still, this spatial resolution is too low for segmentation. Therefore, we adopt an output FPN to upsample the feature maps by a factor of 4 (i.e., 1/2 of the original images). The output FPN consists of two upsampling steps:

f 12 = upsample 2(f 2) + conv12(f 1), g1234 = upsample 4(g34) + conv24(f 12), (12)

where conv12 and conv24 are 1 1 convolutional layers that align the channels of f 1 to f 2, and f 2 to f 4, respectively.

Figure 5: Top: Optic disc/cup segmentation in fundus images into 3 classes: disc (grey), cup (white), and background (black). Bottom: Polyp segmentation in colonoscopy images into 2 classes: polyp (white) and background (black).

This FPN scheme is the bottom-up FPN proposed in [Liu et al., 2018]. Empirically, it performs better than the original top-down FPN [Lin et al., 2017], as richer semantics in top layers are better preserved. The segmentation head is simply a 1 1 convolutional layer, outputting conﬁdence scores of each class in the mask.

5 Experiments

Three tasks were evaluated in our experiments:

REFUGE20: Optic Disc/Cup Segmentation in Fundus Images. This task does segmentation of the optic disc and cup in fundus images, which are 2D images of the rear of eyes (Fig. 5). It is a subtask of the REFUGE Challenge2 [Orlando et al., 2020], MICCAI 2020. 1200 images were provided for training, and 400 for validation. We also used two extra datasets, Drishti-GS dataset [Sivaswamy et al., 2015] and RIM-ONE v3 [Fumero et al., 2011] when training all models. The Disc/Cup dice scores of validation images were obtained from the ofﬁcial evaluation server.

Polyp: Polyp Segmentation in Colonoscopy Images. Polyps are ﬂeshy growths in the colon lining that may become cancerous. This task does polyp segmentation in colonoscopy images (Fig. 5). Two image datasets [Fan et al., 2020] were used: CVC612 (CVC in short; 612 images) and Kvasir (1000 images). Each was randomly split into 80% training and 20% validation, and the training images were merged into one set.

Bra TS19: Tumor Segmentation in MRI Images. This task focuses on the segmentation of gliomas, a common brain tumor in MRI scans. It was a subtask of the Bra TS 19 challenge3 [Menze et al., 2015; Bakas et al., 2017], MICCAI 2019. It involves four classes: the whole tumor (WT), the tumor core (TC), the enhancing tumor (ET) and background. Among them, the tumor core consists of the necrotic regions and non-enhancing tumors (red), as well as the enhancing tumor (yellow). 335 scans were provided for training, and 125

2https://refuge.grand-challenge.org/Home2020/ 3https://www.med.upenn.edu/cbica/brats-2019/

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

for validation. The dice scores of ET, WT and TC on the validation scans were obtained from the ofﬁcial evaluation server.

5.1 Ablation Studies Two ablation studies were performed on REFUGE20 to compare: 1) the Squeeze-and-Expansion Transformer versus Multi-Head Transformer; and 2) the Learnable Sinusoidal Positional Encoding versus two schemes as well as not using PE. All the settings were variants of the standard one, which used three layers of Squeeze-and-Expansion transformer with four modes (Nm = 4), along with learnable sinusoidal positional encoding. Both Res Net-101 and Efﬁcient Net-B4 were evaluated to reduce random effects from choices of the backbone. We only reported the cup dice scores, as the disc segmentation task was relatively easy, with dice scores only varying 0.005 across most settings. Type of Transformer Layers. Table 1 shows that Squeezeand-Expansion transformer outperformed the traditional multi-head transformers. Moreover, Both the squeeze attention block and the expansion attention block contributed to improved performance.

Transformer Type Res Net-101 Eff-B4 Cell-DETR (Nh = 4) 0.846 0.857 Multi-Head (Nh = 4) 0.858 0.862 No squeeze + Expansion (Nm = 4) 0.840 0.872 Squeeze + Single-Mode 0.859 0.868 Squeeze + Expansion (Nm = 4) 0.862 0.872

Table 1: REFUGE 20 Fundus Optic Cup dice scores change with the type of transformer layers. Single-Mode implies No Expansion. Cell-DETR uses a multi-head transformer and discretely learned PE. Nh: number of attention heads in a MHA. Nm: number of modes in a Squeeze-and-Expansion transformer.

Positional Encoding. Table 2 compares learnable sinusoidal positional encoding with the two mainstream PE schemes and no PE. Surprisingly, without PE, performance of Segtran only dropped 1~2%. A possible explanation is that the transformer may manage to extract positional information from the CNN backbone features [Islam et al., 2020].

Positional Encoding Res Net-101 Eff-B4 None 0.857 0.853 Discretely learned 0.852 0.860 Fixed Sinusoidal 0.857 0.849 Learnable Sinusoidal 0.862 0.872

Table 2: REFUGE 20 Fundus Optic Cup dice scores change with the type of positional encoding (PE) schemes.

Number of Transformer Layers. Table 3 shows that as the number of transformer layers increased from 1 to 3, the performance improved gradually. However, one more layer caused performance drop, indicating possible overﬁtting.

5.2 Comparison with Baselines Ten methods were evaluated on the 2D segmentation tasks:

Number of layers Res Net101 Eff-B4 1 0.856 0.854 2 0.862 0.857 3 0.862 0.872 4 0.855 0.869

Table 3: REFUGE20 Optic Cup dice scores change with the number of transformer layers. Best performers with each backbone are highlighted.

U-Net [Ronneberger et al., 2015]: The implementation in a popular library Segmentation Models.Py Torch (SMP) was used4. The pretrained Res Net-101 was chosen as the encoder. In addition, U-Net implemented in U-Net++ (below) was evaluated as training from scratch. U-Net++ [Zhou et al., 2018]: A popular Py Torch implementation5. It does not provide options to use pretrained encoders, and thus was only trained from scratch. U-Net3+ [Huang et al., 2020]: The ofﬁcial Py Torch implementation6. It does not provide options to use pretrained encoders. Pra Net [Fan et al., 2020]: The ofﬁcial Py Torch implementation7. The pretrained Res2Net-50 [Gao et al., 2020] was recommended to be used as the encoder. Deep Lab V3+ [Chen et al., 2018]: A popular Py Torch implementation8, with a pretrained Res Net-101 as the encoder. Attention based U-Nets [Oktay et al., 2018]: Attention U-Net (Att U-Net) and Att R2U-Net (a combination of Att U-Net and Recurrent Residual U-Net) were evaluated9. They learn to focus on important areas by computing element-wise attention weights (as opposed to the pairwise attention of transformers). nn U-Net [Isensee et al., 2021]: nn U-Net generates a custom U-Net conﬁguration for each dataset based on its statistics. It is primarily designed for 3D tasks, but can also handle 2D images after converting them to pseudo3D. The original pipeline is time-consuming, and we extracted the generated U-Net conﬁguration and instantiated it in our pipeline to do training and test. Deformable U-Net [Jin et al., 2019]: Deformable UNet (DUNet) uses deformable convolution in place of ordinary convolution. The ofﬁcial implementation 10 of DUNet was evaluated. SETR [Zheng et al., 2021]: SETR uses Vi T as the encoder, and a few convolutional layers as the decoder. The SETR-PUP model in the ofﬁcial implementation11 was evaluated, by ﬁne-tuning the pretrained Vi T

4https://github.com/qubvel/segmentation_models.pytorch/ 5https://github.com/4uiiurz1/pytorch-nested-unet 6https://github.com/ZJUGive Lab/UNet-Version 7https://github.com/Deng Ping Fan/Pra Net 8hhttps://github.com/Vain F/Deep Lab V3Plus-Pytorch 9https://github.com/Lee Jun Hyun/Image_Segmentation 10https://github.com/Ran Su Lab/DUNet-retinal-vessel-detection 11https://github.com/fudan-zvg/SETR/

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

REFUGE20 Polyp Avg. Cup Disc Kvasir CVC U-Net 0.730 0.946 0.787 0.771 0.809 U-Net (R101) 0.837 0.950 0.868 0.844 0.875 U-Net++ 0.781 0.940 0.753 0.740 0.804 U-Net3+ 0.819 0.943 0.708 0.680 0.788 Pra Net (res2net50) 0.781 0.946 0.898 0.899 0.881 Deep Lab V3+ (R101) 0.839 0.950 0.805 0.795 0.847 Att U-Net 0.846 0.952 0.744 0.749 0.823 Att R2U-Net 0.818 0.944 0.686 0.632 0.770 DUNet 0.826 0.945 0.748 0.754 0.818 nn U-Net 0.829 0.953 0.857 0.864 0.876 SETR (Vi T) 0.859 0.952 0.894 0.916 0.905 Trans U-Net (R50+Vi T) 0.835 0.958 0.895 0.916 0.901 Segtran (R101) 0.862 0.956 0.888 0.929 0.909 Segtran (eff-B4) 0.872 0.961 0.903 0.931 0.917

Table 4: Dice scores on REFUGE20 and Polyp validation sets. R101: Res Net-101; R50: Res Net-50; eff-B4: Efﬁcient Net-B4.

Trans U-Net [Chen et al., 2021]: Trans U-Net uses a hybrid of Res Net and Vi T as the encoder, and a U-Net style decoder. The ofﬁcial implementation12 was evaluated, by ﬁne-tuning their pretrained weights.

Segtran: Trained with either a pretrained Res Net-101 or Efﬁcient Net-B4 as the backbone.

Three methods were evaluated on the 3D segmentation task:

Extension of nn U-Net [Wang et al., 2019]: An extension of the nn U-Net13 with two sampling strategies.

Bag of tricks (2nd place solution of the Bra TS 19 challenge) [Zhao et al., 2019]: The winning entry used an ensemble of ﬁve models. For fairness, we quoted the best single-model results ( BL+warmup ).

Segtran-3D: I3D [Carreira and Zisserman, 2017] was used as the backbone.

5.3 Training Protocols All models were trained on a 24GB Titan RTX GPU with the Adam W optimizer. The learning rate for the three transformer-based models were 0.0002, and 0.001 for the other models. On REFUGE20, all models were trained with a batch size of 4 for 10,000 iterations (27 epochs); on Polyp, the total iterations were 14,000 (31 epochs). On Bra TS19, Segtran was trained with a batch size of 4 for 8000 iterations. The training loss was the average of the pixel-wise crossentropy loss and the dice loss. Segtran used 3 transformer layers on 2D images, and 1 layer on 3D images to save RAM. The number of modes in each transformer layer was 4.

5.4 Results Tables 4 and 5 present the evaluation results on the 2D and 3D tasks, respectively. Overall, the three transformer

12https://github.com/Beckschen/Trans UNet 13https://github.com/woodywff/brats_2019

Bra TS19 ET WT TC Avg. Extension of nn U-Net 0.737 0.894 0.807 0.813 Bag of tricks 0.729 0.904 0.802 0.812 Segtran (i3d) 0.740 0.895 0.817 0.817

Table 5: Dice scores on Bra TS19 validation set. Only single-model performance is reported.

based methods, i.e., SETR, Trans U-Net and Segtran achieved best performance across all tasks. With Res Net-101 as the backbone, Segtran performed slightly better than SETR and Trans U-Net. With Efﬁcient Net-B4, Segtran exhibited greater advantages. It is worth noting that, Segtran (eff-B4) was among the top 5 teams in the semiﬁnal and ﬁnal leaderboards of the REFUGE20 challenge. Among either REFUGE20 or Bra TS19 challenge participants, although there were several methods that performed slightly better than Segtran, they usually employed ad-hoc tricks and designs [Orlando et al., 2020; Wang et al., 2019; Zhao et al., 2019]. In contrast, Segtran achieved competitive performance with the same architecture and minimal hyperparameter tuning, free of domainspeciﬁc strategies.

(a) RIM-One image (b) Ground truth (c) U-Net (d) Deep Lab V3+ (e) Segtran

Figure 6: Soft segmentation masks produced by different methods on a RIM-One image. The mask by Segtran has the fewest artifacts.

5.5 Cross-Domain Generalization To explore how well different methods generalize to new domains, we trained three representative methods, U-Net, Deep Lab V3+ and Segtran on the 1200 training images of REFUGE20. All the methods used a pretrained Res Net-101 as the encoder/backbone. The trained models were evaluated on both the REFUGE20 training images and the RIM-One dataset [Fumero et al., 2011]. As RIM-One images have drastically different characteristics from REFUGE20, all models suffered severe performance drop, as shown in Table 6. Nevertheless, Segtran had the least performance degradation, showing the best cross-domain generalization. Fig.6 shows a RIM-One image and the corresponding soft segmentation masks (before thresholding) produced by different methods. The mask produced by Segtran contains the fewest artifacts.

5.6 Computational Efﬁciency Table 7 presents the number of parameters and FLOPs of a few representative methods. In general, transformer-based methods consume more computation and GPU RAM than conventional methods. Our proﬁling showed that the number of parameters/FLOPs of Segtran are dominated by the output FPN, which vary drastically across different backbones. As the

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

REFUGE RIM-One Drop U-Net 0.862 0.680 -0.182 Deep Lab V3+ 0.846 0.653 -0.193 Segtran 0.938 0.796 -0.142

Table 6: Generalization of three methods, measured by drop of Optic Cup dice scores from the REFUGE20 training images to a new test domain RIM-One. The smaller the drop is, the better. All used Res Net-101 as the encoder/backbone.

bottom-up FPNs we adopt are somewhat similar to Efﬁcient Det [Tan et al., 2020], the model size/FLOPs are optimal when using Efﬁcient Nets. With Res Nets as the backbone, Segtran has a signiﬁcantly higher model size/FLOPs, and hence this choice of backbone is not recommended for efﬁciency-sensitive scenarios.

Params (M) FLOPs (G) nn U-Net 41.2 16.3 Att U-Net 34.9 51.0 SETR (Vi T) 307.1 91.1 Trans U-Net (R50+Vi T) 93.2 32.2 Segtran (R101) 166.7 152.8 Segtran (eff-B4) 93.1 71.3

Table 7: Number of parameters / FLOPs on a 256x256 input image.

5.7 Impact of Pretraining Models for medical image tasks usually beneﬁt from initialization with weights pretrained on natural images (e.g. Image Net [Deng et al., 2009]), as medical image datasets are typically small. To quantitatively study the impact of pretraining, Table 8 compares the performance of using pretrained weights vs. training from scratch of a few methods. Pretraining brought ~2.5% increase of average dice scores to the two transformer-based models, and 1% to U-Net (Res Net-101).

REFUGE20 Polyp Avg. Cup Disc Kvasir CVC U-Net (R101 scratch) 0.827 0.953 0.847 0.835 0.865 U-Net (R101 pretrain) 0.837 0.950 0.868 0.844 0.875 Trans U-Net (R50+Vi T scratch) 0.817 0.943 0.869 0.872 0.875

Trans U-Net (R50+Vi T pretrained) 0.835 0.958 0.895 0.916 0.901

Segtran (R101 scratch) 0.852 0.939 0.858 0.851 0.875 Segtran (R101 pretrain) 0.862 0.956 0.888 0.929 0.909

Table 8: Impact of using pretrained encoder weights.

6 Conclusions In this work, we present Segtran, a transformer-based medical image segmentation framework. It leverages unlimited receptive ﬁelds of transformers to contextualize features. Moreover, the transformer is an improved Squeeze-and-Expansion transformer that better ﬁts image tasks. Segtran sees both the

global picture and ﬁne details, lending itself good segmentation performance. On two 2D and one 3D medical image segmentation tasks, Segtran consistently outperformed existing methods, and generalizes well to new domains.

Acknowledgements We are grateful for the help and support of Wei Jing. This research is supported by A*STAR under its Career Development Award (Grant No. C210112016), and its Human-Robot Collaborative Al for Advanced Manufacturing and Engineering (AME) programme (Grant No. A18A2b0046).

References [Baheti et al., 2020] B. Baheti, S. Innani, S. Gajre, and S. Talbar. Eff-unet: A novel architecture for semantic segmentation in unstructured environment. In CVPR Workshops, 2020. [Bakas et al., 2017] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos. Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Nature Scientiﬁc Data, 4, 2017. [Carion et al., 2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object Detection with Transformers. In ECCV, 2020. [Carreira and Zisserman, 2017] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017. [Chen et al., 2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. [Chen et al., 2021] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation. ar Xiv preprint ar Xiv:2102.04306, 2021. [Çiçek et al., 2016] Özgün Çiçek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: Learning dense volumetric segmentation from sparse annotation. In Sebastien Ourselin, Leo Joskowicz, Mert R. Sabuncu, Gozde Unal, and William Wells, editors, MICCAI, 2016. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. [Dosovitskiy et al., 2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. [Esser et al., 2020] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. arxiv:2012.09841, 2020. [Fan et al., 2020] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverse attention network for polyp segmentation. In MICCAI, 2020. [Fumero et al., 2011] F. Fumero, S. Alayon, J. L. Sanchez, J. Sigut, and M. Gonzalez-Hernandez. Rim-one: An open retinal image database for optic nerve evaluation. In 24th International Symposium on CBMS, 2011.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[Gao et al., 2020] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. IEEE TPAMI, 43, 2020.

[Huang et al., 2020] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y. Chen, and J. Wu. Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP, 2020.

[Isensee et al., 2021] Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnu-net: a selfconﬁguring method for deep learning-based biomedical image segmentation. Nature Methods, 18, 2021.

[Islam et al., 2020] Md Amirul Islam, Sen Jia, and Neil DB Bruce. How much position information do convolutional neural networks encode? In ICLR, 2020.

[Jin et al., 2019] Qiangguo Jin, Zhaopeng Meng, Tuan D. Pham, Qi Chen, Leyi Wei, and Ran Su. Dunet: A deformable network for retinal vessel segmentation. Knowledge-Based Systems, 2019.

[Kirillov et al., 2019] Alexander Kirillov, Kaiming He, Ross B. Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2019.

[Lee et al., 2019] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019.

[Lin et al., 2017] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, July 2017.

[Liu and Guo, 2020] Sun ao Liu and Xiaonan Guo. Improving brain tumor segmentation with multi-direction fusion and ﬁne class prediction. In Brain Les workshop, MICCAI, 2020.

[Liu et al., 2018] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR, 2018.

[Luo et al., 2016] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive ﬁeld in deep convolutional neural networks. In Neur IPS, 2016.

[Menze et al., 2015] Bjoern H. Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE TMI, 34, 2015.

[Milletari et al., 2016] F. Milletari, N. Navab, and S. Ahmadi. Vnet: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016.

[Murase et al., 2020] Rito Murase, Masanori Suganuma, and Takayuki Okatani. How Can CNNs Use Image Position for Segmentation? ar Xiv:2005.03463, 2020.

[Oktay et al., 2018] Ozan Oktay, Jo Schlemper, Loïc Le Folgoc, Matthew C. H. Lee, Mattias P. Heinrich, Kazunari Misawa, Kensaku Mori, et al. Attention u-net: Learning where to look for the pancreas. In MIDL, 2018.

[Orlando et al., 2020] José Ignacio Orlando, Huazhu Fu, João Barbosa Breda, Karel van Keer, Deepti R. Bathula, Andrés Diaz Pinto, et al. Refuge challenge: A uniﬁed framework for evaluating automated methods for glaucoma assessment from fundus photographs. Medical Image Analysis, 59, 2020.

[Prangemeier et al., 2020] Tim Prangemeier, Christoph Reich, and Heinz Koeppl. Attention-based transformers for instance segmentation of cells in microstructures. In IEEE International Conference on Bioinformatics and Biomedicine, 2020.

[Ronneberger et al., 2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. [Schlemper et al., 2019] Jo Schlemper, Ozan Oktay, Michiel Schaap, Mattias Heinrich, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, 53, 2019. [Shazeer et al., 2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixtureof-experts layer. In ICLR, 2017. [Sivaswamy et al., 2015] Jayanthi Sivaswamy, Subbaiah Krishnadas, Arunava Chakravarty, Gopal Joshi, and Ujjwal. A comprehensive retinal image dataset for the assessment of glaucoma from the optic nerve head analysis. JSM Biomedical Imaging Data Papers, 2, 2015. [Tan and Le, 2019] Mingxing Tan and Quoc V. Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019. [Tan et al., 2020] Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efﬁcientdet: Scalable and efﬁcient object detection. In CVPR, June 2020. [Tancik et al., 2020] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In Neur IPS, 2020. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undeﬁnedukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017. [Voita et al., 2019] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019. [Wang et al., 2019] Feifan Wang, Runzhou Jiang, Liqin Zheng, Chun Meng, and Bharat Biswal. 3d u-net based brain tumor segmentation and survival days prediction. In Brain Les Workshop, MICCAI, 2019. [Zhang et al., 2020a] Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. Feature pyramid transformer. In ECCV, 2020. [Zhang et al., 2020b] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Muller, R. Manmatha, Mu Li, and Alexander Smola. Resnest: Split-attention networks. ar Xiv:2004.08955, 2020. [Zhao et al., 2019] Yuan-Xing Zhao, Yan-Ming Zhang, and Cheng Lin Liu. Bag of tricks for 3d mri brain tumor segmentation. In Brainles Workshop, MICCAI, 2019. [Zheng et al., 2021] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021. [Zhou et al., 2018] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. In DLMIA workshop (MICCAI), 2018.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)