# gated_fully_fusion_for_semantic_segmentation__a8e87864.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Gated Fully Fusion for Semantic Segmentation

Xiangtai Li,1 Houlong Zhao,2 Lei Han,3 Yunhai Tong,1 Shaohua Tan,1 Kuiyuan Yang2

1School of EECS, Peking University 2Deep Motion 3Tecent AI Lab {lxtpku, yhtong}@pku.edu.cn, tan@cis.pku.edu.cn {houlongzhao, kuiyuanyang}@deepmotion.ai leihan.cs@gmail.com

Semantic segmentation generates comprehensive understanding of scenes through densely predicting the category for each pixel. High-level features from Deep Convolutional Neural Networks already demonstrate their effectiveness in semantic segmentation tasks, however the coarse resolution of highlevel features often leads to inferior results for small/thin objects where detailed information is important. It is natural to consider importing low level features to compensate for the lost detailed information in high-level features. Unfortunately, simply combining multi-level features suffers from the semantic gap among them. In this paper, we propose a new architecture, named Gated Fully Fusion(GFF), to selectively fuse features from multiple levels using gates in a fully connected way. Speciﬁcally, features at each level are enhanced by higher-level features with stronger semantics and lowerlevel features with more details, and gates are used to control the propagation of useful information which signiﬁcantly reduces the noises during fusion. We achieve the state of the art results on four challenging scene parsing datasets including Cityscapes, Pascal Context, COCO-stuff and ADE20K.

Introduction Semantic segmentation densely predicts the semantic category for every pixel in an image, such comprehensive image understanding is valuable for many vision-based applications such as medical image analysis (Ronneberger, Fischer, and Brox 2015), remote sensing (Kampffmeyer, Salberg, and Jenssen 2016) and autonomous driving (Xu et al. 2017). However, precisely predicting label for every pixel is challenging as illustrated in Fig. 1, since pixels can be from tiny or large objects, far or near objects, and inside object or object boundary. As a semantic prediction problem, the basic task of semantic segmentation is to generate high-level representation for each pixel, i.e., a high-level and high-resolution feature map. Given the ability of Conv Nets in learning high-level representation from data, semantic segmentation has made much progress by leveraging such high-level representation. However, high-level representation from Conv Nets is generated along lowering the resolution, thus high-resolution and

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Illustration of challenges in semantic segmentation. (a), Input Image. (b), Ground Truth. (c), PSPNet result. (d), Our result. Our method performs much better on small patterns such as distant poles and trafﬁc lights.

high-level feature maps are distributed in two ends in a Conv Net. To get a feature map that is both high-resolution and highlevel, which is not readily available in a Conv Net, it is natural to consider fusing high-level feature maps from top layers and high-resolution feature maps from bottom layers. These feature maps are with different properties, that highlevel feature map can correctly predict most of the pixels on large patterns in a coarse manner, which is widely used in the current semantic segmentation approaches, while low-level feature maps can only predict few pixels on small patterns. Thus, simply combining high-level feature maps and high-resolution feature maps will drown useful information in massive useless information, and cannot reach an informative high-level and high-resolution feature map. Therefore, an advanced fusion mechanism is required to collect information selectively from different feature maps. To achieve this, we propose Gated Fully Fusion (GFF) which uses gating mechanism, a kind of operation commonly used for information extraction from time series, to pixelwisely measure the usefulness of each feature vector, and control information propagation through gates accordingly. The principle of the gate at each layer is designed to either send

out useful information to other layers or receive information from other layers when the information in the current layer is useless. Using gate to control information propagation, redundancies can also be effectively minimized in the network, allowing us to fuse multi-level feature maps in a fully-connected manner. Fig 1 compares the results of GFF and PSPNet (Zhao et al. 2017), where GFF can handle ﬁnelevel details such as poles and trafﬁc lights in a much better way. In addition, contextual information in large receptive ﬁeld is also very important for semantic segmentation as proved by PSPNet (Zhao et al. 2017), ASPP (Chen et al. 2018a) and Dense ASPP (Yang et al. 2018). Therefore, we also model contextual information after GFF to further improve the performance. Speciﬁcally, we propose a dense feature pyramid (DFP) module to encode context information into each feature map. DFP reuses the contextual information for each feature level and aims to enhance the context modeling part while GFF operates on the backbone network to capture more detailed information. Combining both components in a single end-to-end network, we achieve state-of-the-art results on four scene parsing datasets. The main contributions of our work can be summarized as three points: Firstly, we propose Gated Fully Fusion to generate high-resolution and high-level feature map from multilevel feature maps, and Dense Feature Pyramid to enhance the semantic representation of multi-level feature maps. Secondly, detailed analysis with visualization of gates learned in different layers intuitively shows the information regulation mechanism in GFF. Finally, The proposed method is extensively veriﬁed on four standard semantic segmentation benchmarks including Cityscapes, Pascal Context, COCOstuff and ADE20K, where our method achieves state-of-theart performance on all four tasks. In particular, our model achieves 82.3% m Io U on Cityscapes test set trained only on the ﬁne labeled data with Res Net101 as backbone.

Related Work

Context modeling Though high-level feature maps in Conv Nets have shown promising results on semantic segmentation (Long, Shelhamer, and Darrell 2015), their receptive ﬁeld sizes are still not large enough to capture contextual information for large objects and regions. Thus, context modeling becomes a practical direction in semantic segmentation. PSPNet (Zhao et al. 2017) uses spatial pyramid pooling to aggregate multi-scale contextual information. Deeplab series (Chen et al. 2015; 2018a; 2017) develop atrous spatial pyramid pooling (ASPP) to capture multi-scale contextual information by dilated convolutional layers with different dilation rates. Instead of parallel aggregation as adopted in PSPNet and Deeplab, Yang et al. (Yang et al. 2018) and Bilinski et al. (Bilinski and Prisacariu 2018) follow the idea of the dense connection (Huang et al. 2017) to encode contextual information in a dense way. In (Peng et al. 2017), factorized large ﬁlters are directly used to increase the receptive ﬁeld size for context modeling. SVCNet (Ding et al. 2019) generates a scale and shape-variant semantic mask for each pixel to conﬁne its contextual region.In PSANet (Zhao

et al. 2018), contextual information is collected from all positions according to the similarities deﬁned in a projected feature space. Similarly, DANet (Fu et al. 2018) and CCNet (Huang et al. 2018) use non-local style operator (Wang et al. 2018) to aggregate information from the whole image based on similarities. Multi-level feature fusion Despite a loss of contextual information, the top layer also lacks ﬁne detailed information. To address this issue, in FCN (Long, Shelhamer, and Darrell 2015), predictions from middle layers are used to improve segmentation for detailed structures, while hypercolumns (Hariharan et al. 2015) directly combines features from multiple layers for prediction. The U-Net (Ronneberger, Fischer, and Brox 2015) adds skip connections between the encoder and decoder to reuse low level features, (Zhang et al. 2018b) improves U-Net by fusing highlevel features into low-level features. Feature Pyramid Network (FPN) (Lin et al. 2017b) uses the structure of U-Net with predictions made on each level of the feature pyramid. Deep Lab V3+ (Chen et al. 2018b) reﬁnes the decoder of its previous version by combing low-level features. In (Lin et al. 2018) and (Ding et al. 2018), they proposed to locally fuse every two adjacent feature maps in the feature pyramid into one feature map until only one feature map is left. These fusion methods operate locally in the feature pyramid without awareness of the usefulness of all feature maps to be fused, which limits the propagation of useful features. Gating mechanism In deep neural networks, especially for recurrent networks, gates are commonly utilized to control the information propagation. For example, LSTM (Hochreiter and Schmidhuber 1997) and GRU (Cho et al. 2014) are two typical cases using different gates to handle long-term memory and dependencies. The highway network (Srivastava, Greff, and Schmidhuber 2015) uses gates to make training deep network possible. To improve multitask learning for scene parsing and depth estimation, PADNet (Xu et al. 2018) is proposed to use gates to fuse multimodal features trained from multiple auxiliary tasks. Depth Seg(Kong and Fowlkes 2018) proposes depth-aware gating module which uses depth estimates to adaptively modify the pooling ﬁeld size in high-level feature map. Our method is related and inspired by the above methods, and differs from them in that we propose to fuse multi-level feature maps simultaneously through gating mechanism, and the resulting method surpasses the state-of-the-art approaches.

In this section, we ﬁrst overview the basic setting of multilevel feature fusion and three baseline fusion strategies. Then, we introduce the proposed multi-level fusion module (GFF) and the whole network with the context modeling module (DFP).

Multi-level Feature Fusion

Given L feature maps {Xi RHi Wi Ci}L i=1 extracted from some backbone networks such as Res Net (He et al. 2016), where feature maps are ordered by their depth in the network with increasing semantics but decreasing details,

Multiplication

Figure 2: The proposed gated fully fusion module, where Gl is the gate map generated from Xl, and features corresponding high gate values are allowed to send out and regions with low gate values are allowed to receive.

Hi, Wi and Ci are the height, width and number of channels of the ith feature map respectively, feature maps of higher levels are with lower resolution due to the downsampling operations, i.e., Hi+1 Hi, Wi+1 Wi. In semantic segmentation, the top feature map XL with 1/8 resolution of the raw input image is mostly used for its rich semantics. The major limitation of XL is its low spatial resolution without detailed information, because the outputs need to be with the same resolution as the input image. In contrast, feature maps of low level from shallow layers are with high resolution, but with limited semantics. Intuitively, combining the complementary strengths of multiple level feature maps would achieve the goal of both high resolution and rich semantics, and this process can be abstracted as a fusion process f, i.e.,

{X1, X2 XL} f { X1, X2 XL} (1)

where Xl is the fused feature map for the lth level. To simplify the notations in following equations, bilinear sampling and 1 1 convolution are ignored which are used to reshape the feature maps at the right hand side to let the fused feature maps have the same size as those at the left hand side. Concatenation is a straightforward operation to aggregate all the information in multiple feature maps, but it mixes the useful information with large amount of non-informative features. Addition is another simple way to combine feature maps by adding features at each position, while it suffers from the similar problem as concatenation. FPN (Lin et al. 2017b) conducts the fusion process through a top-down pathway with lateral connections. The three fusion strategies can be formulated as,

Concat: Xl = concat(X1, ..., XL), (2)

Addition: Xl =

i=1 Xi, (3)

FPN: Xl = Xl+1 + Xl, where XL = XL. (4)

The problem of these basic fusion strategies is that feature maps are fused together without measuring the usefulness of

Figure 3: Illustration of the overall architecture. (a) Backbone Network(e.g. Res Net (He et al. 2016)) with pyramid pooling module (PPM) (Zhao et al. 2017) on the top. The backbone provides a pyramid of features at different levels. (b), Feature pyramid through gated fully fusion (GFF) modules. The detail of the GFF module is illustrated in Fig 2 . (c), Then the ﬁnal features containing context information are obtained from a dense feature pyramid (DFP) module. Best view in color and zoom in.

each feature vector, and massive useless features are mixed with useful feature during fusion.

Gated Fully Fusion GFF module design: The basic task in multi-level feature fusion is to aggregate useful information together under interference of massive useless information. Gating is a mature mechanism to measure the usefulness of each feature vector in a feature map and aggregates information accordingly. In this paper, Gated Fully Fusion (GFF) is designed based on the simple addition-based fusion by controlling information ﬂow with gates. Speciﬁcally, each level l is associated with a gate map Gl [0, 1]Hl Wl. With these gate maps, the addition-based fusion is formally deﬁned as

Xl = (1 + Gl) Xl + (1 Gl)

i=1,i =l Gi Xi, (5)

where denotes element-wise multiplication broadcasting in the channel dimension, each gate map Gl = sigmoid(wi Xi) is estimated by a convolutional layer parameterized with wi R1 1 Ci. There are totally L gate maps where L equals to the number of feature maps. The detailed operation can be seen in Fig 2. GFF involves duplex gating mechanism: A feature vector at position (x, y) from level i, (where i = l) can be fused to l only when the value of Gi(x, y) is large and the value of Gl(x, y) is small, i.e., information is sent when level i has the useful information that level l is missing. Besides that useful information can be regulated to the right place through gates, useless information can also be effectively suppressed on both the sender and receiver sides, and information redundancy can be avoided because the information is only received when the current position has useless

features. More visualization examples can be seen in experiments parts. Comparison with Other Gate module: The work (Ding et al. 2018) also used gates for information control between adjacent layers. GFF differs in using gates to fully fuse features from every level instead of adjacent levels, and richer information in all levels with large usability variance motivates us to design the duplex gating mechanism, which ﬁlters out useless information more effectively with gates at both sides of the sender and receiver. Experimental results in the experiment section demonstrate the advantage of the proposed method.

Dense Feature Pyramid Context modeling aims to encode more global information, and it is orthogonal to the proposed GFF becasue GFF is designed for backbone level. Therefore, we further design a module to encode more contextual information from outputs of both PSPNet (Zhao et al. 2017) and GFF. Motivated by that dense connections can strengthen feature propagation (Huang et al. 2017), we also densely connect the feature maps in a top-down manner starting from feature map outputted from the PSPNet, and high-level feature maps are reused multiple times to add more contextual information to low levels, which was found important in our experiments for correctly segmenting large pattern in objects. This process is shown as follows:

yi = Hi([y0, X1, ..., Xi 1]) (6)

Consequently, the j-th feature pyramid receives the featuremaps of all preceding pyramids, y0, X1,... Xi 1 as input and outputs current pyramid yi: where x0 is the output of PSPNet and Xi is the output of i-th GFF module. Fusion function Hi is implemented by a single convolution layer. Since the feature pyramid is densely connected, we denote this module as Dense Feature Pyramid (DFP). The collections of DFP s outputs yi are used for ﬁnal prediction. Both GFF and DFP can be plugged into existing FCNs for end-to-end training with only slightly extra computation cost.

Network Architecture and Implementation Our network architecture is designed based on previous state-of-the-art network PSPNet (Zhao et al. 2017) with Res Net (He et al. 2016) as backbone for basic feature extraction, the last two stages in Res Net are modiﬁed with dilated convolution to make both strides to 1 and keep spatial information. Fig 3 shows the overall framework including both GFF and DFP. PSPNet forms the bottom-up pathway with backbone network and pyramid pooling module (PPM), where PPM is at the top to encode contextual information. Feature maps from last residual blocks in each stage of backbone are used as the input for GFF module, and all feature maps are reduced to 256 channels with 1 1 convolutional layers. The output feature maps from GFF are further fused with two 3 3 convolutional layers in each level before feeding into the DFP module. All convolutional layers are followed by batch normalization (Ioffe and Szegedy 2015) and Re LU activation function. After DFP, all feature maps

Figure 4: Visualization of segmentation results of two images using GFF and PSPNet. The ﬁrst column shows two input images zoomed in regions marked with red dash rectangles. The second column shows results of PSPNet, and the third column shows results of using GFF. The fourth column lists the ground truth. The last column shows the reﬁned parts by GFF. It shows that GFF can handle distant missing objects like poles, trafﬁc lights and object boundaries. Best view in color.

are concatenated for ﬁnal semantic segmentation. Compared with the basic PSPNet, the proposed method only slightly increases the number of parameters and computations. The entire network is trained in an end-to-end manner driving by cross-entropy loss deﬁned on the segmentation benchmarks. To facilitate the training process, an auxiliary loss together with the main loss are used to help optimization following (Lee et al. 2015), where the main loss is deﬁned on the ﬁnal output of the network and the auxiliary loss is deﬁned on the output feature map at stage3 of Res Net with weight of 0.4 (Zhao et al. 2017).

In this section, we analyze the proposed method on Cityscapes (Cordts et al. 2016)dataset and report results on other datasets.

Implementation Details

Our implementation is based on Py Torch (Paszke et al. 2017). The weight decay is set to 1e-4. Standard SGD is used for optimization, and poly learning rate scheduling policy is used to adjust learning rate, where initial learning rate is set to 1e-3 and decayed by (1 iter total iter)power with power = 0.9. Synchronized batch normalization (Zhang et al. 2018a) is used. For Cityscapes, crop size of 864 864 is used, 100K training iterations with mini-batch size of 8 is carried for training. For ADE20K, COCO-stuff and Pascal Context, crop size of 512 512 is used (images with side smaller than the crop size are padded with zeros), 150K training iterations are used with mini-batch size of 16. As a common practice to avoid , data augmentation including random horizontal ﬂipping, random cropping, random color jittering within the range of [ 10, 10], and random scaling in the range of [0.75, 2] are used during training.

Figure 5: DFP enhances segmentation results on large scale objects and generates more consistent results. Best view in color and zoom in.

Experiments on Cityscapes Dataset

Cityscapes is a large-scale dataset for semantic urban scene understanding. It contains 5000 ﬁne pixel-level annotated images, which is divided into 2975, 500, and 1525 images for training, validation and testing respectively, where labels of training and validation are publicly released and labels of testing set are held for online evaluation. It also provides 20000 coarsely annotated images. 30 classes are annotated and 19 of them are used for pixel-level semantic labeling task. Images are in high resolution with the size of 1024 2048. The evaluation metric for this dataset is the mean Intersection over Union (m Io U). Strong Baseline We choose PSPNet (Zhao et al. 2017) as our baseline model which achieved state-of-the-art performance for semantic segmentation. We re-implement PSPNet on Cityscapes and achieve similar performance with m Io U of 78.6% on validation set. All results are reported by using sliding window crop prediction. Ablation Study on Feature Fusion Methods First, we compare several methods introduced in Method part. To speed up the training process, we use weights from the trained PSPNet to initialize the parameters in each fusion method. We use train-ﬁne data for training and report performance on validation set. For fair comparison with concatenation and addition, we also reduce the channel dimension of feature maps to 256 and use two 3 3 convolutional layers to reﬁne the fused feature map. As for FPN, we implement the original FPN for semantic segmentation following (Kirillov et al. 2017) and we add it to PSPNet. Note that FPN based on PSPNet fuses 5 feature maps, where one is context feature map from pyramid pooling module and others are from the backbone. All the results are shown in Table 1. As expected, concatenation and addition only slightly improve the baseline, and FPN achieves the best performance among the three base fusion methods, while the proposed GFF obtains even more improvement with m Io U of 80.4%. Since GFF is a gated version of addition-based fusion, the results demonstrate the effectiveness of the used gating mechanism. For further comparison, we also add the proposed gating mechanism into top-down pathway of FPN and observe slight improvement,

Method m Io U(%) PSPNet(Baseline) 78.6 PSPNet + Concat 78.8 (0.2 ) PSPNet + Addition 78.7 (0.1 ) PSPNet + FPN 79.3 (0.7 ) PSPNet + Gated FPN 79.4 (0.8 ) PSPNet + GFF 80.4 (1.8 )

Table 1: Comparison experiments on Cityscapes validation set, where PSPNet serves as the baseline method.

Method m Io U(%) PSPNet(Baseline) 78.6 PSPNet + GFF 80.4 (1.8 ) PSPNet + GFF + DFP 81.2 (2.6 ) PSPNet + GFF + DFP + MS 81.8 (3.2 )

Table 2: Comparison experiments on Cityscapes validation set, where PSPNet serves as the baseline method.

which is reasonable since most high-level features are useful for low levels. This demonstrates the advantage of fully fusing multi-level feature maps, and the importance of gating mechanism especially during fusing low-level features to high levels. Fig 4 shows results after using GFF, where the accuracies of predictions for both distant objects and object boundaries are signiﬁcantly improved. Ablation Study for Improvement Strategies We perform two strategies to further boost the performance of our model:, (1) DFP: Dense Feature Pyramid is used after the output of GFF module; and (2) MS: multi-scale inference is adopted, where the ﬁnal segmentation map is averaged from segmentation probability maps with scales {0.75, 1, 1.25, 1.5, 1.75} for evaluation. Experimental results are shown in Table 2, and DFP further improves the performance by 0.8% m Io U. Fig. 5 shows several visual comparisons, where DFP generates more consistent segmentation inside large objects and demonstrates the effectiveness in using contextual information for resolving local ambiguities. With multi-scale inference, our model achieves 81.8% m Io U, which signiﬁcantly outperforms previous state-ofthe-art model Deep Labv3+ (79.55% on Cityscapes validation set) by 2.25%. Ablation Study for other architectures we also perform experiments two different backbone architectures (Yang et al. 2018). One is another strong baseline and the other is PSPNet with lightweight backbone. Results are shown in table 4. It shows that both GFF and DFP show their generality on improving model results. In particular, resnet18 based PSPnet improve 5.9% point from the baseline. Computation Cost In Table 3, we also study the computational cost of using our modules, where our method spends 7.7% more computational cost and 6.3% more parameters compared with the baseline PSPNet. Comparison to the State-of-the-Art As a common practice toward best performance, we average the predictions of multi-scaled images for inference. For fair comparison, all methods are only trained using ﬁne annotated dataset

Method m Io U(%) FLOPS(G) Params(M) PSPNet(Baseline) 78.6 580.1 65.6 PSPNet + GFF 80.4 600.1 69.7 PSPNet + GFF + DFP 81.2 625.5 70.5

Table 3: Computational cost comparison, where PSPNet serves as the baseline with image of size 512 512 as input.

Method Backbone m Io U% Dense ASPP Dense Net121 78.9 Dense ASPP+GFF Dense Net121 80.1(1.2 ) Dense ASPP+GFF+DFP Dense Net121 80.9(2.0 ) PSP Res Net18 73.0 PSP + GFF Res Net18 76.6 (3.6 ) PSP + GFF+DFP Res Net18 78.9 (5.9 )

Table 4: Ablation study on two different models, where m Io U is evaluated on Cityscapes validation set.

Method Backbone m Io U(%) PSPNet (Zhao et al. 2017) Res Net101 78.4 PSANet (Zhao et al. 2018) Res Net101 78.6 GFFNet(Ours) Res Net101 80.9 AAF (Ke et al. 2018) Res Net101 79.1 PSANet (Zhao et al. 2018) Res Net101 80.1 DFN (Yu et al. 2018) Res Net101 79.3 Depth Seg (Kong and Fowlkes 2018) Res Net101 78.2 Dense ASPP (Yang et al. 2018) Dense Net161 80.6 SVCNet (Ding et al. 2019) Res Net101 81.0 DANet (Fu et al. 2018) Res Net101 81.5 GFFNet(Ours) Res Net101 82.3

Table 5: State-of-the-art comparison experiments on Cityscapes test set. means only using the train-ﬁne dataset. means both the train-ﬁne and val-ﬁne data are used.

Method Backbone m Io U(%)Pixel Acc.(%) Reﬁne Net (Lin et al. 2017a) Res Net101 40.20 - PSPNet (Zhao et al. 2017) Res Net101 43.29 81.39 PSANet (Zhao et al. 2018) Res Net101 43.77 81.51 Enc Net (Zhang et al. 2018a) Res Net101 44.65 81.69 GCUNet (Li and Gupta 2018)Res Net101 44.81 81.19 GFFNet(Ours) Res Net101 45.33 82.01

Table 6: State-of-the-art comparison experiments on ADE20K validation set. Our models achieve top performance measured by both m Io U and pixel accuracy.

and evaluated on test set by the evaluation server. Table 5 summarizes the comparisons, our method achieves 80.9% m Io U by only using train-ﬁne dataset and outperforms PSANet (Zhao et al. 2018) by 2.3%. By ﬁne-tuning the model on both train-ﬁne and val-ﬁne datasets, our method achieves the best m Io U of 82.3%. Detailed per-category comparisons are reported in Table 7, where our method achieves the highest Io U on 15 out of 19 categories, and large improvements are from small/thin categories such as pole, street light/sign, person and rider. We don t use coarse data. More detailed analysis will be given by gate visualization.

Figure 6: Visualization of learned gate maps on ADE20K dataset. Gi represents the gate map of the ith layer. Best view in color and zoom in for detailed information.

Visualization of Gates In this section, we visualize what gates have learned and analyze how gates control the information propagation. Fig 6 shows the gates learned from ADE20K and Fig 7(a) shows the gates learned from Cityscapes respectively. For each input image, we show the learned gate map of each level. As expected, we ﬁnd that the higher-level features (e.g., G3, G4) are more useful for large structures with explicit semantics, while the lower-level features (e.g., G1 and G2) are mainly useful for local details and boundaries. Functionally, we ﬁnd that the higher level features always spread information to other layers and only receive sparse feature signals. For example, the gate from stage 4 (in G4 of Fig 7) shows that almost all pixels are of high-conﬁdence. Higher-level features cover a large receptive ﬁeld with fewer details, and they can provide a ground scope of the main semantics. In contrast, the lower level layers prefer to receive information while only spreading a few sparse signals. This veriﬁes that lower level representations generally vary frequently along the spatial dimension and they require additional features as semantic supplement, while a beneﬁt is that lower features can provide precise information for details and object boundaries (G2 in Fig 6 and G1 in Fig 7(a)). To further verify the effectiveness of the learned gates, we set the value of each gate Gi to zero and compare the segmentation results with learned gate values. Fig 7 (b) shows the comparison results, where wrongly predicted pixels after setting Gi to zero are highlighted. Information through G1 and G2 is mainly help for object boundaries, while information through G3 and G4 is mainly help for large patterns such as cars. Additional visualization examples for the gates can be found in the supplementary materials.

Results on Other Datasets ADE20K is a challenging scene parsing dataset annotated with 150 classes, and it contains 20K/2K images for training and validation. Images in this dataset are from more different scenes with more small scale objects, and are with

Method road swalk build wall fence pole tlight sign veg. terrain sky person rider car truck bus train mbike bike m Io U PSPNet (Zhao et al. 2017) 98.6 86.2 92.9 50.8 58.8 64.0 75.6 79.0 93.4 72.3 95.4 86.5 71.3 95.9 68.2 79.5 73.8 69.5 77.2 78.4 AAF (Ke et al. 2018) 98.5 85.6 93.0 53.8 58.9 65.9 75.0 78.4 93.7 72.4 95.6 86.4 70.5 95.9 73.9 82.7 76.9 68.7 76.4 79.1 Dense ASPP (Yang et al. 2018) 98.7 87.1 93.4 60.7 62.7 65.6 74.6 78.5 93.6 72.5 95.4 86.2 71.9 96.0 78.0 90.3 80.7 69.7 76.8 80.6 DANet (Fu et al. 2018) 98.6 87.1 93.5 56.1 63.3 69.7 77.3 81.3 93.9 72.9 95.7 87.3 72.9 96.2 76.8 89.4 86.5 72.2 78.2 81.5 GFFNet(Ours) 98.7 87.2 93.9 59.6 64.3 71.5 78.3 82.2 94.0 72.6 95.9 88.2 73.9 96.5 79.8 92.2 84.7 71.5 78.8 82.3

Table 7: Per-category results on Cityscapes test set. Note that all the models are trained with only ﬁne annotated data. Our method outperforms existing approaches on 15 out of 19 categories, and achieves 82.3% m Io U.

Figure 7: (a) Visualization of learned gate maps on Cityscapes dataset, where Gi represents the gate map of the ith layer. (b) Wrongly classiﬁed pixels are highlighted after setting Gi to 0 comparing with using original gate values. Best view in color and zoom in for detailed information.

Method Back Bone m Io U(%) Enc Net (Zhang et al. 2018a) Res Net-50 49.0 DANet (Fu et al. 2018) Res Net-50 50.1 GFFNet(Ours) Res Net50 51.0 PSPNet (Zhao et al. 2017) Res Net-101 47.8 Enc Net (Zhang et al. 2018a) Res Net-101 51.7 CCLNet (Ding et al. 2018) Res Net101 51.6 DANet (Fu et al. 2018) Res Net-101 52.6 SVCNet (Ding et al. 2019) Res Net-101 53.2 GFFNet(Ours) Res Net101 54.2

Table 8: Results on Pascal Context testing set.

Method Back Bone m Io U(%) Reﬁne Net (Lin et al. 2017a) Res Net101 33.6 DSSPN (Liang, Zhou, and Xing 2018) Res Net101 36.2 CCLNet (Ding et al. 2018) Res Net101 35.7 GFFNet(Ours) Res Net101 39.2

Table 9: Results on COCO stuff testing set.

varied sizes including max side larger than 2000 and min side smaller than 100. Following the standard protocol, both m Io U and pixel accuracy evaluated on validation set are used as the performance metrics.In table6, with backbone Res Net101, our method outperforms state-of-the-art methods with considerable margin in terms of both m Io U and pixel accuracy. Several visual comparison results are shown in Fig 8, where our method performs much better at details and object boundaries.

Figure 8: Visualization results on ADE20K validation dataset (Res Net101 as backbone). Comparing with PSPNet, our method captures more detailed information, and ﬁnds missing small objects (e.g., lights in ﬁrst two examples) and generates smoother on object boundaries (e.g., ﬁgures on the wall in last example). Best view in color.

Pascal Context (Mottaghi et al. 2014) provides pixel-wise segmentation annotation for 59 classes. There are 4998 training images and 5105 testing images. The results are shown in Table 8. Our method achieves the state-of-the-art results on both Res Net50 and Res Net101 backbone and outperforms the existing methods by a large margin. COCO Stuff (Caesar, Uijlings, and Ferrari 2018) contains 10000 images from Microsoft COCO dataset (Lin et al. 2014), out of which 9000 images are for training and 1000 images for testing. This dataset contains 171 categories including objects and stuff annotated to each pixel. The results of COCO Stuff are shown in Table 9. Our method outperforms the existing methods and achieves top performance.

In this work, we propose Gated Fully Fusion (GFF) to fully fuse multi-level feature maps controlled by learned gate maps. The novel module bridges the gap between high resolution with low semantics and low resolution with high semantics. We explore the proposed GFF for the task of semantic segmentation and achieve new state-of-the-art results four challenging scene parsing dataset. In particular, we ﬁnd that the missing low-level features can be fused into each feature level in the pyramid, which indicates that our module can well handle small and thin objects in the scene.

Bilinski, P., and Prisacariu, V. 2018. Dense decoder shortcut connections for single-pass semantic segmentation. In CVPR. Caesar, H.; Uijlings, J.; and Ferrari, V. 2018. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1209 1218. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2015. Semantic image segmentation with deep convolutional nets and fully connected CRFs. ICLR. Chen, L.-C.; Papandreou, G.; Schroff, F.; and Adam, H. 2017. Rethinking atrous convolution for semantic image segmentation. ar Xiv preprint ar Xiv:1706.05587. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2018a. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; and Adam, H. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV. Cho, K.; Van Merri enboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR. Ding, H.; Jiang, X.; Shuai, B.; Qun Liu, A.; and Wang, G. 2018. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In cvpr. Ding, H.; Jiang, X.; Shuai, B.; Liu, A. Q.; and Wang, G. 2019. Semantic correlation promoted shape-variant context for segmentation. In CVPR. Fu, J.; Liu, J.; Tian, H.; Fang, Z.; and Lu, H. 2018. Dual attention network for scene segmentation. ar Xiv preprint ar Xiv:1809.02983. Hariharan, B.; Arbelaez, P.; Girshick, R.; and Malik, J. 2015. Hypercolumns for object segmentation and ﬁne-grained localization. In CVPR. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation. Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In CVPR. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; and Liu, W. 2018. Ccnet: Criss-cross attention for semantic segmentation. ar Xiv preprint ar Xiv:1811.11721. Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167. Kampffmeyer, M.; Salberg, A.-B.; and Jenssen, R. 2016. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In CVPRW. Ke, T.-W.; Hwang, J.-J.; Liu, Z.; and Yu, S. X. 2018. Adaptive afﬁnity ﬁelds for semantic segmentation. In ECCV. Kirillov, A.; He, K.; Girshick, R. B.; and Doll ar, P. 2017. A uniﬁed architecture for instance and semantic segmentation.

Kong, S., and Fowlkes, C. C. 2018. Recurrent scene parsing with perspective understanding in the loop. In CVPR. Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; and Tu, Z. 2015. Deeply-supervised nets. In Artiﬁcial Intelligence and Statistics. Li, Y., and Gupta, A. 2018. Beyond grids: Learning graph representations for visual recognition. In Neur IPS. Liang, X.; Zhou, H.; and Xing, E. 2018. Dynamic-structured semantic propagation network. In CVPR. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, 740 755. Springer. Lin, G.; Milan, A.; Shen, C.; and Reid, I. D. 2017a. Reﬁnenet: Multi-path reﬁnement networks for high-resolution semantic segmentation. In CVPR. Lin, T.-Y.; Doll ar, P.; Girshick, R. B.; He, K.; Hariharan, B.; and Belongie, S. J. 2017b. Feature pyramid networks for object detection. In CVPR. Lin, D.; Ji, Y.; Lischinski, D.; Cohen-Or, D.; and Huang, H. 2018. Multi-scale context intertwining for semantic segmentation. In ECCV. Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR. Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.-G.; Lee, S.-W.; Fidler, S.; Urtasun, R.; and Yuille, A. 2014. The role of context for object detection and semantic segmentation in the wild. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch. In NIPS-W. Peng, C.; Zhang, X.; Yu, G.; Luo, G.; and Sun, J. 2017. Large kernel matters improve semantic segmentation by global convolutional network. In CVPR. Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. MICCAI. Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015. Training very deep networks. In NIPS. Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-local neural networks. In CVPR. Xu, H.; Gao, Y.; Yu, F.; and Darrell, T. 2017. End-to-end learning of driving models from large-scale video datasets. In CVPR. Xu, D.; Ouyang, W.; Wang, X.; and Sebe, N. 2018. Pad-net: Multitasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In CVPR. Yang, M.; Yu, K.; Zhang, C.; Li, Z.; and Yang, K. 2018. Denseaspp for semantic segmentation in street scenes. In CVPR. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; and Sang, N. 2018. Learning a discriminative feature network for semantic segmentation. In CVPR. Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; and Agrawal, A. 2018a. Context encoding for semantic segmentation. In CVPR. Zhang, Z.; Zhang, X.; Peng, C.; Xue, X.; and Sun, J. 2018b. Exfuse: Enhancing feature fusion for semantic segmentation. In ECCV. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In CVPR. Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Change Loy, C.; Lin, D.; and Jia, J. 2018. Psanet: Point-wise spatial attention network for scene parsing. In ECCV.