# multiscale_graph_fusion_for_cosaliency_detection__bb6b6d1a.pdf

Multi-scale Graph Fusion for Co-saliency Detection

Rongyao Hu1,2, Zhenyun Deng3, Xiaofeng Zhu1,2,

1Center for Future Media and School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China 2School of Natural and Computational Science, Massey University Auckland Campus, New Zealand 3School of Computer Science, The University of Auckland, New Zealand hurongyao123, zysjd1991@gmail.com, xfzhu0011@hotmail.com

The key challenge of co-saliency detection is to extract discriminative features to distinguish the common salient foregrounds from backgrounds in a group of relevant images. In this paper, we propose a new co-saliency detection framework which includes two strategies to improve the discriminative ability of the features. Speciﬁcally, on one hand, we segment each image to semantic superpixel clusters as well as generate different scales/sizes of images for each input image by the VGG-16 model. Different scales capture different patterns of the images. As a result, multi-scale images can capture various patterns among all images by many kinds of perspectives. Second, we propose a new method of Graph Convolutional Network (GCN) to ﬁne-tune the multi-scale features, aiming at capturing the common information among the features from all scales and the private or complementary information for the feature of each scale. Moreover, the proposed GCN method jointly conducts multi-scale feature ﬁne-tune, graph learning, and feature learning in a uniﬁed framework. We evaluated our method on three benchmark data sets, compared to state-of-the-art co-saliency detection methods. Experimental results showed that our method outperformed all comparison methods in terms of different evaluation metrics.

Introduction Co-saliency detection focuses on simulating the human visual system to perceive the scene for searching the common and salient prospects from a group of images (Zhang et al. 2018; Peng et al. 2020), and has been applied to improve the understanding of the image or video content in various applications such as image retrieval (Papushoy and Bors 2015), images co-segmentation (Tsai et al. 2018), and objects colocalization (Jerripothula et al. 2017; Wang et al. 2017). In the co-saliency detection task, the semantic category of the common salient objects should be detected from the speciﬁc content of the input image group, involving two key steps, i.e., feature extraction extracting discriminative features to reliably distinguish the foregrounds from the backgrounds of each image, and model construction detecting the cosaliency regions from a group of images based on the extracted features.

Corresponding author. Rongyao Hu and Zhenyun Deng contribute equally to this work. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Feature extraction is focused on extracting either handcrafted features or deep features based on image pixel or superpixel. The popular methods for handcrafted feature extraction include color/texture feature (Fu, Cao, and Tu 2013; Shen et al. 2018), Histogram of Oriented Gradient (HOG) feature (Huang, Feng, and Sun 2015), GIST descriptors (Jerripothula, Cai, and Yuan 2016), etc. Since handcrafted features are usually difﬁcult to capture the appearance changes of both common objects and complex background information (Wang et al. 2019), deep features have been widely designed to explore the semantic connection of co-saliency objects (Tsai et al. 2018; Zhang et al. 2018). For example, (Ren et al. 2020) proposed to extract both deep collaborative features and deep high-to-low features to balance the individual intra-image information. (Wang et al. 2019) employed the VGG-19 framework to extract the high-level group-wise semantic feature and the visual feature for co-saliency detection. Although current feature extraction methods (including handcrafted features and deep features) achieved success in the application of co-saliency detection, extracting single feature is still a challenging task to detect complex variations between co-salient objects and backgrounds. To this end, multi-view feature was extracted to explore both intra-image and inter-image information for co-saliency detection (Jiang et al. 2019a; Zhang et al. 2020a).

Given the image features, both traditional machine learning methods and deep learning methods are designed to detecting the co-saliency across a group of images. For example, (Zhang, Meng, and Han 2016) regarded the co-saliency detection task as multi-instance learning where each image and each superpixel region, respectively, are regarded as a bag and an instance, and thus the multi-instance classiﬁer is used to predict the locations of the co-salient objects in the instance level. However, feature extraction and co-saliency detection are two separated processes in many traditional machine learning methods. As a result, the feature can not be adjusted based on the result of co-saliency detection, and thus leading to suboptimal performance of co-saliency detection. To address this issue, deep learning integrates these two processes in a uniﬁed framework so that each other can be adaptively adjusted by the other, and thus easily outputting optimal performance of co-saliency detection. For example, fully convolution neural networks were designed to automatically learn high-level semantic features by mod-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

!"#$"%&#' (#)*#+,-,%.+

0-123#-,!$#2#&,$-4,%.+ 5+"!,2%*-)#( 06123#-,!$#23%+#7,!+%+)

04128#,#4,%.+

!"#$% &'()*+

Figure 1: The architecture of the proposed framework for co-saliency detection. Speciﬁcally, it involves three key steps, (a) Feature Extraction extracts three-scale deep features to represent each image; (b) Feature Fine-tuning ﬁne-tunes the multi-scale features to obtain discriminative features by considering their common and complementary information; (c) Detection conducts a binary classiﬁcation task to distinguish the common salient foregrounds from backgrounds.

eling collaborative relationships among the images (Zhang et al. 2019). Recently, Graph Convolutional Network (GCN) was designed to utilize both the feature information and the relationship among the images (i.e., the graph) to improve the performance of co-saliency detection (Jiang et al. 2019a; Xu et al. 2020). However, previous deep learning methods suffer from some drawbacks to severely limit the detection effectiveness. For example, deep learning methods focus on extracting the self-learnt features from the images without considering the semantic meaning and lacking the interpretability. The convolutional layers and pooling operations in some deep learning methods decrease the size of feature maps to easily result in the loss of boundary details (Zhang et al. 2019). In this paper, we propose a novel GCN method to fuse multi-scale features based on the superpixel regions/clusters for co-saliency detection. To do this, our proposed method involves three steps, i.e., feature extraction, feature fusion by the proposed GCN method, and co-saliency detection, shown in Figure 1. In the step of feature extraction, we ﬁrst employ the Simple Linear Iterative Clustering (SLIC) algorithm (Achanta et al. 2012) to obtain superpixel based regions/clusters including sub-blocks of the background and saliency regions for each image. The motivation is that the superpixel representation may adhere to image boundaries better, compared to pixel representation (Zhang et al. 2018). We also employ VGG-16 to generate multi-scale features for each image. Furthermore, we convert multi-scale features to represent the image with a vector based on the superpixel clusters. In the step of feature ﬁne-tuning, we design a new graph fusion method to ﬁne-tune the features of each scale by the help of the information of the features from other scales. The goal is to comprehensively explore the intra-image correlation within one image and the interregion relationship across the images. Finally, the outputted features are concatenated together ﬁrst and then passes a fully-connected layer to conduct the binary classiﬁcation, i.e., regarding the co-saliency detection task as a classiﬁcation task.

Compared to previous methods, we list the contributions of our method as follows.

This paper ﬁrst extracts multi-scale features and then designs a new graph fusion method to ﬁne-tune these features. The multi-scale features can detect different sizes of patterns of the images and the fusion method ﬁne-tunes the multi-scale features to extract the complementary information and the common information among the features. It is noteworthy that previous methods (Zhang et al. 2016; Han et al. 2017) extract handcrafted features to difﬁcult explore the comprehensive information among the images. Other methods extract the multi-view feature to touch the issue of the handcrafted feature, but leaving the correlation among multiple features alone (Liu et al. 2019; Jiang et al. 2019a; Zhang et al. 2020a). Hence, our method is more ﬂexible compared to these methods.

This paper proposes a new dynamic GCN method jointly conducting multi-graph fusion, graph learning, and feature learning in a uniﬁed work. In the literature, (Jiang et al. 2019a) and (Zhang et al. 2020a) focused on conducting multi-graph learning on multi-view data by considering the consistency among the graph (i.e., the common information) and ignoring the complementary or private information across multi-scale features.

Methodology Overview Denoting P = {Pn}N n=1 as a set of N related images, co-saliency detection is designed to output the map matrix M = {Mn}N n=1, which is used for distinguishing the common salient foregrounds from backgrounds. To this end, our proposed method includes three steps visualized in Figure 1. Given the input image set P, we ﬁrst employ the SLIC algorithm to conduct superpixel segmentation to obtain superpixel regions or clusters. Meanwhile, we employ the VGG16 model (Simonyan and Zisserman 2014) to convert each input image to multi-scale images by removing the fullyconnected layers and the softmax layer of the VGG-16 model. Speciﬁcally, we store the images outputted at the third

pooling layer, the fourth pooling layer, and the ﬁfth pooling layer. After this, we combine the superpixel regions and the images of each scale to obtain three hierarchical features X = {Xv}3 v=1 as the multi-scale features of P, and thus converting the co-saliency detection task to the classiﬁcation task based on superpixel clusters.

Feature Extraction

Perceptual and semantic visual features are essential for cosaliency detection (Zha et al. 2020; Zhang et al. 2018). A superpixel is usually deﬁned as a set of pixels with common characteristics such as pixel intensity, so superpixel based handcrafted features were shown to carry more semantic information and contain perceptual meaning, compared to either pixel based handcrafted features (Gao et al. 2020). However, handcrafted features are not robust to complex visual scenes (Zhang et al. 2020b; Ren et al. 2020). On the contrary, deep features can capture the changes within one image or across images to produce robust co-saliency detection models, but lacking semantic meaning. In this paper, we propose to integrate superpixel segmentation with deep features to generate multi-scale deep features for each image, aiming at producing semantic and discriminative features as well as converting the co-saliency detection task to a classiﬁcation problem based on the superpixel cluster/region. Given a set of N related images P = {Pi}N i=1 (Pi R224 224 3, we employ the SLIC algorithm (Achanta et al. 2012) to generate superpixel regions for each image Pi by clustering pixels based on their color similarity and proximity in the image plane. As a result, we obtain ni superpixels for each image. For simplicity, we set all nis (i = 1, ..., N) as the same value for a group of images, i.e., n and denote N = N n. Meanwhile, we input each image Pi to the pre-trained VGG-16 model to generate three images with different scales, i.e., Pi { P1 i , P2 i , P3 i } where P1 i R56 56 256, P2 i R28 28 512, and P3 i R14 14 512, respectively, denotes the images obtained from the third pooling layer, the fourth pooling layer, and the ﬁfth pooling layer of the VGG-16 model. We then upsampling these images to be the equivalent size, i.e., X1 i R224 224 256, X2 i R224 224 512, and X3 i R224 224 512 where 256 and 512 indicate the ﬁlter number, aiming to avoid the loss of boundary details due to the decrease of the feature map size in { P1 i , P2 i , P3 i } (Gao et al. 2020). For each image Xj i (i = 1, ..., N and j = 1, 2, 3), we use the result of the superpixel segmentation to partition it into n regions. The representation of each superpixel region is a scalar, which is the average values of the activation maps of all pixels within the same superpixel. Hence, each image is represented by three matrices with different scales of the image size, e.g., X1 i Rn 256, X2 i Rn 512, and X3 i Rn 512, i = 1, ..., N. Furthermore, we use X = {X1, X2, X3} to represent the feature matrices of relevant images, where X1 RN 256, X2 RN 512, and X3 RN 512. Finally, the initial graph matrix Av (v = 1, 2, 3) for Xv is constructed by the formulation: Av = Xv Xv T RN N (Jiang et al. 2019a).

Feature Fine-tuning In this section, we ﬁrst review the classical GCN model and then propose our proposed graph fusion method in details.

Graph convolutional network The GCN method aims to learn a latent representation Ov = f(Xv, Gv; Θv) of the original feature matrix Xv (v = 1, 2, 3) while preserving the graph structure of all data points (Kipf and Welling 2016). Generally, GCN includes one input layer, two hidden layers, and one perceptron layer. Given the input matrix Xv RN dv which has N superpixel regions (or samples) and dv features for each sample, Av denotes the pair-wise correlation between any two samples. Hence, the layer-wise propagation in the k-th hidden layer of GCN is

Fv k+1 = σ( Dv 1

2 Fv kΘv k) (1) where k = (0, 1, ..., K 1) and K is the number of layers. Fv 0 = Xv is the initial feature matrix, Fv k is the output feature map of the k-th layer, Av = Av + In is the adjacency matrix of the undirected graph, and In is the identity matrix. Dv = diag( dv 1, ..., dv n) is a diagonal matrix with dv i = Pn j Av and σ(.) is an activation function such as Re LU. The last perception layer is deﬁned as:

Ov = softmax( Dv 1

2 Fv KΘv K) (2) Ov is the prediction matrix, Θv = (Θv 0, ..., ΘK v) which are trainable parameters and can be learned by minimizing the cross-entropy loss function over the labeled samples.

j yijlnov ij (3)

where L denotes the set of labelled samples, c is the number of classes, yij is the ground truth, and ov ij is the corresponding predictions. Different from Convolutional Neural Network (CNN) (Krizhevsky, Sutskever, and Hinton 2012) regrading the feature matrix as the input, GCN regards both the feature matrix and the graph as the inputs to generate deep features by preserving the local structure in the graph. As a result, GCN has been demonstrated to outperform CNN in many real applications (Kipf and Welling 2016). Moreover, previous studies (e.g., (Chen, Wu, and Zaki 2019; Jiang et al. 2019b)) showed that the quality of the graph is the key issue for the effectiveness of the GCN method. In the literature, many methods can be used for constructing the graph, i.e., k Nearest Neighbor (k NN) graph, ϵ-neighborhood graph, fully connected graph, etc. The graph construction by many previous GCN methods is independent of the feature learning process, so that easily resulting in the sub-optimal feature learning. To address this issue, dynamic GCN methods focus on jointly conducting graph learning and feature learning, where the graph can be updated by the optimal features and the features are also adjusted by the update graph. As a result, the quality of the graph can be improved by a data-driven way, and thus the outputted feature is discriminative. To this end, the following objective function of the graph learning is: LGL : min Av Pn i,j=1 xv i Qv xv j Qv 2 2av ij + Av 2 F s.t., Pn j=1 av ij = 1, av ij > 0, i, j = 1, ..., n. (4)

!"#$ *"+%,-.

!"#$ $ !"#$ %

!"#$ $ !"#$ %

Figure 2: The structure of the proposed graph fusion, which conducts feature ﬁne-tuning by exploring the common and complementary information of multi-scale features.

where Qv Rdv r (r dv) and av ij denote the similarity between xv i and xv j. Finally, the dynamic GCN method adds the constraint of the graph learning (i.e., Eq. (4)) as the regularization of the GCN model to have:

L = LGCN + γLGL (5)

where γ is a tuning parameter. Similar to the literature (Li et al. 2018; Jiang et al. 2019a) that approximately optimizes a new variable with less tuning parameters rather than directly optimizing the variable Qv in Eq. (4) with expensive time cost, this paper designs to optimize Qv by the following objective function:

Av = σ(Xv Qv(Xv Q)T ) (6)

where σ(.) is the sigmoid activation function and Qv is learnable projection matrix.

Proposed graph fusion In this work, we design a new dynamic GCN in Eq. (5) and Eq. (6) to ﬁne-tune the features of each scale, which was obtained from VGG and superpixel segmentation. Thus we obtain a dynamic GCN model for the features from each scale. However, each GCN model is independently trained from other two. Hence, we propose a fusion method to combine three dynamic GCN models to explore the common information among three models and the complementary information in each model. We list the proposed fusion structure in Figure 2. Speciﬁcally, given the feature matrix Xv and the corresponding graph Av, the layer-wise propagation in the hidden layer of our proposed GCN method is deﬁned as:

Fv t = Re LU( ˆAvˆFv (t 1)Θv t ) (7)

where Fv t RN dt v is the new representation of ˆFv (t 1) in the t-th layer, ˆAv = Dv 1

2 (Av + IN )Dv 1

2 is normalized adjacency matrix, and Dv is the diagonal matrix of (Av + IN ). IN is an identity matrix and Re LU(.) is an activation function. Θv t is a trainable projection matrix for the v-th superpixel feature. Since we have multi-scale features to describe the same patterns on a group of images. The features with different sizes/scales can capture the common foregrounds with different scales. Moreover, the features of each scale has the

complementary information (or private information, e.g., different foregrounds) different from the features from other scales, while all features should have the common information (i.e., the common foregrounds with different scales) as they are assumed to contain the same foregrounds. If the common information is detected, these features will be discriminative for the co-saliency detection. Meanwhile, the difference among the features can also beneﬁt the learning of discriminative features. To this end, we have the deﬁnition of Fv (t 1) as follows:

Fv (t 1) = VP

v=1 αv (t 1)Fv (t 1) (8)

where αv (t 1) indicates the contribution or the weight for

Fv (t 1) to its v-th ﬁnal features Fv (t 1) in the (t-1)-th layer. Moreover, αv (t 1) = [α1 (t 1), ..., αV (t 1)] is a trainable vector. Speciﬁcally, our GCN method outputs Fv (t 1), which

will be combined with all other Fv (t 1) (v = v ) to generate

the v-th ﬁnal features Fv (t 1) in the (t-1)-th layer. As a result, the feature learning in each scale have the complementary information (i.e., Fv (t 1)) and the common information

from other scales Fv (t 1) (v = v ). After the new presentation Fv t is obtained, its ﬁnal output is deﬁned as:

Ov = softmax( Av Fv t Θv t ) (9)

After conducting Eq. (9), we obtain three outputs and then concatenate them to have:

Z = FC([O1, O2, O3]) (10)

where FC denotes the fully-connected layer and Z is the predicted label. Co-saliency detection is designed to propagate information from intra-superpixel correlations across the relevant images. Hence, we only consider the prediction performance by employing the cross-entropy loss function to obtain:

j=1 ηi(zi(j)logzi(j)

(1 ηi)(1 zi(j))log(1 yi(j))) (11)

where yi(j) and zi(j) is the ground truth and predicted result of the j-th superpixel of the i-th image, respectively. ηi is the ratio of salient superpixel cluster in all superpixel clusters and can be calculated by applying the same superpixel partition for ground truths in advance.

Experiments We experimentally evaluated our method, compared to four comparison methods, on three image data sets, i.e., i Coseg, Cosal2015, and MSRC, in terms of four evaluation metrics.

Data Sets The data set i Coseg (Batra et al. 2010) contains 643 images within 38 different categories. Each image has a manually labeled pixel-wise ground truth for evaluation.

Methods i Coseg Cosal2015 MSRC AUC Fβ Sα AP AUC Fβ Sα AP AUC Fβ Sα AP CBCS 0.9315 0.7301 0.6707 0.7958 0.8077 0.5489 0.5439 0.5859 0.8083 0.6563 0.4959 0.6992 ESMG 0.9317 0.7094 0.7436 0.7728 0.7687 0.4803 0.5524 0.5111 0.7875 0.6111 0.5452 0.6112 EGNet 0.9598 0.8651 0.8365 0.8751 0.9303 0.7909 0.8206 0.8077 0.8624 0.7714 0.7183 0.7618 MGLCN 0.9671 0.8912 0.8355 0.8263 0.9534 0.8845 0.8142 0.8519 0.9415 0.8559 0.8001 0.8427 Proposed 0.9727 0.8787 0.8391 0.8742 0.9716 0.8928 0.9341 0.8817 0.9515 0.8565 0.8212 0.9158

Table 1: Results of all methods on three image data sets.

Figure 3: Visualization comparisons of all methods on three images, each of which is from one data set.

The data set Cosal2015 (Zhang et al. 2016) consists of 2015 images of 50 categories, and each group suffers from various challenging issues such as complex environments, occlusion issues, target appearance variations, and background clutters. The data set MSRC (Winn, Criminisi, and Minka 2005) contains 233 images within 7 categories. The images in the data set are complicated as the common objects are vary unpredictable in color and shape appearance.

Comparison Methods

We used four state-of-the-art methods of co-saliency detection to evaluate the effectiveness of our proposed framework in our experiments.

Cluster-Based Co-Saliency detection (CBCS) integrates three bottom-up saliency cues (including the spatial distribution cue, the global contrast cue, and the corresponding cue) with multiplication way to conduct the ﬁnal cosaliency maps (Fu, Cao, and Tu 2013).

Efﬁcient Saliency-Model-Guided co-saliency detection (ESMG) conducts a two-step saliency-guided method,

where the ﬁrst step uses the manifold ranking to recover the co-salient parts missing for each single saliency map and the second step utilizes a ranking framework with various queries to capture the corresponding correlations to guide co-saliency maps (Li et al. 2014).

Edge Guidance Network (EGNet) designs a single base network which consists of three parts, i.e., edge feature extraction, salient object feature extraction, and one-toone guidance network, to improve the saliency detection performance (Zhao et al. 2019).

Multiple Graph Learning and Convolutional Network (MGLCN) explores the superpixel-level similarity to replace pixel-level saliency detection, by embedding both the intra-graph and the inter-graph learning in the framework of graph convolution network (Jiang et al. 2019a).

The methods (e.g., CBCS and ESMG) are traditional machine learning methods and the methods (e.g., EGNet, MGLCN, and our method) are deep learning methods.

0.0 0.2 0.4 0.6 0.8 1.0 False positive rate

True positive rate

CBCS ESMG EGNet MGLCN Proposed

0.0 0.2 0.4 0.6 0.8 1.0 False positive rate

True positive rate

CBCS ESMG EGNet MGLCN Proposed

0.0 0.2 0.4 0.6 0.8 1.0 False positive rate

True positive rate

CBCS ESMG EGNet MGLCN Proposed

0.0 0.2 0.4 0.6 0.8 1.0 Recall

CBCS ESMG EGNet MGLCN Proposed

(a) i Coseg

0.0 0.2 0.4 0.6 0.8 1.0 Recall

CBCS ESMG EGNet MGLCN Proposed

(b) Cosal2015

0.0 0.2 0.4 0.6 0.8 1.0 Recall

CBCS ESMG EGNet MGLCN Proposed

Figure 4: ROC and PR curves of all methods on three image data sets.

Experiment Setting In our experiments, we reshaped the size of all images to 224 224 and set the number of superpixel regions as 5000. For deep learning methods (i.e., EGNet, MGLCN, and our method), we selected the data set MSRAB in (Liu et al. 2010) to train deep models. In our method, we set the maximal number of epochs as 10000 using the Adam optimizer (Kingma and Ba 2014), and set the initial learning rate and the weight decay, respectively, as 1e-5 and 0.005. We set stopping criterion as no decreasing of the objective function for 100 consecutive epochs in the training process. For fair comparison, we obtained the source codes by online or from the authors. The experimental settings of all comparison methods were followed the corresponding literature to make all of them output their best performance. All experiments were conducted on a server with 4 NVIDIA Quadro P4000 8G. The evaluation metrics included Precision-Recall (PR) curve, Receiver Operating Characteristic (ROC) curve, Area Under the Curve (AUC) score , Fβ score, Sα score, and Average Precision (AP) (Fan et al. 2017). Speciﬁcally, Fβ score is deﬁned as:

Fβ = (1+β2)P recision Recall

β2P recision+Recall (12)

where precision and recall are obtained using a self-adaptive threshold T = µ+ε. µ and ε are the mean and standard deviation values of the saliency map, respectively. We followed (Achanta et al. 2009) to set β2 as 0.3. Sα score describes the structural similarity between the ground truths and the corresponding co-saliency maps, and we followed the literature (Fan et al. 2017) to set all hyperparameters as 0.5.

Results Analysis

We listed the results of all methods on three benchmark data sets in Table 1, where the bold number stands for the best result in one column. We also reported the ROC and PR curves of all methods on all data sets in Figure 4. First, our proposed framework obtained the best performance, followed by MGLCN, EGNet, CBCS, and ESMG. For example, our method improved on average by 1.13%, 0.12%, 4.82%, and 5.03%, compared to the best comparison method (i.e., MGLCN), and averagely improved by 13.60%, 27.57%, 25.11%, and 25.89%, compared to the worst comparison method (i.e., ESMG), in terms of AUC, Fβ, Sα, and AP, respectively, on three data sets. This indicates the success of our two strategies for co-saliency detection, i.e., generating multi-scale images for every image, and fusing multi-scale features to produce discriminative features. In particular, deep learning methods (i.e., EGNet, MGLCN, and our method) outperformed traditional methods (i.e., CBCS and ESMG) as the former methods extract more informative features to describe the salient region than the latter ones. This indicates that deep features are suitable for co-saliency detection. Second, by comparing with four deep learning methods, EGNet achieved the worst performance as the methods (such as MGLCN, and our method) extract multiple deep features for co-saliency detection. For example, MGLCN improved on average by 3.65%, 6.81%, 2.48%, and 2.54%, respectively, on three data sets, for the evaluation metrics such as AUC, Fβ, Sα, and AP, compared to EGNet. This implies that graph convolutional structure are reasonable for co-saliency detection.

AUC F S AP 0

1 Proposed-s Proposed

(a) i Coseg

AUC F S AP 0

1 Proposed-s Proposed

(b) Cosal2015

AUC F S AP 0

1 Proposed-s Proposed

Figure 5: Results of our model without/with the process of feature fusion on three data sets.

Figure 6: Visual comparisons between the classiﬁcation task (left) and the regression task (right) using our framework on three images for the used data sets.

Ablation Analysis In this section, we verify the effectiveness of our model from the following aspects: (1) the effectiveness of our fusion method; and (2) regression performance of our method.

Graph fusion effectiveness In our framework, we fuse the features from multiple scales to explore the complementary information in each scale and the common information among all scales. However, we can also ignore the fusion process, i.e., separately conducting 3 dynamic models and then concatenate 3 outputs to conduct co-saliency detection, Proposed-s for short. We reported the results of both Proposed and Proposed-s in Figure 5. Obviously, Proposed outperformed Proposed-s on all data sets in terms of different evaluation metrics. For example, Proposed improved by on average 9.4%, 11.32%, 8.93%, and 9.29%, respectively, compared to Proposed-s, in terms of AUC, Fβ, Sα, and AP. This indicates the importance for feature fusion on multi-scale features.

Regression effectiveness In this paper, we regarded the co-saliency detection task as a binary classiﬁcation task, and reported the visualization of all methods in Figure 3. Actually, we can also regard the co-saliency detection task as a regression task, whose visualization can easier detect the edge boundary compared to the classiﬁcation task. This is because that the regression task assigns the edge boundary with continuous values and the classiﬁcation task assigns it with binary values. To this end, we reported the visualization of our method on the regression task in Figure 6. Compared the regression task to the classiﬁcation task in terms of the visualization, the edge boundary produced by

the regression task is more blur by considering the pixel graph-scale values, compared to the one in the classiﬁcation task. Hence, the proposed framework can be designed for both the classiﬁcation task and the regression task.

Conclusion In this paper, we proposed a new co-saliency detection framework by designing two strategies to generate discriminative features, i.e., multi-scale features to capture the patterns with different sizes across the images, and feature fusion to extract the common and complementary information among the multi-scale features. Moreover, we embedded these two strategies into our designed dynamic GCN model to jointly conduct feature fusion, graph learning, and feature learning. Experimental results on three benchmark data sets demonstrated that our framework outperformed the state-of-the-art methods of co-saliency detection in terms of several evaluation metrics. Moreover, experimental results also veriﬁed the effectiveness of each strategy in our cosaliency detection framework.

Acknowledgements This work was partially supported by the Natural Science Foundation of China (Grants No: 61876046 and 61672177); the Guangxi Collaborative Innovation Center of Multi Source Information Integration and Intelligent Processing; the Guangxi Bagui Teams for Innovation and Research; the Marsden Fund of New Zealand (MAU1721); the Project of Guangxi Science and Technology (Gui Ke AD17195062); and the Sichuan Science and Technology Program (No. 2019YFG0533).

Achanta, R.; Hemami, S.; Estrada, F.; and Susstrunk, S. 2009. Frequency-tuned salient region detection. In CVPR, 1597 1604.

Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; and S usstrunk, S. 2012. SLIC superpixels compared to state-of-theart superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34(11): 2274 2282.

Batra, D.; Kowdle, A.; Parikh, D.; Luo, J.; and Chen, T. 2010. icoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR, 3169 3176.

Chen, Y.; Wu, L.; and Zaki, M. J. 2019. Deep iterative and adaptive learning for graph neural networks. ar Xiv preprint ar Xiv:1912.07832 .

Fan, D.-P.; Cheng, M.-M.; Liu, Y.; Li, T.; and Borji, A. 2017. Structure-measure: A new way to evaluate foreground maps. In ICCV, 4548 4557.

Fu, H.; Cao, X.; and Tu, Z. 2013. Cluster-based co-saliency detection. IEEE Transactions on Image Processing 22(10): 3766 3778.

Gao, G.; Zhao, W.; Liu, Q.; and Wang, Y. 2020. Co-Saliency Detection with Co-Attention Fully Convolutional Network. IEEE Transactions on Circuits and Systems for Video Technology .

Han, J.; Quan, R.; Zhang, D.; and Nie, F. 2017. Robust object cosegmentation using background prior. IEEE Transactions on Image Processing 27(4): 1639 1651.

Huang, R.; Feng, W.; and Sun, J. 2015. Saliency and co-saliency detection by low-rank multiscale fusion. In ICME, 1 6.

Jerripothula, K. R.; Cai, J.; Lu, J.; and Yuan, J. 2017. Object coskeletonization with co-segmentation. In CVPR, 3881 3889.

Jerripothula, K. R.; Cai, J.; and Yuan, J. 2016. Cats: Co-saliency activated tracklet selection for video co-localization. In ECCV, 187 202.

Jiang, B.; Jiang, X.; Zhou, A.; Tang, J.; and Luo, B. 2019a. A uniﬁed multiple graph learning and convolutional network model for co-saliency estimation. In ACM MM, 1375 1382.

Jiang, B.; Zhang, Z.; Lin, D.; Tang, J.; and Luo, B. 2019b. Semisupervised learning with graph learning-convolutional networks. In CVPR, 11313 11320.

Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 .

Kipf, T. N.; and Welling, M. 2016. Semi-supervised classiﬁcation with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907 .

Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 1097 1105.

Li, R.; Wang, S.; Zhu, F.; and Huang, J. 2018. Adaptive graph convolutional neural networks. AAAI .

Li, Y.; Fu, K.; Liu, Z.; and Yang, J. 2014. Efﬁcient saliency-modelguided visual co-saliency detection. IEEE Signal Processing Letters 22(5): 588 592.

Liu, T.; Yuan, Z.; Sun, J.; Wang, J.; Zheng, N.; Tang, X.; and Shum, H.-Y. 2010. Learning to detect a salient object. IEEE Transactions on Pattern analysis and machine intelligence 33(2): 353 367.

Liu, Y.; Han, J.; Zhang, Q.; and Shan, C. 2019. Deep salient object detection with contextual information guidance. IEEE Transactions on Image Processing 29: 360 374.

Papushoy, A.; and Bors, A. G. 2015. Image retrieval based on query by saliency content. Digital Signal Processing 36: 156 173. Peng, L.; Yang, Y.; Wang, Z.; Huang, Z.; and Shen, H. T. 2020. MRA-Net: Improving VQA via Multi-modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence 10.1109/TPAMI.2020.3004830. Ren, J.; Liu, Z.; Li, G.; Zhou, X.; Bai, C.; and Sun, G. 2020. Co Saliency Detection Using Collaborative Feature Extraction And High-To-Low Feature Integration. In ICME, 1 6. Shen, F.; Xu, Y.; Liu, L.; Yang, Y.; Huang, Z.; and Shen, H. T. 2018. Unsupervised Deep Hashing with Similarity-Adaptive and Discrete Optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(12): 3034 3044. Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556 . Tsai, C.-C.; Li, W.; Hsu, K.-J.; Qian, X.; and Lin, Y.-Y. 2018. Image co-saliency detection and co-segmentation via progressive joint optimization. IEEE Transactions on Image Processing 28(1): 56 71. Wang, B.; Yang, Y.; Xu, X.; Hanjalic, A.; and Shen, H. T. 2017. Adversarial Cross-Modal Retrieval. In ACM MM, 154 162. Wang, C.; Zha, Z.-J.; Liu, D.; and Xie, H. 2019. Robust deep cosaliency detection with group semantic. In AAAI, volume 33, 8917 8924. Winn, J.; Criminisi, A.; and Minka, T. 2005. Object categorization by learned universal visual dictionary. In ICCV, volume 2, 1800 1807. Xu, X.; Wang, T.; Yang, Y.; Hanjalic, A.; and Shen, H. T. 2020. Radial Graph Convolutional Network for Visual Question Generation. IEEE Transactions on Neural Networks and Learning Systems 10.1109/TNNLS.2020.2986029. Zha, Z.-J.; Wang, C.; Liu, D.; Xie, H.; and Zhang, Y. 2020. Robust Deep Co-Saliency Detection With Group Semantic and Pyramid Attention. IEEE Transactions on Neural Networks and Learning Systems . Zhang, D.; Fu, H.; Han, J.; Borji, A.; and Li, X. 2018. A review of co-saliency detection algorithms: Fundamentals, applications, and challenges. ACM Transactions on Intelligent Systems and Technology 9(4): 1 31. Zhang, D.; Han, J.; Li, C.; Wang, J.; and Li, X. 2016. Detection of co-salient objects by looking deep and wide. International Journal of Computer Vision 120(2): 215 232. Zhang, D.; Meng, D.; and Han, J. 2016. Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(5): 865 878. Zhang, K.; Li, T.; Liu, B.; and Liu, Q. 2019. Co-saliency detection via mask-guided fully convolutional networks with multi-scale label smoothing. In CVPR, 3095 3104. Zhang, K.; Li, T.; Shen, S.; Liu, B.; Chen, J.; and Liu, Q. 2020a. Adaptive Graph Convolutional Network with Attention Graph Clustering for Co-saliency Detection. In CVPR, 9050 9059. Zhang, Z.; Jin, W.; Xu, J.; and Cheng, M.-M. 2020b. Gradient Induced Co-Saliency Detection. ar Xiv preprint ar Xiv:2004.13364 . Zhao, J.-X.; Liu, J.-J.; Fan, D.-P.; Cao, Y.; Yang, J.; and Cheng, M.-M. 2019. EGNet: Edge guidance network for salient object detection. In ICCV, 8779 8788.