# gmmlic_graph_matching_based_multilabel_image_classification__f02b8ef7.pdf

GM-MLIC: Graph Matching based Multi-Label Image Classiﬁcation

Yanan Wu1 , He Liu1 , Songhe Feng1 , Yi Jin1 , Gengyu Lyu1 and Zizhang Wu2

1Beijing Key Laboratory of Trafﬁc Data Analysis and Mining School of Computer and Information Technology, Beijing Jiaotong University 2Zongmu Technology {19112034, liuhe1996, shfeng, yjin, 18112030}@bjtu.edu.cn, zizhang.wu@zongmutech.com

Multi-Label Image Classiﬁcation (MLIC) aims to predict a set of labels that present in an image. The key to deal with such problem is to mine the associations between image contents and labels, and further obtain the correct assignments between images and their labels. In this paper, we treat each image as a bag of instances, and reformulate the task of MLIC as an instance-label matching selection problem. To model such problem, we propose a novel deep learning framework named Graph Matching based Multi-Label Image Classiﬁcation (GM-MLIC), where Graph Matching (GM) scheme is introduced owing to its excellent capability of excavating the instance and label relationship. Speciﬁcally, we ﬁrst construct an instance spatial graph and a label semantic graph respectively, and then incorporate them into a constructed assignment graph by connecting each instance to all labels. Subsequently, the graph network block is adopted to aggregate and update all nodes and edges state on the assignment graph to form structured representations for each instance and label. Our network ﬁnally derives a prediction score for each instance-label correspondence and optimizes such correspondence with a weighted cross-entropy loss. Extensive experiments conducted on various image datasets demonstrate the superiority of our proposed method.

1 Introduction

Multi-label image classiﬁcation (MLIC) is an essential computer vision task, aiming to assign multiple labels to one image based on its content. Compared with single-label image classiﬁcation, MLIC is more general and practical since an arbitrary image is likely to contain multiple objects in the physical world. Thus, it widely exists in many applications such as image retrieval [Wei et al., 2019] and medical diagnosis recognition [Ge et al., 2018]. However, it is also more chal-

The authors contributed equally to this work. Corresponding Author

Figure 1: Illustration of the ASsignment Graph Construction (ASGC). The upper part designs an instance spatial graph Go, while the lower part builds a label semantic graph Gl. Then each instance is connected to all labels to form the ﬁnal assignment graph GA.

lenging because of the rich semantic information and complex dependency of an image and its labels. The key to accomplish the task of MLIC is how to effectively explore the valuable semantic information from the image context, and further obtain the correct assignments between images and their labels. A simple and straightforward way is to treat each image as a bag of instances/proposals coping with the instances in isolation, and convert the multi-label problem into a set of binary classiﬁcation problems. However, its performance is essentially limited due to ignoring the complex topology structure among labels. This stimulates research for approaches to capture and mine the label correlations in various ways. For example, Recurrent Neural Networks (RNNs) [Wang et al., 2016] [Wang et al., 2017] and Graph Convolution Network (GCN) [Chen et al., 2019b] [Wang et al., 2020b] are widely used in many MLIC frameworks owing to their competitive performance on explicitly modeling label dependencies. However, most of these methods ignore the associations between semantic labels and image local features, and the spatial contexts of images are not sufﬁciently exploited. Some other works [Zhu et al., 2017] [Chen et al., 2019a] introduce attention mechanisms to adaptively search semantic-aware instance regions and aggregate features from these regions to identify multiple labels. However, due to the lack of ﬁne-grained supervision information, these methods could merely locate instance regions roughly, which neither consider the interactions among instances nor explicitly describe the instance-label assignment relationship.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

To address the above mentioned issues, in this paper, we fully explore the correspondence (matching) between each instance and label, and reformulate the task of MLIC as an instance-label matching selection problem. Accordingly, we propose a novel Graph Matching based Multi-Label Image Classiﬁcation (GM-MLIC) deep learning model, which simultaneously incorporates instance spatial relationship, label semantic correlation and co-occurrence possibility of varying instance-label assignments into a uniﬁed framework. Specifically, inspired by Graph Matching scheme (GM) suitable for structured data, we ﬁrst design an instance spatial graph by representing each instance feature as the node attribute and the relative location relationship of adjacent instances as the edge attribute. Meanwhile, a label semantic graph is employed to capture the overall semantic correlations, which takes the word embedding of each label as node attribute, and concatenates the attributes of two nodes (labels) associated with the same edge to form the edge attribute. Then each instance is connected to all labels to form the ﬁnal assignment graph, as shown in Figure 1, which aims to explicitly model the instance-label matching possibility. Furthermore, the Graph Network Block (GNB) is introduced to our framework to perform computation on the constructed assignment graph, which forms structured representations for each instance and label by a graph propagation mechanism. Finally, we design a weighted cross-entropy loss to optimize our network output, which indicates the prediction score of each instance-label correspondence. Extensive experiments demonstrate that our proposed method can achieve superior performance against state-of-the-art methods.

2 Related Work

The task of MLIC has attracted an increasing interest recently. A straightforward way to address this problem is to train independent binary classiﬁers for each label. However, such method does not consider the relationship among labels, and the number of predicted labels will grow exponentially as the number of categories increase. To overcome the challenge of such an enormous output space, some works convert multilabel problem into a set of multi-class problems over region proposals. For example, [Wei et al., 2015] extracted an arbitrary number of object proposals, then aggregated the label conﬁdences of these proposals with max-pooling to obtain the ﬁnal multi-label predictions. [Yang et al., 2016] treated each image as a bag of instances/proposals, and solved the MLIC task in a multi-instance learning manner. However, the above methods ignore the label correlation in multi-label images when converting MLIC to the multi-class task. Recently, researchers focus on exploiting the label correlation to facilitate the learning process. [Gong et al., 2013] leveraged a ranking-based learning strategy to train deep convolutional neural networks for MLIC and found that the weighted approximated-ranking loss can implicitly model label correlation and work best. [Wang et al., 2016] utilized recurrent neural networks (RNNs) to transform labels into embedded label vectors, so that the correlation between labels can be employed. [Chen et al., 2019b] used Graph Convolutional Network to map a group of label semantic embed-

dings borrowed from natural language processing into interdependent classiﬁers. [Wang et al., 2020b] proposed to model label correlation by superimposing label graph built from statistical co-occurrence information into the graph constructed from knowledge priors of labels. However, none of the aforementioned methods consider the associations between semantic labels and image local features, and the spatial contexts of images have not been sufﬁciently exploited. To solve the above issues, recent progress on MLIC attempt to model label correlation with region-based multilabel approaches. For example, [Wang et al., 2017] introduced a spatial transformer to locate semantic-aware instance regions and then captured the spatial dependencies of these regions by Long Short-Term Memory. [Chen et al., 2019a] incorporated category semantics to better learn instance features and explored their interactions under the guidance of statistical label dependencies. Although the above methods have achieved competitive performance, they neither consider the spatial location relationships among instances nor explicitly describe the instance-label assignment relationships, which may make these methods lose the ability to effectively represent the categories visual features. Different from all these methods, we utilize the GM scheme and propose a novel multi-label image classiﬁcation learning framework called GM-MLIC, where the instance spatial relationship, label semantic correlation and instance-label assignment possibility are simultaneously incorporated into the framework to improve the classiﬁcation performance. The details of the framework are introduced in the following section.

3 The Proposed Method

Given a multi-label image dataset D = {(Xi, Yi)}N i=1, we denote Xi = {x1 i , x2 i , ..., x M i } as the i-th image that consists of M instances, Yi = [y1 i , y2 i , ..., y C i ]T as the ground-truth label vector for Xi, where each instance xj i is a d-dimensional feature vector and C is the number of all possible label in the dataset. yc i = 1 indicates that image Xi is annotated with label c, and yc i = 0 otherwise. GM-MLIC aims to learn a multi-label classiﬁcation model from the instance-level feature vector together with image-level ground-truth label vector, and further assign the predictive labels for a test image.

3.1 Overview

The overview architecture of the proposed GM-MLIC is illustrated in Figure 2, which consists of two components: the ASsignment Graph Construction (ASGC) and the Instance Label Matching Selection (ILMS). The constructed assignment graph takes instance spatial graph and label semantic graph as input, and considers each instance-label connection as candidate matching edge. The matching selection module is another core component of our learning framework, which introduces Graph Network Block (GNB) to convolve the information of neighborhoods of each instance and label to form a structured representation through several convolution operators. Finally, our model derives a prediction score for each instance-label correspondence and optimizes such correspondence with a weighted cross-entropy loss.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Figure 2: Illustration of our proposed deep learning framework for MLIC task. Overall our model consists of two major components: the ASsignment Graph Construction (ASGC) and the Instance-Label Matching Selection (ILMS). The details of ASGC are shown in Figure 1. The ILMS module consists of Encoder, Graph Convolution and Decoder. The encoder and decoder are designed as MLPs, where encoder transforms all node attributes and edge attributes into latent space and decoder derives instance-label matching scores from the updated graph state. Moreover, the graph convolution utilizes Graph Network Block (GNB) to perform nodes and edges attribute aggregating and updating.

3.2 The Assignment Graph Construction As depicted in Figure 1, we ﬁrst construct an instance spatial graph to explore the relationship of the spatially adjacent instances. Speciﬁcally, we feed an image to a pre-trained Faster R-CNN network [Ren et al., 2016] to generate a set of semantic-aware instances, where each instance contains a bounding box B(x, y, w, h). Each of these instances is taken as a node in the instance spatial graph Go in which the edges between each pair of nodes are produced through k-Nearest Neighbor criteria. Note that our Go is a directed graph, where the attributes of nodes are denoted by the feature f of corresponding instances, and the attributes of edges are represented by the concatenations of bounding box coordinates of its source instance and receive instance vo i = fi, eo ij = [ ˆBi, ˆBj], (1)

where fi and ˆBi(xi, yi, xi +wi, yi +hi) denote the attributes and location coordinates of i-th instance respectively and [ , ] the concatenations of its input. Similar to [Chen et al., 2019b] [You et al., 2020], we construct a label semantic graph Gl to capture the topological structure in the label space, where each node of the graph is represented as word embeddings of the label. Different from above these methods, our Gl does not require pre-deﬁned label co-occurrence statistics. It instead combines the word embeddings of the connected two nodes (labels) to form the initial edges attributes of Gl as

vl i = wi, el ij = [wi, wj], (2) where wi denotes the word embedding of i-th label. To explicitly establish the instance-label matching relationship, we connect each instance in Go to all labels in Gl to form the instance-label assignment graph GA. In GA, the attributes of the matching edge that connecting instance and label are represented as

em ij = [vo i , vl j]. (3) In this way, we successfully convert the problem of building the complex correspondence between an image and its labels to the issue of selecting reliable edges from a constructed

assignment graph. Accordingly, the goal of the MLIC problem is transformed into how to solve the matching selection problem and obtain the optimal instance-label assignment.

3.3 Modeling Instance-Label Correspondence As illustrated in Figure 2, our matching selection module is designed on the top of Graph Network Block presented in [Wang et al., 2020a], which deﬁnes a class of functions for relation reasoning over graph-structured representations. Speciﬁcally, the matching selection module consists of three main components: Encoder, Graph Convolution and Decoder.

Encoder. The encoder takes the constructed assignment graph GA as input, and transforms its attributes into a latent feature space by two parametric update functions ϕv enc and ϕe enc. In our framework, ϕv enc and ϕe enc are designed as multi-layer perceptrons (MLPs), each of which takes respectively a node attribute vector and an edge attribute vector as input and transforms them into latent spaces. Formally speaking, we denote ϕv enc(VA) and ϕe enc(EA) as the updated node attributes and edge attributes by applying ϕv enc and ϕe enc to each node and each edge respectively. Then the encoder module can be brieﬂy described as

GA Enc(GA) = (VA, EA, ϕv enc(VA), ϕe enc(EA)). (4)

The updated graph GA is then passed to the subsequent convolution modules as input.

Graph Convolution Module. This module consists of a node convolution layer and an edge convolution layer. The node convolution layer collects the attributes of all the nodes and edges adjacent to each node to compute per-node updates. It is followed by the edge convolution layer that assembles the attributes of the two nodes associated with each edge to generate a new attribute of this edge. Speciﬁcally, for the i-th node in Go, the aggregation function gathers the information from its adjacent nodes and associated edges, and the update function outputs the updated attributes according to the gathered information. Given that for each instance node vo i , other instance nodes and label nodes

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

are connected with instance edges and matching edges respectively, we design two types of aggregation functions as

ˆvo i = 1 No ˆρo n([eo ij, vo j ]), vo i = 1 Nl ρo n([ec ij, vl j]), (5)

where No and Nl are two sets of instance nodes and label nodes adjacent with vo i . ˆρo n and ρo n gather the information from object nodes and label nodes respectively. Then an update function is employed to update the attributes for vo i vo i φo n([vo i , ˆvo i , vo i ]), (6) where φo n takes the concatenation of the current attributes of vo i and gathered information ˆvo i and vo i , and outputs the updated attributes for vo i . Similar to instance nodes, the label nodes are connected with two types of nodes and associated with two types of edges. Therefore, we also design two aggregation functions and an update function for label node vl i. Speciﬁcally, the aggregation functions are formulated as

ˆvl i = 1 Nl ˆρl n([el ij, vl j]), vl i = 1 No ρo n([ec ij, vo j ]), (7)

and the update function is represented as vl i φl n([vl i, ˆvl i, vl i]). (8) For an instance edge eo ij, both of its source node and receive node are in Go. Therefore, we design an aggregation function and the update function as follows ˆeo ij = ρo e([vo i , vo j ]), eo ij φo e([eo ij, ˆeo ij]), (9) where ρo e aggregates the information from source instance node vo i and receive node vo j , φo e updates the attributes of eo ij according to the gathered information. Similar to the instance edge convolution operator, the label edge convolution layer consists of an aggregation function and an update function, which are designed as ˆel ij = ρl e([vl i, vl j]), el ij φl e([el ij, ˆel ij]). (10) Different from instance edges and label edges, the matching edges connect instance nodes and label nodes. Therefore, the aggregation function gathers information from different type of nodes ˆem ij = ρm e ([vo i , vl j]), (11) and the update function also takes the combination of aggregated features and its current features as input and output the updated attributes em ij φm e ([em ij, ˆem ij]). (12) All the above aggregation functions and update functions are designed as MLPs, but their structure and parameters are different from each other. Decoder. The decoder module reads out the ﬁnal output from the updated graph state. Since only the attributes of instance-label matching edges are required for ﬁnal evaluation, the decoder module contains only one update function ϕe dec that transforms the edges attributes into the desired space S = Dec(GA) = ϕe dec(EA), (13) where S [0, 1]M C denotes the prediction score that each instance is matched with the corresponding label. Similarly, ϕe enc is parameterized by an MLP.

3.4 Optimizing Multi-Label Prediction In order to interpret each ground truth label of the input image, there should be at least one instance that best matches it. With consideration of the possibly noisy instances, a cross-instances max-pooling is carried out to fuse the output of our framework into an integrative prediction. Suppose sj(j = 1, 2, ..., M) is the prediction score vector of the jth instance from the decoder and sc j(c = 1, 2, ..., C) is the c-th category matching score of sj. The cross-instances maxpooling can be formulated as pc i = max(sc 1, sc 2, ..., sc M), (14) where pc i can be considered as the prediction score for the c-th category of the given image i. Finally, the training process of our network is guided by a weighted cross-entropy loss with the ground-truth labels Yi as supervision,

c=1 wc[yc i log(pc i) + (1 yc i ) log(1 pc i))]

wc = yc i eβ(1 rc) + (1 yc i ) eβrc, (15) where wc is used to alleviate the class imbalance, β is a hyperparameter and rc is the ratio of label c in the training set.

4 Experiments 4.1 Evaluation Metrics We adopt six widely used multi-label metrics to evaluate each comparing method, including the average per-class precision (CP), recall (CR), F1 (CF1) and the average overall precision (OP), recall (OR), F1 (OF1), whose detailed deﬁnitions can be found in [Chen et al., 2019a]. In this paper, we present the above metrics under the setting that a label is predicted as positive if its estimated probability is greater than 0.5. To fairly compare with the state-of-the-art methods, we also report the results of top-3 labels. Besides, we compute and report the average precision (AP) and mean average precision (m AP).

4.2 Implementation Details In ASGC module, we apply Faster R-CNN (resnet50-fpn) [Ren et al., 2016] to generate a set of instances for per image, where each of these instances, in addition to the visual feature f and bounding box B, also contains a preliminary class label c with a conﬁdence score s. To save computational cost, we only select the instances with the top-m conﬁdence score for each image. For label representations, we adopt 300-dim Glo Ve [Pennington et al., 2014] trained on the Wikipedia dataset. In ILMS module, the Graph Convolution Layer consists of k convolution modules, where they are stacked to aggregate the information of kth-order neighborhoods. In our experiments, k is 2 and the output dimension of the corresponding convolution module is 512 and 256, respectively. During training, the input images are randomly cropped and resized into 448 448 with random horizontal ﬂips for data augmentation. All modules are implemented in Py Torch and the optimizer is SGD with momentum 0.9. Weight decay is 10 4. The initial learning rate is 0.01, which decays by a factor of 10 for every 30 epochs. And the hyperparameter β in the Eq. (15) is set to 0 in VOC 2007 dataset and 0.4 in both MS-COCO and NUS-WIDE datasets.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Methods aero bike bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv m AP HCP 98.6 97.1 98.0 95.6 75.3 94.7 95.8 97.3 73.1 90.2 80.0 97.3 96.1 94.9 96.3 78.3 94.7 76.2 97.9 91.5 90.9 CNN-RNN 96.7 83.1 94.2 92.8 61.2 82.1 89.1 94.2 64.2 83.6 70.0 92.4 91.7 84.2 93.7 59.8 93.2 75.3 99.7 78.6 84.0 Res Net-101 99.5 97.7 97.8 96.4 65.7 91.8 96.1 97.6 74.2 80.9 85.0 98.4 96.5 95.9 98.4 70.1 88.3 80.2 98.9 89.2 89.9 RNN-Attention 98.6 97.4 96.3 96.2 75.2 92.4 96.5 97.1 76.5 92.0 87.7 96.8 97.5 93.8 98.5 81.6 93.7 82.8 98.6 89.3 91.9 ML-GCN 99.6 98.3 97.9 97.6 78.2 92.3 97.4 97.4 79.2 94.4 86.5 97.4 97.9 97.1 98.7 84.6 95.3 83.0 98.6 90.4 93.1 SSGRL 99.5 97.1 97.6 97.8 82.6 94.8 96.7 98.1 78.0 97.0 85.6 97.8 98.3 96.4 98.8 84.9 96.5 79.8 98.4 92.8 93.4 TSGCN 98.9 98.5 96.8 97.3 87.5 94.2 97.4 97.7 84.1 92.6 89.3 98.4 98.0 96.1 98.7 84.9 96.6 87.2 98.4 93.7 94.3 GM-MLIC 99.4 98.7 98.5 97.6 86.3 97.1 98.0 99.4 82.5 98.1 87.7 99.2 98.9 97.5 99.3 87.0 98.3 86.5 99.1 94.9 94.7

Table 1: Comparisons of AP and m AP with state-of-the-art methods on VOC 2007. red: best, blue: sub-optimal results. Best viewed in color.

Methods All Top-3 m AP CP CR CF1 OP OR OF1 CP CR CF1 OP OR OF1 CNN-RNN [Wang et al., 2016] 61.2 - - - - - - 66.0 55.6 60.4 69.2 66.4 67.8 Res Net-101 [He et al., 2016] 77.3 80.2 66.7 72.8 83.9 70.8 76.8 84.1 59.4 69.7 89.1 62.8 73.6 RNN-Attention [Wang et al., 2017] - - - - - - - 79.1 58.7 67.4 84.0 63.0 72.0 SRN [Zhu et al., 2017] 77.1 81.6 65.4 71.2 82.7 69.9 75.8 85.2 58.8 67.4 87.4 62.5 72.9 ML-GCN [Chen et al., 2019b] 83.0 85.1 72.0 78.0 85.8 75.4 80.3 89.2 64.1 74.6 90.5 66.5 76.7 SSGRL [Chen et al., 2019a] 83.6 89.5 68.3 76.9 91.2 70.7 79.3 91.9 62.1 73.0 93.6 64.2 76.0 CMA [You et al., 2020] 83.4 82.1 73.1 77.3 83.7 76.3 79.9 87.2 64.6 74.2 89.1 66.7 76.3 TSGCN [Xu et al., 2020] 83.5 81.5 72.3 76.7 84.9 75.3 79.8 84.1 67.1 74.6 89.5 69.3 78.1 GM-MLIC 84.3 87.3 70.8 78.3 88.6 74.8 80.6 90.6 67.3 74.9 94.0 69.8 77.8

Table 2: Comparisons with state-of-the-art methods on the MS-COCO dataset. red: best, blue: sub-optimal results. Best viewed in color.

Methods All Top-3 m AP CF1 OF1 CF1 OF1 CNN-RNN [Wang et al., 2016] 56.1 - - 34.7 55.2 SRN [Zhu et al., 2017] 61.8 56.9 73.2 47.7 62.2 MLIC-KD-WSD [Liu et al., 2018] 60.1 58.7 73.7 53.8 71.1 PLA [Yazici et al., 2020] - 56.2 - - 72.3 CMA [You et al., 2020] 61.4 60.5 73.7 55.5 70.0 GM-MLIC 62.2 61.0 74.1 55.3 72.5

Table 3: Comparisons with state-of-the-art methods on NUS-WIDE. red: best, blue: sub-optimal results. Best viewed in color.

4.3 Comparisons with State-of-the-Arts

Results on VOC 2007. Pascal VOC 2007 [Everingham et al., 2010] is the most widely used dataset to evaluate the MLIC task, which covers 20 common categories and contains a trainval set of 5,011 images and a test set of 4,952 images. We present the AP of each category and m AP over all categories on the VOC 2007 dataset in Table 1. Considering that the dataset is less complicated and its size is relatively small, our proposed model still achieves 94.7% m AP, which is respectively 0.4%, 1.3% and 1.6% superior over TSGCN, SSGRL and ML-GCN on VOC 2007 dataset. Particularly, our GM-MLIC obtains signiﬁcant improvements in the categories of small objects, including bird, plant, sheep and cat with 98.5%, 87.0%, 98.3% and 99.4% in terms of AP.

Results on MS-COCO. Microsoft COCO [Lin et al., 2014] contains a training set of 82,081 images and a validation set of 40,137 images, and covers 80 common categories with 2.9

instance labels per image. The number of labels of different images also varies considerably, which makes MS-COCO more challenging. As shown in Table 2, GM-MLIC outperforms all baselines in terms of all evaluation metrics in most cases. Speciﬁcally, GM-MLIC is superior to other comparing methods with 84.3% m AP, 78.3% (74.9%) CF1 (Top-3), 80.6% OF1, 67.3% Top-3 CR and 69.8% Top-3 OR, respectively. To visually understand the effectiveness of our model, we randomly select some images from different scenes and exhibit the top-3 returned labels by GM-MLIC and SSGRL in Figure 3. Results on NUS-WIDE. The NUS-WIDE dataset [Chua et al., 2009] contains 161,789 images for training and 107,859 images for testing. The dataset is manually annotated by 81 concepts, with 2.4 concept labels per image on average. Experimental results on this dataset are shown in Table 3, which is clearly to observe that GM-MLIC can not only effectively learn from such large-scale data, but also achieve superior performance on most evaluation metrics with 62.2% m AP, 61.0% CF1, 74.1% (72.5%) OF1 (Top-3), respectively.

4.4 Further Analysis Ablation Studies. To evaluate the effectiveness of each component in our proposed framework, we conduct ablation studies on the MS-COCO and VOC 2007 datasets, as shown in Table 4. Speciﬁcally, using label semantic graph Gl would result in 1.3% higher precision but 0.7% lower recall score than using instance spatial graph Go solely. In other words, introducing Go into our framework would reduce the possibility of missing instances in images. Besides, the instancelabel matching edge that incorporating category semantics

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

car,person,bus person,sports ball,bench bowl,banana,person person,frisbee,umbrella person,wine glass,vase cake,knife,mouse

car,truck,stop sign person,sports ball,bottle cake,person,dining table person,frisbee,chair person,donut,laptop cake,dining,table,knife

orange,bowl,dining table person,handbag,parking meter truck,bicycle,car person,frisbee,umbrella pizza,fork,spoon spoon,microwave,refrigerator

orange,bowl,book person,parking meter,potted plant car,umbrella,truck person,frisbee,chair dining table,chair,sandwich refrigerator,oven,cup

Figure 3: Top-3 returned labels by GM-MLIC (red word) and SSGRL(blue word). Best viewed in color.

Dataset MS-COCO VOC Methods m AP CP CR CF1 m AP Our w/o Gl 81.7 85.1 68.3 76.5 92.8 Our w/o Go 82.9 86.4 67.6 77.2 93.3 Our w/o ILM edges 83.8 86.7 69.2 77.6 94.2 Our GM-MLIC 84.3 87.3 70.8 78.3 94.7

Table 4: Comparison of m AP (%) of our framework (Our GMMLIC), our framework without label semantic graph (Our w/o Gl), our framework without object spatial graph (Our w/o Go) and our framework without instance-label matching edges (Our w/o ILM edges) on the VOC 2007 and MS-COCO dataset.

can guide the model to better learn semantic-speciﬁc features and further improve the classiﬁcation precision.

Hyper-parameter and Visualization. Furthermore, we study the performance of our proposed method given different parameter settings. We ﬁrst vary the number m of the extracted instances per image, and show the results in Figure 4 (a). Note that, if we keep all instances, the model will contain a lot of noise instances, which is difﬁcult to converge. However, when too many instances are ﬁltered out, the accuracy drops since some positive instances in images may be removed incorrectly. Empirically, the optimal value of m is set to 10 in VOC 2007 dataset, and 35 in both MS-COCO and NUS-WIDE datasets. Then, we show the performance results with different numbers of convolution modules k for our model in Figure 4 (b). With the number of convolution modules increases, the accuracy drops on all the three datasets. Thus, we set k = 2 in our work. In Figure 5, we visualize the classiﬁers learned by our proposed method, which demonstrates that the meaningful semantic topology is maintained.

5 Conclusion

In this paper, we propose a novel graph matching based multilabel image classiﬁcation learning framework GM-MLIC, which reformulates the MLIC problem into a graph matching structure. By incorporating instance spatial graph and label semantic graph, and establishing instance-label assignment, the proposed GM-MLIC utilizes graph network block

VOC 2007 MS-COCO NUS-WIDE

35 45 30 10 15 20 25 40

(a) The number of instance.

1 2 3 4 5 Number

VOC 2007 MS-COCO NUS-WIDE

(b) The number of conv module.

Figure 4: Accuracy comparisons with different parameter values.

fire hydrant

horse hot dog

keyboard mouse

parking me ter person pizza

potted plant

traffic flight tv

vase umbrella

toothbrushsurfboard

stop sign baseball glove

skis snowboard skateboard

sandwich refrigerator

sink wine glass

baseball bat

aptop suitcase banana

car truck motorcycle

boat bench dining table

tennis racket

sport ball couch teddy bear

transport kitchenanimal food fruit washroomliving sport elec-appperson other

Figure 5: Visualization of the learned classiﬁers by our model on MS-COCO dataset.

to form structured representations for each instance and label by convolving its neighborhoods, which can effectively contribute the label dependencies and semantic-aware features to the learning model. Extensive experiments on various image datasets demonstrate the superiority of our proposed method.

Acknowledgments This work was supported by the National Natural Science Foundation of China (Nos. 61872032, 61972030, 62072027, 62076021), the Beijing Natural Science Foundation (Nos. 4202058, 4202057, 4202060), and in part by the Fundamental Research Funds for the Central universities (Nos. 2020YJS036, 2019YJS044, 2020YJS026).

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

References [Chen et al., 2019a] Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. Learning semantic-speciﬁc graph representation for multi-label image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 522 531, 2019. [Chen et al., 2019b] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5177 5186, 2019. [Chua et al., 2009] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nuswide: a real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 1 9, 2009. [Everingham et al., 2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303 338, 2010. [Ge et al., 2018] Zongyuan Ge, Dwarikanath Mahapatra, Suman Sedai, Rahil Garnavi, and Rajib Chakravorty. Chest x-rays classiﬁcation: A multi-label and ﬁne-grained problem. ar Xiv preprint ar Xiv:1807.07247, 2018. [Gong et al., 2013] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. Deep convolutional ranking for multilabel image annotation. ar Xiv preprint ar Xiv:1312.4894, 2013. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740 755. Springer, 2014. [Liu et al., 2018] Yongcheng Liu, Lu Sheng, Jing Shao, Junjie Yan, Shiming Xiang, and Chunhong Pan. Multilabel image classiﬁcation via knowledge distillation from weakly-supervised detection. In Proceedings of the 26th ACM International Conference on Multimedia, pages 700 708, 2018. [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532 1543, 2014. [Ren et al., 2016] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137 1149, 2016.

[Wang et al., 2016] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnn-rnn: A uniﬁed framework for multi-label image classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2285 2294, 2016. [Wang et al., 2017] Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE International Conference on Computer Vision, pages 464 472, 2017. [Wang et al., 2020a] Tao Wang, He Liu, Yidong Li, Yi Jin, Xiaohui Hou, and Haibin Ling. Learning combinatorial solver for graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7568 7577, 2020. [Wang et al., 2020b] Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, and Shilei Wen. Multilabel classiﬁcation with label graph superimposing. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 12265 12272, 2020. [Wei et al., 2015] Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. Hcp: A ﬂexible cnn framework for multi-label image classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1901 1907, 2015. [Wei et al., 2019] Shikui Wei, Lixin Liao, Jia Li, Qinjie Zheng, Fei Yang, and Yao Zhao. Saliency inside: Learning attentive cnns for content-based image retrieval. IEEE Transactions on Image Processing, 28(9):4580 4593, 2019. [Xu et al., 2020] Jiahao Xu, Hongda Tian, Zhiyong Wang, Yang Wang, Fang Chen, and Wenxiong Kang. Joint input and output space learning for multi-label image classiﬁcation. IEEE Transactions on Multimedia, 2020. [Yang et al., 2016] Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 280 288, 2016. [Yazici et al., 2020] Vacit Oguz Yazici, Abel Gonzalez Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. Orderless recurrent models for multi-label classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 13440 13449, 2020. [You et al., 2020] Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. Cross-modality attention with semantic graph embedding for multi-label classiﬁcation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 12709 12716, 2020. [Zhu et al., 2017] Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. Learning spatial regularization with image-level supervisions for multi-label image classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5513 5522, 2017.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)