# incomplete_attribute_learning_with_auxiliary_labels__5a52916f.pdf

Incomplete Attribute Learning with Auxiliary Labels

Kongming Liang1,2,3, Yuhong Guo2, Hong Chang1, Xilin Chen1,3

1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2School of Computer Science, Carleton University, Ottawa, Canada 3University of Chinese Academy of Sciences, Beijing 100049, China kongming.liang@vipl.ict.ac.cn, yuhong.guo@carleton.ca, {changhong, xlchen}@ict.ac.cn

Visual attribute learning is a fundamental and challenging problem for image understanding. Considering the huge semantic space of attributes, it is economically impossible to annotate all their presence or absence for a natural image via crowdsourcing. In this paper, we tackle the incompleteness nature of visual attributes by introducing auxiliary labels into a novel transductive learning framework. By jointly predicting the attributes from the input images and modeling the relationship of attributes and auxiliary labels, the missing attributes can be recovered effectively. In addition, the proposed model can be solved efﬁciently in an alternative way by optimizing quadratic programming problems and updating parameters in closedform solutions. Moreover, we propose and investigate different methods for acquiring auxiliary labels. We conduct experiments on three widely used attribute prediction datasets. The experimental results show that our proposed method can achieve the state-of-the-art performance with access to partially observed attribute annotations.

1 Introduction Attributes are semantic properties of objects which can be inferred from visual images. Beyond traditional object recognition, attribute learning shows a promising way to natural image understanding as it is able to provide ﬁne-grained descriptions. According to the deﬁnitions in previous works [Farhadi et al., 2009; Russakovsky and Fei-Fei, 2010], attributes usually contain rich semantic meanings, including color, shape, texture and object parts. Recent research has veriﬁed that attributes can beneﬁt many relevant computer vision tasks such as image retrieval [Kovashka et al., 2012; Liang et al., 2016] and image captioning [Fang et al., 2015; You et al., 2016]. Moreover, attribute learning makes it possible to do zero-shot classiﬁcation [Lampert et al., 2014; Jayaraman and Grauman, 2014] by modeling the correlation between seen and unseen object categories. Direct attribute prediction methods [Farhadi et al., 2009; Lampert et al., 2014] train a binary classiﬁer to predict each individual attribute. Since the binary attribute classiﬁers are

trained independently, they fail to exploit the correlation information between attributes. By taking each attribute as one subtask, [Jayaraman et al., 2014; Chen et al., 2014] formulate attribute learning in a regularization-based multi-task learning framework. In this way, the correlations between attributes are well incorporated during the learning process. In addition to leveraging the correlation within attributes, the relationship between attributes and their associated object categories can also play a key factor for improving the discriminative ability of attribute classiﬁers. [Wang and Ji, 2013; Huang et al., 2015] propose to model a high order relationship between attribute and object categories. In this way, they can better recognize the attributes which are hard to predict based only on visual appearances. Moreover, by modeling the attribute classiﬁer in a category-speciﬁc way [Liang et al., 2015], different visual attribute manifestations across categories (e.g. attribute ﬂuffy varies considerably between dog and towel) can be characterized explicitly. Nevertheless, all the above methods have assumed the training images with complete attribute annotations.

Since multiple attributes may present on a single instance and the space of attributes is almost inﬁnite, exhaustively annotating all the presented attributes seems economically infeasible. The resulting incompleteness of attribute labels can increase the difﬁculty of attribute prediction to a large extent. Therefore, it is very necessary to build an attribute prediction model that can tackle the incomplete problem. Although the incomplete learning problem has received attention in multilabel learning, there is almost no previous work to investigate the problem for attribute learning. In this paper, we propose to tackle the attribute learning problem with incomplete annotations. Our contributions are in four folds: First, we propose a novel transductive learning model to predict visual attributes, which is able to exploit both labeled and unlabeled images in the learning process. Second, we incorporate highlevel auxiliary labels into the transductive learning model via label matrix completion to improve attribute prediction. By enforcing the low rank property on the augmented label matrix, the model can infer the missing attributes from both the observed attributes and the augmented high-level auxiliary labels such as auxiliary labels. Third, we investigate different sources of high-level auxiliary labels, including both the existing object category annotations and the knowledge transferred from auxiliary large scale data sources. Finally, we

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

conduct experiments on the widely used datasets for attribute learning. Experimental results demonstrate the effectiveness of our proposed method on attribute learning with incomplete annotations.

2 Related Work

A number of previous works have been proposed to tackle incomplete label problem in the literature of multi-label learning. Common ways include taking the missing part of labels as negative labels [Sun et al., 2010] or training on the provided labels [Yu et al., 2014]. [Chen et al., 2013] proposed a fast image tagging algorithm with only incomplete tags for image annotation. It co-regularized both the partially observed tags and image representation to recover the complete tag labels within a joint convex loss function. [Wu et al., 2013] proposed to infer the missing labels through label completion based on visual similarity and label co-occurrence. Moreover, [Wu et al., 2015] proposed to complete the missing labels by further adding semantic hierarchy constraints. They addressed the incomplete multi-label learning problem by using a mixed graph to exploit the label dependencies according to instance similarity, class co-occurrence, and semantic hierarchy simultaneously. [Zhao and Guo, 2015] proposed to solve incomplete multi-label learning in a semisupervised way by integrating a Laplacian manifold regularization into the learning procedure. However, directly using the above methods for incomplete attribute learning is not effective: Since the visual manifestations for a single attribute vary across different object categories, it is difﬁcult to exploit the correlation between attributes when only considering visual appearance. Transferring auxiliary labels from external knowledge database is an effective way to boost the original learning task. [Hwang and Sigal, 2014] used the taxonomy tree to jointly embed attributes and super-categories into the same space. [Frome et al., 2013] mapped the object category labels to its corresponding semantic embedding. The embeddings of object category labels are learned from textual data in an unsupervised way. [Lu, 2016] proposed an unsupervised zeroshot learning method to embed large scale object classes by exploiting the outputs of a trained neural network. [Lu et al., 2016] leveraged language priors from semantic word embeddings to improve visual relationship detection task. [Liang et al., 2015] leveraged auxiliary object category labels to model the high order relationship between image, object and attribute. A common semantic space is constructed for embedding the three types of information. Inspired by the above methods, we propose different ways to acquire auxiliary labels which are helpful for incomplete attribute learning.

In this work, we consider learning image attribute predictors in the following setting. Assume we have an input data matrix X Rd n, which contains n images, and each image is represented as a d-dimensional feature vector. Without loss of generality, we assume the ﬁrst nℓimages, Xℓ, from X are labeled training instances and associated with an attribute indicator matrix Yℓ {0, 1}L nℓ, where 1 indicates the pres-

ence of the corresponding attribute among the total L predeﬁned attributes, while assuming the attribute indicator matrix Yu {0, 1}L nu for the rest nu (such that n = nℓ+nu) images, Xu, is unobserved and needs to be predicted. Thus overall we have a partially observed attribute-based label indicator matrix Y = [Yℓ, Yu]. Below we present a novel transductive learning method for attribute prediction, which is able to exploit auxiliary labels and can be naturally extended to handle incomplete attribute annotations.

3.1 Attribute Learning with Auxiliary Labels

Though attribute learning can be tackled as a standard label prediction problem, the nature of visual attributes enables the existence of related auxiliary label categories on the same images beyond the attribute labels. For example, the object categories can be a natural set of auxiliary labels that can be useful for attribute label prediction. Such auxiliary labels and the target attribute labels can typically present strong correlation patterns and dependence relationships. We hence propose a novel transductive learning model that not only exploits both labeled images and unlabeled images for attribute prediction, but also integrates auxiliary labels into the learning process. In particular, we assume there is a set of ˆL auxiliary labels, and the prediction information on these auxiliary labels for all the images can be encoded into a matrix Z [0, 1]ˆL n, and we formulate our transductive learning into the following framework:

min W,Yu,M L(f(X; W), Y) + α

2 W 2 F , (1)

where f( ; W) denotes the attribute prediction function with model parameter W, and L( , ) denotes the attribute prediction loss function; F denotes the Frobenius norm and denotes the nuclear norm; α and β are trade-off parameters. The nuclear norm enforces the low-rank property over the M matrix. Together with the second term of the objective function, by pushing the augmented label matrix to be close to a low-rank matrix, we aim to capture the linear correlations between the augmented labels and infer the unobserved labels such as Y u from the observed ones such as Z. The proposed framework is expected to integrate information from both the input matrix X through the prediction function f and the augmented label matrix through the low-rank regularization to enhance attribute prediction. The nuclear norm regularization nevertheless is nonsmooth. To entail a simple learning procedure, we further exploit a well known identity and encode the low-rank property by introducing two low-dimensional matrices, U R(L+ˆL) m and V Rn m (m < min(L + ˆL, n)), and replacing M with M = UV . This leads to the following learning formulation:

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 1: The proposed framework for incomplete attribute learning. It can integrate both observed attribute labels and auxiliary labels for attribute prediction. The red part of Y denotes the unlabeled part.

min W,Yu,U,VL(f(X; W), Y) + α

2 ( U 2 F + V 2 F ) + γ

2 W 2 F (2)

3.2 Learning with Incomplete Attribute Labels As aforementioned, complete attribute annotations are typically difﬁcult to obtain, while incomplete attribute annotations are prevalent. In this case, our label indicator matrix Yℓ for the labeled training images is not completely observed. Hence both Yu and part of Yℓcontain missing labels or unobserved entries. Here we use a mask matrix Ω {0, 1}L n to indicate the observation status of the corresponding entries of Y. Our proposed transductive learning model above nevertheless can be naturally extended to handle learning with incomplete attribute labels by learning the label matrix for all unobserved entries, which leads to the following formulation:

min W, Y,U,V L(f(X; W), Y) + α

2 ( U 2 F + V 2 F ) + γ

s.t. Ω Y = Ω Y; 0 Y 1

where denotes the element-wise matrix multiplication. The equality constraints preserve the observed labels in the given label matrix Y. Ideally, our attribute prediction matrix Y should be an indicator matrix, i.e., Y {0, 1}L n. To facilitate convenient optimization, here we relaxed the integer constraints into a continuous range between 0 and 1. The overall learning framework is illustrated in Fig. 1. In this learning scenario, the low-rank regularization over the augmented label matrix can help to infer the missing attribute labels by exploiting the linear correlations/dependencies between auxiliary labels and attribute labels. For example, by taking the object categories as auxiliary labels, we can infer the attribute ear with a high probability if we have already known the object is cat with the attribute

head present. We can not do such reasoning if the object is bird because the ear is not visible on the head part of birds under most circumstances.

3.3 Optimization To obtain a concrete learning problem, we propose to use a linear prediction function f and squared loss function L( , ):

L(f(X; W), Y) = W X Y 2

With this loss function, the learning model in Eqn. (3) is a joint minimization problem over four variables: Y, W, U and V. The objective function is convex in each variable while keeping the other variables ﬁxed. Therefore, we propose to solve this optimization problem using an alternating optimization procedure. We ﬁrst initialize W and Y by training a linear regression model to predict the partially observed attribute labels:

(W, Wz) = argmin W,Wz

Ω [W; Wz] X Z

Y = (1 Ω) [W; Wz] [X; Z] + Ω Y, (5) where Wz is the parameter matrix for predicting the auxiliary labels which is only used during the initialization stage. We used the auxiliary labels as inputs in order to achieve a better initialization of Y. We then initialize U and V by performing SVD on the augmented label matrix Y; Z = PΣQ , such that

U = P:,1:mΣ

1 2 1:m,1:m, V = Q:,1:mΣ

1 2 1:m,1:m. (6)

Given these initialization values, we iteratively update the four variables and in each iteration we perform the following two steps. First, given the current value of the parameter matrices W, U and V, we optimize Y in a row-wise manner. The ith row of Y is updated by solving the following subproblem:

Y i,: = argmin Yi,:(1 + α

2 ) Yi,: YT i,: (2W :,i X + αUi,:V ) Y i,:

s.t. Ωi,: Yi,: = Ωi,: Yi,:; 0 Yi,: 1 (7) The above formulation is a constrained quadratic programming problem which can be solve efﬁciently using a standard quadratic solver. Second, given ﬁxed Y, we use the following closed-form updates for W, U and V:

W = (2XX + γI) 12X Y ,

V = Y , Z U(U U + β

This learning process will be stopped if no performance gain is further obtained on the validation set. The overall optimization procedure is summarized in Alg. 1.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Algorithm 1 Optimization procedure

Input: X, Y, Z, Ω; α, β, γ and m Initialization: Initialize W, Y using Eqn. (5). Initialize U, V using Eqn. (6). repeat Update Y = Y by solving Eqn. (7) with ﬁxed W, U and V Update W, U and V using Eqn. (8) with ﬁxed Y until no further performance gain

3.4 Mining Auxiliary labels In this section, we investigate different types of auxiliary labels such as using human annotated category labels and transferring auxiliary labels from external knowledge database.

Human Annotated Object Categories In many tasks, object category annotations are usually available in the training data. Moreover, from the presence probability of the pre-deﬁned attributes calculated with images belonging to the same category, as shown in Fig. 2, we found that images from similar object categories usually share more common attributes. Therefore, we ﬁrst investigate using object categories as auxiliary labels to infer the missing part of attributes. In this case, Z is a sparse matrix with only one non-zero element for each column. For conventional attribute learning problem, the object category annotations for unseen data are usually not observed. Therefore, we choose to train a category prediction model ﬁrst by using the category supervision of the seen data. Then the learned model is used to produce the auxiliary category labels for the unseen data as Zu. Here a ridge regression model is trained using the data Xl with known object category annotations Zℓand the annotations for Xu can be then produced as following:

Zu = Zl Xl (Xl Xl + λI) 1Xu, (9)

where λ is the hyper parameter for the L2 model parameter regularization in a linear regression model. The auxiliary matrix in Eqn. (3) can be formed as Z = Zl, Zu .

Knowledge Transferring from External Database In addition to using human annotated category labels, we also investigate auxiliary labels from the external dataset. In particular, we propose to use Large Scale Visual Recognition Challenge 2012 (ILSVRC 2012) [Russakovsky et al., 2015] as the external database. It contains 1.2 million images and 1000 object categories. We consider the ILSVRC 2012 dataset as the source domain and the dataset for attribute learning as the target domain. Only part of the categories deﬁned in the target domain may appear in the source domain. By using a base model (e.g. Alex Net [Krizhevsky et al., 2012]) pre-trained on ILSVRC 2012, we can extract the category prediction S Rc n for all the images X in the target domain, where c = 1000 denotes the number of source domain object categories and the sum of each column of S equals to one. Since the source domain contains much more object categories than the target domain, it is not efﬁcient to directly use all the posterior probabilities as auxiliary labels.

a Pascal, P(attribute|category)

2D Boxy 3D Boxy Round Vert Cyl Horiz Cyl Occluded Tail Beak Head Ear Snout Nose Mouth Hair Face Eye Torso Hand Arm Leg Foot/Shoe Wing Propeller Jet engine Window Row Wind Wheel Door Headlight Taillight Side mirror Exhaust Pedal Handlebars Engine Sail Mast Text Label Furn. Leg Furn. Back Furn. Seat Furn. Arm Horn Rein Saddle Leaf Flower Stem/Trunk Pot Screen Skin Metal Plastic Wood Cloth Furry Glass Feather Wool Clear Shiny Vegetation Leather

aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor

a Yahoo, P(attribute|category)

2D Boxy 3D Boxy Round Vert Cyl Horiz Cyl Occluded Tail Head Ear Snout Nose Mouth Hair Face Eye Torso Hand Arm Leg Foot/Shoe Wing Window Row Wind Wheel Door Headlight Taillight Side mirror Exhaust Handlebars Engine Text Horn Rein Saddle Skin Metal Plastic Wood Cloth Furry Glass Feather Wool Clear Shiny Leather

donkey monkey goat wolf jetski zebra centaur mug statue building bag carriage

Figure 2: The presence probability of attributes for different object categories on a Pascal and a Yahoo datasets (Yellow means high probability).

Therefore, we propose the following two ways to make full use of the external knowledge. Semantic Pooling. We ﬁrst propose Semantic Pooling to select the most relevant object categories from the source domain as the auxiliary labels for attribute learning. By summing up the presence probabilities of each source domain object category on the target domain images, such as Sj = P i Sj,i, we can ﬁnd the sum probabilities demonstrate a long-tail distribution which means only a small proportion of object categories from the source domain are relevant to the target dataset. Therefore, we propose to pool the most relevant object categories as the auxiliary labels. We use R = {j| Sj > t} to denote the indices of the selected object categories, where t is the threshold for category selection. Then we can produce the auxiliary category label matrix as Z = S(R, :) and further normalize Z to keep each column sum up to one. Semantic Propagation. Instead of only using part of the object categories of the source domain, we also try to directly propagate the posterior probabilities of all the source object categories to the target object categories. We model the similarity between the object categories of the source domain and target domain using Word Net hierarchy [Fellbaum, 1998]. We denote the ith object in the source domain and the jth object in the target domain as Os i and Ot j respectively. The similarity of the two categories is measured by the Wu-Palmer Similarity [Bird et al., 2009]. It is based on the depth of the two senses in the taxonomy and their Least Common Subsumer, and is calculated as K(Os i, Ot j) = 2 Depth(LCS(Os i, Ot i))/(Depth(Os i)+Depth(Ot j)). The propagation matrix can then be constructed as following:

Ti,j = exp(ρK(Os i, Ot j)2) PˆL k=1 exp(ρK(Os i, Ot k)2) , (10)

where ρ is a parameter to be speciﬁed. The auxiliary label matrix can be obtained as Z = T S by propagating the posterior probabilities from source domain to target domain.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Detailed information for the datasets.

Dataset # images # attributes # objects a Pascal 12785 64 20 a Yahoo 2644 47 12 Imagenet attribute 9600 25 384

4 Experiments

4.1 Experimental Setting

Datasets. We conducted experiments on three real-world datasets for attribute learning. a Pascal [Farhadi et al., 2009] contains 6430 training images and 6355 testing images from Pascal VOC 2008 challenge. Each image comes from twenty object categories and is annotated with 64 binary attribute labels. a Yahoo [Farhadi et al., 2009] contains 2644 images belonging to twelve object categories. Each image is annotated with the same 64 binary attributes as the a Pascal dataset. By discarding the attributes with no positive data, we ﬁnally get 47 attributes to conduct experiments. INA (Image Net Attributes [Russakovsky and Fei-Fei, 2010]) contains 9,600 images across 384 categories. Each image is annotated with 25 binary attributes. The information about the three datasets are summarized in Table 1. Experimental Setup. For the a Pascal dataset, we use the default {train, test} split and separate half the training data for validation. For a Yahoo and INA, we randomly split the dataset into three subsets with equal size for training, validating and testing. We used the Convolutional Neural Networks (CNN) [Donahue et al., 2014] to extract 4096 De CAF features for each image within the provided bounding box area. The performance of attribute predictors are measured by m AUC (mean Area Under ROC) and m AP (mean Average Precision) to reﬂect the average performance of all the attributes. To simulate the incomplete attribute learning setting, we randomly used {10, 20, 30, 40, 50} percent of the annotated labels for model training. We compared all methods using the same data setting, randomly sampled the observed attribute labels and repeated each experiment ﬁve times. Comparison Methods. In the experiments, we compare the proposed approach with the following methods: (1) the mixed graph method for multi-label learning with missing labels (ML-MG) [Wu et al., 2015]; (2) the uniﬁed multiplicative framework for attribute Learning (UMF) [Liang et al., 2015]; (3) the concatenation methods with multiple input information (Concat); and (4) the baseline binary relevance method (BR). The ﬁrst two methods are state-of-the-art methods for multi-label learning with incomplete labels and visual attribute learning respectively. ML-MG incorporates instance level similarity and label dependencies to handle missing labels. UMF integrates object recognition into conventional attribute learning model in a multiplicative way. The Concat method also exploits object labels; it concatenates the image features and the auxiliary object labels together as the input data. Comparing with UMF, Concat leverages multiple information in an additive way. BR is a widely used method for multi-label classiﬁcation. We independently train logistic regression model for each binary attribute for BR. We used the open-source code of ML-MG and UMF to conduct the experiments. For all the methods, we conducted

parameter selection based on the performance on the validation set. For our proposed approach, we select the trade-off parameters α from {10 2; 10 1; 1; 10; 100}, select β from {10 5; 10 4; 10 3; 10 2} while setting β and γ to be equal.

4.2 Experiment Results Incomplete Attribute Learning In this section, we take human annotated object categories as the auxiliary labels. We use the ground-truth category labels for seen data and acquire object category prediction for unseen data based on Eqn. (9). From the results in Fig. 3, we can see that the approaches integrated with auxiliary labels (UMF, Concat and the proposed method) perform better than the other methods especially when the observed attribute labels are rare. This demonstrates the effectiveness of using auxiliary labels on attribute prediction. By comparing different ways for leveraging auxiliary labels, we ﬁnd that Concat works not very effectively since it trends to fall behind BR on a Pascal and a Yahoo with increasing number of observed attribute labels. The main difference between Concat and our proposed method is Concat uses the auxiliary labels to directly infer the missing part of attributes without considering the observed part of attributes. Compared with Concat and the proposed method, UMF works well on a Pascal and a Yahoo but fails on INA which has many more categories. ML-MG seems to not perform well on attribute learning tasks. The main reason can be that attributes do not have the semantic label hierarchy such as animal horse and plant grass which are commonly presented in multilabel learning problem. Moreover, the co-occurrence of attributes is hard to exploit if the relationship with attributes and objects are not well considered. Our proposed approach constantly outperforms the other comparison methods based on the m AP evaluation metric on all the three datasets especially when only a small portion of attribute labels are observed. By observing 10 percent of attributes on the a Pascal dataset, our proposed method improves the state-of-the-art performance about 2% according to both m AP and m AUC. For the a Yahoo dataset, we achieve the best m AP performance but falls behind UMF based on m AUC. Since the presences of most attributes are much less than their absences, attribute learning usually suffers from data imbalance problem. Referring to [Davis and Goadrich, 2006], using an evaluation metric of Precision-Recall curve is more reasonable than the ROC curve to measure the comparison methods on the imbalance learning task.

Mining Auxiliary Labels We conduct experiments by taking advantage of the external database. The source domain is speciﬁed to be the 1000 object categories deﬁned in ILSVRC 2012 dataset. We extract the posterior probabilities of source domain object categories on the input images using two base networks: Alex Net [Donahue et al., 2014] and VGG-16 [Simonyan and Zisserman, 2014]. Then we use the two methods proposed in Sec. 3.4 to conduct the auxiliary label matrix Z. For semantic propagation, we manually map the object categories from both ILSVRC 2012 and the three benchmark datasets into the Word Net hierarchy. Then we can measure the similarity be-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

10 20 30 40 50 Percentage of observed attributes

BR Concat ML-MG UMF Proposed

10 20 30 40 50 Percentage of observed attributes

BR Concat ML-MG UMF Proposed

10 20 30 40 50 Percentage of observed attributes

BR Concat ML-MG UMF Proposed

10 20 30 40 50 Percentage of observed attributes

BR Concat ML-MG UMF Proposed

10 20 30 40 50 Percentage of observed attributes

BR Concat ML-MG UMF Proposed

10 20 30 40 50 Percentage of observed attributes

BR Concat ML-MG UMF Proposed

Figure 3: The performance of different comparison methods on the three benchmark datasets with incomplete attribute labels.

10 20 30 40 50 0

m AP Gain (%)

10 20 30 40 50 Percentage of observed attributes

10 20 30 40 50 0

SP SP_VGG PG PG_VGG

Figure 4: Knowledge Transferring from External Database.

50 100 150 200 250 300 Top K

Semantic Pooling

1e-1 1e0 1e1 1e2 55.5

Semantic Propagation

Figure 5: Hyperparameters tuning on a Pascal Validation.

tween two arbitrary object categories and compute the propagation matrix based on Eqn. (10). We calculated the performance gain of using auxiliary labels over the binary relevance method (BR). As shown in Fig. 4, mining auxiliary labels is always helpful for learning from incomplete attributes. By comparing two methods of generating auxiliary labels, Semantic Propagation (PG) achieves better performance than Semantic Pooling (SP) on a Pascal and INA. This shows the effectiveness of leveraging Word Net taxonomy. Comparing with using Alex Net as base network, using VGG alway shows better performance. This is reasonable as VGG achieves a lower error rate than Alex Net on the ILSVRC 2012 dataset. We conducted the experiments by setting different values of hyperparameters with observing 10% attributes on a Pascal. For Semantic Pooling, we modify the threshold to pool the top K object categories of source do-

main. As shown in Fig.5, the performance of Semantic Pooling starts to drop when more object categories are involved as auxiliary labels. For Semantic Propagation, choosing a larger value for ρ can achieve better performance. Comparing Fig. 3 and Fig. 4, the proposed method achieves better performance by using human annotated object categories rather than mining auxiliary labels. The main reason is part of the objects from target datasets are missing on ILSVRC 2012 dataset though the latter contains many more object categories. However, mining auxiliary labels is still promising as it does not need any human annotations which dramatically decreases the cost of labeling.

5 Conclusion We proposed a novel transductive learning method by integrating auxiliary labels for incomplete attribute learning. By modeling the relationship of attributes and auxiliary labels, the missing attributes can be recovered effectively. The proposed model can be solved efﬁciently by alternatively optimizing constrained quadratic programming problems and parameter updating in closed form solutions. In addition, we investigate different ways to acquire auxiliary labels. By taking the auxiliary labels as the human annotated object category labels, our proposed method can achieve the state-of-the-art performance on three widely used datasets. Moreover, the auxiliary labels transferred from a large scale dataset can also improve the performance without adding extra human cost.

Acknowledgments Research supported by China Scholarship Council (No. 201604910935), Natural Science Foundation of China (No. 61390515) and the Canada Research Chairs program.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Bird et al., 2009] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. O Reilly Media, Inc , 2009. [Chen et al., 2013] Minmin Chen, Alice X Zheng, and Kilian Q Weinberger. Fast image tagging. In Proc. of ICML, pages 1274 1282, 2013. [Chen et al., 2014] Lin Chen, Qiang Zhang, and Baoxin Li. Predicting multiple attributes via relative multi-task learning. In Proc. of CVPR, pages 1027 1034. IEEE, 2014. [Davis and Goadrich, 2006] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proc. of ICML, pages 233 240. ACM, 2006. [Donahue et al., 2014] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Proc. of ICML, pages 647 655, 2014. [Fang et al., 2015] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Doll ar, et al. From captions to visual concepts and back. In Proc. of CVPR, pages 1473 1482, 2015. [Farhadi et al., 2009] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Proc. of CVPR, pages 1778 1785. IEEE, 2009. [Fellbaum, 1998] Christiane Fellbaum. Word Net. Wiley Online Library, 1998. [Frome et al., 2013] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Proc. of NIPS, pages 2121 2129, 2013. [Huang et al., 2015] Sheng Huang, Mohamed Elhoseiny, Ahmed Elgammal, and Dan Yang. Learning hypergraphregularized attribute predictors. In Proc. of CVPR, pages 409 417, 2015. [Hwang and Sigal, 2014] Sung Ju Hwang and Leonid Sigal. A uniﬁed semantic embedding: Relating taxonomies and attributes. In Proc. of NIPS, pages 271 279, 2014. [Jayaraman and Grauman, 2014] Dinesh Jayaraman and Kristen Grauman. Zero-shot recognition with unreliable attributes. In Proc. of NIPS, pages 3464 3472, 2014. [Jayaraman et al., 2014] Dinesh Jayaraman, Fei Sha, and Kristen Grauman. Decorrelating semantic visual attributes by resisting the urge to share. In Proc. of CVPR, pages 1629 1636. IEEE, 2014. [Kovashka et al., 2012] Adriana Kovashka, Devi Parikh, and Kristen Grauman. Whittlesearch: Image search with relative attribute feedback. In Proc. of CVPR, pages 2973 2980. IEEE, 2012. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Proc. of NIPS, pages 1097 1105, 2012.

[Lampert et al., 2014] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classiﬁcation for zero-shot visual object categorization. IEEE TPAMI, 36(3):453 465, 2014. [Liang et al., 2015] Kongming Liang, Hong Chang, Shiguang Shan, and Xilin Chen. A uniﬁed multiplicative framework for attribute learning. In Proc. of ICCV, pages 2506 2514, 2015. [Liang et al., 2016] Kongming Liang, Hong Chang, Shiguang Shan, and Xilin Chen. Attribute conjunction learning with recurrent neural network. In Proc. of ECML-PKDD, pages 345 360. Springer, 2016. [Lu et al., 2016] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In Proc. of ECCV, pages 852 869. Springer, 2016. [Lu, 2016] Yao Lu. Unsupervised learning on neural network outputs: with application in zero-shot learning. In Proc. of IJCAI, 2016. [Russakovsky and Fei-Fei, 2010] Olga Russakovsky and Li Fei-Fei. Attribute learning in large-scale datasets. In Proc. of ECCV Workshop, 2010. [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. IJCV, 115(3):211 252, 2015. [Simonyan and Zisserman, 2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Co RR, abs/1409.1556, 2014. [Sun et al., 2010] Yu-yin Sun, Yin Zhang, and Zhi-hua Zhou. Multi-label learning with weak label. In Proc. of AAAI. Citeseer, 2010. [Wang and Ji, 2013] Xiaoyang Wang and Qiang Ji. A uniﬁed probabilistic approach modeling relationships between attributes and objects. In Proc. of ICCV, 2013. [Wu et al., 2013] Lei Wu, Rong Jin, and Anil K Jain. Tag completion for image retrieval. IEEE TPAMI, 35(3):716 727, 2013. [Wu et al., 2015] Baoyuan Wu, Siwei Lyu, and Bernard Ghanem. Ml-mg: multi-label learning with missing labels using a mixed graph. In Proc. of ICCV, 2015. [You et al., 2016] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Proc. of CVPR, pages 4651 4659, 2016. [Yu et al., 2014] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S Dhillon. Large-scale multi-label learning with missing labels. In Proc. of ICML, pages 593 601, 2014. [Zhao and Guo, 2015] Feipeng Zhao and Yuhong Guo. Semi-supervised multi-label learning with incomplete labels. In Proc. of IJCAI, pages 4062 4068. AAAI Press, 2015.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)