# incomplete_label_distribution_learning__c655e5b9.pdf Incomplete Label Distribution Learning Miao Xu and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Nanjing 210023, China {xum,zhouzh}@lamda.nju.edu.cn Label distribution learning (LDL) assumes labels can be associated to an instance to some degree, thus it can learn the relevance of a label to a particular instance. Although LDL has got successful practical applications, one problem with existing LDL methods is that they are designed for data with complete supervised information, while in reality, annotation information may be incomplete, because assigning each label a real value to indicate its association with a particular instance will result in large cost in labor and time. In this paper, we will solve LDL problem when given incomplete supervised information. We propose an objective based on trace norm minimization to exploit the correlation between labels. We develop a proximal gradient descend algorithm and an algorithm based on alternating direction method of multipliers. Experiments validate the effectiveness of our proposal. 1 Introduction Classical machine learning tasks assume that one label is either associated with an instance or not. However, in some real applications, labels may be associated with an instance to some degree, thus each instance is annotated by soft labels rather than a single label or a set of labels. Label Distribution Learning (LDL) [Geng, 2016], which learns a mapping from a particular instance to a distribution across all the labels, can assign the relevance of a label to a particular instance. In recent years, LDL has been successfully used in facial age estimation [Geng et al., 2013], action detection in videos [Geng and Ling, 2017], facial expression recognition [Zhou et al., 2015], crowd opinion prediction [Geng and Hou, 2015] et al. Despite the fact that LDL has been applied successfully in recent years, one problem with existing LDL methods is that they are designed for data with complete supervised information. Nevertheless, in reality, annotation information may be incomplete. In practice, annotations are often given by human annotators, thus assigning each label a real value to indicate This work was supported by the NSFC (61333014), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Huawei Fund (YBN2017030027). its association with a particular instance will result in a large cost in labor and time, especially when there is a large number of labels and instances. On the other hand, for some labels, it may be difficult to give an accurate value to indicate how they are related to a particular instance. All these phenomena will result in training data with incomplete supervised information, thus one question is put forward, that is, how to do LDL with incomplete annotation (Incom LDL). At first glance, it may be trivial to adapt existing LDL algorithms to fit for the Incom LDL problem. Some of the LDL algorithms are based on maximum entropy model [Geng et al., 2010; 2013; Geng, 2016]. This type of algorithms can be adapted by optimizing the sum of entropy of observed labels only. There are some other LDL algorithms transforming the LDL problem into binary classification by sampling labels according to their relevance degree [Geng, 2016]. They can also be adapted to incomplete case by using the observed supervised information as sampling weights. Even if these algorithms can be adapted to solve the Incom LDL problem, note that they treat each label separately, ignoring the fact that LDL is used in scenarios when labels are interlaced with each other [Geng and Ling, 2017]. Without considering the correlation between labels, when facing the severe incomplete annotation, training instances for each single label will be tremendously reduced, thus we need much more training instances to learn a classifier as good as that learned from complete data. Thus label correlations should be exploited to reduce the effect of lacking training data in Incom LDL. Moreover, although one advantage of LDL is that this learning paradigm can take label correlation into consideration by assigning similar relevances to similar labels, however, when annotations are missing, such similarities would also lost. Without such similarity, we need find new ways to characterize the label correlation in Incom LDL. Multi-label learning (MLL) [Zhou and Zhang, 2017], which assumes each instance is associated with multiple labels, is highly related to LDL problem. However, there is a fundamental difference between MLL and LDL. MLL assumes one label is either related to an instance or not, while in LDL, the relative importance of each label is used in describing the instances. Similar to LDL, MLL also faces the incomplete annotation problem, and various algorithms are proposed based on low-rank assumption [Xu et al., 2013], instance-level smoothness [Wu et al., 2016], and label em- Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) bedding [Bi and Kwok, 2014]. However, these algorithms are required to give the hard 0/1 label for multi-label data, while in LDL, the prediction is constrained to be within a probability simplex for a particular instance. Furthermore, adding the probability simplex constraint to the non-smooth unconstrained incomplete MLL problem will result in difficulty in optimizing, thus new optimization algorithms are need to be exploited to solve the new problem. In this paper, we will solve the Incom LDL problem by considering the correlation between labels, which has never been considered by previous LDL algorithms. Considering labels are correlated and determined by a few factors, we will assume the label distribution matrix is low-rank and propose an optimization objective based on trace norm minimization [Cai et al., 2010]. Our optimization objective will require the entries in the observed positions of the recovered label distribution matrix to be close to those observed values, under the constraint that the recovered label distribution for every instance should form a probability simplex. By showing that following the standard analysis of convex optimization, it is non-trivial to get the optimum solution using the accelerated proximal gradient descend technique [Tseng, 2008], we will additionally develop an alternating direction method of multipliers [Boyd et al., 2011] algorithm. Experiments on 15 real label distribution data sets with various missing percentages validate the effective of the proposed algorithms. The paper is organized as follows. In Section 2 we will briefly review related work. Our proposed two algorithms Incom LDL-prox and Incom LDL-admm will be introduced in Section 3. We will show experimental results in Section 4, followed by conclusion in Section 5. 2 Related Work Label distribution learning (LDL) [Geng, 2016] is first proposed to solve the facial age estimation problem [Geng et al., 2010; 2013] by noticing the fact that instances at neighboring ages are similar. In such problems, a distribution across all ages is more desirable than a single age for a face. Later on, [Geng, 2016] discovers that in some real applications the distribution across all labels is more desirable than the association of a single label to an instance. For example, in biology experiments, the gene expression level across all time period is more desirable than the expression level at a particular time point [Geng, 2016]; in facial expression recognition, we can hardly use a particular pure emotion to describe an expression, but a mixture of several basic emotions [Zhou et al., 2015]. Additionally, for some applications, there is natural uncertainty in annotation, thus the instance is annotated by a distribution across labels rather than a single label or a set of labels. One example is crowdsourcing data such as movie ratings [Geng and Hou, 2015], in which different people may have different attitudes, thus the crowd opinion will naturally form a distribution across labels. Various algorithms have been proposed for LDL, divided into three groups. One group is based on optimizing the sum of log-likelihood of all training labels and instances. Two representative algorithms, IIS-LDL [Geng et al., 2013] and BFGS-LDL [Geng, 2016] belong to this group, using im- proved iterative scaling [Pietra et al., 1997] and BFGS [Nocedal and Wright, 2006] respectively to do optimization. Another group of algorithms is based on problem transformation [Geng, 2016], transforming LDL into binary classification problem by sampling from the original LDL data using the description degree of that label as sampling weight. After sampling, base learners such as SVM or Naive Bayes are used to do binary classification, forming algorithms PT-SVM and PT-Bayes respectively. The final group of LDL algorithms are those based on algorithm adaptation. Existing algorithms such as k NN [Geng, 2016] and boosting [Xing et al., 2016] are adapted to fit the LDL schema. When facing the training data with incomplete annotation, some of these algorithms can be adapted to deal with it. However, these algorithms treat each label separately, ignoring the fact that LDL is used in scenarios when labels are interlaced with each other [Geng and Ling, 2017]. Without considering the correlation between labels, when facing the incomplete annotation, training data for a particular label will be dramatically reduced, thus we will require much more training data to give a satisfactory classifier. Moreover, although LDL can naturally exploit label correlation [Geng, 2016] by assigning similar description degrees to similar labels, we want to state that when there are missing data in the label distribution matrix (for example up to 90% in our experiments), most of the annotation will be missing. It is hard for LDL to acquire the similarity information from the incomplete annotations, thus the classical way how LDL exploit label correlation cannot be used, we need to find new ways to characterize label correlation for Incom LDL. Multi-label learning (MLL) [Zhou and Zhang, 2017] assumes each instance is associated with a set of labels. Label distribution learning is a natural extension of multi-label learning, by extending the crisp 0/1 label belongingness to soft probabilistic label belongingness. There are abundant studies about multi-label learning with incomplete label assignments [Sun et al., 2010; Yang et al., 2013; Bi and Kwok, 2014; Wu et al., 2016]. Trace norm minimization [Goldberg et al., 2010; Zhao and Guo, 2015; Xu et al., 2013; Yu et al., 2014] has been popularly used, for that it equals to the low-rank assumption on label matrix, which implicitly exploits the label correlation. These approaches are not directly applicable to LDL, because rather than crisp 0/1 label outputs, the LDL outputs are constrained to follow a probability simplex. To summarize, we need to propose new LDL algorithms which can deal with incomplete supervised information. By considering the label correlation, our proposed algorithms should give a better result than existing LDL algorithms. 3 The Incom LDL Methods 3.1 Formalization We will give a more formal definition of LDL. LDL assigns a value dy x called description degree to instance x for a particular label y, which indicates the relevance of the label y to the instance x. In LDL, all dy x-s for a particular instance form a probability simplex. Note that dy x is not the probability that y correctly labels x, but the proportion that y accounts for in a Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) full description of x [Geng, 2016]. All labels with non-zero dy x-s are actually the correct labels to describe the instance, where their relative importance is measured by dy x. Let X Rn d be the feature matrix, where n is the number of instances and d is the number of features, thus the ith row of X will be instance xi. D Rn m is the label distribution matrix, where m is the number of labels and Dij is dyj xi. Since the supervised information for one instance should follow the probability simplex, we have Pm j=1 Dij = 1 i [n] and Dij 0 (i, j) [n] [m]. When we are dealing with the incomplete annotation, we assume that entries in label distribution matrix D are uniformly random missing. Let Ω [n] [m] denote the indices of observed entries sampled uniformly at random from D. We denote the observed label distribution matrix by D, which has equal size of D, with entries in the observed positions same as D and entries in unobserved positions 0. Specially, we will have linear operator RΩ: Rn m Rn m defined as, [RΩ(M)]ij = Mij if (i, j) Ω 0 otherwise Finally, we will use F to denote the Frobenius norm of a matrix, and tr to denote the trace norm of the matrix, which is the sum of all singular values. Based on the above notation, to solve the Incom LDL problem, we will propose a learning objective based on squared loss and a regularizer exploiting label correlations. Specially, we will have the following optimization objective, min W 1 2 RΩ(XW D) 2 F + λ XW tr (1) s.t. XW 1m = 1n, XW 0n m where W Rd m is our learning objective. 1n and 1m are the length n and m vectors with all its entries equaling one, respectively. 0n m is an all-zero matrix of size n m. In Eq. 1, we assume that there is a linear relationship between the label matrix D and feature matrix X since linear classifiers have acquired good performance in previous studies [Geng, 2016]. In this way, the recovered label matrix will be ˆD = XW, where W is the linear coefficients and will be our learning objective. To exploit the correlation between labels, we assume that D is low-rank, i.e. XW has a small trace norm. λ is the regularization parameter trading off the importance between trace norm and the difference in Frobenius norm in those observed positions. Because the recovered supervised information for each matrix need to be in the probability simplex, that is, nonnegative and summing up to one, we add constraint XW 1m = 1n to make sure that each row of ˆD will sum up to one and XW 0n m to constraint all entries to be nonnegative. Combining all these aspects, we will have Eq. 1 as our learning objective for Incom LDL problem. 3.2 Optimizing using Proximal Gradient Descend Following [Xu et al., 2013], we will first use Accelerated Proximal Gradient Descend to optimize Eq. 1, which can achieve a convergence rate of O(1/T 2) [Tseng, 2008], where T is the number of iterations. Since it is difficult to handle the trace norm of XW, we will assume X is orthonormal without losing of generality. If X is not orthonormal, we can do SVD on X and use the top right singular vectors as a replacement for X. After assuming X is orthonormal, the optimization will become, min W 1 2 RΩ(XW D) 2 F + λ W tr (2) s.t. XW 1m = 1n, XW 0n m Ignoring the constraints, Accelerated Proximal Gradient Descend will optimize Eq. 2 iteratively. In the tth iteration, it will introduce an auxiliary variable Yt, which is, Yt = Wt + θt( 1 θt 1 1)(Wt Wt 1) (3) After introducing Yt, we have the following subproblem in the tth iteration, min W λ W tr + (4) LX RΩ(XYt 1 D) F which has closed form solution by SVT [Cai et al., 2010]. Note that here L is the Lipschitz constant which can be found by linear search, i.e., we will initialize L and increase it until the following is violated. ℓ(Wt+1) ℓ(Yt) + (5) f(Yt), Wt+1 Yt + L 2 Wt+1 Yt 2 F where ℓ(Wt+1) = RΩ(XW D) 2 F /2. One problem with the above procedure is that, it is used for unconstrained problem, while in our setting, the problem is constrained. One straightforward solution is to project the solution Wt onto the set defined by the constraints, that is, we will project each row of ˆDt = XWt onto the probability simplex. Assuming the ith row of ˆDt is ˆdi t, to do projection, we will solve the following problem, min d d ˆdi t s.t. d 1m = 1, d 0m, (6) while without misunderstanding, we will omit i and t in ˆdi t. The above problem has an O(m log m) solution by sorting all the elements in ˆd in descending order into u. Then we find the maximum position index ρ such that θ = uρ + (1 Pρ i=1 ui)/ρ > 0. With ρ, we will get the optimal d in which dj = max{ ˆdj + θ, 0}, j [m]. The proof of correctness of the projection procedure can be found in [Duchi et al., 2008; Wang and Carreira-Perpi n an, 2013]. After we get the Rprob( ˆDt) which project each row of ˆDt onto a probability simplex, we will recover Wt as follows, min W XW Rprob( ˆDt) 2 F (7) which has closed-form solution as (X X) [X Rprob( ˆDt)]. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Algorithm 1 Incom LDL-prox 1: Initialization: θ1 = θ2 (0, 1], W1 = W2, L, γ > 1, and stopping criterion ϵ 2: t = 2; 3: while stopping criterion is not satisfied do 4: Calculate Yt by Eq. 3 5: Calculate Wt+1 by Eq. 4 6: while Eq. 5 is satisfied do 7: L = L γ 8: Update Wt+1 by Eq. 4 with the new L 9: end while 10: θt+1 = ( p θ4 t + 4θ2 t θ2 t )/2 11: t = t + 1 12: ˆDt = XWt 13: Calculate each row of Rprob( ˆDt) by solving Eq. 6 14: Update Wt by Eq. 7 15: end while The Incom LDL-prox algorithm which uses accelerated proximal gradient descend with a projection onto the probability simplex is shown in Alg. 1. There remains one question, that is, does the projection procedure really lead to a minimum of the following constrained optimization problem in the tth iteration? min W λ W tr + L 2 W Qt 2 F (8) s.t. XW 1m = 1n, XW 0n m where Qt = Yt X RΩ(XYt 1 D)/L Note the Lagrange dual problem of Eq. 8 is as following, max α,B Dλ/L[Qt X A + X B] + α 1n 2 Qt X A + X B 2 F where α and B are Lagrange multipliers associated with the equality constraint and inequality constraint respectively. A Rn m is the catenation of m α-s. Dλ/L[ ] is the SVT solver [Cai et al., 2010]. To solve Eq. 8, we will first solve α and B for the Lagrange dual problem. With the optimal solution α and B , the original unconstrained optimizer W can be calculated. However, maximizing Dλ/L[ ] does not have closed-form solution for α and B. Although Algo. 1 solves the problem by calculating Dλ/L[Qt] and then projecting the solution to a probability simplex, it cannot be guaranteed to find the optimal solution following the standard analysis of convex optimization [Boyd and Vandenberghe, 2004]. Thus we need to find new optimization strategy to optimize Eq. 1. 3.3 Optimizing using ADMM ADMM (Alternating Direction Method of Multipliers) [Boyd et al., 2011] which is suitable to those objectives summing up a smooth function and a non-smooth function, is proper for solving Eq. 1. To use ADMM, we first rewrite our objective into the following equivalent form, min W C 1 2 F + λ Z tr (9) s.t. XW Z = 0 C = {W|XW 1m = 1n and XW 0n m}. Eq. 9 can be solved by the following alternative methods in iteration t, Wt+1 = arg min W C 1 2 RΩ(XW D) 2 F (10) + Λt, XW Zt + ρ1 2 XW Zt 2 F Zt+1 = arg min Z λ Z tr + Λt, XWt+1 Z (11) 2 XWt+1 Z 2 F Λt+1 = Λt + ρ1(XWt+1 Zt+1) (12) in which Eq. 11 can be rewritten into Zt+1 = arg min Z 1 2 Z (XWt+1 + Λt ρ1 ) 2 F + λ which has closed form solution. Thus the problem remained is how to solve Eq. 10. Assuming M = XW, we rewrite Eq. 10 here, min M 1 2 RΩ(M D) 2 F + Λt, M Zt s.t. M 1m = 1n, M 0n m We can decompose the above problem into optimizing each row of M, while the ith row of M is denoted as Mi, so is Λi, Di, Zi and Ωi, min Mi 1 2 RΩi(Mi Di) 2 (13) + Λt i, Mi Zt i + ρ1 2 Mi Zt i 2 s.t. Mi 1m = 1, Mi 0m Although in [Duchi et al., 2008; Wang and Carreira Perpi n an, 2013], the projection onto a probability simplex problem is solved using an O(m log m) algorithm by brute force searching through [m] from the largest entry to the smallest one for a particular j satisfying the KKT condition, the algorithm is inefficient to be used here. The reason is the m entries in Mi are divided into Ωi and Ωi, and the non-zero Mijs will have different weights and different projection objectives due to the KKT condition, thus we need to search all possible positions in Ωi and Ωi for a pair of perfect solution, which will cost a lot of time (O(m2) for the worst case) and cannot guarantee a unique optimal solution. Thus we will solve Eq. 13 by forming it into a standard QP problem, and use state-of-the-art QP solvers, such as interior-point-method. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Algorithm 2 Incom LDL-admm 1: Initialization: W1, Λ1 and Z1, λ and ρ1, t = 1 2: while stopping criterion is not satisfied do 3: Calculate each row of M by Eq. 13 4: Wt+1 = (X X) [X M ] 5: Solve Zt+1 by Eq. 11 6: Update Λt+1 by Eq. 12 7: t = t + 1 8: end while Table 1: Statistics of the 15 data sets, where n is number of instance, d is number of features and m is number of labels. Dataset n d m Yeast-alpha 2465 24 18 Yeast-cdc 2465 24 15 Yeast-elu 2465 24 14 Yeast-diau 2465 24 7 Yeast-heat 2465 24 6 Yeast-spo 2465 24 6 Yeast-cold 2465 24 4 Yeast-dtt 2465 24 4 Yeast-spo5 2465 24 3 Yeast-spoem 2465 24 2 Human Genes 20,542 36 68 Natural Scene 2,000 294 9 SJAFFE 213 243 6 SBU 3DFE 2,500 243 6 Movie 7,755 1,869 5 After we get M , we can project it back into the space defined by X through (X X) [X M ]. We will summarize the proposed Incom LDL-ADMM algorithm in Alg. 2 According to [He and Yuan, 2012], our Incom LDL-admm will converge at O(1/T) rate to the optimum solution. Although it is slow compared to the O(1/T 2) rate by accelerated proximal gradient descend method, in practice, a good approximate solution is sufficient to obtain satisfactory performance [Boyd et al., 2011]. 4 Experiment We evaluate the proposed algorithms for Incom LDL problem on 15 real data sets. We implement our approaches in Matlab. All the results were obtained on a Linux server with CPU 2.53GHz and 48GB memory. The algorithm is evaluated on 15 data sets covering fields of biochemistry, natural scene recognition, facial expression and movie-rating. Details of them can be found in [Geng, 2016]. Here we summarize their statistics in Table 1. Settings and Baselines To make these data sets incomplete, we will use two kinds of settings. In the first setting, we will make all elements in the label distribution matrix uniformly random missing and call it the (general) incomplete setting. We vary the observed rate ω% from 10% to 40%, and measure the difference between the groundtruth and the predicted label distribution matrix. We will then test our proposed algorithm in the second setting, the transductive setting, in which we have 10% test data with no supervised information, accompanied by incomplete training data, while the observed rate ω% will also vary from 10% to 40%. We will use the incomplete training data and features of the test data together to give a prediction. Difference between the ground truth and the predicted distribution matrix for test data will be measured. We will repeat each experiments 10 times and report the results averaged over 10 trials. In Incom LDL-prox, the regularization parameter is selected from 2{ 10, 9,...,9,10} by cross-validation on training data. Parameters γ and ϵ are set to be 2 and 10 5 respectively. The maximum iteration is set to be 100. In Incom LDL-admm, regularization parameter λ and number of maximum iteration are selected in the same way as Incom LDL-prox. ρ1 is simply set as 1 and all the variables are initialized to be all-zero. The stopping criterion parameters ϵabs and ϵrel are set as 10 4 and 10 2 as suggested in the survey [Boyd et al., 2011]. We will compare our proposed Incom LDL algorithm with several baselines. They are all adapted from existing label distribution algorithms to fit for the incomplete situation. Note that although these algorithms can solve the incomplete label distribution problem, they consider each label separately thus are expected to work worse than our proposal. These algorithms include two maximum entropy algorithms IIS-LDL [Geng et al., 2013], BFGS-LDL [Geng, 2016] and two problem transformation algorithms PT-Bayes and PT-SVM [Geng, 2016]. All the codes are shared by original authors, and we use the default parameter suggested there, except that we tune the regularization parameter for the PTSVM algorithm using 10-folder cross-validation in the same way as in Incom LDL-prox. Following [Geng, 2016], we will use five measurements for incomplete label distribution problem. Among them, Chebyshev, Clark and Canberra measure the distance between two vectors, thus they are the lower the better. Cosine and Intersection measure the similarity between two vectors, thus they are the higher the better. Details of these measurements can be found in [Geng, 2016]. Note that there is one additional measurement proposed in [Geng, 2016], which measures the KL-divergence between two vectors. For KL-divergence is calculated by log(dy x/ ˆdy x), and it will be meaningless when ˆdy x is zero, we will not use it here. Results Due to space limitation, here we only present representative results. Other results are similar and we will put them in a longer version. Note that we have done experiments on 4 different ω% within both incomplete and transductive setting, measured on 5 measurements. Here we will present the Chebyshev (the lower the better) results for incomplete setting on all the data with ω = 10 in Table 2 and the Intersection (the higher the better) results for transductive setting on test data only with ω = 30 in Table 3. We can see in both these two scenarios, our proposed two algorithms are superior to the baselines. The results are as expected since these two methods exploit the label correlation when there are insufficient training data facing the Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Table 2: Chebyshev (the lower the better) results for incomplete setting on all data when ω% = 10%. Each column corresponds to an algorithm, while Incom LDL-a and Incom LDL-p are abbreviation for Incom LDL-admm and Incom LDL-prox respectively. Each row corresponds to a data set. The value is measured by 10-folder cv shown in mean std form. The best results on each row are bolded, with its comparable ones (pairwise single-tailed t-test at 95% confidence level ) marked by . Data Set Incom LDL-a Incom LDL-p IIS-LDL BFGS-LDL PT-Bayes PT-SVM Yeast-alpha .0135 .0000 .0135 .0000 .0214 .0003 .0361 .0000 .4401 .0329 .0178 .0012 Yeast-cdc .0161 .0001 .0162 .0000 .0247 .0006 .0427 .0000 .4543 .0359 .0221 .0021 Yeast-cold .0513 .0001 .0514 .0002 .0636 .0013 .0945 .0000 .4624 .0298 .0672 .0085 Yeast-diau .0370 .0003 .0371 .0001 .0479 .0009 .0751 .0000 .4917 .0408 .0510 .0064 Yeast-dtt .0361 .0000 .0362 .0001 .0518 .0011 .0851 .0000 .4718 .0373 .0501 .0096 Yeast-elu .0162 .0000 .0163 .0000 .0255 .0005 .0441 .0000 .4431 .0254 .0223 .0019 Yeast-heat .0422 .0002 .0425 .0002 .0545 .0013 .0802 .0000 .4590 .0315 .0530 .0049 Yeast-spo .0584 .0000 .0582 .0001 .0671 .0015 .0927 .0000 .4847 .0271 .0684 .0058 Yeast-spo5 .0912 .0002 .0913 .0001 .0989 .0014 .1327 .0000 .4478 .0426 .0997 .0063 Yeast-spoem .0875 .0003 .0874 .0004 .0939 .0018 .1190 .0000 .3565 .0269 .0980 .0126 Human Gene .0533 .0000 .0533 .0000 .0535 .0000 .0543 .0000 .5453 .1388 .0537 .0001 Natural Scene .3360 .0031 .3380 .0025 .3576 .0024 .7179 .0000 .3690 .0000 .4168 .0213 SJAFFE .1078 .0021 .1083 .0024 .1279 .0089 .7771 .0000 .1204 .0000 .1417 .0185 SBU 3DFE .1170 .0000 .1185 .0009 .1351 .0012 .2301 .0000 .1389 .0000 .1414 .0028 Movie .1257 .0000 .1237 .0011 .3697 .0038 .4952 .0000 .1807 .0000 .2510 .0278 Table 3: Intersection (the higher, the better) results for transductive setting on test data when ω% = 30%. Each column corresponds to an algorithm, while Incom LDL-a and Incom LDL-p are abbreviation for Incom LDL-admm and Incom LDL-prox respectively. Each row corresponds to a data set. The value is measured by 10-folder cv shown in mean std form. The best results on each row are bolded, with its comparable ones (pairwise single-tailed t-test at 95% confidence level ) marked by . Data Set Incom LDL-a Incom LDL-p IIS-LDL BFGS-LDL PT-Bayes PT-SVM Yeast-alpha .9621 .0001 .9625 .0005 .9417 .0016 .8845 .0038 .5749 .0180 .9548 .0027 Yeast-cdc .9567 .0013 .9575 .0011 .9386 .0011 .8862 .0030 .5888 .0136 .9512 .0027 Yeast-cold .9406 .0015 .9398 .0011 .9264 .0019 .8902 .0035 .6292 .0199 .9279 .0050 Yeast-diau .9391 .0008 .9394 .0013 .9249 .0018 .8825 .0019 .5974 .0183 .9249 .0053 Yeast-dtt .9605 .0014 .9582 .0015 .9406 .0025 .8971 .0041 .6338 .0186 .9514 .0044 Yeast-elu .9573 .0008 .9584 .0010 .9383 .0014 .8831 .0034 .5724 .0292 .9508 .0015 Yeast-heat .9393 .0014 .9402 .0015 .9233 .0021 .8858 .0031 .6166 .0187 .9324 .0046 Yeast-spo .9175 .0028 .9161 .0031 .9048 .0032 .8680 .0037 .6114 .0233 .9022 .0064 Yeast-spo5 .9154 .0031 .9086 .0043 .9026 .0049 .8727 .0052 .6644 .0224 .9033 .0055 Yeast-spoem .9112 .0036 .9133 .0035 .9063 .0028 .8893 .0058 .7430 .0218 .9054 .0080 Human Gene .7868 .0007 .7868 .0042 .7840 .0042 .7761 .0040 .2707 .0769 .7821 .0042 Natural Scene .4753 .0092 .5073 .0129 .4635 .0127 .2529 .0170 .2927 .0172 .3686 .0408 SJAFFE .8555 .0070 .8481 .0100 .8440 .0115 .2014 .0156 .8461 .0098 .8346 .0156 SBU 3DFE .8588 .0037 .8574 .0038 .8385 .0028 .7044 .0081 .8382 .0037 .8351 .0044 Movie .8256 .0021 .8211 .0033 .6841 .0055 .4500 .0098 .7391 .0027 .6719 .0475 incomplete annotation. However, although we cannot give a guarantee that Incom LDL-prox does optimize our objective Eq. 1, Incom LDL-prox s performance is comparable with that of Incom LDL-admm, even though Incom LDL-admm get the best results most of the time. We plan to study this phenomenon in our future work. For those baseline methods, PT-SVM and IIS-LDL perform the best. BFGS-LDL, although reported to be better than IIS-LDL when data are complete [Geng, 2016], however, are not suitable for incomplete case, especially on Natural Scene and SJAFFE data sets. Comparing Table 2 and Table 3, we can see that for the incomplete setting, all algorithms except PT-Bayes are more stable. Thus it is much difficult to predict the total unlabeled test data in the transductive setting. 5 Conclusion In this paper, we solve the problem of Incomplete Label Distribution Learning (Incom LDL) where the supervised information are incomplete. To solve the problem of data deficiency when facing incomplete data, we propose to use the trace norm minimization technique, thus we can exploit label correlation, reducing the effect of lacking training data. Two algorithms are then proposed based on proximal gradient descend and alternative direction method of multipliers. Experiments on all 15 data sets show the merits of the proposed algorithms compared to baselines. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) References [Bi and Kwok, 2014] Wei Bi and James T. Kwok. Multilabel classification with label correlations and missing labels. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI 14), Canada, pages 1680 1686, 2014. [Boyd and Vandenberghe, 2004] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, 2004. [Boyd et al., 2011] Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. [Cai et al., 2010] Jian-Feng Cai, Emmanuel J. Cand es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. [Duchi et al., 2008] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning (ICML 18), Finland, pages 272 279, 2008. [Geng and Hou, 2015] Xin Geng and Peng Hou. Pre-release prediction of crowd opinion on movies by label distribution learning. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 15), Argentina, pages 3511 3517, 2015. [Geng and Ling, 2017] Xin Geng and Miaogen Ling. Soft video parsing by label distribution learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI 17), CA, 2017. [Geng et al., 2010] Xin Geng, Kate Smith-Miles, and Zhi Hua Zhou. Facial age estimation by learning from label distributions. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI 10), GA, 2010. [Geng et al., 2013] Xin Geng, Chao Yin, and Zhi-Hua Zhou. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(10):2401 2412, 2013. [Geng, 2016] Xin Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734 1748, 2016. [Goldberg et al., 2010] Andrew B. Goldberg, Xiaojin Zhu, Ben Recht, Jun-Ming Xu, and Robert D. Nowak. Transduction with matrix completion: Three birds with one stone. In Proceedings of 24th Annual Conference on Neural Information Processing Systems (NIPS 10), Canada, pages 757 765, 2010. [He and Yuan, 2012] Bingsheng He and Xiaoming Yuan. On the o(1/n) convergence rate of the douglas-rachford alternating direction method. SIAM J. Numerical Analysis, 50(2):700 709, 2012. [Nocedal and Wright, 2006] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006. [Pietra et al., 1997] Stephen D. Pietra, Vincent J. D. Pietra, and John D. Lafferty. Inducing features of random fields. IEEE Transaction on Pattern Analysis and Machine Intelligence, 19(4):380 393, 1997. [Sun et al., 2010] Yu-Yin Sun, Yin Zhang, and Zhi-Hua Zhou. Multi-label learning with weak label. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI 10), GA, 2010. [Tseng, 2008] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle, 2008. [Wang and Carreira-Perpi n an, 2013] Weiran Wang and Miguel A. Carreira-Perpi n an. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. Co RR, abs/1309.1541, 2013. [Wu et al., 2016] Baoyuan Wu, Siwei Lyu, and Bernard Ghanem. Constrained submodular minimization for missing labels and class imbalance in multi-label learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 16), AZ, pages 2229 2236, 2016. [Xing et al., 2016] Chao Xing, Xin Geng, and Hui Xue. Logistic boosting regression for label distribution learning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 16), NV, pages 4489 4497, 2016. [Xu et al., 2013] Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion with side information: Application to multi-label learning. In Proceedings of 27th Annual Conference on Neural Information Processing Systems (NIPS 13), NV, pages 2301 2309, 2013. [Yang et al., 2013] Shu-Jun Yang, Yuan Jiang, and Zhi-Hua Zhou. Multi-instance multi-label learning with weak label. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 13), China, pages 1862 1868, 2013. [Yu et al., 2014] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31th International Conference on Machine Learning (ICML 14), China, pages 593 601, 2014. [Zhao and Guo, 2015] Feipeng Zhao and Yuhong Guo. Semi-supervised multi-label learning with incomplete labels. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 15) Argentina, pages 4062 4068, 2015. [Zhou and Zhang, 2017] Zhi-Hua Zhou and Min-Ling Zhang. Multi-label learning. In Claude Sammut and Geoffrey I. Webb, editors, Encyclopedia of Machine Learning and Data Mining, pages 875 881. Springer, 2017. [Zhou et al., 2015] Ying Zhou, Hui Xue, and Xin Geng. Emotion distribution recognition from facial expressions. In Proceedings of the 23rd Annual ACM Conference on Multimedia (MM 15), Australia, pages 1247 1250, 2015. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)