# learning_from_multidimensional_partial_labels__62c4f78e.pdf Learning From Multi-Dimensional Partial Labels Haobo Wang1,2 , Weiwei Liu3 , Yang Zhao2 , Tianlei Hu1,2 , Ke Chen1,2 and Gang Chen1,2 1Key Lab of Intelligent Computing Based Big Data of Zhejiang Province, Zhejiang University 2College of Computer Science and Technology, Zhejiang University 3School of Computer Science, Wuhan University {wanghaobo, awalk, chenk, htl, cg}@zju.edu.cn, liuweiwei863@gmail.com Multi-dimensional classification (MDC) has attracted much attention from the community. Though most studies consider fully annotated data, in real practice obtaining fully labeled data in MDC tasks is usually intractable. In this paper, we propose a novel learning paradigm: Multi Dimensional Partial Label Learning (MDPL) where the ground-truth labels of each instance are concealed in multiple candidate label sets. We first introduce the partial hamming loss for MDPL that incurs a large loss if the predicted labels are not in candidate label sets, and provide an empirical risk minimization (ERM) framework. Theoretically, we rigorously prove the conditions for ERM learnability of MDPL in both independent and dependent cases. Furthermore, we present two MDPL algorithms under our proposed ERM framework. Comprehensive experiments on both synthetic and realworld datasets validate the effectiveness of our proposals. 1 Introduction Multi-dimensional classification (MDC) aims to assign each instance to multiple classes, which has been seen in a variety of real-world applications, including but not limited to, text categorization [Ortigosa-Hern andez et al., 2012], gene function prediction [Barutc uoglu et al., 2006] and image annotation [Read et al., 2014; Batal et al., 2013; Arias et al., 2016]. In order to train an effective MDC model, it is typically desirable to obtain a large number of precisely annotated data. Unfortunately, obtaining fully labeled data in MDC tasks is usually intractable. As a result, it is non-trivial to learn multidimensional classifiers from partially labeled data. Weakly-supervised learning has been explored to deal with partially labeled data in various settings. For example, semi-supervised learning (SSL) [Chapelle et al., 2002] learns from both labeled and unlabeled data. In positiveunlabeled learning (PUL) [Denis, 1998; Kiryo et al., 2017], there are only positive labeled data and unlabeled data available. In partial label learning (PLL) [Cour et al., 2011; Corresponding Author. Liu and Dietterich, 2012; Wu and Zhang, 2018], the groundtruth label is concealed in a set of candidate labels. Recently, there are also some works that address the weakly-supervised learning problem in multiple-label setting, such as semisupervised multi-label learning [Zhan and Zhang, 2017], partial multi-label learning (PML) [Fang and Zhang, 2019; Wang et al., 2019] and semi-supervised multi-dimensional classification [Ortigosa-Hern andez et al., 2012]. In this work, we consider a new weakly-supervised MDC scenario where the ground-truth labels of each instance are concealed in multiple candidate label sets, i.e. Multi Dimensional Partial Label Learning (MDPL). Take the image [Khosla et al., 2011] in Table ?? as an example, it is associated with four class variables {Place, Tree, Dog Breeds, Weather}. It is hard for the annotators to identify all the correct labels, but they can provide some candidate labels with much less effort. Label disambiguation and label correlation extraction pose the serious challenges in MDPL. The noisy information will decrease the generalization performance of MDPL. However, the label correlations will provide additional semantic information to disambiguate the noisy labels. For example, since there exist some trees in the image, it is more likely to be a Mountain instead of a Glacier. Our main contributions in this paper are to formulate the MDPL problem and provide an empirical risk minimization (ERM) framework. In particular, we propose a partial hamming loss that incurs a large loss if the predicted labels are not included in candidate label sets. Theoretically, we rigorously present the conditions for ERM learnability of MDPL in both independent and dependent cases. Moreover, we instantiate two MDPL algorithms under our empirical risk minimization framework. Extensive experiments on both synthetic and real-world datasets demonstrate that our proposed methods can effectively handle MDPL tasks. 2 Related Work 2.1 Multi-Dimensional Classification In multi-dimensional classification (MDC), each object is associated with multiple class variables. It is a generalization of multi-label learning [Liu and Tsang, 2017; Shen et al., 2018; Liu et al., 2019] that allows each class variable to have more than two values. Compared to MLL problems, the label correlations in MDC are more sophisticated, because the intra- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Class Type Multi-Dimensional Multi-Dimensional Partial Labels Place Mountain Mountain, Glacier Tree Yes Yes Dog Breeds Malamute Siberian Husky, Malamute Weather Sunny Sunny, Snowy, Cloudy Table 1: An example of MDPL task for image annotation. In MDC, we provide all the ground-truth labels. In MDPL, only some candidate labels are given but it takes much less time than precise annotation. class labels are exclusive, while inter-class labels may still correlate to each other. One popular strategy for MDC is binary relevance (BR) [Read et al., 2014] that decomposes the original problem into several multi-class classification problems. Despite its computational efficiency, BR neglects the label dependencies and hence the predictive performance is limited. To cope with this shortcoming, many works are proposed, including probabilistic graph model based algorithms [Batal et al., 2013; Benjumeda et al., 2018], classifier chains [Zaragoza et al., 2011], instance-based approaches [Jia and Zhang, 2019] and so on. Nevertheless, all of them require the training data to be precisely labeled, which is demanding and time-consuming. Consequently, some weakly-supervised multiple-label problems have been studied, such as semi-supervised multilabel learning [Zhao and Guo, 2015; Zhan and Zhang, 2017], partial multi-label learning (PML) [Fang and Zhang, 2019; Wang et al., 2019], semi-supervised multi-dimensional classification [Ortigosa-Hern andez et al., 2012] and so on. However, most of these learning paradigms explore only multilabel setting where the labels are restricted to be binary and it is non-trivial to study the generalized weakly-supervised MDC problems. 2.2 Partial Label Learning The partial label learning (PLL) setting is between fully supervised and unsupervised learning setting, but is quantitively different from SSL [Chapelle et al., 2002; Zhan and Zhang, 2017] and PUL [Denis, 1998; Kiryo et al., 2017]. In PLL, each instance is equipped with a set of candidate labels. The ground-truth label is guaranteed to be included and the remaining labels are termed as distractor labels or false positive labels. The biggest challenging issue in PLL is to disambiguate the ground-truth label from the distractor labels and many papers [Cour et al., 2011; Liu and Dietterich, 2012; Wu and Zhang, 2018; Feng and An, 2019; Lv et al., 2020] are presented to address this problem. There are also some works [Fang and Zhang, 2019; Wang et al., 2019] studying partial multi-label learning, which extends PLL problem to the multiple-label learning field. Nonetheless, PML restricts the labels to be binary and thus is unpractical in many realworld tasks [Read et al., 2014]. To bridge this gap, we propose a novel learning paradigm: multi-dimensional partial label learning where the groundtruth labels of each instance are concealed in multiple candidate label sets. 3 Learning Framework We first formulate the problem of MDPL and introduce an empirical risk minimization framework. Consider a standard setting of MDC problem with an input space X Rm and an output space Y = C1 C2 ... Cd. Here Ci = {li1, li2, ..., liki} is the i-th class space and Y is their Cartesian product. The ultimate goal of MDC is to induce a mapping from X to Y that captures the dependence of the outputs on inputs. To this end, based on a training dataset Q = {(xi, Yi)|xi X, Yi Y, 1 i n}, a learner chooses an optimal hypothesis h from a given hypothesis space H to minimize the prediction loss. Specifically, common choices of prediction loss (or risk) for MDC include hamming loss and global loss [Read et al., 2014]. In MDPL tasks, we are interested in the case where the correct labels are adulterated by false positive labels. To be more specific, the ground-truth labels are invisible and only a collection of candidate label sets S = {s1, s2, ..., sd} S is given, where S = (2C1 ) (2C2 ) ... (2Cd ) is the candidate class space and si Ci is the i-th candidate label set for the corresponding class space. We denote a complete example by (x, Y, S), where only instance vector x and candidate label collection S are accessible. The goal of MDPL is to learn a multi-dimensional classifier h : X 7 Y from multi-dimensional partially labeled data by minimizing the expected hamming loss: LH D(h) = E(x,Y,S) D[ 1 d Pd i=1 I(hi(x) = yi)], where D is the underlying data distribution, hi(x) is the i-th predicted label and yi denotes the i-th ground-truth label. Since the correct labels are invisible in the training dataset, we can not minimize the standard hamming loss directly. Inspired by partial 0/1 loss [Cour et al., 2011], we introduce a multi-dimensional version named expected partial hamming loss, LP D(h) = E(x,Y,S) D[1 i=1 I(hi(x) / si)] (1) An obvious observation is that expected partial hamming loss is an underestimate of the true expected hamming loss. Thus, it is not a surrogate and we have to explore some conditions where minimizing the partial loss can also bound the true loss. Moreover, a large loss will incur if the predicted labels are not included in candidate label sets. It motivates us to employ the VC-dimension of the inside-out set binary classification task as a bridge to complete our theoretical proof. In summary, based on our proposed expected partial hamming loss, we propose an empirical risk minimization frame- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) work for MDPL, and each hypothesis h will be evaluated by average partial hamming loss, LP Z(h) = 1 j=1 I(hj(xi) / si j) (2) where Z = {(xi, Si)|xi X, Si S, 1 i n} is the partially labeled training dataset and si j is the j-th candidate label set of i-th training example. 4 Learnability of MDPL In this section, we will discuss how to bound the true loss using expected partial hamming loss. In this paper, we only investigate the realizable case where an optimal hypothesis h makes the risk LH D(h ) = 0. 4.1 Independent Case We first consider the independent case, i.e. the labels are independent to each other. Then we can simply decompose the MDPL problem to a set of partial label learning problems. Many works have studied PLL problems based on minimizing the upper-bound of risk LDp, usually, the expected 0/1 loss [Cour et al., 2011]: LDp(hp) = E(x,y,s) Dp[I(hp(x) = y)], where Dp is the underlying distribution of a PLL task. Based on this risk, we obtain the following lemma. Lemma 1. Assume that the labels in an MDPL problem are independent to each other. Then if a PLL problem adopts the expected 0/1 loss as risk and it is PAC-learnable with sample complexity n0(Hp, δ, ϵ), the MDPL problem is also PAC-learnable with sample complexity as follows, n1(H, δ, ϵ) = max i n0(Hi p, δ, ϵ) (3) where Hi p is the i-th PLL hypothesis space. Proof. If a PLL problem is PAC-learnable [Shalev-Shwartz and Ben-David, 2014], then for every ϵ, δ (0, 1), when the training set has size n n0(Hp, δ, ϵ), there exists an ERM learner Ap that returns a hypothesis hp Hp with expected 0/1 loss LDp(hp) ϵ. Since the labels in this MDPL problem are independent to each other, the MDPL task can be decomposed to d PLL problems. By running an ERM learner Ai p on each single PLL problem, and aggregating the hypothesises hi = Ai p(Zi p), we can obtain an MDPL classifier h = [hi]d, where Zi p is the i-th PLL training dataset. When the training set has size n n1(H, δ, ϵ) n0(Hi p, δ, ϵ), for every ϵ, δ (0, 1), the following inequality holds with probability no less than 1 δ, LH D(h) = 1 i=1 E(x,yi,si) Dp[I(hi(x) = yi)] i=1 LDp(hi) 1 We conclude that the MDPL problem is PAC-learnable with sample complexity n1(H, δ, ϵ). The learnability of partial label learning can refer to many works [Cour et al., 2011; Ishida et al., 2017]. For instance, the small ambiguity degree condition, proposed by [Cour et al., 2011], is one of the most popular assumptions in PLL problems. 4.2 Dependent Case During the past decades, a variety of works [Zaragoza et al., 2011; Read et al., 2014; Shen et al., 2018] have proved that neglecting label correlations may achieve degenerated performance in multiple-label problems. Thus, it is crucial to study the problem in what condition can MDPL tasks be learned in the dependent case. Here we propose a sufficient condition for the PAClearnability of MDPL tasks. Theorem 1. In an MDPL problem, if there exists a positive constant γ > 0 such that, h H : LH D(h) > 0, LP D(h) LH D(h) γ (5) then in realizable case, the MDPL problem is PAC-learnable. We first introduce an MDC algorithm, Class Powerset (CP) [Read et al., 2014], into MDPL scenario. The basic idea is to transform the MDC problem to a multi-class classification problem by regarding each label combination as a new class. Then it learns a multi-class classifier f : X 7 Y where Y is the new label space with size Qd i=1 ki. Since all the label combinations are considered, we can fully explore the label correlations across the class space. Specifically, we call an ERM learner Acp that returns a hypothesis minimizing the multi-class risk, i.e. expected 0/1 loss. Nonetheless, in MDPL setting, the learner does not have direct access to the precise data. To deal with this issue, we involve a surrogate loss called global loss LG with corresponding partial global loss LGP , which are defined as, LG D(h) = E(x,Y,S) D[I( i, hi(x) = yi)], LGP D (h) = E(x,Y,S) D[I( i, hi(x) / si)] (6) We can immediately obtain their relation, 1 d LG D(h) LH D(h) LG D(h), 1 d LGP D (h) LP D(h) LGP D (h) (7) The last step is to design a CP algorithm Acp that minimizes the empirical partial global loss, Acp(Z) = argmin h Hmc LGP Z (h) = argmin h Hmc i=1 I( i, hi(x) / si) (8) In traditional MDC setting, it is a typical multi-class learning problem. Let Hcp b be a binary hypothesis class with VCdimension τ = VCdim(Hcp b ), e.g., Hcp b is linear with τ = m. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Suppose that the multi-class hypothesis space Hmc is constructed above Hcp b using one-versus-all strategy. According to Lemma 29.5 in [Shalev-Shwartz and Ben-David, 2014], the Natarajan dimension [Natarajan, 1989] of Hmc enjoys an upper-bound of, Ndim(Hmc) 3τ log(τ where Ndim( ) denotes the Natarajan dimension of a hypothesis space. Nevertheless, in MDPL setting, our learning problem is no longer a multi-class task and the Natarajan dimension can not directly yield the sample complexity. Our strategy is to construct a binary classification task from the problem above. Given a partial example (x, S), the binary classifier should predict whether there exist some predicted labels yi outside their corresponding candidate label sets, i.e. returning I( i, yi / si). We observe that the binary classification loss is the partial global loss. Therefore, we can design an ERM learner Ab that calls Acp and then transforms the prediction to binary output space. Compared with class powerset method, it is unpractical but provides good theoretical results. Now our task is to explore the VC-dimension of the binary classifier. Denote the hypothesis space of this binary classification task by Hb. We have the following lemmas. Lemma 2. Let K = Qd i=1 ki. The VC-dimension of Hb can be bounded by, VCdim(Hb) 3τK log(τK) log 2 e 1 (log(3τK log(τK)) + 2 log K) Proof. Let ν = VCdim(Hb) and µ = Ndim(Hmc). Then, the maximum size of a set that Hb can shatter is ν. In other words, there are 2ν different dichotomies (i.e., labelings) induced by Hb over these ν instances. Based on Lemma 29.4 in [Shalev-Shwartz and Ben-David, 2014], we can conclude that, 2ν νµK2µ (11) Taking the natural logarithm of both sides yields that, ν log 2 µ log ν + 2µ log K (12) To bound ν, we involve a function g(x) = e log x x. Its maximum value g(x) = 0 is obtained when g (x) = 0, i.e. x = e. Hence, g(x) 0 holds for all x > 0. Choosing x = ν µ gives that log ν ν eµ + log µ. Hence, ν log 2 µ( ν eµ + log µ) + 2µ log K ν µ log µ + 2µ log K Combining Eq. (9) and Eq (13), we obtain the desired result. Lemma 3. For every δ, ϵ (0, 1), every distribution D over X, and the binary classification task defined above, if the realizable assumption holds, when running algorithm Ab on a training set of size n satisfying n n2(Hb, δ, ϵ) = 432ν ϵ2 (8ν log(e/ν) + 2 log(2/δ)) then the algorithm returns a hypothesis h such that with probability of at least 1 δ, LGP D (h) ϵ. The proof can be found in [Shalev-Shwartz and Ben-David, 2014]. Now recalling Eq. (5) and Eq. (7), when Ab runs on a training set of size n2(Hb, δ, ϵ γ ), the hamming loss has the following bound, LH D(h) γLP D(h) γLGP D (h) ϵ (15) Taking the corresponding multi-class hypothesis h as our solution, we obtain a provable algorithm that ensures the MDPL problems to be PAC-learnable with a finite sample complexity n2(Hb, δ, ϵ γ ). Therefore, Theorem 1 is proved. 4.3 Further Discussion Remark of the Proposed Condition Suppose we know the distribution of partial examples. We can design a Bayesian optimal classifier with zero partial hamming loss. In realizable case, our goal is to find a hypothesis h that satisfies LH D(h ) = 0. Denote the optimal hypothesis set by H . If there exists a hypothesis ˆh / H such that LP D(ˆh) = 0, even the Bayesian optimal classifier can not guarantee to return an optimal solution. Hence, our sufficient condition actually ensures the ERM learner to return an optimal hypothesis from H . Relation to PML Another observation is that the recently popularized task of partial multi-label learning also benefits from our theoretical analysis. In a typical PML problem, the ground-truth binary labels are adulterated with some irrelevant labels. If we regard each candidate label as a two-element candidate label set, it can be categorized into MDPL problems. In independent case, each positive label is accompanied by a negative label. Thus, it should be treated as a positive-unlabeled learning problem instead of a PLL problem, whose learnability can be referred to [Denis, 1998]. In the dependent case, PML enjoys the same PAC-learnability as MDPL. In reality, PML is an untypical branch of MDPL, because only positive labels will be partially labeled. Practical Implementation According to our theoretical analysis, two MDPL algorithms are instantiated under our ERM framework. In independent case, we propose MDPL-BR that reduces an MDPL problem to multiple PLL tasks, which can be solved by any off-theshelf PLL method. And we present the MDPL-CP method for dependent case. Note that by Eq. (7), partial hamming Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Datasets avg.#CLs MDPL-CP MDPL-BR MDPL-k NN P-VLS Co H Puppy 1.3 1.1 1.4 1.4 .603 .047 .384 .033 .432 .043 .578 .084 .529 .073 1 2 1 2 3 .736 .050 .367 .052 .461 .038 .659 .044 .473 .045 1 2 2 2 3 .673 .041 .352 .033 .360 .025 .659 .053 .436 .024 1 2 2 2 4 .664 .081 .359 .032 .364 .056 .646 .063 .418 .046 1 2 2 2 5 .609 .023 .340 .062 .404 .041 .601 .034 .400 .045 3 2 1 .942 .013 .910 .010 .921 .006 .683 .125 .928 .008 4 4 2 .939 .025 .862 .020 .906 .021 .331 .031 .922 .008 6 4 2 .928 .051 .888 .011 .883 .015 .454 .070 .919 .009 7 5 2 .902 .090 .837 .021 .868 .018 .576 .030 .913 .007 2 1 2 1 2 2 1 .632 .009 .614 .009 .604 .003 .622 .014 .601 .016 2 2 2 2 2 2 2 .631 .016 .605 .005 .586 .008 .629 .018 .616 .009 2 2 3 3 2 2 2 .621 .015 .577 .009 .566 .010 .616 .012 .578 .017 3 3 3 3 3 3 3 .612 .011 .524 .022 .513 .010 .622 .018 .554 .015 1 1 1 2 2 2 2 .671 .015 .644 .005 .638 .008 .659 .014 .615 .016 2 2 2 2 2 2 2 .660 .006 .638 .003 .623 .005 .658 .008 .604 .018 3 2 3 2 2 3 2 .653 .013 .601 .010 .579 .005 .643 .016 .596 .014 3 3 3 3 3 3 3 .648 .010 .543 .007 .533 .013 .646 .013 .568 .011 2 2 1 1 1 1 1 .961 .002 .962 .001 .960 .001 .799 .014 .952 .016 2 2 1 1 2 2 1 .960 .001 .961 .001 .960 .002 .717 .049 .954 .008 3 3 1 1 2 2 1 .959 .001 .953 .003 .959 .001 .710 .022 .943 .004 3 3 2 1 2 2 2 .960 .002 .896 .004 .958 .003 .746 .039 .949 .004 Average number of candidate labels. Each configuration for a synthetic dataset demonstrates the average number of candidate labels on each dimension, respectively. Table 2: Results of hamming accuracy on all datasets (mean standard deviation). The best ones are in bold. loss is also a surrogate loss to partial global loss. Therefore, we unify the two algorithms by minimizing the proposed partial hamming loss. To validate the theoretical results, we consider all the label combinations in the experiments. However, such a strategy may decrease the scalability of MDPLCP. This problem can be alleviated by an ensemble technique [Tsoumakas et al., 2011]. Due to the page limitation, we leave it for future work. 5 Experiments In this section, we evaluate the performance of our proposed methods on both synthetic and real-world dataset. All the computations are performed on a workstation with an i75930K CPU, a TITAN Xp GPU and 64GB main memory running Linux platform. 5.1 Dataset Synthetic Datasets We follow the experimental setting in [Wang et al., 2019] and synthesize a total of 20 MDPL datasets from 5 realworld MDC datasets. The MDC datasets are collected from UCI repository [Dheeru and Karra Taniskidou, 2017]: 1) Bridges estimates bridge properties from specific constraints; 2) WQplant and WQanimals determine the plant and animal species in Slovenian rivers; 3) Flare predicts number of times that certain types of solar flare occurred within 24 hours period; 4) Thyroid determines types of thyroid problems based on patient information. For each class Datasets N m d K Puppy 102 1,000 4 2-4 Bridges 108 7 5 2-6 Flare 1,066 10 3 3-8 WQanimal 1,060 16 7 4 WQplant 1,060 16 7 4 Thyroid 9,172 29 7 2-5 Table 3: Statistics of the experimental datasets. variable of an example, we randomly select some negative labels and aggregate them with the ground-truth label to obtain a candidate label set. Different configurations are controlled by the number of average candidate labels in each class space. The detailed information is reported in Table 3. Puppy Dataset Because the MDPL is a new learning setting, there is no publicly available MDPL dataset yet. To further boost our empirical studies, this paper builds one real-world MDPL dataset Puppy. A total of 102 dog images are collected and categorized to 4 class variables {Place, Tree, Dog Breeds, Weather}. We manually tagged all the data examples by ground-truth labels. The candidate label sets are collected by crowdsourcing. Moreover, we extract 1000-dimensional fc-8 feature of these images using a pre-trained VGG-19 [Simonyan and Zisserman, 2015] model. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Images Examples Candidate Labels Malamute/Husky No River Cloudy Malamute/Husky Yes Mountain/River/Glacier Cloudy/Sunny MDPL-CP Pred. Malamute No River Sunny Malamute Yes Mountain MDPL-BR Pred. No River Sunny Yes Glacier MDPL-k NN Pred. Malamute No Glacier Husky No Grassland P-VLS Pred. Samoyed No Grassland Husky No Grassland Malamute No River Sunny Husky No Glacier Cloudy Figure 1: Some MDPL image annotation examples on Puppy. For each image, we show the candidate labels, and the labels predicted by all the methods. The black labels denote the ground-truth or the correctly predicted ones. The red labels denote the distractor labels or wrongly predicted ones. All the datasets are randomly split in to 80% training and 20% testing. We run five times on each dataset and the mean hamming accuracy with standard deviation are reported. 5.2 Baselines We compared the proposed algorithms with three state-ofthe-art baselines: 1) P-VLS: PARTICLE [Fang and Zhang, 2019] is an effective PML method that integrates the label propagation and calibrated label ranking techniques. By regarding each nominal label as a binary label, the MDPL problem can be transformed to a PML problem and solved by PARTICLE. In this work, we choose the virtual label splitting based version, i.e. P-VLS. 2)Co H: Co H [Shen et al., 2018] is a label embedding based multi-label algorithm that jointly compresses the input and output to a latent space by co-hashing. We employ it to deal with MDPL tasks by treating all the candidate labels as valid ones. Note that P-VLS and Co H will return a group of positive labels in an uncertain size. Hence, we take the label with maximum score in its class space as our prediction. 3) MDPL-k NN: we induce a knearest neighbor model from MDPL data and the prediction is made by voting in each class space. We use CLPL [Cour et al., 2011] as the base PLL predictor in MDPL-BR method. For MDPL-CP, we choose linear classifier as the base hypothesis. In practice, we also adopt a naive convex surrogate loss in [Jin and Ghahramani, 2002] to implement MDPL-CP. For our methods, the empirical risk is optimized by stochastic gradient algorithm. We also add an l2 regularization term. The learning rate and the regularization parameters are fine-tuned by cross-validation. The number of nearest neighbors is set as k = 10 for all the k NN-based approaches. Following the experimental setting in [Fang and Zhang, 2019], we set thr = 0.9 and α = 0.95 for P-VLS. Finally, following [Shen et al., 2018], the parameters of Co H are set as α = 100 and d = 10. 5.3 Results Table 2 lists the results of hamming accuracy of all the methods on Puppy and 20 synthetic MDPL datasets. Figure 1 shows some real predictive results on some test examples from Puppy dataset. From the results, we can see that: 1) MDPL-CP algorithm achieves the best performance, which verifies our theoretical analysis. For instance, on Puppy dataset, MDPL-CP algorithm improves the best result of the baselines by 4.3%. By considering all the label combinations, it fully explores the label correlations with theoretically guaranteed disambiguation ability. 2) MDPL-BR works well on some datasets such as Thyroid. However, it generally underperforms MDPL-CP because of neglecting the correlations, which also further effects its disambiguating ability. 3) P-VLS and Co H are inferior to MDPL-CP method. Since they are designed for partial multi-label tasks, they will wrongly learn the label correlations of intra-class labels due to the false positive labels, which leads to degenerated performance. 4) Without considering both correlations and ambiguity, MDPL-k NN underperforms MDPL-CP and MDPL-BR. By these observations, we conclude that our proposed methods can effectively tackle the MDPL problems. 6 Conclusion In this paper, we propose a novel learning paradigm named Multi-Dimensional Partial Label Learning (MDPL), where each data instance is equipped with multiple candidate label sets. Based on our proposed partial hamming loss, we present an empirical risk minimization framework for MDPL. Theoretically, we rigorously prove the ERM learnability of MDPL in specific conditions. We further provide two effective MDPL algorithms under our ERM framework. In independent case, we propose MDPL-BR that decomposes the original task to a series of partial label learning problems. In dependent case, we propose MDPL-CP which fully explores the label correlations. Extensive experiments on both synthetic and real-world datasets validate our theoretical studies and the effectiveness of our proposed methods. Acknowledgments This work is supported by the National Natural Science Foundation of China (Grant No. 61976161) and Key R&D Program of Zhejiang Province (Grant No. 2020C01024). References [Arias et al., 2016] Jacinto Arias, Jos e A. G amez, Thomas D. Nielsen, and Jos e Miguel Puerta. A scalable Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) pairwise class interaction framework for multidimensional classification. Int. J. Approx. Reasoning, 68:194 210, 2016. [Barutc uoglu et al., 2006] Zafer Barutc uoglu, Robert E. Schapire, and Olga G. Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830 836, 2006. [Batal et al., 2013] Iyad Batal, Charmgil Hong, and Milos Hauskrecht. An efficient probabilistic framework for multi-dimensional classification. In CIKM, pages 2417 2422, 2013. [Benjumeda et al., 2018] Marco Benjumeda, Concha Bielza, and Pedro Larra naga. Tractability of most probable explanations in multidimensional bayesian network classifiers. Int. J. Approx. Reasoning, 93:74 87, 2018. [Chapelle et al., 2002] Olivier Chapelle, Jason Weston, and Bernhard Sch olkopf. Cluster kernels for semi-supervised learning. In Neur IPS, pages 585 592, 2002. [Cour et al., 2011] Timoth ee Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels. Journal of Machine Learning Research, 12:1501 1536, 2011. [Denis, 1998] Franc ois Denis. PAC learning from positive statistical queries. In ALT, pages 112 126, 1998. [Dheeru and Karra Taniskidou, 2017] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. [Fang and Zhang, 2019] Jun-Peng Fang and Min-Ling Zhang. Partial multi-label learning via credible label elicitation. In AAAI, 2019. [Feng and An, 2019] Lei Feng and Bo An. Partial label learning with self-guided retraining. In AAAI, pages 3542 3549. AAAI Press, 2019. [Ishida et al., 2017] Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from complementary labels. In Neur IPS, pages 5644 5654, 2017. [Jia and Zhang, 2019] Bin-Bin Jia and Min-Ling Zhang. Multi-dimensional classification via knn feature augmentation. In AAAI, 2019. [Jin and Ghahramani, 2002] Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Neur IPS, pages 897 904. MIT Press, 2002. [Khosla et al., 2011] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), volume 2, 2011. [Kiryo et al., 2017] Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pages 1674 1684, 2017. [Liu and Dietterich, 2012] Li-Ping Liu and Thomas G. Dietterich. A conditional multinomial mixture model for superset label learning. In Neur IPS, pages 557 565, 2012. [Liu and Tsang, 2017] Weiwei Liu and Ivor W. Tsang. Making decision trees feasible in ultrahigh feature and label dimensions. J. Mach. Learn. Res., 18:81:1 81:36, 2017. [Liu et al., 2019] Weiwei Liu, Donna Xu, Ivor W. Tsang, and Wenjie Zhang. Metric learning for multi-output tasks. IEEE Trans. Pattern Anal. Mach. Intell., 41(2):408 422, 2019. [Lv et al., 2020] Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true labels for partial-label learning. Co RR, abs/2002.08053, 2020. [Natarajan, 1989] B. K. Natarajan. On learning sets and functions. Machine Learning, 4:67 97, 1989. [Ortigosa-Hern andez et al., 2012] Jonathan Ortigosa Hern andez, Juan Diego Rodr ıguez, Leandro Alzate, Manuel Lucania, I naki Inza, and Jos e Antonio Lozano. Approaching sentiment analysis by using semi-supervised learning of multi-dimensional classifiers. Neurocomputing, 92:98 115, 2012. [Read et al., 2014] Jesse Read, Concha Bielza, and Pedro Larra naga. Multi-dimensional classification with superclasses. IEEE Trans. Knowl. Data Eng., 26(7):1720 1733, 2014. [Shalev-Shwartz and Ben-David, 2014] Shai Shalev Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [Shen et al., 2018] Xiaobo Shen, Weiwei Liu, Ivor W. Tsang, Quan-Sen Sun, and Yew-Soon Ong. Compact multi-label learning. In AAAI, pages 4066 4073, 2018. [Simonyan and Zisserman, 2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [Tsoumakas et al., 2011] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Vlahavas. Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng., 23(7):1079 1089, 2011. [Wang et al., 2019] Haobo Wang, Weiwei Liu, Yang Zhao, Chen Zhang, Tianlei Hu, and Gang Chen. Discriminative and correlative partial multi-label learning. In Sarit Kraus, editor, IJCAI, pages 3691 3697. ijcai.org, 2019. [Wu and Zhang, 2018] Xuan Wu and Min-Ling Zhang. Towards enabling binary decomposition for partial label learning. In IJCAI, pages 2868 2874, 2018. [Zaragoza et al., 2011] Julio H. Zaragoza, Luis Enrique Sucar, Eduardo F. Morales, Concha Bielza, and Pedro Larra naga. Bayesian chain classifiers for multidimensional classification. In IJCAI, pages 2192 2197, 2011. [Zhan and Zhang, 2017] Wang Zhan and Min-Ling Zhang. Inductive semi-supervised multi-label learning with cotraining. In KDD, pages 1305 1314, 2017. [Zhao and Guo, 2015] Feipeng Zhao and Yuhong Guo. Semi-supervised multi-label learning with incomplete labels. In IJCAI, pages 4062 4068, 2015. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)