# copula_multilabel_learning__9d784271.pdf Copula Multi-label Learning Weiwei Liu School of Computer Science, Wuhan University Wuhan, China 430072 liuweiwei863@gmail.com A formidable challenge in multi-label learning is to model the interdependencies between labels and features. Unfortunately, the statistical properties of existing multi-label dependency modelings are still not well understood. Copulas are a powerful tool for modeling dependence of multivariate data, and achieve great success in a wide range of applications, such as finance, econometrics and systems neuroscience. This inspires us to develop a novel copula multi-label learning paradigm for modeling label and feature dependencies. The copula based paradigm enables to reveal new statistical insights in multi-label learning. In particular, the paper first leverages the kernel trick to construct continuous distribution in the output space, and then estimates our proposed model semiparametrically where the copula is modeled parametrically, while the marginal distributions are modeled nonparametrically. Theoretically, we show that our estimator is an unbiased and consistent estimator and follows asymptotically a normal distribution. Moreover, we bound the mean squared error of estimator. The experimental results from various domains validate the superiority of our proposed approach. 1 Introduction Multi-label learning [1, 2, 3, 4, 5, 6], which allows multiple labels for each instance simultaneously, is of paramount importance in a variety of fields ranging from protein function classification and video annotation, to automatic image categorization. For example, an image may have Cloud, Tree and Sky tags; labels such as Government, Policy and Election may be needed to describe the subject of the video; a gene can belong to the functions of Protein Synthesis, Metabolism and Transcription. Binary relevance (BR) [7] is one of the most popular baselines for multi-label learning, which aims to independently train a binary classifier for each label. Recently, much of the multi-label learning literature [8, 9, 10, 11, 12, 13] have shown that the independent assumption among labels and features leads to degenerated performance. A plethora of methods have been motivated by a perceived need to modelling the dependence. For example, the classifier chain (CC) model [14] captures label dependency by using binary label predictions as extra input attributes for the following classifiers in a chain. CCA [15] uses canonical correlation analysis for learning label dependency. CPLST [10] uses principal component analysis to capture both the label and the feature dependencies. Unfortunately, the statistical properties of all these methods are still not well understood. This paper aims to fill this gap. Particularly, the work in this paper is inspired by Sklar s observation below. Sklar s Observation. Sklar s Theorem [16] shows that the univariate margins and the multivariate dependence structure can be separated, and the dependence structure can be represented by a copula [17]. Therefore, a copula contains all the information that we need to measure dependence, and it is invariant to any nonlinear strictly increasing transformations of the marginal variables. Contributions. Motivated by copulas s superiority in modeling dependence, we develop a novel copula multi-label learning paradigm for modeling label and feature dependencies. Our main 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. contributions may be summarized as follows. Firstly, we provide the statistical understandings for multi-label learning. Secondly, this paper leverages the kernel trick to construct continuous distribution in the output space, and then we use a semiparametric approach to estimate our proposed model where the copula is modeled parametrically, while the marginal distributions are modeled nonparametrically. Theoretically, we show that our estimator is an unbiased and consistent estimator and follows asymptotically a normal distribution. Moreover, we bound the mean squared error (MSE) of estimator. Our results show that the MSE of estimator goes to 0 as the number of training samples goes to infinity. The experimental results on several real-world data sets with a wide range of domains demonstrate that our proposed approach outperforms existing dependency modelings. We organize this paper as follows. We first discuss the related work in the following paragraph. 2 presents some basic definitions and n-dimensional copula. 3 introduces copula multi-label learning paradigm and estimators. 4 presents the statistical properties of our proposed estimator, and experimental results are presented in 5. The last section provides our conclusions. Related Work. Copula [17] modeling has become exceedingly popular in recent years, and has been successfully used in a diverse range of applications, especially in finance [18], actuarial science [19], survival analysis [20], systems neuroscience [21] and econometrics [22]. Due to its flexibility and simplicity in modeling the dependence, copulas have also been widely explored in machine learning community, such as domain adaptation [23, 24], Markov Chain Monte Carlo [25], latent markov networks [26], Bayesian networks [27], kernel learning [28] and graphical models [29, 30]. However, it is unclear whether copula is advantageous for multi-label learning. This work provides an affirmative answer. 2 Preliminaries We denote by Dom H and Ran H the domain and range of a function H. Given two real vectors a = [a1, , an] and b = [b1, , bn]. We write a b, if ai bi for all i {1, 2, . . . , n}. For a b, let [a, b] represent the n-box B = [a1, b1] [an, bn], the Cartesian product of n closed intervals. The vertices of an n-box B are the points d = [d1, , dn], where di is equal to either ai or bi. We first present the following concepts and notations. Definition 1. A sequence of random variables xn is said to converge to a constant ϖ in probability, in symbols xn P ϖ, if for every ϵ > 0, P(|xn ϖ| < ϵ) 1 as n , or P(|xn ϖ| ϵ) 0 as n . Definition 2. Given two sequences of random variables xn and yn. xn is of smaller order in probability than yn, in symbols xn = op(yn) , if xn yn P 0. Particularly, xn = op(1) , if and only if xn P 0. Definition 3. Given two sequences of random variables xn and yn. xn is of order less than or equal to that of yn in probability, in symbols xn = Op(yn) , if given ϵ > 0 there exists a constant M = M(ϵ) and an integer m = m(ϵ) such that P(|xn| M|yn|) 1 ϵ for all n > m. Definition 4. A sequence of random variables xn with cumulative distribution functions (CDF) Fn converges in distribution to a random variable x with CDF F, in symbols xn D x, if limn Fn(x) = F(x), for all continuity points x of F. Next, we introduce some important definitions before going to the copulas. Definition 5 (H-volume). Let S1,S2, ,Sn be nonempty subsets of [ , + ]. Let H be a real function of n variables such that Dom H = S1 Sn. Let B = [a, b] be an n-box all of whose vertices are in Dom H. Then the H-volume of B is given by VH(B) = P sgn(d)H(d), where the sum is taken over all vertices d of B, and sgn(d) is given by sgn(d) = 1 if di = ai for an even number of i s 1 if di = ai for an odd number of i s Definition 6 (n-increasing). A real function H of n variables is n-increasing if VH(B) 0 for all n-boxes B whose vertices lie in Dom H. Definition 7. Suppose that the domain of a real function H of n variables is given by Dom H = S1 Sn, where each Si has a least element ai. We say that H is grounded if H(d) = 0 for all d in Dom H such that di = ai for at least one i. If each Si is nonempty and has a greatest element bi, then H has margins, and the one dimensional margins of H are the functions Hi given by Dom Hi = Si and Hi(x) = H(b1, , bi 1, x, bi+1, , bn) for all x in Si. Higher-dimensional margins are defined by fixing fewer places in H. One dimensional margins are just called margins, and for i 2, we will write i-margin for i-dimensional margin. Definition 8 (n-dimensional copula). An n-dimensional copula (or n-copula) is a function C : [0, 1]n [0, 1] such that (i) C is n-increasing and grounded. (ii) C has margins Ci, i {1, 2, . . . , n}, which satisfy Ci(u) = u for all u in [0, 1]. Note that for any n-copula C, n 3, each i-margin of C is a i-copula, for 2 i < n. Next, we introduce the most important Sklar s Theorem regarding copulas. Theorem 1 (Sklar s Theorem [16]). Let H be an n-dimensional distribution function with marginal CDF F1,. . .,Fn. Then there exists an n-copula C such that for all (x1, x2, . . . , xn) [ , + ]n, H(x1, . . . , xn) = C(F1(x1), . . . , Fn(xn)) If F1,. . .,Fn are all continuous, then C is unique; otherwise C is uniquely determined on Ran F1 Ran Fn. Conversely, if C is an n-copula and F1,. . .,Fn are distribution functions, then the function H defined above is an n-dimensional distribution function with marginal CDF F1,. . .,Fn. Sklar s Theorem indicates that copula allows a complete separation of dependence modeling from the marginal distributions and by specifying a copula one can summarize all the dependencies between margins. Assuming H has n-order partial derivatives. Using the chain rule, we derive the joint density f from the copula construction: f(x1, . . . , xn) = n C(F1(x1), . . . , Fn(xn)) F1(x1), . . . , Fn(xn) i=1 fi(xi) = c(F1(x1), . . . , Fn(xn)) i=1 fi(xi) (1) where c(F1(x1), . . . , Fn(xn)) is called the copula density function and fi is the marginal density function of xi. 3 Copula Multi-label Learning We denote the transpose of the vector/matrix by the superscript and the logarithms to base 2 by log. Let || ||2 represent the l2 norm. Assume that x = (x1, . . . , xp) Rp 1 is a random vector representing an input (feature), and z = (z1, . . . , zq) {0, 1}q 1 is a binary random vector representing the corresponding output (label) of multi-label learning (MLC). We denote by Fa and fa the CDF and density function of xa, a {1, . . . , p}, respectively. Let pj represent the probability mass function of zj, j {1, . . . , q}. 3.1 Kernel Trick: Constructing Continuous Distribution Note that the output of MLC is boolean-valued variable, and it is non-trivial to apply Sklar s Theorem to discrete variable. In this paper, we use a continuous distribution to replace each discrete distribution. In particular, we leverage a kernel density with a uniform kernel and a small bandwidth to construct the continuous distribution. Let b be the bandwidth. b should be less than or equal to half the distance of the two discrete points. For an observation at zj, j {1, . . . , q}, the kernel density function is pj(zj) 2b . By setting b = 0.5, we transform binary variable zj to continuous variable yj with CDF Fj: 0 yj < 0.5 pj(0)(yj + 0.5) 0.5 yj 0.5 pj(0) + pj(1)(yj 0.5) 0.5 < yj 1.5 1 yj > 1.5 3.2 Copula Modeling Using Sklar s Theorem, the CDF of (y1, . . . , yq, x1, . . . , xp) can be expressed as C(F1(y1), . . . , Fq(yq), F1(x1), . . . , Fp(xp)), where C is the (p + q)-copula function of (y1, . . . , yq, x1, . . . , xp). In the following paper, we focus on the (p + 1)-margin of C on variables (yj, x1, . . . , xp), j {1, . . . , q}, which have inherited the label dependence information from (p + q)-copula function. Using Eq.(1), we derive the conditional density of yj given x as: fyj(yj|x) = Υj(yj)c(Fj(yj), F1(x1), . . . , Fp(xp)) cx(F1(x1), . . . , Fp(xp)) where Υj(yj) is the density function of yj and cx(F1(x1), . . . , Fp(xp)) = p C(1,F1(x1),...,Fp(xp)) F1(x1),..., Fp(xp) is the copula density of x1, . . . , xp. The conditional mean, Ξj(x), of yj given x, can be derived as: Ξj(x)=E(yj|x) yjfyj(yj|x)dyj yjΥj(yj)c(Fj(yj), F1(x1), . . . , Fp(xp)) cx(F1(x1), . . . , Fp(xp)) dyj =ϑj(F1(x1), . . . , Fp(xp)) cx(F1(x1), . . . , Fp(xp)) =E yjω(Fj(yj), F1(x1), . . . , Fp(xp)) where ω(Fj(yj), F1(x1), . . . , Fp(xp)) = c(Fj(yj),F1(x1),...,Fp(xp)) cx(F1(x1),...,Fp(xp)) and ϑj(F1(x1), . . . , Fp(xp)) = E(yjc(Fj(yj), F1(x1), . . . , Fp(xp)). Eq.(2) demonstrates that the conditional mean of yj given x can be obtained from the copula density. It also indicates that the conditional mean is a weighted mean function, where the weights are induced by the copula function ω defined above. Proposition 1. If p = 1 or x1, . . . , xp are mutually independent, Eq.(2) reduces to Ξj(x) = ϑj(F1(x1), . . . , Fp(xp)), j {1, . . . , q}. Given estimators bω, c Fj and c F1, . . . , c Fp for ω, Fj and F1, . . . , Fp, respectively, then Ξj can be estimated by bΞj(x) = E yjbω(c Fj(yj), c F1(x1), . . . , c Fp(xp)) . To estimate ω, one needs estimators for the copula densities c and cx. Finally, we conduct the multi-label predictions based on the value of bΞj(x). This paper uses a semiparametric approach where the copula is modeled parametrically, while the marginal distributions are modeled nonparametrically. 3.3 Estimators Let (y(i) 1 , . . . , y(i) q , x(i) 1 , . . . , x(i) p ), i {1, . . . , n}, be n independent and identically distributed (i.i.d.) training samples generated from the distribution of (y1, . . . , yq, x1, . . . , xp). For j {1, . . . , q}, the probability mass functions are estimated by bpj(0) = Pn i=1 I(y(i) j =0) n and bpj(1) = 1 bpj(0), where I( ) is the indicator function. Fj is estimated by 0 yj < 0.5 bpj(0)(yj + 0.5) 0.5 yj 0.5 bpj(0) + bpj(1)(yj 0.5) 0.5 < yj 1.5 1 yj > 1.5 We use kernel smoothing method to estimate F1, . . . , Fp. Let k( ) be a symmetric probability density function and h be the bandwidth. For j {1, . . . , p}, Fj is estimated by Pn i=1 K xj x(i) j h where K(t) = R t k(u)du. We make the following assumption for b Fj(xj). Assumption A. For j {1, . . . , p}, Pn i=1 I(x(i) j xj) n + op(n 1/2) We use a parametric approach to estimate the copula. Suppose that (p + q)-copula density belongs to a given parametric family C = {c( ; θ), θ Rv 1}. Assume that θ is the true but unknown copula parameter. Maximum pseudo-likelihood estimator [31, 32] bθ is one of the most popular estimators of θ , which is defined as bθ = arg maxθ log Pn i=1 c(b F1(y(i) 1 ), . . . , b Fq(y(i) q ), b F1(x(i) 1 ), . . . , b Fp(x(i) p ); θ). Let θ j Rd 1 and bθj Rd 1 be the corresponding true and estimator of parameter for (p + 1)- margin on variables (yj, x1, . . . , xp). We make the following assumption on bθj. Assumption B. For j {1, . . . , q}, bθj θ j = Pn i=1 ψi n + op(n 1/2) where ψi = ψ(Fj(y(i) j ), F1(x(i) 1 ), . . . , Fp(x(i) p ); θ j ) is a d-dimensional random vector such that E(ψ) = (0, . . . , 0) and E||ψ||2 2 < . [32] shows that maximum pseudo-likelihood estimator satisfy this assumption. The following analysis focuses on the (p + 1)-margin c on variables (yj, x1, . . . , xp). To simplify the analysis, we provide some simple notations. For a {1, . . . , p + 1}, ca = c ua . For a {1, . . . , p}, cx,a = cx ua and ϑj,a = ϑj ua . cx = (cx,1, . . . , cx,p) . ϑj = (ϑj,1, . . . , ϑj,p) . Let θ1 , . . . , c θd ) , cx = ( cx θ1 , . . . , cx θd ) and ϑj = ( ϑj θ1 , . . . , ϑj θd ) , where ϑj(u1, . . . , up; θ) = E(yjc(Fj(yj), u1, . . . , up; θ)). We make the following assumptions. Assumption C. (i) c and ca, a {1, . . . , p + 1}, are continuous. (ii) E|yj| < for j {1, . . . , q}. (iii) E(yjca(Fj(yj), F1(x1), . . . , Fp(xp); θ j ))2 < and E(yjc(Fj(yj), F1(x1), . . . , Fp(xp); θ j ))2 < for j {1, . . . , q} and a {1, . . . , p + 1}. (iv) E(yj c(Fj(yj),F1(x1),...,Fp(xp);θ j ) θb )2 < for j {1, . . . , q} and b {1, . . . , d}. 4 Main Results This section presents the statistical properties of our proposed estimator. The proofs can be found in the Supplementary Materials. 4.1 Unbias and Consistency We first consider the simple case where p = 1. Proposition 1 shows that Ξj(x1) = ϑj(F1(x1); θ j ) = E(yjc(Fj(yj), F1(x1); θ j )) can be estimated by bΞj(x1) = Pn i=1 y(i) j c(b Fj(y(i) j ), b F1(x1);bθj) n . We first provide the following Lemma. Lemma 1. For j {1, . . . , q}, suppose that Assumption C holds, if b F1(x1) = F1(x1) + Op(n 1/2), and bθj = θ j + Op(n 1/2), then we have Pn i=1 y(i) j c(Fj(y(i) j ), F1(x1); θ j ) n i=1 y(i) j (b Fj(y(i) j ) Fj(y(i) j ))c1(Fj(y(i) j ), F1(x1);θ j ) +( b F1(x1) F1(x1))ϑj,1(F1(x1); θ j ) + (bθj θ j ) ϑj(F1(x1); θ j ) + op(n 1/2). Based on Lemma 1, we present the following theorem. Theorem 2. Given p = 1, under Assumptions A, B and the conditions of Lemma 1, for j {1, . . . , q}, bΞj(x1) is an unbiased and consistent estimator for Ξj(x1). In the general case, if p > 1, let F(x) = (F1(x1), . . . , Fp(xp)) and b F(x) = ( b F1(x1), . . . , b Fp(xp)), Ξj(x) = ϑj(F (x);θ j ) cx(F (x);θ j ). Similar to the case p = 1, we estimate the numerator of Ξj(x) by bϑj( b F(x); bθj) = 1/n Pn i=1 y(i) j c(b Fj(y(i) j ), b F(x); bθj). Under Assumptions A, B and C, suppose that b Fa(xa) = Fa(xa) + Op(n 1/2), a {1, . . . , p} and bθj = θ j + Op(n 1/2), j {1, . . . , q}, according to Lemma 1, we obtain that bϑj( b F(x); bθj) ϑj(F(x); θ j ) = 1/n i=1 Ψi(x; θ j ) + op(n 1/2) (3) where Ψi(x; θ j ) = y(i) j (b Fj(y(i) j ) Fj(y(i) j ))c1(Fj(y(i) j ), F(x); θ j ) + Pp a=1(I(x(i) a xa) Fa(xa))ϑj,a(F(x); θ j ) + ψ i ϑj(F(x); θ j ). The denominator of Ξj(x) can be estimated by bcx( b F(x); bθj) = 1/n Pn i=1 c(b Fj(y(i) j ), b F(x); bθj). Similarly, under conditions of the numerator estimator, we can also obtain that bcx( b F(x); bθj) cx(F(x); θ j ) = 1/n i=1 Wi(x; θ j ) + op(n 1/2) (4) where Wi(x; θ j ) = Pp a=1(I(x(i) a xa) Fa(xa))cx,a(F(x); θ j ) + ψ i cx(F(x); θ j ). For j {1, . . . , q}, the estimator of Ξj(x) is given by bΞj(x) = bϑj( b F (x)) bcx( b F (x)) = Pn i=1 y(i) j c(b Fj(y(i) j ), b F (x);bθj) Pn i=1 c(b Fj(y(i) j ), b F (x);bθj) . Eq.(3) and Eq.(4) lead to the following theorem. Theorem 3. Under Assumptions A, B and C, suppose that b Fa(xa) = Fa(xa) + Op(n 1/2), a {1, . . . , p} and bθj = θ j + Op(n 1/2), j {1, . . . , q}, we have bΞj(x) Ξj(x) = 1/n Ψi(x) Ξj(x)Wi(x) cx(F(x)) + op(n 1/2) Based on Theorem 3, we derive the following corollary. Corollary 1. Given p > 1, under conditions of Theorem 3, for j {1, . . . , q}, bΞj(x) is an unbiased and consistent estimator for Ξj(x). 4.2 Asymptotic Normality Let N(0, 1) denote the standard Gaussian distribution. V ar represents the variance. We first consider the simple case p = 1, based on Lemma 1 and central limit theorem (CLT) [33], we provide the following theorem. Theorem 4. If p = 1, suppose that Assumptions A, B and the conditions of Lemma 1 hold, for j {1, . . . , q}, we have n(bΞj(x1) Ξj(x1)) p V ar(Ψi(x1)) where Ψi(x1) = y(i) j (b Fj(y(i) j ) Fj(y(i) j ))c1(Fj(y(i) j ), F1(x1); θ j ) + (I(x(i) 1 x1) F1(x1))ϑj,1(F1(x1); θ j ) + ψ i ϑj(F1(x1); θ j ). Based on Theorem 3 and CLT, we provide the following theorem in the general case. Theorem 5. Given p > 1, under conditions of Theorem 3, for j {1, . . . , q}, we have n(bΞj(x) Ξj(x)) q V ar( Ψi(x) Ξj(x)Wi(x) cx(F (x)) ) Theorem 5 shows that n(bΞj(x) Ξj(x)) follows asymptotically a normal distribution with mean 0 and variance V ar( Ψi(x) Ξj(x)Wi(x) cx(F (x)) ). 4.3 Bounds on the Mean Squared Error The mean squared error (MSE) is a basic measure of the accuracy of estimator bΞj(x), j {1, . . . , q}, at an arbitrary fixed point x = (x1, . . . , xp) Rp 1, which is defined as: MSEj(x) = E bΞj(x) Ξj(x) 2 (5) Eq.(5) can be transformed to the following formula: MSEj(x) = b2 j(x) + σ2 j (x) (6) where bj(x) = E(bΞj(x)) Ξj(x) is the bias and σ2 j (x) = E bΞj(x) E(bΞj(x)) 2 is the variance of the estimator bΞj(x) at point x. 4.1 shows that bΞj(x) is an unbiased estimator, so bj(x) = 0. Next, we bound the variance of the estimator. The following theorem is under the case of p = 1. Theorem 6. Given p = 1. Suppose that Assumptions A, B and the conditions of Lemma 1 hold. For j {1, . . . , q}, assume that the density function of yj, Υj(yj) Υmax < and c(b Fj(yj), b F1(x1)) cmax < , then we have MSEj(x1) 7Υmaxc2 max 6n Similarly, we obtain the following theorem in the general case. Theorem 7. Given p > 1. Suppose that the conditions of Theorem 3 hold. For j {1, . . . , q}, assume that the density function of yj, Υj(yj) Υmax < , then we have MSEj(x) 7Υmax Remark. Theorem 6 and Theorem 7 show that the MSE of bΞj(x1) and bΞj(x), j {1, . . . , q}, go to 0 as n goes to infinity. 5 Experiment 5.1 Data Sets and Baselines This paper considers two popular families of copulas: Multivariate normal copula: A Gaussian copula with a given correlation matrix Z [ 1, 1]d d is defined as C(u1, . . . , ud; Z) = ΦZ(Φ 1(u1), . . . , Φ 1(ud)), where Φ 1 is the inverse CDF of a standard normal distribution and ΦZ is the joint CDF of a multivariate normal distribution with zero mean vector and correlation matrix equal to Z. The density function is written as c(u1, . . . , ud; Z) = det(Z) 1/2exp( 1/2ς (Z 1 Id)ς), where det(Z) represents the determinant of Z, ς = (Φ 1(u1), . . . , Φ 1(ud)) and Id is the identity matrix. Multivariate student s t copula: A student s t copula with a given correlation matrix Z [ 1, 1]d d and degree of freedom ν is defined as C(u1, . . . , ud; Z, ν) = TZ,ν(T 1 ν (u1), . . . , T 1 ν (ud)), where T 1 ν is the inverse CDF of a student s t-distribution with degree of freedom ν and TZ,ν is the joint CDF of a multivariate student s tdistribution with a correlation matrix Z and degree of freedom ν. The density function is c(u1, . . . , ud; Z, ν) = det(Z) 1/2 Γ((ν+d)/2)(Γ(ν/2))d 1 (Γ((ν+1)/2))d (1+1/νς Z 1ς) (ν+d)/2 Qd i=1(1+ς2 i /ν) (ν+1)/2 , where Γ is the gamma function, ς = (ς1, . . . , ςd) with ςi = T 1 ν (ui), i = 1, . . . , d. Table 1: The results of Example-F1 on the various data sets (mean standard deviation). The best ones are in bold. DATA SET BR CC CCA CPLST CML+GAU CML+ST EMOTIONS 0.5051 0.0288 0.5063 0.0349 0.5141 0.0283 0.5178 0.0686 0.5215 0.0642 0.5146 0.0389 SCENE 0.5203 0.0509 0.5358 0.0229 0.5212 0.0559 0.5311 0.0533 0.5378 0.0250 0.5367 0.0571 MEDICAL 0.1633 0.0738 0.1690 0.0145 0.1713 0.0866 0.1716 0.0372 0.1767 0.0160 0.1728 0.0551 YEAST 0.5578 0.0153 0.5620 0.0329 0.5609 0.0124 0.5649 0.0323 0.5647 0.0112 0.5893 0.0429 ENRON 0.1707 0.0536 0.1799 0.0107 0.1756 0.0413 0.1822 0.0225 0.1888 0.0313 0.1831 0.0322 Table 2: The results of Macro-F1 on the various data sets (mean standard deviation). The best ones are in bold. DATA SET BR CC CCA CPLST CML+GAU CML+ST EMOTIONS 0.5475 0.0172 0.5510 0.0304 0.5492 0.0259 0.5528 0.0393 0.5701 0.0147 0.5534 0.0355 SCENE 0.5801 0.0540 0.5814 0.0163 0.5910 0.0335 0.5926 0.0312 0.6084 0.0217 0.5953 0.0411 MEDICAL 0.0669 0.0417 0.0769 0.0153 0.0677 0.0276 0.0733 0.0463 0.0751 0.0018 0.0808 0.0199 YEAST 0.3342 0.0115 0.3393 0.0267 0.3373 0.0106 0.3453 0.0381 0.3523 0.0143 0.3590 0.0145 ENRON 0.0165 0.0054 0.0176 0.0018 0.0216 0.0033 0.0194 0.0016 0.0218 0.0016 0.0220 0.0030 We abbreviate our proposed copula multi-label learning with multivariate normal copula and multivariate student s t copula to CML+GAU and CML+ST, respectively. This section evaluates the performance of the proposed method on five real-world benchmark data sets with various domains: EMOTIONS (music), SCENE (image), MEDICAL (text), YEAST (biology) and ENRON (text). The statistics of these data sets are presented in the website1. We compare our CML+GAU and CML+ST with several multi-label learning approaches, which aim to capture the interdependencies between labels: BR, CC, CCA and CPLST. We use the code provided by the respective authors with default parameters. The bandwidth is set to h = 0.1 in the experiment. To fairly measure the performance of our methods and the baseline methods, we consider Hamming Loss, Example-F1, Micro-F1 and Macro-F1 as the evaluation measurements [10, 34, 35]. The smaller the value of Hamming Loss, the better the performance, while the larger the value of the other three measurements, the better the performance. We perform 3-fold cross-validation on each data set and report the mean and standard error of each evaluation measurement. 5.2 Results Tables 1 and 2 show the Example-F1 and Macro-F1 results for our methods and baseline approaches in respect of the different data sets. The Hamming Loss and Micro-F1 results are reported in the Supplementary Materials. From Tables 1 and 2, we can see that: 1) BR generally underperforms. BR does not consider the distributions and relationships between labels, so it achieves much lower accuracy. 2) CPLST outperforms CC and CCA, because CPLST captures both the label and the feature dependency. 3) Our proposed methods are most successful on all data sets. The empirical results illustrate the superiority of our proposed model and corroborate our theoretical studies. 6 Conclusion The great success of copulas in a wide range of applications inspires us to develop a novel copula multi-label learning paradigm for modeling label and feature dependencies, and reveal new statistical insights in multi-label learning. Particularly, after leveraging the kernel trick to construct continuous distribution in the output space, we use a semiparametric approach to estimate our proposed model. Theoretically, we show that our estimator is an unbiased and consistent estimator and the distribution of our proposed estimator converges to a normal distribution. Moreover, we provide the bound for the MSE of estimator. The experimental results demonstrate the effectiveness of our proposed approach. Acknowledgments This work is supported by the National Natural Science Foundation of China under Grants 61976161 and the Science and Technology Major Project of Hubei Province (Next-Generation AI Technologies). [1] Grigorios Tsoumakas, Min-Ling Zhang, and Zhi-Hua Zhou. Introduction to the special issue on learning from multi-label data. Machine Learning, 88(1-2):1 4, 2012. 1http://mulan.sourceforge.net/datasets-mlc.html [2] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multi-label classification. In NIPS, pages 730 738, 2015. [3] Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit S. Dhillon. PD-Sparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, pages 3069 3077, 2016. [4] Weiwei Liu and Ivor W. Tsang. Making decision trees feasible in ultrahigh feature and label dimensions. Journal of Machine Learning Research, 18(81):1 36, 2017. [5] Weiwei Liu, Ivor W. Tsang, and Klaus-Robert Müller. An easy-to-hard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research, 18(94):1 38, 2017. [6] Weiwei Liu, Donna Xu, Ivor W. Tsang, and Wenjie Zhang. Metric learning for multi-output tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):408 422, 2019. [7] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Vlahavas. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, pages 667 685. 2010. [8] Weiwei Liu and Ivor W. Tsang. On the optimality of classifier chain for multi-label classification. In NIPS, pages 712 720, 2015. [9] Weiwei Liu and Ivor W. Tsang. Large margin metric learning for multi-label prediction. In AAAI, pages 2800 2806, 2015. [10] Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for multi-label classification. In NIPS, pages 1538 1546, 2012. [11] Xiaobo Shen, Weiwei Liu, Ivor W. Tsang, Quan-Sen Sun, and Yew-Soon Ong. Multilabel prediction via cross-view search. IEEE Transactions on Neural Networks and Learning Systems, 29(9):4324 4338, 2018. [12] Xiaobo Shen, Weiwei Liu, Ivor W. Tsang, Quan-Sen Sun, and Yew-Soon Ong. Compact multi-label learning. In AAAI, pages 4066 4073, 2018. [13] Joey Tianyi Zhou, Hao Zhang, Di Jin, and Xi Peng. Dual adversarial transfer for sequence labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. [14] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333 359, 2011. [15] Yi Zhang and Jeff G. Schneider. Multi-label output codes using canonical correlation analysis. In AISTATS, pages 873 882, 2011. [16] A. Sklar. Random variables, distribution functions, and copulas a personal look backward and forward. Institute of Mathematical Statistics Lecture Notes - Monograph Series, 28:1 14, 1996. [17] Roger B. Nelsen. An Introduction to Copulas. Springer, 2006. [18] Umberto Cherubini, Elisa Luciano, and Walter Vecchiato. Copula methods in finance. Wiley, 2004. [19] Edward W. Frees and Emiliano A. Valdez. Understanding relationships using copulas. North American Actuarial Journal, 2(1):1 25, 1998. [20] David Oakes. Bivariate survival models induced by frailties. Journal of the American Statistical Association, 84(406):487 493, 1989. [21] Pietro Berkes, Frank D. Wood, and Jonathan W. Pillow. Characterizing neural dependencies with copula models. In NIPS, pages 129 136, 2008. [22] Andrew Patton. Modelling asymmetric exchange rate dependence. International Economic Review, 47(406):527 556, 2006. [23] David López-Paz, José Miguel Hernández-Lobato, and Bernhard Schölkopf. Semi-supervised domain adaptation with non-parametric copulas. In NIPS, pages 674 682, 2012. [24] Joey Tianyi Zhou, Ivor W. Tsang, Sinno Jialin Pan, and Mingkui Tan. Multi-class heterogeneous domain adaptation. Journal of Machine Learning Research, 20:57:1 57:31, 2019. [25] Alfredo A. Kalaitzis and Ricardo Bezerra de Andrade e Silva. Flexible sampling of discrete data correlations without the marginal distributions. In NIPS, pages 2517 2525, 2013. [26] Rongjing Xiang and Jennifer Neville. Collective inference for network data with copula latent markov networks. In WSDM, pages 647 656, 2013. [27] Gal Elidan. Copula bayesian networks. In NIPS, pages 559 567, 2010. [28] Barnabás Póczos, Zoubin Ghahramani, and Jeff G. Schneider. Copula-based kernel dependency measures. In ICML, pages 775 782, 2012. [29] Sergey Kirshner. Learning with tree-averaged densities and distributions. In NIPS, pages 761 768, 2007. [30] Han Liu, John D. Lafferty, and Larry A. Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10:2295 2328, 2009. [31] C. Genest, K. Ghoudi, and L.-P. Rivest Rivest. A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika, 82:543 552, 1995. [32] Hideatsu Tsukahara. Semiparametric estimation in copula models. The Canadian Journal of Statistics, 33(3):357 375, 2005. [33] Robert J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, 1980. [34] Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu, Meng Fang, Rick Siow Mong Goh, and Kenneth Kwok. Dual adversarial neural transfer for low-resource named entity recognition. In ACL, pages 3461 3471, 2019. [35] Joey Tianyi Zhou, Ivor W. Tsang, Shen-Shyang Ho, and Klaus-Robert Müller. N-ary decomposition for multi-class classification. Machine Learning, 108(5):809 830, 2019.