# theoretical_analysis_of_label_distribution_learning__7bdef4bd.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Theoretical Analysis of Label Distribution Learning

Jing Wang, Xin Geng

MOE Key Laboratory of Computer Network and Information Integration School of Computer Science and Engineering, Southeast University, Nanjing 210096, China {wangjing91, xgeng}@seu.edu.cn

As a novel learning paradigm, label distribution learning (LDL) explicitly models label ambiguity with the deﬁnition of label description degree. Although lots of work has been done to deal with real-world applications, theoretical results on LDL remain unexplored. In this paper, we rethink LDL from theoretical aspects, towards analyzing learnability of LDL. Firstly, risk bounds for three representative LDL algorithms (AA-k NN, AA-BP and SA-ME) are provided. For AA-k NN, Lipschitzness of the label distribution function is assumed to bound the risk, and for AA-BP and SA-ME, rademacher complexity is utilized to give data-dependent risk bounds. Secondly, a generalized plug-in decision theorem is proposed to understand the relation between LDL and classiﬁcation, uncovering that approximation to the conditional probability distribution function in absolute loss guarantees approaching to the optimal classiﬁer, and also data-dependent error probability bounds are presented for the corresponding LDL algorithms to perform classiﬁcation. As far as we know, this is perhaps the ﬁrst research on theory of LDL.

Introduction Traditional learning paradigms include single-label learning (SLL) and multi-label learning (MLL) (Zhang and Zhou 2014). SLL assumes that an instance is associated with one label, while MLL assumes that multiple labels are assigned to an instance. Essentially both SLL and MLL aim at ﬁnding related label/labels to describe the instance, while neither SLL nor MLL support relative importance of labels assigned to the instance. However in some real-word applications, labels are related to the instance with different relative importance degree. Thus it seems reasonable to label an instance with a soft label rather than a hard one (e.g., single label or a set of labels). Inspired by this, (Geng 2016) proposes a novel learning paradigm, Label Distribution Learning (LDL), which labels an instance with a distribution of description degree over label space, called label distribution, and learns a mapping from instance to label distribution directly. Compared with MLL, where positive labels are commonly treated equally (with description degree equals 1/c implicitly for c positive labels), LDL allows explicitly mod-

Corresponding author. Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

eling of different relative importance of labels assigned to the instance, which is more suitable for many real scenarios. Recently, LDL has been extensively applied in many realworld applications, which can be classiﬁed into three classes according to the source of label distribution. The ﬁrst one features that label distribution is from data itself, which includes pre-release rating prediction on movies (Geng and Hou 2015), emotion recognition (Zhou, Xue, and Geng 2015), et al. The second one is characterized by that label distribution is originated from pre-knowledge, among which applications include age estimation (Geng, Yin, and Zhou 2013), head pose estimation (Geng and Xia 2014), et al. The third one is attributed to that label distribution is learned from data automatically. Applications of such class include label-importance-aware multi-label learning (Li, Zhang, and Geng 2015), beauty sensing (Ren and Geng 2017), video parsing (Geng and Ling 2017), et al. The secret of successes of LDL being applied in a variety of ﬁelds is that explicit introduction of label ambiguity with label distribution boosts performance of real-world applications. In this paper, we re-examine LDL from theoretical aspects, towards analyzing generalization of LDL algorithms. Precisely, there are at least two arguments related to the generalization of LDL. The ﬁrst one is on generalization of LDL itself, and the second one is on generalization of LDL to perform classiﬁcation. On one hand, LDL can be regarded as one kind of multi-output regression (Borchani et al. 2015), and generalization of multi-output regression algorithms (Kakade, Sridharan, and Tewari 2009) can be transfered to LDL somewhat. However, specializations of LDL should not be neglected. The ﬁrst thing to note is probability simplex constraint of label distribution. As suggested by (Geng 2016), the target label distribution is real-valued and satisﬁes distribution constraints, i.e., ηy x [0, 1], and P

y Y ηy x = 1 (formally deﬁned in Preliminary). This constraint is often satisﬁed by applying a softmax function onto each output of a multi-output regression model, which complicates complexity of the corresponding multi-output regression model. Furthermore, the second thing to note is specialization of measures for LDL. Although Lipschitzness of loss function is a general assumption when bounding the risk, Kullback-Leibler (KL) divergence as loss function for LDL (e.g., SA-ME) does not satisfy Lipschitzness instead. On another hand, we ﬁnd that LDL is usually adopted to per-

form classiﬁcation implicitly. For examples, in applications of age estimation (Geng, Yin, and Zhou 2013), head-pose estimation (Geng and Xia 2014), pre-release rating prediction on movies (Geng and Hou 2015), et al., label corresponding to the maximum predicted description degree is treated as the predicted label. As far as we know, generalization of such framework has never been touched. The main contributions of this paper are summarized as followings,

1) We establish risk bounds for three representative LDL algorithms, i.e., AA-BP, SA-ME, and AA-k NN, where bounds for the ﬁrst two are data-dependent risk bounds, and bound for AA-k NN is consistency bound with Lipschitz assumption.

2) We generalize the binary plug-in decision theorem into multi-class classiﬁcation, discovering that LDL dominates classiﬁcation.

3) We provide, to the best of our knowledge, the ﬁrst datadependent error probability bounds for three representative LDL algorithms to perform classiﬁcation.

The rest of this paper is organized as follows. Firstly, related works are brieﬂy reviewed. Secondly, notations are introduced in preliminary. Then risk bounds for three representative LDL algorithms are provided. Next generalization of LDL to perform classiﬁcation is examined. Finally, we conclude the paper.

Related Work LDL (Geng 2016) is a novel learning paradigm, which labels an instance with a label distribution and learns a mapping from instance to label distribution straightly. Existing studies mainly focus on algorithm design and improvement. Three strategies are embraced to design algorithms for LDL (Geng 2016). The ﬁrst one is Problem Transformation (PT). Algorithms of this class ﬁrstly transform dataset equipped with label distribution into single-label dataset via re-sampling, and then SLL algorithms are employed to learn the transformed single-label dataset. Two representative algorithms are PT-SVM and PT-Bayes, which apply SVM and Bayes classiﬁer respectively. The second one is Algorithm Adaptation (AA), which extends certain existing learning algorithms to handle label distribution seamlessly. Concretely, k-NN is adapted in such that, for an instance, mean of label distribution of its k nearest neighbors is calculated as the predicted label distribution, which is denoted by AA-k NN. Besides, three-layer back-propagation neural network with multi-output is also adapted to minimize the sum-squared loss of output of neural network compared with real label distribution, which is denoted by AA-BP. The third one is Specialized Algorithms (SA), which matches characteristics of LDL. Two specialized algorithms, i.e., SA-IIS and SABFGS, are proposed by applying maximum entropy model (Berger, Pietra, and Pietra 1996) with KL divergence as loss function to learn the label distribution. Notice that all above algorithms are designed without learnability guarantee. Recently, two improvements on LDL are noteworthy. The ﬁrst one tackles shortage of label distribution dataset. Compared with SLL and MLL, LDL dataset is more difﬁcult to

acquire. (Xu, Tao, and Geng 2018) proposes to boost logical label (from SLL/MLL dataset) into real-valued label distribution by graph laplacian label enhancement. (Hou et al. 2017) utilizes semi-supervised learning, with abundant unlabeled data, to adaptively learn label distribution. (Xu and Zhou 2017) deals with the situation when label distribution is partially observed, and jointly recovers and learns it by matrix completion with trace-norm regularizer. The second one exploits label correlations of LDL. (Zhou, Xue, and Geng 2015) represents label correlations by Pearson s correlation coefﬁcient. (Jia et al. 2018) addresses label correlations by adding a regularizer, which encodes label correlations into distance between labels. (Zheng, Jia, and Li 2018) considers label correlation is local, which is encoded into a local correlation vector, and learns optimal encoding vector and LDL simultaneously. Algorithm improvements are mainly according to practice guide, without taking theoretical feasibility into consideration. There are few work on theoretical research of LDL. One recent paper (Zhao and Zhou 2018) studies generalization of LDL partially. However, the major motivation of the paper is to take structures of label space into consideration by applying optimal transport distance (Villani 2008) instead of traditional LDL measures. And the proposed risk bound is too abroad, without providing a bound for rademacher complexity of the hypothesis space. Besides, characteristics of LDL measures are not taken into consideration at all. Risk bounds we developed in this paper match specializations of LDL (i.e., probability simplex constraint and loss function characteristics), which are broadly applicable. For generalization of LDL to perform classiﬁcation, one possible related topic is the plug-in decision theorem (Devroye, Gy orﬁ, and Lugosi 1996) (formally described in Theorem 8). It states that error probability difference between a decision function and the optimal one is bounded by expectation of absolute difference between the plug-in function and the conditional probability distribution function. Plug-in decision theorem has been widely used to prove consistency of many decision rules, such as Stone s Theorem (Stone 1977), strong consistency of k-NN rule (Biau and Devroye 2015), consistency of kernel rule (Devroye and Krzy zak 1989), et al. Note that the classic plug-in decision theorem is only established in the binary setting, which limits its applications. In this paper, we generalize the plug-in decision theorem and develop data-dependent error probability bounds for the corresponding LDL algorithms.

Preliminary

Denote input space by X Rd, and label space by Y = {y1, y2, . . . , ym}. Deﬁne label distribution function η : X Y R, which satisﬁes η(x, y) 0 and P

y η(x, y) = 1. Let the training set be S = {(x1, η(x1)), . . . , (xn, η(xn))}, where xi is sampled according to an underlying probability distribution D, and η(xi) = {ηy1 xi, . . . , ηym xi } is determined by an unknown label distribution function η with ηyj xi = η(xi, yj) for convenience of notation. The goal of LDL is to learn the unknown function η. Given a function class H and a loss function ℓ: Rm

Rm R+, for function h H, its corresponding risk and empirical risk are deﬁned as LD(h) = Ex D[ℓ(h(x), η(x))] and LS(h) = 1

n Pn i=1 ℓ(h(xi), η(xi)), respectively. Recall the deﬁnition of rademacher complexity w.r.t. S and ℓ,

ˆRn(ℓ H S) = Eϵ1,...,ϵn

i=1 ℓ(h(xi), η(xi))ϵi

where ϵ1, . . . , ϵn are n independent rademacher random variables with P(ϵi = 1) = P(ϵi = 1) = 1/2. Then Lemma 1. (Bartlett and Mendelson 2003; Mohri, Rostamizadeh, and Talwalkar 2012) Let H be a family of functions. For a loss function ℓbounded by µ, then for any δ > 0, with probability at least 1 δ, for all h H such that

LD(h) LS(h) + 2 ˆRn(ℓ H S) + 3µ

Risk Bounds for LDL Algorithms We now provide risk bounds for three representative algorithms from two classes, i.e., AA-k NN and AA-BP from algorithm adaptation, SA-ME (maximum entropy) from specialized algorithms. Notice that (Geng 2016) proposes two specialized algorithms, i.e., SA-IIS and SA-BFGS, which differ only in the underlying optimization methods. In this paper we focus on the generalization ability of algorithms, and SA-ME represents SA-IIS and SA-BFGS collectively from the perspective of the underlying model, i.e., maximum entropy model. Also algorithms from problem transformation are not covered, for the reason that this type of algorithms circumvent learning label distribution with resampling and single-label learning algorithms instead of learning it directly.

AA-k NN AA-k NN extends k NN to deal with label distribution. Given a new instance x, its k nearest neighbors are ﬁrstly selected in the training set. Then mean of label distribution of k nearest neighbors is calculated as the label distribution of x, i.e.,

j Nk(x) ηyi xj, (i = 1, 2, . . . , m),

where η : X Y Rm is the output function of AA-k NN with ηyi x = η(x, yi) for simplicity of notation, and Nk(x) is the index set of k nearest neighbors of x. For convenience of analysis, we assume X = [0, 1]d. Theorem 2. Assume η( , yi) be ci-Lipschitz w.r.t. X, and c = Pm i=1 ci, then for any δ > 0, with probability at least 1 δ such that

i=1 |ηyi x ηyi x |

1/(d+1) . (2)

Proof. With Lipschitz assumption of η( , yi), we have

i=1 |ηyi x ηyi x |

j Nk(x) |ηyi x ηyi xj |

j Nk(x) ||xj x||2

Thus to prove Theorem (2), it sufﬁces to prove

j Nk(x) ||xj x||2

1/(d+1) , (3)

which is tricky and left to appendices. Then apply Markov s inequality to Eq. (3), and Theorem (2) follows directly.

AA-BP AA-BP adapts back-propagation neural network to perform label distribution learning. The three-layer neural network has m output units, each of which outputs the description degree of a label. To make sure the output of neural network satisﬁes probability simplex constraint, softmax activation function is applied in each output unit. Similar to multioutput regression, AA-BP minimizes sum-squared loss of the output of neural network with the real label distribution. Observing that AA-BP can be regarded as a combination of softmax function and function family of the three-layer neural network, rademacher complexity of which w.r.t. S for loss function ℓis bounded as following, Theorem 3. Denote softmax function by SF. Let H be a family of functions for three-layer neural network with m outputs (identity activation on the output layer), and Hj be a family of functions for j-th output. For a loss function ℓ with Lipschitz constant Lℓ, we have

ˆRn(ℓ SF H S) 2

j=1 ˆRn(Hj S), (4)

Proof. Deﬁne function φ( , ) as ℓ(SF( ), ). Next, we show that φ is Lipschitz. For probability distribution p, q Rm,

|φ(p, ) φ(q, )| Lℓ||SF(p) SF(q)||1,

where the inequality is according to Lipschitzness of ℓ. Next, right-hand side of the proceeding equation equals

j =i epj pi 1 1 + P

j =i eqj qi

where pi, qi is i-th element of p and q, respectively. Observing that for v Rm, function 1 1+P

i exp(vi) is actually 1-Lipschitz, thus the preceding equation is bounded by

i=1 ||p 1 pi q + 1 qi||2

i=1 (||p q||2 + m|pi qi|)

Lℓ(m||p q||2 + m

i=1 |pi qi|)

2m Lℓ||p q||2,

where the last inequality is according to Cauchy-Schwarz s inequality. Recall the deﬁnition of rademacher complexity

ˆRn(ℓ SF H S) = E

i=1 φ(h(xi), η(xi))ϵi

and according to (Maurer 2016), with φ being 2m LℓLipschitz, right-hand side of above equation is bounded by

j=1 ϵi,jhj(xi)

where hj(xi) is j-th component of h(xi), and ϵi,j are n m i.i.d. random variables. Notice that function class of a multioutput neural network can be regarded as a direct sum of function classes of multiple scalar-output neural networks. Suppose H1, . . . , Hm be classes of functions for three-layer neural network with scalar output, then H = j [m]Hj = x [h1(x) . . . hm(x)]T : hj Hj , and Eq. (5) equals

j=1 ϵi,jhj(xi)

i=1 ϵi,jhj(xi)

j=1 ˆRn(Hj S).

which ﬁnishes proof of Theorem 3.

Rademacher complexity of class of functions for threelayer neural network with scalar output satisﬁes Lemma 4. (Bartlett and Mendelson 2003; Gao and Zhou 2016). Let σ be Lipschitz with constant Lσ. Deﬁne class of functions Hj = {x 7 P

i wj,iσ(vi x) : ||wj||2 B1, ||vi||2 B0}, rademacher complexity of which satisﬁes ˆRn(Hj S) LσB0B1 n max i [n] ||xi||2.

Accordingly, right-hand side of Eq. (4) is bounded as

ˆRn(ℓ SF H S) 2

2m2LℓLσB0B1 n max i [n] ||xi||2.

For AA-BP, sum-squared loss is 2-Lipschitz, and bounded by 2. Finally data-dependent risk bound for AA-BP is Theorem 5. Let F be the family of functions for AA-BP deﬁned above, with sum-squared loss and weight constraints as Theorem 4, for any δ > 0, with probability at least 1 δ, for all f F such that

i=1 (f yi x ηyi x )2 #

j=1 (ηyj xi f yj xi )2

2m2LσB0B1 n max i [n] ||xi||2 + 6

SA-ME SA-ME applies maximum entropy model to learn label distribution, i.e.,

Z exp(wj x), (7)

where Z = P j exp(wj x) is the normalization factor. Actually Eq. (7) can be regarded as a combination of softmax function and multi-output linear regression, namely SF H, where H represents a class of functions of multi-output linear regression. SA-ME uses KL divergence as loss function, which is denoted by KL : Rm Rm R+. Rademacher complexity of SA-ME w.r.t. S for loss function KL satisﬁes Theorem 6. Let H be a family of functions for multi-output linear regression, and Hj be a family of functions for the j-th output. Rademacher complexity of SA-ME with KL loss satisﬁes

ˆRn(KL SF H S) (

j=1 ˆRn(Hj S), (8)

Proof. Note that KL(u, ) is not ρ-Lipschitz over Rm for any ρ R and u Rm, thus Theorem 3 cannot be applied directly. Deﬁne function φ( , ) as KL( , SF( )). Next we show that φ(u, ) satisfy Lipschitzness. For p, q Rm, |φ(u, p) φ(u, q)| = |KL(u, SF(p)) KL(u, SF(q))|, which equals

ln exp(pi) Pm j=1 exp(pj) ln exp(qi) Pm j=1 exp(qj)

j =i epj pi) ln(1 + X

j =i eqj qi)

Observing that ln(1+P

j exp vi) is 1-Lipschitz for v Rm, thus right-hand side of preceding equation is bounded by

i=1 ui||p 1 pi q + 1 qi||2

||p q||2 + m

i=1 ui|pi qi|

( m + 1)||p q||2,

namely, φ is ( m + 1)-Lipschitz. Similar to the discussion of bounding rademacher complexity for AA-BP, we have

ˆRn(KL SF H S) (

j=1 ˆRn(Hj S),

which concludes the proof.

As discussed above, although KL alone does not satisfy Lipschitzness, KL SF, however, is ( m+1)-Lipschitz. Deﬁne class of functions of j-th output with weight constraints as Hj = {x wj x : ||wj||2 1}. According to (Kakade, Sridharan, and Tewari 2009), rademacher complexity of Hj satisﬁes ˆRn(Hj S) maxi [n] ||xi||2 n .

Then right-hand side of Eq. (8) is bounded as

ˆRn(KL SF H S) (

2)m n max i [n] ||xi||2.

Accordingly, data-dependent risk bound for SA-ME is

Theorem 7. Let F be the family of functions for SA-ME deﬁned above with KL divergence as loss function bounded by a constant b, for any δ > 0, with probability at least 1 δ, for all f F such that

j=1 ηyj x ln ηyj x f yj x

j=1 ηyj xi ln ηyj xi f yj xi

2)m n max i [n] ||xi||2.

As (Cha 2007) suggests that 0 is replaced by a very small value, say γ > 0, for division by 0 when implementing KL divergence, then for probability distribution p, q Rm with pi γ, qi γ

i=1 pi ln pi

i=1 pi ln 1

thus there exists a constant b ln γ such that KL( , ) b (e.g., b = 35 for γ = 1 10 15).

Relation between LDL and Classiﬁcation LDL aims at learning the unknown label distribution function η by minimizing distance (or maximize similarity) between the given distribution and the output distribution. However, in practice, LDL is usually applied to perform classiﬁcation. Firstly, a label distribution function η is learned according to training sample S with label distribution. Secondly, a given instance x is classiﬁed by η, with label corresponding to the maximum predicted label description degree as the predicted label, i.e., y = arg maxy Y η(x, y). In this part we tackle feasibility of performing classiﬁcation of LDL algorithms. To make the analysis possible, we stay in probabilistic setting, where the underlying label distribution function η is the conditional probability distribution function, i.e., η(x, y) = P(y|x). Unlike LDL, where the unknown label distribution function η is deterministic though unknown, label variable for classiﬁcation is stochastic, and sampled according to the conditional probability distribution. Thus we start with the plug-in decision theorem to bound error probability with risk of LDL (with absolute loss), then get the upper bound for error probability using union bound.

Generalized Plug-in Decision Theorem As a preliminary, we ﬁrstly introduce the well-known plugin decision theorem. The classic plug-in decision theorem applies to binary decision. For a given x with label y, then the Bayes decision function h is

h (x) = y0 if η1(x) 1/2, y1 otherwise,

where η1(x) = P(y = y1|x). For another decision h with plug-in decision function

h(x) = y0 if η1(x) 1/2, y1 otherwise,

where η1(x) (0 η1(x) 1) is an approximation of η1(x). Then we have

Theorem 8. (Devroye, Gy orﬁ, and Lugosi 1996) For the plug-in decision function h deﬁned above, difference of error probability between h and h satisﬁes P(h(x) = y) P(h (x) = y) 2Ex [| η1(x) η1(x)|] . The theorem states that if η1(x) is close to η1(x) in absolute value, then 0/1 risk of decision h is near that of the optimal decision function h . The preceding plug-in decision theorem only applies to binary classiﬁcation, and we generalize it to multi-class classiﬁcation. Formally, given an instance x with the conditional probability distribution function η (multi-class), the corresponding Bayes decision is h (x) = arg max y Y η(x, y).

Similarly, assume we have access to η that approximates η and the plug-in decision is deﬁned as h(x) = arg max y Y η(x, y).

Theorem 9. For the plug-in decision function h deﬁned above, we have

P(h(x) = y) P(h (x) = y) Ex

i=1 |ηi(x) ηi(x)|

where ηi(x), ηi(x) represent η(x, yi), η(x, yi) respectively.

Proof. Given an instance x, then the conditional error probability of h P(h(x) = y|x) = 1 P(h(x) = y|x)

i=1 P(h(x) = yi, y = yi|x)

i=1 Kr(h(x), yi)ηi(x),

where Kr( , ) is the Kronecker delta function. Then difference of conditional error probability between h and h

P(h(x) = y|x) P(h (x) = y|x)

i=1 (Kr(h (x), yi) Kr(h(x), yi)) ηi(x).

Without loss of generality, let k be the prediction of h and j be the prediction for h , then P(h(x) = y|x) P(h (x) = y|x) = ηj(x) ηk(x). (10) Lemma 10. Let a b and d c, then |a b| |a c| + |b d|.

Proof. If c b or d a, the inequality is obvious. If c b and a d, then |a b| = |a c| + |c b| |a c| + |b d|.

For k = j, then ηj(x) ηk(x) and ηk(x) ηj(x), and according to Lemma 10, |ηj(x) ηk(x)| |ηj(x) ηj(x)|+ |ηk(x) ηk(x)|. For k = j, left-hand side of (10) reduces to 0. Thus we have

P(h(x) = y|x) P(h (x) = y|x)

i=1 |ηi(x) ηi(x)|.

(11) Take expectation of both sides of Eq. (11) and we ﬁnish proof of Theorem 9.

Table 1: Measures for LDL

Measure Formula Chebyshev Dis1(p, q) = maxi |pi qi|

Clark Dis2(p, q) = q P

i pi qi)2 (pi+qi)2 Canberra Dis3(p, q) = P

pi+qi Kullback-Leibler Dis4(p, q) = P

qi Cosine Sim1(p, q) =

||q||2 Intersection Sim2(p, q) = P

i min(pi, qi)

Measures for LDL

As we already have discussed in Theorem 9 that in probabilistic setting, LDL with absolute loss is sufﬁcient for classiﬁcation, because approximation to the conditional probability distribution function with absolute loss guarantees approximation to the optimal classiﬁer. (Geng 2016) suggests six measures for LDL, i.e., Chebyshev distance, Clark distance, Canberra metric, KL divergence, cosine coefﬁcient, and intersection similarity, which belong to Minkowski family, the χ2 family, the L1 family, the Shannon s entropy family, the inner product family, and the intersection family, respectively (Cha 2007). For probability distribution p, q Rm, formulation of the six measures are summarized in Table 1, where after the distance measures indicates the smaller the better , and after the similarity measures indicates the larger the better .

Theorem 11. All measures in Table 1 are sufﬁcient for LDL to perform classiﬁcation.

Proof. Denote absolute loss as Dis(p, q) = P

i |pi qi|. For Chebyshev, Clark and Canberra distances, it s trivial to validate that Dis1 1

m Dis, Dis3 1

2Dis and Dis2 1 2 m Dis. For KL distance, we have Dis4 1

2Dis. For cosine similarity, we have 1 m Sim1 1

2Dis2, and for intersection similarity, it satisﬁes that 1 Sim2 = 1

2Dis. In conclusion, all measures in Table 1 bound absolute loss somehow. Details of the proof is left to appendix.

Data-dependent Error Probability Bounds for LDL

As discussed above, error probability is directly correlated with absolute loss, and Theorem 11 states that measures in Table 1 bound absolute loss somehow. Accordingly, risk bounds for AA-k NN (absolute loss), AA-BP (sum-squared loss), and SA-ME (KL loss) can be extended to error probability bounds. Formally, let f be a learned label distribution function, deﬁne corresponding decision function as g(f(x)) = arg maxy Y f(x, y). Notice that the optimal error probability L exists in the left-hand side of Theorem 9, which can be bounded with Hoeffding s inequality. Moreover, the optimal classiﬁer h outputs the label corresponding to the maximum conditional probability, and empirical error probability of the optimal classiﬁer h is LS(h ) = 1 n Pn i=1 1 maxy Y ηy xi . By

Hoeffding s inequality, for any ϵ > 0 such that

P {|LS(h ) L | ϵ} 2 exp( 2nϵ2),

namely for any δ > 0, with probability at least 1 δ,

|LS(h ) L |

Finally combine Theorem 9 and Equation 12 with the results of Theorem 2, Theorem 5 and Theorem 7 respectively, and we conclude with following theorems.

Theorem 12. Let ηi be ci-Lipschitz. Let η be the output function of AA-k NN. Then for any δ > 0, with probability at least 1 δ, we have

P(g( η(x)) = y) 8c

i=1 max y Y ηy xi.

Different with AA-k NN, AA-BP minimizes sum-squared loss, and note that for random variable z Rm

Ez||z||1 m Ez||z||2 m q

where the ﬁrst inequality is according to Cauchy-Schwarz s inequality and the second one is according to Jensen s inequality. Thus error probability for AA-BP satisﬁes Theorem 13. Let F be a family of functions for AA-BP deﬁned above. For any δ > 0, with probability at least 1 δ, for all f F, such that

P(g(f(x)) = y) m

i,j (η yj xi f yj x )2 + 7

2m2LσB0B1 n max i [n] ||xi||2 + 1 1

i=1 max y Y ηy xi

where is the root operator.

Observing that for probability distribution p, q Rm, ||p q||1 2 p

KL(p, q), which implies that

Theorem 14. Let F be family of functions for SA-ME deﬁned above, and b = 3(b + 1). For any δ > 0, with probability at least 1 δ, for all f F such that

P(g(f(x)) = y) 2

i,j ηyj xi ln ηyj xi f yj xi + b

2)m n max i [n] ||xi||2 + 1 1

i=1 max y Y ηy xi

Conclusion This paper studies learnability of LDL from two aspects, i.e., generalization of LDL itself, and generalization of LDL to perform classiﬁcation. On one hand, for generalization of LDL, risk bounds for three representative LDL algorithms,

i.e., AA-k NN, AA-BP and SA-ME are provided, with convergence rate O(( k

n)1/d+1), O( 1 n), O( 1 n) respectively, which indicates learnability of LDL. On the other hand, for generalization of LDL to perform classiﬁcation, a generalized plug-in decision theorem is proposed, discovering that minimizing absolute loss is sufﬁcient for a corresponding LDL decision function approaching the optimal classiﬁer. Furthermore, six commonly used LDL measures are also shown to be sufﬁcient for classiﬁcation, for the reason that all six measures bound absolute loss somehow. Besides, data-dependent error probability bounds for LDL are given to demonstrate feasibility of LDL to perform classiﬁcation.

Appendices Details of Proof of Theorem 11 For discrete probability distribution p, q Rm, the missing proof for relation between measures in Table 1 and absolute loss is as following. Intersection Similarity. For intersection similarity and absolute distance, it satisﬁes 1 Sim2 = Dis/2.

Proof. Firstly for a, b R, we have min(a, b) = (a+b)/2 |a b|/2, and it follows that Sim2(p, q) = Pm i=1[(pi + qi)/2 |pi qi|/2] = 1 Dis/2

Cosine Similarity. For cosine similarity and absolute distance, it satisﬁes 1 m Sim1 Dis2/2.

Proof. According to cosine law, we have |p||2 2 + ||q||2 2 2||p||2||q||2Cosine(p, q) = ||p q||2 2. Notice that ||p||1 = 1 and ||q||1 = 1. Apply Cauchy Schwarz s inequality to both sides of above equation, and it follows that m/2 2Cosine(p, q) Dis2/m.

Kullback-Leibler Distance. For Kullback-Leibler distance and absolute distance, it satisﬁes Dis4 Dis/2.

Proof. This proof is according to (Cover and Thomas 2012), where the probability distribution is continuous, and here KL divergence is applied to discrete probability distribution. Observing Dis4(p, q) = P

pi , which equals X

i pi {ln min(qi/pi, 1) + ln max(qi/pi, 1)}

i pi min(qi

pi , 1) + ln X

i pi max(qi

i min(pi, qi) + ln X

i max(pi, qi)

pi + qi |pi qi|

pi + qi + |pi qi|

ln {1 Dis(p, q)/2} + ln {1 + Dis(p, q)/2}

ln 1 Dis2/4 , where the third inequality is according to Jensen s inequality. Arrange items, and it follows that ( 1

2Dis)2 1 exp( Dis4) Dis4, where the second inequality is according to the trivial inequality 1 e x x for any x 0, which concludes the proof.

Proof of Equation (3)

The proof is according to the work of (Shalev-Shwartz and Ben-David 2014), which bounds expected distance between a random x and its closest neighbor in the training set.

Proof. For k = 1, according to (Shalev-Shwartz and Ben David 2014),

Ex,S ||x N1(x) x||2 4

which satisﬁes Eq. (3). One can easily check that when k = 1, the right-hand side of Eq. (3) is 21/(d+1)4

dn 1/(d+1) > 4

dn 1/(d+1), thus Eq. (3) holds. Furthermore for k 2,

Lemma 15. (Shalev-Shwartz and Ben-David 2014) Let C1, C2, . . . , Cr be a collection of subsets of X, then for any k 2,

i:|Ci S|<k P[Ci]

Firstly, let C1, C2, . . . , Cr be the cover of X = [0, 1]d using boxes of length λ. Then for x and its k nearest neighbors in the same box, we have P

j Nk(x) ||x xj||2 k

otherwise P

j Nk(x) ||x xj||2 k

d. Therefore,

j Nk(x) ||xj x||2

i:|Ci S|<k P[Ci]

i:|Ci S| k P[Ci]

By Lemma 15 and trivial bound P

i:|Ci S| k P(Ci) 1, it follows that

j Nk(x) ||xj x||2

Since number of boxes r = (1/λ)d, set λ = (2kd/n)(1/d+1) to maximize the right-hand side of the preceding equation,

j Nk(x) ||xj x||2

1 d+1 (d d d+1 + d 1 d+1 ).

Notice that function d d/d+1+d1/d+1 is monotone decreasing with maximum 2 when d = 1. Therefore,

j Nk(x) ||xj x||2

and Eq. (3) holds as well, which concludes the proof.

Acknowledgments This research was supported by the National Key Research & Development Plan of China (No. 2017YFB1002801), the National Science Foundation of China (61622203), the Jiangsu Natural Science Funds for Distinguished Young Scholar (BK20140022), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Collaborative Innovation Center of Wireless Communications Technology.

References Bartlett, P. L., and Mendelson, S. 2003. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3:463 482. Berger, A. L.; Pietra, V. J. D.; and Pietra, S. A. D. 1996. A maximum entropy approach to natural language processing. Comput. Linguist. 22(1):39 71. Biau, G., and Devroye, L. 2015. Lectures on the nearest neighbor method. Springer. Borchani, H.; Varando, G.; Bielza, C.; and Larra naga, P. 2015. A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5(5):216 233. Cha, S.-H. 2007. Comprehensive survey on distance/similarity measures between probability density functions. Cover, T. M., and Thomas, J. A. 2012. Elements of information theory. John Wiley & Sons. Devroye, L., and Krzy zak, A. 1989. An equivalence theorem for l1 convergence of the kernel regression estimate. Journal of Statistical Planning and Inference 23(1):71 82. Devroye, L.; Gy orﬁ, L.; and Lugosi, G. 1996. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media. Gao, W., and Zhou, Z.-H. 2016. Dropout rademacher complexity of deep neural networks. Science China Information Sciences 59(7):072104. Geng, X., and Hou, P. 2015. Pre-release prediction of crowd opinion on movies by label distribution learning. In Proceedings of the 24th International Conference on Artiﬁcial Intelligence, IJCAI 15, 3511 3517. Geng, X., and Ling, M. 2017. Soft video parsing by label distribution learning. In AAAI Conference on 31th AAAI Conference on Artiﬁcial Intelligence, AAAI 17, 1331 1337. Geng, X., and Xia, Y. 2014. Head pose estimation based on multivariate label distribution. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 14, 1837 1842. Geng, X.; Yin, C.; and Zhou, Z. 2013. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(10):2401 2412. Geng, X. 2016. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28(7):1734 1748.

Hou, P.; Geng, X.; Huo, Z.-W.; and Lv, J.-Q. 2017. Semisupervised adaptive label distribution learning for facial age estimation. In AAAI Conference on 31th AAAI Conference on Artiﬁcial Intelligence, AAAI 17. Jia, X.; Li, W.; Liu, J.; and Zhang, Y. 2018. Label distribution learning by exploiting label correlations. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI 18. Kakade, S. M.; Sridharan, K.; and Tewari, A. 2009. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, NIPS 09, 793 800. Li, Y.; Zhang, M.; and Geng, X. 2015. Leveraging implicit relative labeling-importance information for effective multilabel learning. In 2015 IEEE International Conference on Data Mining, 251 260. Maurer, A. 2016. A vector-contraction inequality for rademacher complexities. In Algorithmic Learning Theory, 3 17. Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012. Foundations of machine learning. MIT press. Ren, Y., and Geng, X. 2017. Sense beauty by label distribution learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI 17, 2648 2654. Shalev-Shwartz, S., and Ben-David, S. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press. Stone, C. J. 1977. Consistent nonparametric regression. The Annals of Statistics 5(4):595 620. Villani, C. 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media. Xu, M., and Zhou, Z.-H. 2017. Incomplete label distribution learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI 17, 3175 3181. Xu, N.; Tao, A.; and Geng, X. 2018. Label enhancement for label distribution learning. In Proceedings of the Twenty Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI 18, 2926 2932. Zhang, M., and Zhou, Z. 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26(8):1819 1837. Zhao, P., and Zhou, Z.-H. 2018. Label distribution learning by optimal transport. In AAAI Conference on 32th AAAI Conference on Artiﬁcial Intelligence, AAAI 18. Zheng, X.; Jia, X.; and Li, W. 2018. Label distribution learning by exploiting sample correlations locally. In AAAI Conference on 32th AAAI Conference on Artiﬁcial Intelligence, AAAI 18. Zhou, Y.; Xue, H.; and Geng, X. 2015. Emotion distribution recognition from facial expressions. In Proceedings of the 23rd ACM International Conference on Multimedia, MM 15, 1247 1250. New York, NY, USA: ACM.