# theoretical_analysis_of_label_distribution_learning__7bdef4bd.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Theoretical Analysis of Label Distribution Learning Jing Wang, Xin Geng MOE Key Laboratory of Computer Network and Information Integration School of Computer Science and Engineering, Southeast University, Nanjing 210096, China {wangjing91, xgeng}@seu.edu.cn As a novel learning paradigm, label distribution learning (LDL) explicitly models label ambiguity with the definition of label description degree. Although lots of work has been done to deal with real-world applications, theoretical results on LDL remain unexplored. In this paper, we rethink LDL from theoretical aspects, towards analyzing learnability of LDL. Firstly, risk bounds for three representative LDL algorithms (AA-k NN, AA-BP and SA-ME) are provided. For AA-k NN, Lipschitzness of the label distribution function is assumed to bound the risk, and for AA-BP and SA-ME, rademacher complexity is utilized to give data-dependent risk bounds. Secondly, a generalized plug-in decision theorem is proposed to understand the relation between LDL and classification, uncovering that approximation to the conditional probability distribution function in absolute loss guarantees approaching to the optimal classifier, and also data-dependent error probability bounds are presented for the corresponding LDL algorithms to perform classification. As far as we know, this is perhaps the first research on theory of LDL. Introduction Traditional learning paradigms include single-label learning (SLL) and multi-label learning (MLL) (Zhang and Zhou 2014). SLL assumes that an instance is associated with one label, while MLL assumes that multiple labels are assigned to an instance. Essentially both SLL and MLL aim at finding related label/labels to describe the instance, while neither SLL nor MLL support relative importance of labels assigned to the instance. However in some real-word applications, labels are related to the instance with different relative importance degree. Thus it seems reasonable to label an instance with a soft label rather than a hard one (e.g., single label or a set of labels). Inspired by this, (Geng 2016) proposes a novel learning paradigm, Label Distribution Learning (LDL), which labels an instance with a distribution of description degree over label space, called label distribution, and learns a mapping from instance to label distribution directly. Compared with MLL, where positive labels are commonly treated equally (with description degree equals 1/c implicitly for c positive labels), LDL allows explicitly mod- Corresponding author. Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. eling of different relative importance of labels assigned to the instance, which is more suitable for many real scenarios. Recently, LDL has been extensively applied in many realworld applications, which can be classified into three classes according to the source of label distribution. The first one features that label distribution is from data itself, which includes pre-release rating prediction on movies (Geng and Hou 2015), emotion recognition (Zhou, Xue, and Geng 2015), et al. The second one is characterized by that label distribution is originated from pre-knowledge, among which applications include age estimation (Geng, Yin, and Zhou 2013), head pose estimation (Geng and Xia 2014), et al. The third one is attributed to that label distribution is learned from data automatically. Applications of such class include label-importance-aware multi-label learning (Li, Zhang, and Geng 2015), beauty sensing (Ren and Geng 2017), video parsing (Geng and Ling 2017), et al. The secret of successes of LDL being applied in a variety of fields is that explicit introduction of label ambiguity with label distribution boosts performance of real-world applications. In this paper, we re-examine LDL from theoretical aspects, towards analyzing generalization of LDL algorithms. Precisely, there are at least two arguments related to the generalization of LDL. The first one is on generalization of LDL itself, and the second one is on generalization of LDL to perform classification. On one hand, LDL can be regarded as one kind of multi-output regression (Borchani et al. 2015), and generalization of multi-output regression algorithms (Kakade, Sridharan, and Tewari 2009) can be transfered to LDL somewhat. However, specializations of LDL should not be neglected. The first thing to note is probability simplex constraint of label distribution. As suggested by (Geng 2016), the target label distribution is real-valued and satisfies distribution constraints, i.e., ηy x [0, 1], and P y Y ηy x = 1 (formally defined in Preliminary). This constraint is often satisfied by applying a softmax function onto each output of a multi-output regression model, which complicates complexity of the corresponding multi-output regression model. Furthermore, the second thing to note is specialization of measures for LDL. Although Lipschitzness of loss function is a general assumption when bounding the risk, Kullback-Leibler (KL) divergence as loss function for LDL (e.g., SA-ME) does not satisfy Lipschitzness instead. On another hand, we find that LDL is usually adopted to per- form classification implicitly. For examples, in applications of age estimation (Geng, Yin, and Zhou 2013), head-pose estimation (Geng and Xia 2014), pre-release rating prediction on movies (Geng and Hou 2015), et al., label corresponding to the maximum predicted description degree is treated as the predicted label. As far as we know, generalization of such framework has never been touched. The main contributions of this paper are summarized as followings, 1) We establish risk bounds for three representative LDL algorithms, i.e., AA-BP, SA-ME, and AA-k NN, where bounds for the first two are data-dependent risk bounds, and bound for AA-k NN is consistency bound with Lipschitz assumption. 2) We generalize the binary plug-in decision theorem into multi-class classification, discovering that LDL dominates classification. 3) We provide, to the best of our knowledge, the first datadependent error probability bounds for three representative LDL algorithms to perform classification. The rest of this paper is organized as follows. Firstly, related works are briefly reviewed. Secondly, notations are introduced in preliminary. Then risk bounds for three representative LDL algorithms are provided. Next generalization of LDL to perform classification is examined. Finally, we conclude the paper. Related Work LDL (Geng 2016) is a novel learning paradigm, which labels an instance with a label distribution and learns a mapping from instance to label distribution straightly. Existing studies mainly focus on algorithm design and improvement. Three strategies are embraced to design algorithms for LDL (Geng 2016). The first one is Problem Transformation (PT). Algorithms of this class firstly transform dataset equipped with label distribution into single-label dataset via re-sampling, and then SLL algorithms are employed to learn the transformed single-label dataset. Two representative algorithms are PT-SVM and PT-Bayes, which apply SVM and Bayes classifier respectively. The second one is Algorithm Adaptation (AA), which extends certain existing learning algorithms to handle label distribution seamlessly. Concretely, k-NN is adapted in such that, for an instance, mean of label distribution of its k nearest neighbors is calculated as the predicted label distribution, which is denoted by AA-k NN. Besides, three-layer back-propagation neural network with multi-output is also adapted to minimize the sum-squared loss of output of neural network compared with real label distribution, which is denoted by AA-BP. The third one is Specialized Algorithms (SA), which matches characteristics of LDL. Two specialized algorithms, i.e., SA-IIS and SABFGS, are proposed by applying maximum entropy model (Berger, Pietra, and Pietra 1996) with KL divergence as loss function to learn the label distribution. Notice that all above algorithms are designed without learnability guarantee. Recently, two improvements on LDL are noteworthy. The first one tackles shortage of label distribution dataset. Compared with SLL and MLL, LDL dataset is more difficult to acquire. (Xu, Tao, and Geng 2018) proposes to boost logical label (from SLL/MLL dataset) into real-valued label distribution by graph laplacian label enhancement. (Hou et al. 2017) utilizes semi-supervised learning, with abundant unlabeled data, to adaptively learn label distribution. (Xu and Zhou 2017) deals with the situation when label distribution is partially observed, and jointly recovers and learns it by matrix completion with trace-norm regularizer. The second one exploits label correlations of LDL. (Zhou, Xue, and Geng 2015) represents label correlations by Pearson s correlation coefficient. (Jia et al. 2018) addresses label correlations by adding a regularizer, which encodes label correlations into distance between labels. (Zheng, Jia, and Li 2018) considers label correlation is local, which is encoded into a local correlation vector, and learns optimal encoding vector and LDL simultaneously. Algorithm improvements are mainly according to practice guide, without taking theoretical feasibility into consideration. There are few work on theoretical research of LDL. One recent paper (Zhao and Zhou 2018) studies generalization of LDL partially. However, the major motivation of the paper is to take structures of label space into consideration by applying optimal transport distance (Villani 2008) instead of traditional LDL measures. And the proposed risk bound is too abroad, without providing a bound for rademacher complexity of the hypothesis space. Besides, characteristics of LDL measures are not taken into consideration at all. Risk bounds we developed in this paper match specializations of LDL (i.e., probability simplex constraint and loss function characteristics), which are broadly applicable. For generalization of LDL to perform classification, one possible related topic is the plug-in decision theorem (Devroye, Gy orfi, and Lugosi 1996) (formally described in Theorem 8). It states that error probability difference between a decision function and the optimal one is bounded by expectation of absolute difference between the plug-in function and the conditional probability distribution function. Plug-in decision theorem has been widely used to prove consistency of many decision rules, such as Stone s Theorem (Stone 1977), strong consistency of k-NN rule (Biau and Devroye 2015), consistency of kernel rule (Devroye and Krzy zak 1989), et al. Note that the classic plug-in decision theorem is only established in the binary setting, which limits its applications. In this paper, we generalize the plug-in decision theorem and develop data-dependent error probability bounds for the corresponding LDL algorithms. Preliminary Denote input space by X Rd, and label space by Y = {y1, y2, . . . , ym}. Define label distribution function η : X Y R, which satisfies η(x, y) 0 and P y η(x, y) = 1. Let the training set be S = {(x1, η(x1)), . . . , (xn, η(xn))}, where xi is sampled according to an underlying probability distribution D, and η(xi) = {ηy1 xi, . . . , ηym xi } is determined by an unknown label distribution function η with ηyj xi = η(xi, yj) for convenience of notation. The goal of LDL is to learn the unknown function η. Given a function class H and a loss function ℓ: Rm Rm R+, for function h H, its corresponding risk and empirical risk are defined as LD(h) = Ex D[ℓ(h(x), η(x))] and LS(h) = 1 n Pn i=1 ℓ(h(xi), η(xi)), respectively. Recall the definition of rademacher complexity w.r.t. S and ℓ, ˆRn(ℓ H S) = Eϵ1,...,ϵn i=1 ℓ(h(xi), η(xi))ϵi where ϵ1, . . . , ϵn are n independent rademacher random variables with P(ϵi = 1) = P(ϵi = 1) = 1/2. Then Lemma 1. (Bartlett and Mendelson 2003; Mohri, Rostamizadeh, and Talwalkar 2012) Let H be a family of functions. For a loss function ℓbounded by µ, then for any δ > 0, with probability at least 1 δ, for all h H such that LD(h) LS(h) + 2 ˆRn(ℓ H S) + 3µ Risk Bounds for LDL Algorithms We now provide risk bounds for three representative algorithms from two classes, i.e., AA-k NN and AA-BP from algorithm adaptation, SA-ME (maximum entropy) from specialized algorithms. Notice that (Geng 2016) proposes two specialized algorithms, i.e., SA-IIS and SA-BFGS, which differ only in the underlying optimization methods. In this paper we focus on the generalization ability of algorithms, and SA-ME represents SA-IIS and SA-BFGS collectively from the perspective of the underlying model, i.e., maximum entropy model. Also algorithms from problem transformation are not covered, for the reason that this type of algorithms circumvent learning label distribution with resampling and single-label learning algorithms instead of learning it directly. AA-k NN AA-k NN extends k NN to deal with label distribution. Given a new instance x, its k nearest neighbors are firstly selected in the training set. Then mean of label distribution of k nearest neighbors is calculated as the label distribution of x, i.e., j Nk(x) ηyi xj, (i = 1, 2, . . . , m), where η : X Y Rm is the output function of AA-k NN with ηyi x = η(x, yi) for simplicity of notation, and Nk(x) is the index set of k nearest neighbors of x. For convenience of analysis, we assume X = [0, 1]d. Theorem 2. Assume η( , yi) be ci-Lipschitz w.r.t. X, and c = Pm i=1 ci, then for any δ > 0, with probability at least 1 δ such that i=1 |ηyi x ηyi x | 1/(d+1) . (2) Proof. With Lipschitz assumption of η( , yi), we have i=1 |ηyi x ηyi x | j Nk(x) |ηyi x ηyi xj | j Nk(x) ||xj x||2 Thus to prove Theorem (2), it suffices to prove j Nk(x) ||xj x||2 1/(d+1) , (3) which is tricky and left to appendices. Then apply Markov s inequality to Eq. (3), and Theorem (2) follows directly. AA-BP AA-BP adapts back-propagation neural network to perform label distribution learning. The three-layer neural network has m output units, each of which outputs the description degree of a label. To make sure the output of neural network satisfies probability simplex constraint, softmax activation function is applied in each output unit. Similar to multioutput regression, AA-BP minimizes sum-squared loss of the output of neural network with the real label distribution. Observing that AA-BP can be regarded as a combination of softmax function and function family of the three-layer neural network, rademacher complexity of which w.r.t. S for loss function ℓis bounded as following, Theorem 3. Denote softmax function by SF. Let H be a family of functions for three-layer neural network with m outputs (identity activation on the output layer), and Hj be a family of functions for j-th output. For a loss function ℓ with Lipschitz constant Lℓ, we have ˆRn(ℓ SF H S) 2 j=1 ˆRn(Hj S), (4) Proof. Define function φ( , ) as ℓ(SF( ), ). Next, we show that φ is Lipschitz. For probability distribution p, q Rm, |φ(p, ) φ(q, )| Lℓ||SF(p) SF(q)||1, where the inequality is according to Lipschitzness of ℓ. Next, right-hand side of the proceeding equation equals j =i epj pi 1 1 + P j =i eqj qi where pi, qi is i-th element of p and q, respectively. Observing that for v Rm, function 1 1+P i exp(vi) is actually 1-Lipschitz, thus the preceding equation is bounded by i=1 ||p 1 pi q + 1 qi||2 i=1 (||p q||2 + m|pi qi|) Lℓ(m||p q||2 + m i=1 |pi qi|) 2m Lℓ||p q||2, where the last inequality is according to Cauchy-Schwarz s inequality. Recall the definition of rademacher complexity ˆRn(ℓ SF H S) = E i=1 φ(h(xi), η(xi))ϵi and according to (Maurer 2016), with φ being 2m LℓLipschitz, right-hand side of above equation is bounded by j=1 ϵi,jhj(xi) where hj(xi) is j-th component of h(xi), and ϵi,j are n m i.i.d. random variables. Notice that function class of a multioutput neural network can be regarded as a direct sum of function classes of multiple scalar-output neural networks. Suppose H1, . . . , Hm be classes of functions for three-layer neural network with scalar output, then H = j [m]Hj = x [h1(x) . . . hm(x)]T : hj Hj , and Eq. (5) equals j=1 ϵi,jhj(xi) i=1 ϵi,jhj(xi) j=1 ˆRn(Hj S). which finishes proof of Theorem 3. Rademacher complexity of class of functions for threelayer neural network with scalar output satisfies Lemma 4. (Bartlett and Mendelson 2003; Gao and Zhou 2016). Let σ be Lipschitz with constant Lσ. Define class of functions Hj = {x 7 P i wj,iσ(vi x) : ||wj||2 B1, ||vi||2 B0}, rademacher complexity of which satisfies ˆRn(Hj S) LσB0B1 n max i [n] ||xi||2. Accordingly, right-hand side of Eq. (4) is bounded as ˆRn(ℓ SF H S) 2 2m2LℓLσB0B1 n max i [n] ||xi||2. For AA-BP, sum-squared loss is 2-Lipschitz, and bounded by 2. Finally data-dependent risk bound for AA-BP is Theorem 5. Let F be the family of functions for AA-BP defined above, with sum-squared loss and weight constraints as Theorem 4, for any δ > 0, with probability at least 1 δ, for all f F such that i=1 (f yi x ηyi x )2 # j=1 (ηyj xi f yj xi )2 2m2LσB0B1 n max i [n] ||xi||2 + 6 SA-ME SA-ME applies maximum entropy model to learn label distribution, i.e., Z exp(wj x), (7) where Z = P j exp(wj x) is the normalization factor. Actually Eq. (7) can be regarded as a combination of softmax function and multi-output linear regression, namely SF H, where H represents a class of functions of multi-output linear regression. SA-ME uses KL divergence as loss function, which is denoted by KL : Rm Rm R+. Rademacher complexity of SA-ME w.r.t. S for loss function KL satisfies Theorem 6. Let H be a family of functions for multi-output linear regression, and Hj be a family of functions for the j-th output. Rademacher complexity of SA-ME with KL loss satisfies ˆRn(KL SF H S) ( j=1 ˆRn(Hj S), (8) Proof. Note that KL(u, ) is not ρ-Lipschitz over Rm for any ρ R and u Rm, thus Theorem 3 cannot be applied directly. Define function φ( , ) as KL( , SF( )). Next we show that φ(u, ) satisfy Lipschitzness. For p, q Rm, |φ(u, p) φ(u, q)| = |KL(u, SF(p)) KL(u, SF(q))|, which equals ln exp(pi) Pm j=1 exp(pj) ln exp(qi) Pm j=1 exp(qj) j =i epj pi) ln(1 + X j =i eqj qi) Observing that ln(1+P j exp vi) is 1-Lipschitz for v Rm, thus right-hand side of preceding equation is bounded by i=1 ui||p 1 pi q + 1 qi||2 ||p q||2 + m i=1 ui|pi qi| ( m + 1)||p q||2, namely, φ is ( m + 1)-Lipschitz. Similar to the discussion of bounding rademacher complexity for AA-BP, we have ˆRn(KL SF H S) ( j=1 ˆRn(Hj S), which concludes the proof. As discussed above, although KL alone does not satisfy Lipschitzness, KL SF, however, is ( m+1)-Lipschitz. Define class of functions of j-th output with weight constraints as Hj = {x wj x : ||wj||2 1}. According to (Kakade, Sridharan, and Tewari 2009), rademacher complexity of Hj satisfies ˆRn(Hj S) maxi [n] ||xi||2 n . Then right-hand side of Eq. (8) is bounded as ˆRn(KL SF H S) ( 2)m n max i [n] ||xi||2. Accordingly, data-dependent risk bound for SA-ME is Theorem 7. Let F be the family of functions for SA-ME defined above with KL divergence as loss function bounded by a constant b, for any δ > 0, with probability at least 1 δ, for all f F such that j=1 ηyj x ln ηyj x f yj x j=1 ηyj xi ln ηyj xi f yj xi 2)m n max i [n] ||xi||2. As (Cha 2007) suggests that 0 is replaced by a very small value, say γ > 0, for division by 0 when implementing KL divergence, then for probability distribution p, q Rm with pi γ, qi γ i=1 pi ln pi i=1 pi ln 1 thus there exists a constant b ln γ such that KL( , ) b (e.g., b = 35 for γ = 1 10 15). Relation between LDL and Classification LDL aims at learning the unknown label distribution function η by minimizing distance (or maximize similarity) between the given distribution and the output distribution. However, in practice, LDL is usually applied to perform classification. Firstly, a label distribution function η is learned according to training sample S with label distribution. Secondly, a given instance x is classified by η, with label corresponding to the maximum predicted label description degree as the predicted label, i.e., y = arg maxy Y η(x, y). In this part we tackle feasibility of performing classification of LDL algorithms. To make the analysis possible, we stay in probabilistic setting, where the underlying label distribution function η is the conditional probability distribution function, i.e., η(x, y) = P(y|x). Unlike LDL, where the unknown label distribution function η is deterministic though unknown, label variable for classification is stochastic, and sampled according to the conditional probability distribution. Thus we start with the plug-in decision theorem to bound error probability with risk of LDL (with absolute loss), then get the upper bound for error probability using union bound. Generalized Plug-in Decision Theorem As a preliminary, we firstly introduce the well-known plugin decision theorem. The classic plug-in decision theorem applies to binary decision. For a given x with label y, then the Bayes decision function h is h (x) = y0 if η1(x) 1/2, y1 otherwise, where η1(x) = P(y = y1|x). For another decision h with plug-in decision function h(x) = y0 if η1(x) 1/2, y1 otherwise, where η1(x) (0 η1(x) 1) is an approximation of η1(x). Then we have Theorem 8. (Devroye, Gy orfi, and Lugosi 1996) For the plug-in decision function h defined above, difference of error probability between h and h satisfies P(h(x) = y) P(h (x) = y) 2Ex [| η1(x) η1(x)|] . The theorem states that if η1(x) is close to η1(x) in absolute value, then 0/1 risk of decision h is near that of the optimal decision function h . The preceding plug-in decision theorem only applies to binary classification, and we generalize it to multi-class classification. Formally, given an instance x with the conditional probability distribution function η (multi-class), the corresponding Bayes decision is h (x) = arg max y Y η(x, y). Similarly, assume we have access to η that approximates η and the plug-in decision is defined as h(x) = arg max y Y η(x, y). Theorem 9. For the plug-in decision function h defined above, we have P(h(x) = y) P(h (x) = y) Ex i=1 |ηi(x) ηi(x)| where ηi(x), ηi(x) represent η(x, yi), η(x, yi) respectively. Proof. Given an instance x, then the conditional error probability of h P(h(x) = y|x) = 1 P(h(x) = y|x) i=1 P(h(x) = yi, y = yi|x) i=1 Kr(h(x), yi)ηi(x), where Kr( , ) is the Kronecker delta function. Then difference of conditional error probability between h and h P(h(x) = y|x) P(h (x) = y|x) i=1 (Kr(h (x), yi) Kr(h(x), yi)) ηi(x). Without loss of generality, let k be the prediction of h and j be the prediction for h , then P(h(x) = y|x) P(h (x) = y|x) = ηj(x) ηk(x). (10) Lemma 10. Let a b and d c, then |a b| |a c| + |b d|. Proof. If c b or d a, the inequality is obvious. If c b and a d, then |a b| = |a c| + |c b| |a c| + |b d|. For k = j, then ηj(x) ηk(x) and ηk(x) ηj(x), and according to Lemma 10, |ηj(x) ηk(x)| |ηj(x) ηj(x)|+ |ηk(x) ηk(x)|. For k = j, left-hand side of (10) reduces to 0. Thus we have P(h(x) = y|x) P(h (x) = y|x) i=1 |ηi(x) ηi(x)|. (11) Take expectation of both sides of Eq. (11) and we finish proof of Theorem 9. Table 1: Measures for LDL Measure Formula Chebyshev Dis1(p, q) = maxi |pi qi| Clark Dis2(p, q) = q P i pi qi)2 (pi+qi)2 Canberra Dis3(p, q) = P pi+qi Kullback-Leibler Dis4(p, q) = P qi Cosine Sim1(p, q) = ||q||2 Intersection Sim2(p, q) = P i min(pi, qi) Measures for LDL As we already have discussed in Theorem 9 that in probabilistic setting, LDL with absolute loss is sufficient for classification, because approximation to the conditional probability distribution function with absolute loss guarantees approximation to the optimal classifier. (Geng 2016) suggests six measures for LDL, i.e., Chebyshev distance, Clark distance, Canberra metric, KL divergence, cosine coefficient, and intersection similarity, which belong to Minkowski family, the χ2 family, the L1 family, the Shannon s entropy family, the inner product family, and the intersection family, respectively (Cha 2007). For probability distribution p, q Rm, formulation of the six measures are summarized in Table 1, where after the distance measures indicates the smaller the better , and after the similarity measures indicates the larger the better . Theorem 11. All measures in Table 1 are sufficient for LDL to perform classification. Proof. Denote absolute loss as Dis(p, q) = P i |pi qi|. For Chebyshev, Clark and Canberra distances, it s trivial to validate that Dis1 1 m Dis, Dis3 1 2Dis and Dis2 1 2 m Dis. For KL distance, we have Dis4 1 2Dis. For cosine similarity, we have 1 m Sim1 1 2Dis2, and for intersection similarity, it satisfies that 1 Sim2 = 1 2Dis. In conclusion, all measures in Table 1 bound absolute loss somehow. Details of the proof is left to appendix. Data-dependent Error Probability Bounds for LDL As discussed above, error probability is directly correlated with absolute loss, and Theorem 11 states that measures in Table 1 bound absolute loss somehow. Accordingly, risk bounds for AA-k NN (absolute loss), AA-BP (sum-squared loss), and SA-ME (KL loss) can be extended to error probability bounds. Formally, let f be a learned label distribution function, define corresponding decision function as g(f(x)) = arg maxy Y f(x, y). Notice that the optimal error probability L exists in the left-hand side of Theorem 9, which can be bounded with Hoeffding s inequality. Moreover, the optimal classifier h outputs the label corresponding to the maximum conditional probability, and empirical error probability of the optimal classifier h is LS(h ) = 1 n Pn i=1 1 maxy Y ηy xi . By Hoeffding s inequality, for any ϵ > 0 such that P {|LS(h ) L | ϵ} 2 exp( 2nϵ2), namely for any δ > 0, with probability at least 1 δ, |LS(h ) L | Finally combine Theorem 9 and Equation 12 with the results of Theorem 2, Theorem 5 and Theorem 7 respectively, and we conclude with following theorems. Theorem 12. Let ηi be ci-Lipschitz. Let η be the output function of AA-k NN. Then for any δ > 0, with probability at least 1 δ, we have P(g( η(x)) = y) 8c i=1 max y Y ηy xi. Different with AA-k NN, AA-BP minimizes sum-squared loss, and note that for random variable z Rm Ez||z||1 m Ez||z||2 m q where the first inequality is according to Cauchy-Schwarz s inequality and the second one is according to Jensen s inequality. Thus error probability for AA-BP satisfies Theorem 13. Let F be a family of functions for AA-BP defined above. For any δ > 0, with probability at least 1 δ, for all f F, such that P(g(f(x)) = y) m i,j (η yj xi f yj x )2 + 7 2m2LσB0B1 n max i [n] ||xi||2 + 1 1 i=1 max y Y ηy xi where is the root operator. Observing that for probability distribution p, q Rm, ||p q||1 2 p KL(p, q), which implies that Theorem 14. Let F be family of functions for SA-ME defined above, and b = 3(b + 1). For any δ > 0, with probability at least 1 δ, for all f F such that P(g(f(x)) = y) 2 i,j ηyj xi ln ηyj xi f yj xi + b 2)m n max i [n] ||xi||2 + 1 1 i=1 max y Y ηy xi Conclusion This paper studies learnability of LDL from two aspects, i.e., generalization of LDL itself, and generalization of LDL to perform classification. On one hand, for generalization of LDL, risk bounds for three representative LDL algorithms, i.e., AA-k NN, AA-BP and SA-ME are provided, with convergence rate O(( k n)1/d+1), O( 1 n), O( 1 n) respectively, which indicates learnability of LDL. On the other hand, for generalization of LDL to perform classification, a generalized plug-in decision theorem is proposed, discovering that minimizing absolute loss is sufficient for a corresponding LDL decision function approaching the optimal classifier. Furthermore, six commonly used LDL measures are also shown to be sufficient for classification, for the reason that all six measures bound absolute loss somehow. Besides, data-dependent error probability bounds for LDL are given to demonstrate feasibility of LDL to perform classification. Appendices Details of Proof of Theorem 11 For discrete probability distribution p, q Rm, the missing proof for relation between measures in Table 1 and absolute loss is as following. Intersection Similarity. For intersection similarity and absolute distance, it satisfies 1 Sim2 = Dis/2. Proof. Firstly for a, b R, we have min(a, b) = (a+b)/2 |a b|/2, and it follows that Sim2(p, q) = Pm i=1[(pi + qi)/2 |pi qi|/2] = 1 Dis/2 Cosine Similarity. For cosine similarity and absolute distance, it satisfies 1 m Sim1 Dis2/2. Proof. According to cosine law, we have |p||2 2 + ||q||2 2 2||p||2||q||2Cosine(p, q) = ||p q||2 2. Notice that ||p||1 = 1 and ||q||1 = 1. Apply Cauchy Schwarz s inequality to both sides of above equation, and it follows that m/2 2Cosine(p, q) Dis2/m. Kullback-Leibler Distance. For Kullback-Leibler distance and absolute distance, it satisfies Dis4 Dis/2. Proof. This proof is according to (Cover and Thomas 2012), where the probability distribution is continuous, and here KL divergence is applied to discrete probability distribution. Observing Dis4(p, q) = P pi , which equals X i pi {ln min(qi/pi, 1) + ln max(qi/pi, 1)} i pi min(qi pi , 1) + ln X i pi max(qi i min(pi, qi) + ln X i max(pi, qi) pi + qi |pi qi| pi + qi + |pi qi| ln {1 Dis(p, q)/2} + ln {1 + Dis(p, q)/2} ln 1 Dis2/4 , where the third inequality is according to Jensen s inequality. Arrange items, and it follows that ( 1 2Dis)2 1 exp( Dis4) Dis4, where the second inequality is according to the trivial inequality 1 e x x for any x 0, which concludes the proof. Proof of Equation (3) The proof is according to the work of (Shalev-Shwartz and Ben-David 2014), which bounds expected distance between a random x and its closest neighbor in the training set. Proof. For k = 1, according to (Shalev-Shwartz and Ben David 2014), Ex,S ||x N1(x) x||2 4 which satisfies Eq. (3). One can easily check that when k = 1, the right-hand side of Eq. (3) is 21/(d+1)4 dn 1/(d+1) > 4 dn 1/(d+1), thus Eq. (3) holds. Furthermore for k 2, Lemma 15. (Shalev-Shwartz and Ben-David 2014) Let C1, C2, . . . , Cr be a collection of subsets of X, then for any k 2, i:|Ci S|