# compact_multilabel_learning__798162da.pdf

Compact Multi-Label Learning

Xiaobo Shen, Weiwei Liu, Ivor W. Tsang, Quan-Sen Sun, Yew-Soon Ong

School of Computer Science and Engineering, Nanyang Technological University School of Computer Science and Engineering, The University of New South Wales Center for Artiﬁcial Intelligence, University of Technology Sydney School of Computer and Engineering, Nanjing University of Science and Technology {njust.shenxiaobo, liuweiwei863}@gmail.com, ivor.tsang@uts.edu.au, sunquansen@njust.edu.cn, asysong@ntu.edu.sg

Embedding methods have shown promising performance in multi-label prediction, as they can discover the dependency of labels. Most embedding methods cannot well align the input and output, which leads to degradation in prediction performance. Besides, they suffer from expensive prediction computational costs when applied to large-scale datasets. To address the above issues, this paper proposes a Co-Hashing (Co H) method by formulating multi-label learning from the perspective of cross-view learning. Co H ﬁrst regards the input and output as two views, and then aims to learn a common latent hamming space, where input and output pairs are compressed into compact binary embeddings. Co H enjoys two key beneﬁts: 1) the input and output can be well aligned, and their correlations are explored; 2) the prediction is very efﬁcient using fast cross-view k NN search in the hamming space. Moreover, we provide the generalization error bound for our method. Extensive experiments on eight real-world datasets demonstrate the superiority of the proposed Co H over the state-of-the-art methods in terms of both prediction accuracy and efﬁciency.

Introduction In multi-label learning (Zhang and Zhou 2014; Liu, Tsang, and M uller 2017), each instance is represented by a set of labels. For example, a document can be associated with a range of topics, such as sports, ﬁnance, and education; or an image may be tagged with both beach and tree. The task of multi-label learning is to learn a function that can predict the proper label sets for unseen instances. Nowadays it is becoming more and more relevant to a number of applications, ranging from document classiﬁcation to gene function prediction and automatic image annotation. Much of the literature (Chen and Lin 2012; Guo and Schuurmans 2013; Zhang and Zhou 2014) has shown that multi-label learning methods usually achieve better prediction performance by explicitly capturing label dependency. To this end, (Hsu et al. 2009) are the ﬁrst to propose embedding the label vectors into a low-dimensional subspace using random projection, and then building a regression model

indicates equal contribution. Corresponding author. Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

in the embedding space. To learn better embeddings, various embedding methods have since been developed, such as canonical correlation analysis (CCA) (Zhang and Schneider 2011), maximum margin output coding (MMOC) (Zhang and Schneider 2012). Such methods have achieved the impressive performance on small datasets. However, their decoding schemes often require solving a quadratic programming (QP) problem in a combinatorial space, which is very computationally expensive. To improve efﬁciency, large margin k NN (LM-k NN) (Liu and Tsang 2015) for fast multi-label prediction has been recently proposed. LM-k NN learns a distance metric to discover the label dependency so that instances with very different multiple labels are moved further away. More importantly, LM-k NN is the ﬁrst approach to employ k-nearest neighbor (k NN) search in the embedding space for prediction and it avoids an expensive decoding procedure. Another state-of-the-art method that also uses k NN search for fast prediction is the recently proposed sparse local embedding for extreme classiﬁcation (SLEEC) (Bhatia et al. 2015). SLEEC ﬁrst seeks the embedding of labels by preserving the pairwise distances between a few nearest label neighbors, and then learns the regressor in the embedding space. A gap usually exists between the input and output in the multi-label setting, and the performance of k NN decision rule for multilabel prediction signiﬁcantly depends on the alignment between the input and output. However, the above two methods fail to explore this correlation, thus their learned embeddings are not well aligned, leading to degradation in prediction performance. In addition, their prediction efﬁciency relies on the speed of k NN search, which is slow for largescale application. How to further improve the prediction efﬁciency also remains less explored. To address the aforementioned issues, this paper regards the input and output as two distinct views, and provides a new insight into multi-label learning from a cross-view perspective (Quadrianto and Lampert 2011; Dhillon, Foster, and Ungar 2011; Guo and Xiao 2012; Shen et al. 2017b), which aims to integrate the complementary advantages of the two views to accomplish multi-label prediction tasks. Speciﬁcally, we propose Co-Hashing (Co H) to learn a common latent hamming space, where the input and output are jointly embedded into compact binary codes and well aligned. In the latent subspace, the correlation between in-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

put and output is built, and knowledge can be better transferred. In prediction phrase, we directly search the k NN label embeddings of any testing instance in the constructed latent subspace. The k NN cross-view label embeddings search beneﬁts from the view alignment, leading to better multilabel prediction performance. Co H is also inspired by the success of hashing in large-scale approximate nearest neighbor search (Wang et al. 2016), which supports to improve prediction efﬁciency by accelerating k NN search in the learned hamming subspace. Extensive experiments on eight real-world datasets demonstrate the proposed Co H outperforms the state-of-the-art methods in terms of both prediction accuracy and efﬁciency.

Related Work Multi-label Learning Embedding methods have shown promising results for multi-label learning, especially for many label case. They aim to project label vector into a low-dimensional subspace, and then perform the regression in the embedding subspace. Until now a number of embedding methods have been proposed via various compression and decompression techniques (Hsu et al. 2009; Zhang and Schneider 2011; Tai and Lin 2012; Zhang and Schneider 2012). To name a few, (Tai and Lin 2012) proposed the principal label space transformation (PLST), which uses principal component analysis (PCA) to compress the label vector. Based on canonical correlation analysis (CCA), (Zhang and Schneider 2011) took both feature and label into consideration. After that, maximum margin output coding (MMOC) (Zhang and Schneider 2012) was developed to learn an output coding. Recently, sparse local embedding for extreme classiﬁcation (SLEEC) is proposed to seek the embedding of labels by preserving the pairwise distances between a few nearest label neighbors. Generally speaking, these methods cannot well establish the alignment between the input and output, which inﬂuences the performance of the k NN search for multi-label prediction. Cross-view Learning Cross-view learning (Guo and Xiao 2012) aims to analyze the relationships among different views, and tries to bridge the gap between these views. Its basic idea is to seek a common latent subspace, where embeddings from different views are well aligned. As a result, cross-view learning allows features from different views to be matched in the latent subspace. Cross-view learning has been successfully applied to many applications, such as image-text retrieval, cross language text classiﬁcation, person re-identiﬁcation. In this work, we try to consider multi-label learning from the perspective of cross-view learning. Speciﬁcally, we assume that the input and output share the common subspace, which is reﬂective of the latent semantics. The common subspace establishes the correlations between the input and output, thus label search and prediction in this subspace is signiﬁcant. To achieve our goal, we formulate multi-label learning as a model that learns the semantic latent embeddings shared by both the input and output. Hash Code Learning Hashing technique (Wang et al. 2016) has become a widely-studied solution to approximate nearest neighbor (ANN) search for its great gains in

both storage and computation among massive data. The basic idea of hashing (Weiss, Torralba, and Fergus 2009; Gong et al. 2013; Shen et al. 2015; 2017a; 2017b) is to map high-dimensional data into a low-dimensional discrete code space called Hamming space, while preserving similarity structure in the original space. Accordingly, each data point is represented by a short code called hash code consisting of a sequence of bits. Some supervised hashing methods (Gong et al. 2013; Zhao et al. 2015; Lai et al. 2016) have been proposed in the multi-label setting. However, the proposed Co H is different from them, which are only developed for retrieval, and cannot be applied for multi-label learning. The motivation of developing Co H is to achieve fast multi-label prediction using hashing technique. Beneﬁting from the efﬁcient Hamming distance computation, Co H can improve efﬁciency of k NN search. To our knowledge, this is the ﬁrst work applying hashing technique to accelerate multi-label prediction.

Let {(x(i), y(i))}n i=1 be the training dataset, x(i) X Rp be an input (instance) vector, y(i) V {0, 1}q be the corresponding output (label) vector, and let n denote the number of training instances. Let X = x(1), x(2), . . . , x(n) Rp n be the instance matrix and Y = y(1), y(2), . . . , y(n) Rq n be the label matrix. The goal of multi-label learning is to learn a multi-label classiﬁer f : Rp {0, 1}q that accurately predicts the label vector for any unseen instance.

Formulation Inspired by the success in cross-view learning (Quadrianto and Lampert 2011; Dhillon, Foster, and Ungar 2011; Guo and Xiao 2012; Shen et al. 2017b), we consider its beneﬁts for multi-label learning. In cross-view learning, it is commonly known that data from the same objects described in different views share a certain common subspace (Guo and Xiao 2012). Accordingly, we assume that a low-dimensional common latent hamming space B exists for the input and output, where the semantic embeddings B = b(1), . . . , b(n) { 1, 1}d n appropriately represent the input and output training pairs, and b(i) { 1, 1}d denotes the binary embedding of the i-th training pair. The binary constraint on B is to achieve fast multi-label prediction, beneﬁting from the efﬁcient computation in Hamming space (Wang et al. 2016). To achieve this goal, we derive the objective function of the proposed Co H as follows

min U,V,B U X B 2 F + V Y B 2 F + αTr BLB

s.t. B { 1, 1}d n, BB = n Id (1)

where the constraint BB = n Id is introduced to make d bits mutually uncorrelated, such that the redundancy among these bits can be minimized; L = In S Rn n is the normalized graph Laplacian (Belkin, Niyogi, and Sindhwani 2006). In this work, the normalized similarity graph is deﬁned as S = Y Λ 1Y, where Λ = diag Y 1 Rq q is

used to normalize each row. In (1), the ﬁrst two terms model the mapping loss; the last manifold regularization term is employed to preserve the semantic similarity structure between the embeddings in the hamming space. Generally, (1) is difﬁcult to solve because of the binary constraint. We ﬁrst deﬁne a set Ω = {Z Rd n|ZZ = n Id}, then provide a new formulation that softens the orthogonal constraint in (1) as

min U,V,B U X B 2 F + V Y B 2 F

+ αTr BLB + ρ

2dist2 (B, Ω) (2)

s.t. B { 1, 1}d n

where dist (B, Ω) = min Z Ω B Z F measures the distance from B to the set Ω, and ρ 0 is a regularization parameter. In (2), we allow a certain discrepancy between B and Ω, such that (2) can be more ﬂexible than (1). Due to the fact that Tr B B = Tr Z Z = nd, (2) can be further transformed to the following problem

min U,V,B,Z U X B 2 F + V Y B 2 F

+ αTr BLB ρTr B Z (3)

s.t. B { 1, 1}d n, Z Rd n, ZZ = n Id

In the next section, we propose a computationally tractable optimization algorithm to solve (3).

Optimization

In essence, (3) is a nonlinear mixed-integer optimization problem involving the discrete constraint on B and nonconvex orthogonal constraint on Z, which is generally NPhard. We propose a tractable alternating algorithm to iteratively optimize each variable. The ﬂowchart of Co H is described by Algorithm 1. Update U, V For a given B, (3) is reduced to a least square minimization problem with respect to U and V. We can obtain the closed-form solution of U as U = XX + ϵIp 1 XB , where ϵ is a small non-negative parameter to avoid overﬁtting, and we simply set ϵ = 0.001. The solution of V can be similarly obtained. Update B By dropping some irrelevant terms to B, we have

max B f(B) = Tr BSB (4)

αTr 2(U X + V Y)B + ρZB

s.t. B { 1, 1}d n

In this work, (4) can be solved via the majorization method, which is widely used in many statistic areas. The main idea of majorization methodology is to iteratively optimize a surrogate function. In speciﬁc, we iteratively optimize a local function ˆfi(B) that linearizes f(B) at the point B(i), and employ ˆfi(B) as a surrogate of f(B). Given B(i), the next

Algorithm 1 Co-Hashing (Co H)

Input: training dataset: {(x(i), y(i))}n i=1, embedding dimensionality: d, regularization parameter: α. Output: U, V, B. 1: Initialize Z Rn d by performing PCA on Y. 2: Initialize B = sign (Z). 3: repeat 4: Compute the closed-form solution of U as U = XX + ϵIp 1 XB ; 5: Compute the closed-form solution of V as V = YY + ϵIq 1 YB ; 6: Update B via (6); 7: Compute SVD decomposition of B, B = PΣQ ; 8: Update Z as Z = n PQ ; 9: until ϵ-optimal

discrete point B(i+1) can be derived by optimizing the following objective function

max B ˆfi(B) = f(B(i)) + f(B(i)), B B(i) (5)

s.t. B { 1, 1}d n

where f(B(i)) = 2B(i)S + 1

α 2(U X + V Y) + ρZ . Note that f(B(i)) may contain many entries, thus multiple solutions for B(i+1) may exist. To avoid this ambiguity, we

introduce the function C (x, y) = x, x = 0 y, x = 0 . Then the

updating rule for B(i+1) can be deﬁned as

B(i+1) = sign C( f(B(i)), B(i)) (6)

where sign( ) is the sign function, C is applied in an elementwise manner. Update Z The sub-problem with respect to Z is deﬁned as follows

max Z Tr B Z s.t. Z Z = n Id (7)

Although there is a non-convex constraint on Z, we fortunately show that (7) admits a closed-form solution. The singular value decomposition (SVD) of B is deﬁned as B = PΣQ = r k=1 σkpkq k , where r is the rank of B, σ1, , σr are the positive singular values, and P = [p1, , pr], and Q = [q1, , qr] contain the left and right singular vectors, respectively. The closed-form solution of (7) can be characterized by the following theorem.

Theorem 1. Z = n PQ is an optimal solution to optimization problem in (7).

Proof. (7) is the classic Orthogonal Procrustes problem. The proof of this theorem can be adapted from (Sch onemann 1966).

Convergence Analysis We theoretically analyze the convergence of Co H. The convergence of B sub-problem in Co H is presented as follows.

Lemma 1. For the B sub-problem, B(i) is the sequence of binary codes generated by the discrete optimization, i.e., (6), then f B(i) converges.

Proof. Since S is positive semideﬁnite, f is a convex function. Then, we have f(B) ˆfi(B). Besides, ˆfi(B(i+1)) ˆfi(B(i)) and ˆfi(B(i)) = ˆf(B(i)) hold. Based on them, we have f(B(i+1)) f(B(i)), and f(B(i)) monotonically increases the objective function value of (4). Together with the fact f(B(i)) is upper bounded, it turns out that the sequence f(B(i)) converges.

Based on the above lemma, we further have the following convergence theorem of Co H. Theorem 2. The alternate updating rules in Algorithm 1 monotonically decrease the objective function value of Co H, i.e., (3) in each iteration, and Algorithm 1 will converge to a local minimum of Co H.

Proof. Let {Uℓ, Vℓ, Bℓ, Zℓ} be the solution in the ℓth iteration generated by Algorithm 1, and ϱℓ = ϱ(Uℓ, Vℓ, Bℓ, Zℓ) be the corresponding objective function value. In Algorithm 1, at the (ℓ+ 1)-th iteration, since each sub-problem is solved exactly, we have

ϱ (Uℓ, Vℓ, Bℓ, Zℓ) ϱ (Uℓ+1, Vℓ, Bℓ, Zℓ) ϱ (Uℓ+1, Vℓ+1, Bℓ, Zℓ) ϱ (Uℓ+1, Vℓ+1, Bℓ+1, Zℓ) ϱ (Uℓ+1, Vℓ+1, Bℓ+1, Zℓ+1) (8)

Thus, the sequence {ϱℓ} is monotonically decreasing. In addition, the objective function ϱ(U, V, B, Z) is lowerbounded by zero. Therefore, Algorithm 1 is guaranteed to converge to a local minimum of Co H.

In addition, our empirical study shows that Co H quickly converges within around 30 iterations.

Computational Complexity Analysis For the training part of Co H, the updates of U and V require O(p3 + p2n), and O(q3 + q2n), respectively. Updating B requires O (dpn + dqn). Besides, updating Z takes O d2n + rdn . Given n p, q > d, the total training computational cost of Co H is O((p2 + q2)n). In the testing stage, Co H requires O(pd) to generate the binary embedding, and then it takes O(ζn) to predict the d-bit code, where O(ζ) denotes the computational complexity for d-bit calculations, and is obviously more efﬁcient than d real-valued distance calculations.

Multi-Label Prediction via Cross-view Search This section transforms multi-label prediction into a crossview search problem. We ﬁrst obtain its semantic binary embedding, i.e., sign(U x) for a testing instance x, and then perform cross-view k NN search among the labels in the constructed latent hamming space. The hamming distance between x and the label of the i-th training instance

Algorithm 2 Testing Algorithm Input: Testing instance: x, number of NN: k, mappings of the input and output: U, V. Output: y. 1: Compute the embedding of x, b = sign(U x); 2: Compute the embedding for training labels, B(l) = sign(V Y); or directly use the embedding in training stage, B(l) = B; 3: Obtain the set Nx by ﬁnding the k nearest neighbors of x in B(l); 4: Obtain the prediction y by voting from Nx.

can be computed as sign(U x) sign(V y(i)) 2 F . In practice, the distance could alternatively be computed as sign(U x) b(i) 2 F . The prediction scheme is detailed in Algorithm 2.

Generalization Error Bound Co H is characterized by a distribution D on the space of input and output X {0, 1}q, where X Rp. Let a sample {(x(j), y(j))} be drawn i.i.d. from the distribution D, where y(j) {0, 1}q (j {1, . . . , n}) is the ground-truth label vector. Assume n samples D = {(x(1), y(1)), , (x(n), y(n))} be drawn i.i.d. n times from the distribution D, which is denoted by D Dn. Let f D knni(x) represent the prediction of the i-th label for input x using Co H+k NN, which is trained on D. The performance of Co H+k NN: (f D knn1( ), , f D knnq( )) : X {0, 1}q is measured in terms of its generalization error, which is its expected loss on a new example (x, y) drawn according to D

ED Dn,(x,y) D q

i=1 ℓ(yi, f D knni(x)) (9)

where yi denotes the i-th label and ℓ(yi, f D knni(x)) represents the loss function of the i-th label. We then deﬁne the following loss function for analysis

ℓ(yi, f D knni(x)) = P(yi = f D knni(x)) (10)

For i-th label, we further deﬁne the function as follows

νi j(x) = P(yi = j|x), j {0, 1} (11)

The Bayes optimal classiﬁer b for i-th label is deﬁned as

b i (x) = arg max j {0,1} νi j(x) (12)

Before deriving our results, we ﬁrst present several important deﬁnitions and theorems. Deﬁnition 1 (Covering Numbers, (Shawe-Taylor et al. 1998)). Let (X, d) be a metric space, A be a subset of X and ε > 0. A set B X is an ε-cover for A, if for every a A, there exists b B such that d(a, b) < ε. The ε-covering number of A, N(ε, A, d), is the minimal cardinality of an ε-cover for A (if there is no such ﬁnite cover then it is deﬁned as ).

Table 1: Statistics of eight real-world datasets.

Datasets #Dataset #Training #Testing #Features #Labels #Card-Label Domain CAL500 502 452 50 68 174 26.044 music COREL5K 5,000 4,500 500 499 374 3.522 images DELICIOUS 16,105 14,495 1,610 500 983 19.020 text EUR-LEX 19,348 17,413 1,935 5,000 3,993 1.292 text NUS-WIDE-V 269,648 161,789 107,859 128 81 1.869 images NUS-WIDE-B 269,648 161,789 107,859 500 81 1.869 images WIKI10 20,762 14,146 6,616 101,938 30,938 18.64 text DELICIOUS-L 296,701 196,606 100,095 782,585 205,443 75.54 text

Deﬁnition 2 (Doubling Dimension, (Krauthgamer and Lee 2004; Kontorovich and Weiss 2014)). Let (X, d) be a metric space, and let λ be the smallest value such that every ball in X can be covered by λ balls of half the radius. The doubling dimension of X is deﬁned as : ddim(X) = log2( λ). Theorem 3 ((Krauthgamer and Lee 2004; Kontorovich and Weiss 2014)). Let (X, d) be a metric space. The diameter of X is deﬁned as diam(X) = sup x,x X d(x, x ). The ε-covering

number of X, N(ε, X, d), is bounded by

N(ε, X, d) 2diam(X)

ddim(X) (13)

We provide the following generalization error bound for Co H+1NN: Theorem 4. Given a metric space (X, dpro), assume function νi : X {0, 1} is Lipschitz with constant L with respect to the sup-norm for each label. Suppose X has a ﬁnite doubling dimension: ddim(X) = D < and diam(X) = 1. Let D = {(x(1), y(1)), , (x(n), y(n))} and (x, y) be drawn i.i.d. from the distribution D. Then, we have

ED Dn,(x,y) D q

i=1 P(yi = f D 1nni(x))

i=1 2P(b i (x) = yi) + 3q L(||U||F + ||V||F )

Inspired by Theorem 19.5 in (Shalev-Shwartz and Ben David 2014), we derive the following lemma for Co H+k NN: Lemma 2. Given metric space (X, dpro), assume function νi : X {0, 1} is Lipschitz with constant L with respect to the sup-norm for each label. Suppose X has a ﬁnite doubling dimension: ddim(X) = D < and diam(X) = 1. Let D = {(x(1), y(1)), , (x(n), y(n))} and (x, y) be drawn i.i.d. from the distribution D. Then, we have

ED Dn,(x,y) D q

i=1 P(yi = f D knni(x))

8/k)P(b i (x) = yi)

+ q(6L(||U||F + ||V||F ) + k)

The following lemma reveals important statistical properties of Co H+1NN and Co H+k NN. Corollary 1. As n goes to inﬁnity, the errors of Co H+1NN and Co H+k NN converge to the sum of twice the Bayes error and 1 +

8/k times the Bayes error over the labels, respectively.

Experiments In this section, we evaluate the performance of the proposed method for multi-label classiﬁcation. All the computations are performed on a Red Hat Enterprise 64-Bit Linux workstation with 18-core Intel Xeon CPU E5-2680 2.80 GHz processor and 256 GB memory.

Experimental Setup Datasets The experiments are conducted on eight multi-label real-world datasets, including six mediumsized datasets1, i.e., CAL500, COREL5K, DELICIOUS, EUR-LEX, NUS-WIDE-V, NUS-WIDE-B, and two largescale datasets2, i.e., WIKI10, DELICIOUS-L. We split the training and testing set of two NUS-WIDE, WIKI10, DELICIOUS-L datasets by following publicly available experiment setting (Chua et al. 2009; Bhatia et al. 2015), while 10-fold cross-validation is applied for the other datasets. The statistics of the eight real-world datasets are summarized in Table 1. Comparison Methods We compare the proposed method with seven state-of-the-art multi-label learning methods, i.e., BR (Tsoumakas, Katakis, and Vlahavas 2009), PLST (Tai and Lin 2012), CCA (Zhang and Schneider 2011), k NN, ML-k NN (Zhang and Zhou 2007), LM-k NN (Liu and Tsang 2015), and SLEEC (Bhatia et al. 2015). The linear classiﬁcation/regression package LIBLINEAR (Fan et al. 2008) with ℓ2-regualrized logistic regression is adopted to train the classiﬁer for BR. Following the experimental settings (Liu and Tsang 2015), we set η = 0.4 in LM-k NN, and C = 10 in BR and LM-k NN. According to the original settings (Bhatia et al. 2015), we set the number of the clusters as n/6000 and the number of learners as 15 for SLEEC. In the proposed Co H, the regularization parameter α is empirically set to 100 for the two NUS-WIDE datasets, and 1 for the other datasets. The dimension of the common subspace in SLEEC and the proposed Co H is empirically set to 50 for

1http://mulan.sourceforge.net 2http://manikvarma.org/downloads/XC/XMLRepository.html

Table 2: Predictive performance comparison on six medium-sized real-world datasets. The best results are in bold. N/A denotes not available.

Datasets Hamming Loss BR PLST CCA k NN ML-k NN LM-k NN SLEEC Co H CAL500 .1390 .0113 .1376 .0043 .0916 .0022 .1453 .0047 .1398 .0055 .1493 .0031 .1489 .0056 .1124 .0037 COREL5K .0174 .0057 .0094 .0001 N/A .0095 .0001 .0094 .0001 .0095 .0002 .0092 .0001 .0082 .0002 DELICIOUS .0181 .0002 .0184 .0002 N/A .0187 .0002 .0182 .0002 .0179 .0002 .0176 .0002 .0162 .0003 EUR-LEX .0331 .0015 .0014 .0001 N/A .0011 .0001 .0011 .0001 .0009 .0000 .0010 .0000 .0008 .0000 NUS-WIDE-V .0209 .2140 N/A .0213 .1590 .0213 .0203 .0165 NUS-WIDE-B .0215 .0217 N/A .0228 .0836 .0216 .0213 .0193

Datasets Example-F1 BR PLST CCA k NN ML-k NN LM-k NN SLEEC Co H CAL500 .3446 .0150 .3062 .0108 .3520 .0191 .3561 .0131 .3216 .0180 .3511 .0161 .3099 .0315 .3602 .0216 COREL5K .0781 .0166 .0587 .0054 N/A .0223 .0056 .0178 .0069 .1295 .0092 .0839 .0061 .1996 .0108 DELICIOUS .2093 .0051 .1131 .0039 N/A .1878 .0046 .1518 .0044 .2553 .0043 .2197 .0044 .2745 .0033 EUR-LEX .4013 .0098 .0725 .0083 N/A .3409 .0057 .3005 .0049 .3818 .0045 .3705 .0058 .3915 .0076 NUS-WIDE-V .1255 .0097 N/A .1382 .1476 .0982 .1703 .2838 NUS-WIDE-B .0981 .0825 N/A .1263 .0815 .0877 .1528 .2622

Datasets Micro-F1 BR PLST CCA k NN ML-k NN LM-k NN SLEEC Co H CAL500 .3448 .0161 .3032 .0095 .3537 .0213 .3593 .0127 .3184 .0162 .3542 .0157 .3077 .0321 .3664 .0234 COREL5K .0956 .0260 .0801 .0081 N/A .0321 .0075 .0278 .0117 .1670 .0137 .1215 .0093 .2127 .0111 DELICIOUS .2447 .0051 .1423 .0058 N/A .2154 .0047 .1738 .0047 .3104 .0058 .2532 .0035 .3011 .0039 EUR-LEX .4266 .0070 .1059 .0115 N/A .4011 .0061 .3489 .0055 .4344 .0051 .4235 .0056 .4689 .0085 NUS-WIDE-V .2584 .1986 N/A .2772 .2826 .2093 .3249 .3282 NUS-WIDE-B .2162 .1910 N/A .2458 .1768 .1993 .2822 .3119

Datasets Macro-F1 BR PLST CCA k NN ML-k NN LM-k NN SLEEC Co H CAL500 .0781 .0115 .0410 .0019 .0917 .0022 .0971 .0060 .0534 .0034 .1001 .0067 .0462 .0081 .1173 .0115 COREL5K .0276 .0039 .0111 .0017 N/A .0052 .0014 .0086 .0036 .0266 .0033 .0240 .0022 .0423 .0049 DELICIOUS .0933 .0042 .0283 .0008 N/A .0550 .0024 .0481 .0018 .1383 .0074 .0702 .0014 .0993 .0040 EUR-LEX .1039 .0066 .0204 .0031 N/A .0877 .0032 .0635 .0019 .0919 .0029 .0805 .0030 .1053 .0046 NUS-WIDE-V .0266 .0162 N/A .0682 .0159 .0171 .0437 .0498 NUS-WIDE-B .0202 .0139 N/A .0373 .0206 .0143 .0302 .0454

Table 3: Efﬁciency comparison on six medium-sized real-world datasets. The training and testing time are recorded in seconds. N/A denotes not available.

Method CAL500 COREL5K DELICIOUS EUR-LEX NUS-WIDE-V NUS-WIDE-B Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing BR 1.33 1.78 10 3 8.63 6.28 10 3 120.70 0.03 2.14 103 0.49 222.21 0.07 511.83 0.02 PLST 0.02 4.00 10 3 0.82 0.12 9.15 0.91 106.26 13.94 .98 0.07 2.49 0.14 CCA 895.50 1.43 104 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A k NN N/A 0.02 N/A 3.24 N/A 30.95 N/A 393.87 N/A 3.70 103 N/A 1.24 104

ML-k NN 1.03 0.19 23.01 4.02 178.49 29.93 1.03 103 155.60 5.57 103 3.66 103 3.02 104 2.02 104

LM-k NN 13.23 0.03 1.20 103 2.35 2.40 104 56.31 4.87 103 316.75 1.64 104 4.83 103 2.08 104 3.17 103

SLEEC 27.99 0.06 736.45 3.93 3.28 103 16.62 5.23 103 41.73 9.72 103 914.56 1.14 104 1.32 103

Co H 0.12 4.00 10 3 3.82 0.05 22.31 0.78 115.23 1.34 44.52 246.50 52.74 248.56

the two NUS-WIDE, WIKI10, DELICIOUS-L datasets, and 100 for the others. Following the similar settings in (Zhang and Zhou 2007; Bhatia et al. 2015), the k in k NN search is selected using 10-fold cross validation over the range {1, 5, 10, 20} for all k NN-based methods. The running time of MMOC (Zhang and Schneider 2012) on most datasets is more than one week, thus we do not report its results. Our empirical study shows that Co H performs well in a wide range of the parameters.

Evaluation Metric Following (Guo and Schuurmans

2013; Liu and Tsang 2015; Bhatia et al. 2015), we consider the widely-used metrics to evaluate the prediction performance of all the methods, i.e., Hamming loss, Example-F1, Micro-F1, Macro-F1 for medium-sized datasets, and Precision@5, n DCG@5 for large-scale datasets.

Results on Medium-Sized Datasets

Performance We conduct the performance evaluation on six medium-sized multi-label datasets. Table 2 reports the prediction performance of all the methods, in terms of Ham-

Table 4: Performance comparison on two large-scale real-world datasets. The training and testing time are in seconds.

Dataset WIKI10 DELICIOUS-L Precision@5 n DCG@5 Training(s) Testing(s) Precision@5 n DCG@5 Training(s) Testing(s) SLEEC .6270 .6813 1.97 103 49.54 .3943 .4137 5.64 104 2.64 103

Co H .6334 .6678 2.59 102 6.72 .4136 .4256 2.39 103 478.58

ming loss, Example-F1, Micro-F1, Macro-F1. From Table 2, we observe that The proposed Co H generally has the best performance on the six medium-sized datasets. For example, on NUSWIDE-B dataset, in term of Hamming loss, Example-F1, Micro-F1, Macro-F1, Co H improves the best results of the baselines by 0.2%, 11.0%, 2.9%, 1.5%, respectively. The above results demonstrate the superior performance of the proposed Co H, and corroborate our theoretical results. Among k NN-based baselines, LM-k NN and SLEEC achieve the better performance, verifying their effectiveness. The conventional k NN method usually outperforms ML-k NN. CCA can only perform on the small CAL500 dataset due to its very expensive time cost. PLST generally underperforms on all the datasets, which is consistent with the empirical results in (Tai and Lin 2012; Liu and Tsang 2015). BR is inferior than the proposed Co H, mainly because it fails to consider the correlations between labels. Time Table 3 illustrates both the training and testing time of all the methods. The conclusions that can be drawn from the table are as follows: On the aspect of the training time, we can see that PLST is the fastest in training, but performs poorly. The proposed Co H generally takes second place for training efﬁciency, that is much faster than the other k NN-based multi-label methods. For instance, on the NUS-WIDE-B dataset, Co H is signiﬁcantly more efﬁcient, nearly 400 times faster than LM-k NN and 200 times faster than SLEEC. CCA is the most time consuming for its expensive encoding procedure. On the aspect of the testing time, the proposed Co H is the most efﬁcient among the k NN-based methods, due to the efﬁcient computation in the Hamming space. SLEEC is also fast on the two NUS-WIDE datasets, as it uses the clustering technique to accelerate the prediction. Co H is nearly 4 times faster than SLEEC, and at least 10 times faster than the other k NN-based methods on the two NUS-WIDE datasets. In addition, BR and PLST are also efﬁcient because of their simple decoding scheme, which, however, fails to obtain good performance. CCA is significantly slower than the other methods.

Results on Large-Scale Datasets We conduct the performance evaluation on two large-scale datasets, i.e., WIKI10, DELICIOUS-L. The label vectors in large-scale datasets are highly sparse, thus we can further enforce the sparsity constraint on the transformation matrices U and V in Co H, which helps to select the most meaningful labels and reduce the computation cost. The efﬁcient

Ground Truth

buildings clouds house ocean sand sky snow train water

animal birds clouds sky

animal clouds grass sky

grass person soccer sports

Co H Annotation

clouds person sky snow

clouds person

sky clouds grass sky sport grass person sky

Figure 1: Several multi-label image annotation examples on NUS-WIDE-V dataset. For each image, we show the ground truth annotations, and the labels predicted by the proposed Co H.

feature generating paradigm (Tan, Tsang, and Wang 2014; Liu and Tsang 2017) is used to select labels, and it can easily handle millions of labels. The performances of SLEEC and Co H are reported in Table 4. From Table 4, we can see that Co H can achieve comparable performance, and have less training and testing time.

We present a case study where the proposed Co H is applied to a multi-label image annotation application. We apply Co H on the NUS-WIDE-V dataset, and illustrate the annotation results of several randomly selected images, as shown in Figure 1. It can be seen that Co H correctly predicts most labels for these images. This case study suggests that Co H works well in real-world image annotation applications.

This paper provides a cross-view insight into the multi-label prediction to fully discover the correlations between the input and output. The proposed Co H learns a latent discrete code space shared by both input and output. Co H performs fast multi-label prediction via the efﬁcient cross-view search in Hamming space. We empirically show that, on the NUSWIDE datasets, Co H improves the prediction efﬁciency of LM-k NN and SLEEC by around 10 and 4 times, respectively. The proposed Co H is theoretically shown to diminish the generalization error bounds of multi-label prediction. The empirical studies verify our theoretical results, and the proposed Co H achieves the superior performance than the state-of-the-art methods with less computation cost.

Acknowledgments

This research is supported by the National Science Foundation of China (Grant No. 61273251, 61673220), the ARC Future Fellowship FT130100746, ARC grant LP150100671,

DP180100106, and partially supported under the Data Science and Artiﬁcial Intelligence Center (DSAIR) at the Nanyang Technological University.

References Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 7:2399 2434. Bhatia, K.; Jain, H.; Kar, P.; Varma, M.; and Jain, P. 2015. Sparse local embeddings for extreme multi-label classiﬁcation. In NIPS, 730 738. Chen, Y.-N., and Lin, H.-T. 2012. Feature-aware label space dimension reduction for multi-label classiﬁcation. In NIPS, 1529 1537. Chua, T.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; and Zheng, Y. 2009. Nus-wide: a real-world web image database from national university of singapore. In CIVR. Dhillon, P.; Foster, D. P.; and Ungar, L. H. 2011. Multi-view learning of word embeddings via cca. In NIPS, 199 207. Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and Lin, C.-J. 2008. Liblinear: A library for large linear classiﬁcation. JMLR 9:1871 1874. Gong, Y.; Lazebnik, S.; Gordo, A.; and Perronnin, F. 2013. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. TPAMI 35(12):2916 2929. Guo, Y., and Schuurmans, D. 2013. Multi-label classiﬁcation with output kernels. In ECML/PKDD, 417 432. Guo, Y., and Xiao, M. 2012. Cross language text classiﬁcation via subspace co-regularized multi-view learning. In ICML, 1615 1622. Hsu, D.; Kakade, S.; Langford, J.; and Zhang, T. 2009. Multi-label prediction via compressed sensing. In NIPS, 772 780. Kontorovich, A., and Weiss, R. 2014. Maximum margin multiclass nearest neighbors. In ICML, 892 900. Krauthgamer, R., and Lee, J. R. 2004. Navigating nets: Simple algorithms for proximity search. In SODA, 798 807. Lai, H.; Yan, P.; Shu, X.; Wei, Y.; and Yan, S. 2016. Instance-aware hashing for multi-label image retrieval. TIP 25(6):2469 2479. Liu, W., and Tsang, I. W. 2015. Large margin metric learning for multi-label prediction. In AAAI, 2800 2806. Liu, W., and Tsang, I. W. 2017. Making decision trees feasible in ultrahigh feature and label dimensions. JMLR 18(81):1 36. Liu, W.; Tsang, I. W.; and M uller, K.-R. 2017. An easyto-hard learning paradigm for multiple classes and multiple labels. JMLR 18(94):1 38. Quadrianto, N., and Lampert, C. H. 2011. Learning multiview neighborhood preserving projections. In ICML, 425 432. Sch onemann, P. H. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31(1):1 10.

Shalev-Shwartz, S., and Ben-David, S. 2014. Understanding Machine Learning: From Theory to Algorithms. New York, USA: Cambridge University Press. Shawe-Taylor, J.; Bartlett, P. L.; Williamson, R. C.; and Anthony, M. 1998. Structural risk minimization over datadependent hierarchies. TIT 44(5):1926 1940. Shen, F.; Shen, C.; Liu, W.; and Tao Shen, H. 2015. Supervised discrete hashing. In CVPR, 37 45. Shen, X.; Liu, W.; Tsang, I. W.; Shen, F.; and Sun, Q. 2017a. Compressed k-means for large-scale clustering. In AAAI, 2527 2533. Shen, X.; Shen, F.; Sun, Q.-S.; Yang, Y.; Yuan, Y.-H.; and Shen, H. T. 2017b. Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. TCYB 47(12):4275 4288. Tai, F., and Lin, H.-T. 2012. Multilabel classiﬁcation with principal label space transformation. Neural Computation 24(9):2508 2542. Tan, M.; Tsang, I. W.; and Wang, L. 2014. Towards ultrahigh dimensional feature selection for big data. JMLR 15(1):1371 1429. Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2009. Mining multi-label data. In Data mining and knowledge discovery handbook. 667 685. Wang, J.; Liu, W.; Kumar, S.; and Chang, S.-F. 2016. Learning to hash for indexing big data-a survey. Proceedings of the IEEE 104(1):34 57. Weiss, Y.; Torralba, A.; and Fergus, R. 2009. Spectral hashing. In NIPS, 1753 1760. Zhang, Y., and Schneider, J. 2011. Multi-label output codes using canonical correlation analysis. In AISTATS, 873 882. Zhang, Y., and Schneider, J. 2012. Maximum margin output coding. In ICML, 1575 1582. Zhang, M.-L., and Zhou, Z.-H. 2007. ML-KNN: A lazy learning approach to multi-label learning. PR 40(7):2038 2048. Zhang, M.-L., and Zhou, Z.-H. 2014. A review on multilabel learning algorithms. TKDE 26(8):1819 1837. Zhao, F.; Huang, Y.; Wang, L.; and Tan, T. 2015. Deep semantic ranking based hashing for multi-label image retrieval. In CVPR, 1556 1564.