# multiclass_heterogeneous_domain_adaptation__d551ee02.pdf

Journal of Machine Learning Research 20 (2019) 1-31 Submitted 12/13; Revised 4/17; Published 4/19

Multi-class Heterogeneous Domain Adaptation

Joey Tianyi Zhou ZHOUTY@IHPC.A-STAR.EDU.SG Institute of High Performance Computing (IHPC) Agency for Science, Technology and Research (A*STAR), Singapore 138632

Ivor W. Tsang IVOR.TSANG@UTS.EDU.AU Centre for Artiﬁcial Intelligence University of Technology Sydney Australia

Sinno Jialin Pan SINNOPAN@NTU.EDU.SG School of Computer Science and Engineering Nanyang Technological University, Singapore 639798

Mingkui Tan MINGKUITAN@SCUT.EDU.CN School of Software Engineering South China University of Technology, China

Editor: Shie Mannor

A crucial issue in heterogeneous domain adaptation (HDA) is the ability to learn a feature mapping between different types of features across domains. Inspired by language translation, a word translated from one language corresponds to only a few words in another language, we present an efﬁcient method named Sparse Heterogeneous Feature Representation (SHFR) in this paper for multi-class HDA to learn a sparse feature transformation between domains with multiple classes. Speciﬁcally, we formulate the problem of learning the feature transformation as a compressed sensing problem by building multiple binary classiﬁers in the target domain as various measurement sensors, which are decomposed from the target multi-class classiﬁcation problem. We show that the estimation error of the learned transformation decreases with the increasing number of binary classiﬁers. In other words, for adaptation across heterogeneous domains to be successful, it is necessary to construct a sufﬁcient number of incoherent binary classiﬁers from the original multi-class classiﬁcation problem. To achieve this, we propose to apply the error correcting output correcting (ECOC) scheme to generate incoherent classiﬁers. To speed up the learning of the feature transformation across domains, we apply an efﬁcient batch-mode algorithm to solve the resultant nonnegative sparse recovery problem. Theoretically, we present a generalization error bound of our proposed HDA method under a multi-class setting. Lastly, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate the superiority of our proposed method over existing state-of-the-art HDA methods in terms of prediction accuracy and training efﬁciency.

Keywords: Heterogeneous domain adaptation, multi-class classiﬁcation, compressed sensing.

c 2019 Joey Tianyi Zhou, Ivor W. Tsang, Sinno Jialin Pan, Mingkui Tan.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/13-580.html.

ZHOU, TSANG, PAN, AND TAN

1. Introduction

In many applications, it is often expensive or time consuming to annotate sufﬁcient labeled training data to build machine learning-based information systems. There may therefore be only a few labeled training data in the domain of interest but many labeled training data in another domain that is referred to as a source domain. Because of the differences between domains, a model trained on the labeled data in the source domain cannot be applied directly to the target domain. To address this problem, domain adaptation that aims to adapt a model from a source domain to a target domain with little or no additional human supervision has been proposed (Blitzer et al., 2006; Jiang and Zhai, 2007; Huang et al., 2007; Pan et al., 2008, 2011). Apart from thorough theoretical studies (Ben David et al., 2006; Ben-David et al., 2010), domain adaptation techniques have been successfully applied to various applications, such as Wi Fi localization (Pan et al., 2008, 2011), Natural Language Processing applications (Daum e III, 2007; Jiang and Zhai, 2007), sentiment analysis (Blitzer et al., 2007b; Pan et al., 2010; Seah et al., 2011; Zhou et al., 2014, 2016a), object categorization (Duan et al., 2009; Gong et al., 2012; Hoffman et al., 2013), object detection (Donahue et al., 2013; Hoffman et al., 2014) and information retrieval (Zhou et al., 2016b, 2018).

Most existing domain adaptation works are based on the assumption that different domain data can be represented by the same feature space of the same dimensionality (Pan and Yang, 2010). However, this assumption may not hold in many real-world scenarios. For instance, in a crosslanguage (e.g., English/Spanish) text classiﬁcation task (Prettenhofer and Stein, 2010), the feature spaces are referred to as vocabularies in different languages, and the features across domains are neither in the same feature space nor of the same dimensionality. To relax this assumption, there has been an increased focus on domain adaptation across heterogeneous feature spaces, which is referred to as Heterogeneous Domain Adaptation (HDA) (Dai et al., 2008; Yang et al., 2009). Besides cross-language text classiﬁcation, many other real-world applications can be formulated as an HDA problem, such as an image classiﬁcation task using auxiliary text data (Zhu et al., 2011), where the heterogeneous features are referred to as pixels and bags of textual words, respectively, image classiﬁcation by using auxiliary images with different sets of features (Saenko et al., 2010; Kulis et al., 2011), where the heterogeneous features are referred to as different computer vision features, and so on.

Existing HDA approaches can be classiﬁed into two main categories. In the ﬁrst category, two dense feature mappings P and Q can be learned to transform the source domain data XS and target domain data XT , respectively, to a new latent common feature space such that the difference between the mapped domain data PXS and QXT is reduced (Shi et al., 2010; Prettenhofer and Stein, 2010; Wang and Mahadevan, 2011; Duan et al., 2012). For example, Shi et al. (2010) proposed a heterogeneous spectral mapping (He Map) method to learn dense orthogonal mappings based on spectral embedding without using any label information. The resultant optimization problem is a standard eigen-decomposition problem. However, this approach is known to suffer from the scalability issue. Wang and Mahadevan (2011) proposed a manifold alignment method known as DAMA to align heterogeneous features into a latent space based on manifold regularization. Unfortunately, DAMA only works on data that has a manifold structure. This limits its transferability to data on which the manifold assumption is satisﬁed. Furthermore, the use of DAMA results in a generalized eigen-decomposition problem of a series of matrices whose sizes depend on the dimensionality of the data. Duan et al. (2012) proposed a Heterogeneous Feature Augmentation (HFA) method to augment heterogeneous features with homogeneous common features learned using a maximum-

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

margin approach from both source and target domains. However, the proposed model results in an expensive semideﬁnite program (SDP) problem. In the second category, a dense feature mapping G can be learned to transform heterogeneous data from one domain to the other domain directly so that the difference between GXS and XT can be minimized or the alignment between GXS and XT can be maximized. Harel and Mannor (2011) proposed a method named MOMAP to learn dense rotation matrices to match data distributions between the source and target domains. The resultant optimization problem is solved through singular value decomposition (SVD) for each class in an independent way. Kulis et al. (2011) proposed an asymmetric regularized cross-domain transformation (ARC-t) method to learn asymmetric transformation across domains based on metric learning. Similar to DAMA, ARC-t also utilizes the label information to construct the similarity and dissimilarity constraints between instances from the source and target domains, respectively. However, the computational complexities of ARC-t and its kernelized version depend quadratically on the feature dimensions and the data size, respectively. Though most existing HDA methods have shown promising results, they still suffer from the following three major limitations.

1. Dense Feature Mappings. Existing methods tend to recover dense feature mappings, which is not tenable without sufﬁcient constraints. As we will show, dense feature mappings may inevitably lose interpretability and contain signiﬁcant amount of noise, which may affect the classiﬁcation performance.

2. Multi-class problem. To address multi-class classiﬁcation problems, most existing HDA methods (Dai et al., 2008; Wang and Mahadevan, 2011; Duan et al., 2012) simply adopt the one-vs-all strategy to learn multiple binary classiﬁers independently. Unfortunately, this strategy fails to fully explore the underlying structure among multiple classes. Consequently, this heuristic scheme is unable to guarantee good performance for multi-class HDA problems.

3. High Computational Cost. Since the size of the feature mapping G scales with the product of the dimensionality of the source and target domains, the computational cost to estimate the mapping is extremely high, especially for high-dimensional source and target data, e.g. MOMAP, He Map, DAMA, ARC-t. To address this computational issue, Duan et al. (2012) and Kulis et al. (2011) proposed kernelized versions to learn the feature mapping. However, the kernelized methods still suffer from high computational cost on large-scale data in terms of the number of data instances.

2. Motivations and Contributions

In this work, we propose a Sparse Heterogeneous Feature Representation (SHFR) approach to address the above issues under following assumptions.

2.1 Motivations

1. Sparse feature representation: The feature mapping G between two domains is row-sparse, i.e., each target domain feature can be represented by a small subset of the source domain features.

2. Class-invariance transformation: Instances belonging to different classes share the same feature mapping G.

ZHOU, TSANG, PAN, AND TAN

CCAT C15 ECAT E21 GCAT M11

Sano ... Bueno Ventajoso

CCAT C15 ECAT E21 GCAT M11

Great Good ... Wholesome ... Healthy

0.5, 0.3, 0 ,0,0.1, 0, , 0.1

Sano=0.5 Great+0.3 Good +0.1 Wholesome+0.1 Healthy

Classes of Source Domain Classes of Source Domain Target Domain (Spanish Doeuments) Feature Transformation Matrix G Source Domain (English Documents)

Figure 1: Illustration of the sparse feature representation matrix.

The above two assumptions are common in real-world multilingual text categorization applications. Recall that the sparsity assumption of the feature mapping across domains implies that each feature in one domain can only be represented by a small subset of features in another domain. Here, we still use the multi-language (i.e., English/Spanish) text classiﬁcation problem as a motivating example. Typically, the word Sano in Spanish has a similar meaning to the words Great , Good , Wholesome , and healthy in English, but not to all English words. Therefore, by assuming that the feature mapping across domains is linear, a feature or word in the Spanish domain can be represented by a linear combination of only several features or words in the English domain. As illustrated in Figure 2.1, the sparse matrix G denotes the feature mapping from the English domain to the Spanish domain. Based on the sparse matrix G, the word Sano in Spanish can be represented sparsely by only four words in English as Sano = 0.5 Great +0.3 Good +0.1 Wholesome +0.1 Healthy . This sparsity characteristic, which also facilitates a signiﬁcant reduction in computational cost on very high-dimensional data, has not been explored in existing HDA methods. The feature mapping of the word Sano is invariant for different classes, for example, CCAT and C15 . To encode the class-invariance property into the feature mapping, we propose to learn a common G underlying all the classes which has a similar spirit to multi-task feature learning (Argyriou et al., 2007) and domain-invariant feature learning (Gong et al., 2013). We ﬁrst decompose a multi-class classiﬁcation problem into multiple binary classiﬁcation tasks, and then jointly optimize all the binary classiﬁcation tasks and the feature mapping G, such that G is class-invariant to all the binary classiﬁcation tasks.

2.2 Our Contributions

Based on these observations, we propose a new approach to learning a feature mapping for HDA with application to cross-lingual multi-class text categorization. To estimate this a feature mapping, we leverage the weight vectors of the binary classiﬁers learned in the source and target domains. We then formulate the problem of learning the feature mapping between domains as a Compressed Sensing (CS) problem, which aims to recover a sparse signal from a small number of measurements with respect to an underdetermined linear system by taking advantage of the sparseness or compressibility of the signal (Donoho, 2006; Cand es et al., 2006). The main contributions of this paper are as follows:

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

1. We propose a Sparse Heterogeneous Feature Representation (SHFR) approach to learn a sparse transformation for HDA by exploring the common underlying structures between multiple classes.

2. Our major ﬁnding is that, based on the CS theory, a sparse feature mapping can be learned if and only if a sufﬁcient number of classiﬁers are provided. Based on this analysis, we propose to generate sufﬁcient classiﬁers under the Error Correcting Output Correcting (ECOC) scheme to accurately estimate the sparse feature mapping.

3. To reduce the high computational cost, we propose a batch-mode pursuit algorithm to efﬁciently solve the resultant nonnegative lasso problem of SHFR. The computational cost of SHFR is independent of the data size and the dimensionality of the source domain data, and scales linearly only with size of the dimensionality of the target domain data.

4. This paper is the ﬁrst to present a generalization error bound for multi-class HDA which is dependent on the number of tasks as well as the degree of sparsity of the feature mapping G.

The remainder of the paper is organized as follows. In Section 3, we ﬁrst present how to formulate an HDA problem as a compressed sensing problem, and describe the detail of the proposed method, SHFR. We then theoretically analyze a generalization error bound for SHFR in Section 4, and develop a batch-mode algorithm in Section 5 to efﬁciently solve the resultant compressed sensing problem. In Section 6, we conduct extensive experiments on both toy and real-world datasets to demonstrate the effectiveness and efﬁciency of SHFR. We conclude this paper and discuss future work in Section 7.

3. Sparse Heterogeneous Feature Representation

We study the HDA problem with one source domain and one target domain under the multi-class setting. Let {(x Si, y Si)}n S i=1 denote a set of labeled training instances of the source domain, where x Si Rd S denotes the i-th instance, and y Si {1, 2, , c} denotes its label. Let {(x Ti, y Ti)}n T i=1 be a set of labeled training instances of the target domain, where n T n S, x Ti Rd T , and y Ti {1, 2, , c}. Given a binary task t {1, 2, , nc}, we assume that the classiﬁers for both source and target domain are linear, which can be written as ft(x) = wt x, where wt is the weight vector of the t-th classiﬁer. Since there are sufﬁcient labeled data in the source domain, i.e., {(x Si, y Si)}n S i=1, we can learn a robust set of weight vectors {wt S}nc t=1 for the source classiﬁers of the binary tasks decomposed from the multi-class classiﬁcation problem. Similarly, we can build a corresponding set of weight vectors {wt T }nc t=1 with limited labeled data {(x Ti, y Ti)}n T i=1 for the target domain.

3.1 Feature Mapping for HDA

Recall that, for HDA problems, the feature dimensions of the source and target domains may not be equal, i.e., d S = d T . To make effective learning across heterogeneous domains possible, a transformation matrix G Rd T d S was introduced in ARC-t (Kulis et al., 2011; Saenko et al., 2010) to learn the similarity x Ti Gx Si between a source instance x Si Rd S and a target instance x Ti Rd T . Data points can be transformed from the source feature space to the target feature

ZHOU, TSANG, PAN, AND TAN

space via x Ti = Gx Si, or equivalently, the target domain data can be mapped into the source domain via G (Saenko et al., 2010). To learn the transformation matrix G, corresponding pairs across domains are usually supposed to be given in advance (Qi et al., 2011) or generated based on the label information in both the source and target domains (Saenko et al., 2010). However, in practice, instance-correspondences between domains are usually missing, or the constructed instance correspondences using label information are not precise. Therefore, instead of learning the transformation by using the input raw data directly, we adapt an idea from a multi-task learning method (Ando and Zhang, 2005)1, and propose to learn the feature mapping across heterogeneous features based on the source and target predictive structures, i.e., {wt S} s and {wt T } s. We can either learn the transformation matrix G Rd T d S by maximizing the dependency between the transformed weight vectors of source classiﬁers and the weight vectors of target classiﬁers via max G wt T Gwt S,

or by minimizing the distance between these two vectors via

min G wt T Gwt S .

For simplicity in theoretical analysis, we adopt the latter objective to learn the transformation G. Given a binary task t {1, , nc} and a transformation G Rd T d S, the relationship between the weight vectors of the source and target classiﬁers can be modeled as follows,

wt T Gwt S = wt ,

where wt is referred to as a delta weight vector, and its ℓ2-norm wt 2 is used to measure the difference between the target weight vector wt T and the transformed source weight vector Gwt S. Note that the relationship is linear. This is because, in this paper, we focus on the application of HDA on cross-language text categorization, where using a linear form to model the relationship of predictive structures between domains is more suitable for high-dimensional text data. For modeling a nonlinear relationship between the source and target predictive structures, which may be more useful in other application areas, e.g., image classiﬁcation or speech recognition, it is possible to apply an explicit nonlinear feature mapping φ(x) to both the source and target domain data before training the weight vectors for each domain.

3.2 Proposed Formulation

In order to use the robust source domain weight vector wt S to make predictions on the target domain data, the difference between the source and target domains after transformation should be minimized. We therefore propose to learn the transformation G by minimizing wt 2. As mentioned in Section 2, the feature mapping G should be class-invariant and sparse, and for multi-language text classiﬁcation and many other real-world applications, the feature mapping G between feature spaces should be nonnegative. The motivation of adding the nonnegative constraint on the feature mapping G is similar to that of applying nonnegative matrix factorization (NMF) (Lee and Seung,

1. Multi-task learning aims to simultaneously learn the classiﬁers for multiple tasks by sharing parameters between all the tasks, where each task has only limited labeled data. In contrast, the goal of HDA is to learn a classiﬁer only for a target domain with limited target label data by leveraging the models learned from the source domain.

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

2001) to text mining. Here, we aim to approximate input vectors , i.e., target features, by a nonnegative linear combination of a set of nonnegative basis vectors , i.e., source features, through the mapping G. This is also similar to multiple kernel learning (MKL) (Lanckriet et al., 2004; G onen and Alpaydın, 2011), where the target kernel is represented by the nonnegative linear combination of base kernels. In the context of multi-lingual text classiﬁcation, the nonnegative entries in G can be deemed as explicit soft global co-occurrences between the source and target wordbooks, while without the nonnegative constraint on G, it is difﬁcult to interpret the physical or semantic meaning behind G. Let G = [g1, g2, , gd T ] , by imposing the ℓ1-regularization and the nonnegative constraint on gi, the learning of G can be formulated as the following nonnegative LASSO problem (Yuan and Lin, 2007; Slawski and Hein, 2011),

t=1 wt T Gwt S 2 2 +

i λi gi 1, (1)

where λi > 0 is the trade-off parameter for the regularization term on each gi. In (1), the ﬁrst term in the objective aims to minimize the difference between wt T and Gwt S over all the nc tasks, the second term in the objective is to enforce the sparsity on each row of G, and the constraints are used to preserve nonnegative linear relations between the source and target predictive structures. Note that in practice, once the source domain classiﬁers in terms of the weight vectors {wt S}nc t=1 are learned ofﬂine, the source domain training data can be discarded. For a new domain of interest, i.e., the target domain, the weight vectors {wt T }nc t=1 can ﬁrst be learned with a few labeled training data, and then G can be learned with {wt S}nc t=1 instead of the original source domain training data, which signiﬁcantly reduces the learning complexity. This learning scheme is typically different from most of the existing HDA methods that require the original source domain training data to be available to learn the feature mapping across domains. More discussion on the complexity issue can be found in Section 5.1. As G is learned through all the binary classiﬁcation tasks, knowledge can be transferred from the easy tasks to the difﬁcult ones. In Section 4.2, we theoretically show that as long as a signiﬁcant number of base binary tasks are not badly designed, the generalization error of the target classiﬁcation model is guaranteed to be small. To solve the optimization problem (1), we can ﬁrst rewrite it in the following equivalent form,

min gi 1 nc

i=1 (wt Ti wt S gi)2 +

i λi gi 1, (2)

where wt Ti is the i-th element of the vector wt T . If we exchange the summation sequences in the ﬁrst term, the above optimization problem can be reformulated as the following CS problem,

nc bi Dgi 2 2 + λi gi 1

ZHOU, TSANG, PAN, AND TAN

where bi is the concatenated row vector containing {wt Ti}nc t=1, and D = [w1 S w2 S wnc S ] Rnc d S. This type of CS problem has been well studied, and a number of algorithms have been proposed to solve it, such as nonnegative least squares (NNLS) (Cantarella and Piatek, 2004), accelerated proximal gradient algorithm (APG) (Toh and Yun, 2010), and orthogonal matching pursuit (OMP) (Zhang, 2009). In this work, we propose an efﬁcient batch-mode algorithm to solve it, which is introduced in detail in Section 5. After the sparse transformation G has been learned, we can reuse the source domain classiﬁers to predict its label for any unseen test data x T from the target domain via y T = F({(Gwt S) x T }nc t=1),

where F( ) is a decision function that combines the predictive results of all the nc source classiﬁers to make a ﬁnal prediction.

4. Generalization Error Bound for HDA

In the past decade, theoretical analysis for domain adaptation has been widely studied (Blitzer et al., 2007a; Mansour et al., 2009; Ben-David et al., 2010), and has focused on understanding conditions and assumptions of domain adaptation. However, most previous theoretical works were restricted by the assumption that the data from the source and target domains are represented in the same feature space and therefore cannot be applied directly to analyze the generalization error of our proposed method, SHFR, under the HDA setting. In this section, we analyze the theoretical guarantee of SHFR in detail. In contrast to existing generalization error analysis for homogeneous domain adaptation, we derive the generalization error bound for SHFR mainly from the perspectives of ECOC and compressed sensing.

4.1 Estimation Error Bound for Sparse Feature Mapping

Recall that the optimization problem (3) is composed of d T lasso problems. In general, we have nc < d S for relatively high-dimensional data, therefore (3) is an underdetermined linear system (Donoho, 2006). However, based on the compressed sensing theory, if gi is sparse, it is possible to obtain a solution with sufﬁcient measurements and a matrix D that satisﬁes certain conditions (Donoho, 2006; Cand es et al., 2006). In particular, one such condition is the sparse Riesz condition or RIP condition, which requires that any two columns of D should be as perfectly incoherent as possible (Donoho, 2006; Zhang and Huang, 2008; Cand es et al., 2006). For simplicity in presentation, let ki denote the number of non-sparse entries of gi (or nonsparsity degree), and bgi denote an estimator of gi. According to Theorem 3 and Remark 4 in (Zhang and Huang, 2008), under some restricted conditions, the estimation error ||gi bgi||2 for each in-

dependent subproblem can be bounded by O q

, where d S denotes the dimensionality of the source domain data and nc is the number of tasks. Hence, we can obtain the following lemma.

Lemma 1 Under the sparse Riesz condition, the estimation error GCS F = G b G F in

(1) is bounded by O d T q

, where k, d S, d T , nc, b G denote the largest row non-sparsity degree of G, the dimensionality of the source and target domain data, the number of tasks, and the estimator of G, respectively.

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Note that besides the sparse Riesz condition, Lemma 1 also holds under some other restricted conditions, such as the RIP condition (Zhang and Huang, 2008; Cand es et al., 2006). Both the sparse Riesz condition and the RIP condition require that any two columns of D should be as incoherent as possible (Donoho, 2006; Zhang and Huang, 2008; Cand es et al., 2006). However, although the RIP condition for randomly designed matrices has been thoroughly investigated (Baraniuk et al., 2008), it is more difﬁcult to investigate it for deterministically designed matrices in general. In contrast, the sparse Riesz condition for a deterministically designed matrix holds with high probability if the following condition is satisﬁed (see Proposition 1 in (Zhang and Huang, 2008)):

max |A|=k inf α 1

i A,i =j |ρji|α/(α 1)

where k denotes the non-sparsity degree and ρji = d jdi/nc.

Remark 2 Note that ρji can be deemed as the correlation between the columns dj and di. Therefore, the sparse Riesz condition for a deterministically designed matrix can be evaluated by the correlations between the columns of the dictionary matrix. The smaller the summary of correlations P |ρji| (or δ) is, the more likely it is that the sparse Riesz condition for D will hold.

According to Lemma 1 and Remark 2, to reduce the reconstruction error of the feature mapping G, it is necessary to 1) construct as many classiﬁers as possible to increase the number of measurements, i.e., nc needs to be large enough, and 2) ensure the columns of D constructed from the classiﬁers are incoherent, i.e., D is incoherent. For a multi-class classiﬁcation problem with labels {1, 2, , c}, c binary classiﬁers can be generated using the one-vs-all strategy (Dietterich and Bakiri, 1995). However, when c is small, this strategy is not able to generate sufﬁcient binary classiﬁers. Alternatively, the one-vs-one strategy may be used to generate c(c 1)/2 binary classiﬁers. The classiﬁers generated in this way may have large redundancy, however, i.e., some classiﬁers may be highly correlated to each other. To address this issue, we propose to use the Error Correcting Output Codes (ECOC) scheme (Dietterich and Bakiri, 1995; Zhou et al.) to generate sufﬁcient binary classiﬁers for estimating the feature mapping G. As will be shown empirically in Section 6.5, greater incoherence for D over the one-vs-all or one-vs-one strategy can be guaranteed with a proper design of the ECOC coding matrix.

4.2 Generalization Error Analysis of HDA Based on ECOC

By using the ECOC scheme, the multi-class hypothesis h is composed of a set of binary classiﬁers { b Gwt S}nc t=1. However, in contrast to the one-vs-all scheme, ECOC is more general and consists of two steps: encoding, i.e., to design a coding matrix M to induce binary classiﬁers, and decoding, i.e., to combine the results of all the binary classiﬁers to make predictions based on the coding matrix. In order to analyze the generalization error bound of SHFR for multi-class HDA based on ECOC, we need to ﬁrst analyze the principles of ECOC for multi-class classiﬁcation. As discussed in Dietterich and Bakiri (1995); Allwein et al. (2001), the prediction performance of a multi-class classiﬁcation model learned based on the ECOC scheme depends on the design of the coding matrix. Given a coding matrix M = { 1, 0, +1}c nc, each class is associated with a row of the matrix M, and each binary classiﬁer is associated with a column of the matrix M. If

ZHOU, TSANG, PAN, AND TAN

an entry M(i, j) = +1, it implies that in the classiﬁer fj, class i is considered as a positive class, while if M(i, j) = 1, it implies that in the classiﬁer fj, the class i is considered as a negative class, otherwise, the class i is not taken into account for training and testing. In this way, either the one-vs-all scheme or the one-vs-one scheme can be considered as a special case of the ECOC scheme. As pointed out by Dietterich and Bakiri (1995), a coding matrix is good if each codeword, i.e., each row, is well-separated in the Hamming distance from each of the other codewords, and each bit-position function fk, i.e., each column, is uncorrelated with the functions to be learned for the other bit positions fj, j = k. The power of a coding matrix M for multi-class classiﬁcation can be measured by the minimum Hamming distance between any pair of its codewords (Allwein et al., 2001), denoted by min(M). For instance, for the one-vs-all code, min(M) = 2, and for the

one-vs-one code, min(M) = c 2

1 /2 + 1. For a random matrix with components chosen

over { 1, 0, 1}, the expected value of min(M) for any distinct pair of rows is nc/2. Here, we ﬁrst follow the proof procedure on error analysis for multi-class classiﬁcation described in Allwein et al. (2001) to analyze the condition of a testing error by the ECOC decoding. After that, we derive the generalization error bound for the proposed SHFR with the ECOC scheme. Given a coding matrix M { 1, 0, +1}c nc, each row corresponds to a class and each column corresponds to a binary task. For some t {1, 2, , nc}, we denote wt S and ht S as the source domain weight vector and the hypothesis of the t-th task deﬁned by the coding matrix M, respectively. We denote b Gwt S and ht T as the estimated target domain weight vector and hypothesis of the t-th task, respectively. We also deﬁne the loss function,

L(z) = (1 z)+ = max{1 z, 0}, (4)

in terms of the margin z. For example, given a target domain instance x Ti with its label y Ti, the corresponding margin zt Ti for t-th task is deﬁned as

zt Ti = M(y Ti, t)ht T (x Ti) = M(y Ti, t)(( b Gwt S) x Ti). (5)

For instance, SVM seeks to minimize the margin-based loss under L2 norm regularization as follows,

min w 1 2 w 2 2 + C

i=1 (1 yiw xi)+, (6)

where we denote margin z = yiw xi. When the prediction is correct and M(y Ti, t) = 0 for x Ti, then the margin zt Ti > 0, which results in 0 L(z) < 1; when the prediction is wrong and M(y Ti, t) = 0 for x Ti, then the margin zt Ti < 0, which results in L(z) > 1; when M(y Ti, t) = 0, then the margin zt Ti = 0, which results in L(z) = 1. We deﬁne the L(z) loss-based decoding scheme for prediction as follows:

d L(M(r, :), h(x Ti)) =

t=1 L(M(r, t)ht T (x Ti)). (7)

The predicted label ˆy {1, 2, , c} is given by ˆy = arg minr d L(M(r), h(x Ti)).

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

To understand the error correction ability of ECOC, we ﬁrst deﬁne the distance between the codes in any distinct pair of rows, M(r1, :) and M(r2, :), in the coding matrix M as

d(M(r1, :), M(r2, :)) =

s=1 d(M(r1, s), M(r2, s)), (8)

where d(M(r1, s), M(r2, s)) = 1 sign(M(r1,s)M(r2,s))

2 . We deﬁne ρ = minr1 =r2 d(M(r1, :), M(r2, :)) as the minimum distance between any two rows in the coding matrix M, which is pre-calculated before the training.

Proposition 3 Given a coding matrix M { 1, 0, +1}c nc, a set of hypothesis outputs h1(x), , hnc(x) on a test instance x generated by nc base classiﬁers, and a convex loss function L(z) in terms of the margin z, if x is misclassiﬁed by the loss-based ECOC decoding, then

t=1 L(M(y, t)ht(x)) ρL(0), (9)

where M(y, t) denotes the corresponding label generated by the coding matrix M of ground truth. In other words, the loss of the vector of predictions f(x) is greater than ρL(0).

Proof Suppose that the loss-based ECOC decoding incorrectly classiﬁes a test instance x with the ground-truth label y. Then should then exist a label r = y such that

d L(M(y, :), h(x)) d L(M(r, :), h(x)).

By using the deﬁnitions of the loss function and margin introduced in (4) and (5), we denote

zt = M(y, t)ht(x),

z t = M(r, t)ht(x),

By using the deﬁnitions of d L in (7), we have

t=1 L(z t). (10)

Let S = {t : M(r, t) = M(y, t) M(r, t) = 0 M(y, t) = 0} be the set of columns of M whose r-th and y-th row entries are different and nonzero, and let S0 = {t : M(r, t) = 0 M(y, t) = 0} be the set of columns whose r-th or y-th row entry is zero. If t / S S0, then zt = z t, and (10) implies that X

t S S0 L(zt) X

t S S0 L(z t),

which, in turn, implies that

t=1 L(zt) X

t S S0 L(zt)

t S S0 (L(zt) + L(z t))

t S (L(zt) + L(z t)) + 1

t S0 (L(zt) + L(z t)). (11)

ZHOU, TSANG, PAN, AND TAN

If t S , then z t = zt and (L( zt) + L(zt))/2 L(0) due to convexity. If t S0, then either zt = 0 or z t = 0. Hence, we obtain that L(z t) + L(zt) L(0), and the second term of (11) is at least L(0)|S0|/2. Therefore,

t=1 L(zt) 1

t S (L(zt) + L(z t)) + 1

t S0 (L(zt) + L(z t))

L(0) |S | + |S0|

=L(0)d(M(r, :), M(y, :)) (12)

where (12) is obtained from the fact that

d(M(r1, s), M(r2, s)) = 1 sign(M(r1, s)M(r2, s))

0 if M(r, t) = M(y, t) M(r, t) = 0 M(y, t) = 0 1 2 if M(r, t) = 0 M(y, t) = 0

1 if M(r, t) = M(y, t) M(r, t) = 0 M(y, t) = 0.

This completes the proof.

Remark 4 From Proposition 3, it can be seen that misclassiﬁcation on a test instance (x, y) implies that Pnc t=1 L(M(y, t)ht(x)) ρL(0). In other words, the prediction codes are not required to be exactly the same as ground-truth codes for all the base classiﬁcations. As long as the loss is smaller than ρL(0), ECOC can rectify the error committed by some base classiﬁers, and is still able to make an accurate prediction. This error-correcting ability is very important, especially when the labeled data is insufﬁcient in the target domain. This proposition holds for any convex margin-based loss function L(z). In this paper, we use hinge loss deﬁned in (4), then L(0) = 1.

Theorem 5 Let ϵ be the averaged loss of the target domain hypotheses h1 T , . . . , hnc T on the target test data {{(x Ti, M(y Ti, t))}n T e i=n T +1}nc t=1 with respect to the coding matrix M { 1, 0, +1}c nc, where c is the cardinality of the label set. Let the convex loss function be L(z) = |1 z|+. The multiclass HDA generalization error of SHFR using sparse random encoding and loss-based decoding is then at most ϵnc

Proof According to Proposition 3, the loss for any misclassiﬁed testing instance x T satisﬁes

t=1 L(M(y T , t)ht T (x T )).

Let a be the number of incorrect predictions for a test sample of size n Te, then we can obtain the following inequality,

t=1 L(M(y T , t)ht T (x T )) n Te

t=1 L(M(y T , t)ht T (x T )). (13)

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Then we have

a n Te Pnc t=1 L(M(y T , t)ht T (x T )) ρL(0) = n Teϵnc

Pnc t=1 L(M(y T ,t)ht T (x T )) nc and L(0) = |1 0|+ = 1. Therefore, the generalization error rate is bounded by ϵnc ρL(0). The proof is completed.

Remark 6 The above error bound is usually small in the setting studied in this paper for the following reasons. When sufﬁcient classiﬁers are generated using ECOC and G satisﬁes the sparsity assumption, the reconstruction error is small and bounded according to the compressive sensing theory. On the other hand, the loss L(M(y T , t)ht T (x T )) for some {ht T }s, which rely on G, may be large due to there being limited target domain data. The data-independent term nc is usually predeﬁned and the minimum distance ρ is determined correspondingly, as ρ monotonically increases with nc according to (8). Given the predeﬁned ECOC coding matrix, as long as the averaged loss ϵ is small, according to Theorem 5, target classiﬁer is still able to make a precise prediction. In other words, SHFR is only unable to achieve satisfactory performance in terms of classiﬁcation accuracy if most or all of the binary classiﬁers generated using ECOC trained on the target domain data are poorly designed.

In summary, we show that the multi-class HDA generalization error of SHFR based on the ECOC scheme has a linear relationship with the average loss over all generated binary classiﬁers when using random encoding and loss-based decoding methods. In addition to theoretical analysis, Garc ıa-Pedrajas and Ortiz-Boyer (2011) conducted comprehensive experiments and concluded that ECOC and one-vs-one are superior to other multi-class classiﬁcation schemes, and are the best choices for either powerful learners or simple learners. Furthermore, Montazer et al. (2012) claimed that, of several multi-class classiﬁcation schemes, sparse ECOC is better than dense ECOC, while dense ECOC is better than one-vs-one, and one-vs-all is the worst. Ghani (2000) proved a theoretical bound that a randomly-construct binary matrix is not well row-separated with probability at most 1/c4, where c is the number of classes. Therefore, in our experiments, we adopt the sparse and random ECOC scheme for multi-class HDA problems.

4.3 Robust Transformation Learning using ECOC

The robustness of ECOC is another important motivation for using ECOC in SHFR. Recall that the error-correcting codes can be viewed as a compact form of voting, and a certain number of incorrect votes can be corrected through the corrected votes (Dietterich and Bakiri, 1995). Given a total number of T classiﬁers, the voting-based methods guarantee to make a correct decision as long as there are T

2 + 1 correct classiﬁers (Dietterich, 2000), where is the ﬂoor function that returns the largest integer not greater than the input of the function. In other words, even though there are misclassiﬁcations due to incorrect base classiﬁers, good performance can still be achieved in terms of the ﬁnal classiﬁcation accuracy by using the ECOC scheme. It has been shown in previous research that although bit error is unavoidable in real-world applications, it is still possible to make correct decisions using the error-correcting codes with a sufﬁcient number of good learners (Ghani, 2000).

ZHOU, TSANG, PAN, AND TAN

This property is particularly important for the proposed multi-class HDA method, SHFR, since some of the learned binary classiﬁers in the target domain may not be accurate due to insufﬁcient label information. SHFR with ECOC is able to learn a robust transformation matrix G even when some target binary classiﬁers are not precise. The robustness of SHFR with ECOC will be veriﬁed empirically in Section 6, where we demonstrate that even with inaccurate target classiﬁers, the classinvariant G learned by SHFR with the ECOC scheme can still dramatically enhance prediction accuracy for the target domain. By combining all the above theoretical analyses, we can conclude that the generalization error of our proposed method SHFR for multi-class HDA problems is small and bounded with a proper ECOC coding matrix design.

5. Efﬁcient Batch Decoding for SHFR

In this section, we present an algorithm to solve the proposed optimization problem (3) of SHFR in detail. Recall that problem (3) contains d T LASSO problems with nonnegative constraints, which can be solved by adapting existing algorithms like NNLS solvers (Cantarella and Piatek, 2004; Slawski and Hein, 2011) or LASSO solvers, e.g., APG (Toh and Yun, 2010) and OMP (Zhang, 2009). However, the time complexity of these methods is O(ncd S) for each LASSO problem, which makes problem (3) intractable when the source domain data is of very high dimensionality and the number of LASSO problems d T is very large. Recall that each of the optimization problems (3) can be cast as the following problem

min gi 1 nc bi Dgi 2 2 + λi gi 1, (15)

where bi Rnc 1 is the concatenated row vector containing wt Ti for all the nc tasks, and D = [w1 S w2 S wnc S ] Rnc d S. For simplicity and without loss of clarity, we drop the subscript i from gi and bi in our discussion. Given any g, let

ξ = b Dg (16)

be the residual. Problem (15) can then be equivalently reformulated as the following problem

min g 0,ξ λ||g||1 + 1

2||ξ||2, (17)

s.t. ξ = b Dg.

Solving this problem directly can be very expensive when the source domain data is of very high dimensionality. Following Tan et al. (2015a), we propose to solve problem (15) via a matching pursuit method, called matching pursuit LASSO (MPL), as shown in Algorithm 1. In Algorithm 1, the residual ξ0 is initialized by ξ0 = b since g0 = 0. At the t-th step, we ﬁrst choose a set of B components in g with the B largest values in D ξt, and merge their indices Jt into the active set It. We then solve a reduced problem whose scale is only O(nc|It|):

min g 0,ξ λ||g||1 + 1

2||ξ||2, (18)

s.t. ξ = b Dg, g Ic t = 0,

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Algorithm 1 Matching pursuit LASSO for solving the nonnegative LASSO problem.

Initialize the residual ξ0 = b, I0 = , and let t = 1. 1: Compute q = D ξt 1, choose the B largest qj, and record their indices by Jt. 2: Let It = It 1 Jt. 3: Let g Ic t = 0, and solve the subproblem (18) to update g It. 4: Compute ξt = b Dg It. 5: Terminate if the stopping condition is achieved. Otherwise, let t = t + 1 and go to step 1.

where It denotes the index set of the selected columns from D and Ic t is the corresponding complementary set of It. This subproblem can be solved efﬁciently using a projected proximal gradient (PG) method. Note that the number B in Step 1 is a small positive integer, e.g., B = 1. Due to the nonnegative constraints g 0, we ﬁnd the elements with the B largest value in D ξ rather than |D ξ|.

Batch-mode MPL. In Algorithm 1, solving each subproblem has a complexity of O(knc), where k denotes the degree of the sparsity, namely k = |It|, while computing D ξ has a complexity of O(d Snc). Note that problem (3) has d T sub-problems as in (15). Therefore, computing D ξ will dominate the whole complexity of the decoding if both nc and d T are very large.

To address the above computational burden imposed by D ξ, we propose a batch-mode MPL (BMPL) algorithm. Note that D ξ = D (b Dg) = D b D Dg. We can pre-compute β = D b and Q = D D, and store them in the memory. Since g Ic t = 0 for the t-th iteration, we have D ξt = D b [D DIt]g It = β QItg It, where QIt denotes the columns of Q indexed by It. The computation cost for D ξ is thus reduced to O(knc), which saves considerable computation cost when k d S. Note that β and Q are shared by all the LASSO tasks, thus they only need to be calculated once. With this strategy, the overall computational cost of solving (3) when d S is large is signiﬁcantly reduced to O(d T nck) (Tan et al., 2015a,b).

5.1 Complexity Comparison

We use linear SVMs to build the base classiﬁers for the source and target domains, which are obtained using the Liblinear solver (Fan et al., 2008) and pre-trained ofﬂine. The learning of G is not related to the number of training instances. According to the analysis in Section 5, the computational cost of SHFR is O(d T nck), where the sparsity degree k d S. Compared to state-of-the-art HDA methods, our proposed method is much more efﬁcient. The MOMAP method transforms the problem into c separate singular value decomposition (SVD) problems resulting in complexity of O(4d2 Sd T + 8d Sd2 T + 9d3 T ). The He Map method is based on standard eigen-decomposition, whose complexity is O(n S + n T )3. The ARC-t method solves an optimization problem that contains n Sn T constraints by applying an alternating projection method (e.g., Bregman s algorithm (Censor and Zenios, 1997)), resulting in complexity of d Sd T n Sn T . The HFA method adopts an alternating projection method to solve a semideﬁnite program (SDP), where the transformation matrix to be learned is in R(n S+n T ) (n S+n T ), resulting in time complexity bounded by O(n S + n T )3. Therefore, ARC-t and HFA perform inefﬁciently when the data size is large. DAMA ﬁrst constructs a series of combinatorial Laplacian matrices in R(d S+d T ) (d S+d T ), and then solves a generalized eigen-decomposition problem of time complexity bounded by O(d S + d T )3. DAMA is therefore

ZHOU, TSANG, PAN, AND TAN

Table 1: Complexity comparison of HDA methods.

Methods Complexity MOMAP (Harel and Mannor, 2011) O(4d2 Sd T + 8d Sd2 T + 9d3 T ) He Map (Shi et al., 2010) O(n S + n T )3

DAMA (Wang and Mahadevan, 2011) O(d S + d T )3

ARC-t (Kulis et al., 2011) O(d Sd T n Sn T ) HFA (Duan et al., 2012) O(n S + n T )3

SHFR (our proposed method) O(d T nck)

computationally very expensive when the data dimensionality is high. The comparison in terms of time complexity between different methods is summarized in Table 1.

6. Experiments

In this section, we conduct experiments on both toy and real-world datasets to verify the effectiveness and efﬁciency of our proposed method SHFR for multi-class HDA. The parameter settings of our proposed method and other baseline methods are as follows. For ARC-t and HFA, which require the use of kernel functions to measure data similarity, we use the RBF kernel for learning the transformation. Parameter tuning is still an open research issue, as cross-validation is not applicable to HDA problems due to the limited size of labeled data in the target domain. We therefore tune the parameters of the comparison methods on a predeﬁned range and report their best results on the test data. We use linear SVMs as the base classiﬁers, the regularization parameter C of which is cross-validated from the range of {0.01, 0.1, 1, 10, 100} on the source domain data, and we use it for all the comparison methods. For our proposed method SHFR, we empirically set the maximum number of iterations to 20, λ = 0.01 and B = 2 in the BGMP algorithm, and generate the ECOC as long as possible for each dataset.

6.1 Experiments on Synthetic Datasets

We ﬁrst compare the performance of different HDA methods in terms of recovering a groundtruth feature mapping G on a 20-class toy dataset. To generate the toy dataset, we ﬁrst randomly generate 150 instances of 150 features for each class from different Gaussian distributions to form a source domain XS R150 3,000. We then construct the ground-truth sparse feature mapping G R100 150 using the following method: for each row i, we set Gij = 1/5, where j = i, i + 1, , i + 5, and Gij = 0 otherwise. This generation of G implies that each target domain feature is represented by ﬁve source domain features. The ground-truth feature mapping is displayed in Figure 2(a), where the dark area represents the zero entries and the bright area denotes nonzero values of G. Lastly, we construct the target domain data XT R100 3,000 by using XT = GXS. When conducting the experiment, we randomly select ﬁve instances per class from the target domain data XT as the labeled training data, and apply different HDA methods on them together with all 3,000 source domain labeled data to recover the feature mapping G. In this experiment, the HDA methods, DAMA and ARC-t, are adopted as the baselines. For ease of comparison, we present the recovered matrix G for each of the three methods in Figures 2(b)- 2(d). From Figure 2(b), we observe that DAMA fails to recover the structure of G, while ARC-t

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

(a) Ground Truth

Figure 2: Illustration of the recovered feature mappings using different methods on the toy data. Black represents 0 and white represents 1.

shows better performance as can be seen in Figure 2(c). However, the Gs recovered by these two methods are not sparse. In contrast, SHFR can perfectly recover the sparse G with little noise as shown in Figure 2(d). These experimental results demonstrate that by explicitly adding sparsity constraints, SHFR is able to recover the feature mapping G more accurately.

6.2 Experiments on Real-world Datasets

We conduct experiments on three real-world datasets, Multilingual Reuters Collection, BBC Collection, and Cross-lingual Sentiment Dataset, to verify the effectiveness and efﬁciency of SHFR. The reported results are averaged over 10 independent data-split procedures.

6.2.1 DATASETS AND EXPERIMENTAL SETUP

Multilingual Reuters Collection2 is a text dataset with over 11,000 news articles from six categories in ﬁve languages, i.e., English, French, German, Italian and Spanish, which are represented by a bag-of-words weighted by TF-IDF. Following the setting in Duan et al. (2012), we use Spanish as the target domain and the other four languages as source domains, which results in four multiclass HDA problems. For each class, we randomly select 100 instances from the source domain

2. http://multilingreuters.iit.nrc.ca/Reuters Multi Lingual Multi View.htm

ZHOU, TSANG, PAN, AND TAN

and 10 instances from the target domain for training. We randomly select 10,000 instances from the target domain as the test data. Note that the original data is of very high dimensionality, and the baseline methods cannot handle such high-dimensional features. To conduct comparison experiments, we ﬁrst perform PCA with 60% energy preserved on the TF-IDF features. We then obtain 1,131 features for the the English documents, 1,230 features for the French documents, 1,417 features for the German documents, 1,041 features for the Italian documents, and 807 features for the Spanish documents. In contrast to the baseline methods, we use original features for SHFR since it can efﬁciently handle high-dimensional data. BBC Collection3 was collected for multi-view learning, and each instance is represented by three views. It was constructed from a single-view BBC corpora by splitting news articles into related views of text. We consider View 3 as the target domain, and View 1 and View 2 as source domains. Similar to the pre-processing on the Reuters dataset, we perform PCA on the original data to reduce dimensions such that other baselines can be applied. The reduced dimensions for View 1, View 2 and View 3 are 203, 205 and 418, respectively. We randomly select 70% source domain instances and 10 target domain instances for each class for training. The remaining target domain instances are considered to be the test data. Cross-lingual Sentiment Dataset4 consists of Amazon product reviews of three product categories: books, DVDs and music. These reviews are written in four languages: English, German, French, and Japanese. We treat English reviews as the source domain data and the other language reviews as the target data domain data. After performing PCA, the reviews are of 715, 929, 964, and 874 features for English, German, French and Japanese, respectively. We randomly select 1,500 source domain instances and 10 target domain instances per class for training, and use the rest 5,970 target domain instances for testing.

6.2.2 OVERALL COMPARISON RESULTS

Comparison results between SHFR and other baselines on the three real-world datasets are reported in Tables 2-4. From the tables, we observe that the SVMs conducted on a small number of target domain data only, denoted by SVM-T, using either one-vs-one or one-vs-all strategy, perform the worst on average. Moreover, the results of SVM-T using one-vs-one and one-vs-all, respectively, are not consistent. For instance, on the BBC dataset in Table 3, SVM-T using the one-vs-all strategy performs much better than SVM-T using the one-vs-one strategy, while on the sentiment dataset in Table 4, SVM-T using the one-vs-one strategy performs much better than SVM-T using the onevs-all strategy. The reason is that the size of the labeled training data is too limited to train a precise and stable classiﬁer in the target domain. Compared to one-vs-all and one-vs-one, SVM-T with ECOC shows much more stable performance and superior results on most problems, which is because of the error correcting ability of ECOC when the labeled data is limited. The performance in terms of classiﬁcation accuracy of the HDA baseline methods, DAMA, ARC-t and HFA, are comparable on the three datasets except for the BBC dataset, where DAMA performs much worse than the other two methods. This may be because the performance of DAMA is sensitive to the intrinsic manifold structure of the data. If the manifold assumption does not hold on the data, the performance of DAMA drops signiﬁcantly. Our proposed SHFR method using either the one-vs-one or ECOC scheme performs best on the three datasets. SHFR can further improve the performance

3. http://mlg.ucd.ie/datasets/segment.html 4. http://www.uni-weimar.de/cms/medien/webis/research/corpora/corpus-webis-cls-10.html

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Table 2: Multilingual Reuters Collection: comparison results in terms of classiﬁcation acc (%). Results of SHFR are signiﬁcantly better than the other baselines, judged by the t-test with a signiﬁcance level at 0.05.

Source SVM-T SVM-T SVM-T DAMA ARC-t HFA SHFR SHFR Domain (1 vs all) (1 vs 1) (ECOC) (1vs1) (ECOC) English

65.40 4.45 65.21 7.85 68.01 2.48

63.42 2.62 66.72 2.27 68.82 1.68 69.45 1.56 72.82 1.08 French 65.12 1.28 67.67 2.05 68.14 1.71 70.77 1.57 74.01 1.25 German 66.98 2.45 68.15 1.72 68.42 1.57 71.62 1.69 74.15 1.14 Italian 67.56 2.23 67.47 2.32 69.59 2.51 70.75 1.54 73.35 1.31

Table 3: BBC Collection: comparison results in terms of classiﬁcation acc (%). Results of SHFR are signiﬁcantly better than the other baselines, judged by the t-test with a signiﬁcance level at 0.05.

Source SVM-T SVM-T SVM-T DAMA ARC-t HFA SHFR SHFR Domain (1 vs all) (1 vs 1) (ECOC) (1vs1) (ECOC) View 1 73.35 4.98 68.78 11.7 76.45 5.47 67.42 2.25 75.78 2.69 72.45 10.65 89.81 1.20 90.57 1.49 View 2 66.87 1.73 74.53 2.48 73.75 8.06 88.92 1.67 91.85 0.96

Table 4: Cross-lingual Sentiment Dataset: comparison results in terms of classiﬁcation acc. (%). Results of SHFR are signiﬁcantly better than the other baselines, judged by the t-test with a signiﬁcance level at 0.05.

Target SVM-T SVM-T SVM-T DAMA ARC-t HFA SHFR SHFR Domain (1 vs all) (1 vs 1) (ECOC) (1vs1) (ECOC) French 48.40 3.45 58.30 5.01 57.45 3.17 55.18 3.46 53.46 5.18 56.46 3.42 60.80 3.08 62.12 2.61 German 49.78 5.12 61.62 6.31 62.18 4.51 56.60 3.72 57.29 2.17 55.14 3.15 63.85 3.46 65.57 2.32 Japanese 48.25 6.34 56.13 4.81 57.27 2.61 53.56 2.67 55.75 3.02 55.02 4.67 59.67 3.75 62.58 2.86

in terms of classiﬁcation accuracy, compared to one-vs-one, using the ECOC scheme. The superior performance beneﬁts from both the global transformation of G and the error correcting ability of ECOC. As discussed in Section 4.1, the recovered feature mapping G tends to be more accurate with more constructed binary tasks.

6.3 Impact on Training Sample Size of the Target Domain

We verify the impact of the size of labeled training sample in the target domain to the overall HDA performance in terms of classiﬁcation accuracy. We vary the number of target domain training instances from ﬁve to 20. In this experiment, we only report the results on the Reuters dataset, where we use English as the source domain and Spanish as the target domain. The experimental results are shown in Figure 3. From the ﬁgure, we observe that SHFR consistently outperforms the baseline methods under different numbers of labeled training instance in the target domain. In particular, SHFR shows signiﬁcantly better performance than the baseline methods when the size of the target domain labeled data is smaller than 10. Based on Theorem 5, the overall generalization error of SHFR depends on the accuracy of the binary classiﬁers generated based on ECOC. To empirically demonstrate this point, we report the

ZHOU, TSANG, PAN, AND TAN

Number of training data points from target domain 5 10 15 20

Accuracy(%)

SVM(1 vs all) SVM(1 vs 1) SVM(ECOC) DAMA ARC-t HFA SHFR(1 vs 1) SHFR(ECOC)

Figure 3: Comparison of different HDA methods under varying size of target labeled data.

5 10 15 20 66

Target binary SVMs accuracy (%)

Number of labeled target data points

(a) Averaged binary accuracy v.s. target data size.

65 70 75 80 85 60

Multi class SHFR accuracy (%)

Target binary SVMs accuracy (%)

(b) Multiclass accuracy v.s. averaged binary accuracy.

Figure 4: Relationships between binary base classiﬁers and multi-class SHFR (ECOC).

averaged accuracy of all binary classiﬁers generated based on ECOC on the Reuters dataset under varying numbers of labeled training instances in the target domain in Figure 4(a). We also report the overall multi-class prediction accuracy versus the averaged accuracy of all binary classiﬁers in Figure 4(b). From the ﬁgures, we see that more labeled target data leads to more precise target binary classiﬁers, and thus leads to improvement in multi-class HDA classiﬁcation in terms of accuracy. In particular, we observe from Figure 4(b) that the multi-class classiﬁcation accuracy is proportional to the averaged binary task accuracy. The reason is that more precise target binary classiﬁers generally induce better alignment between the target and source classiﬁers.

6.4 Impact on Dimensionality of the Target Domain

In this experiment, we aim to compare the performance of different HDA methods under varying dimensionality of the target domain data. Here, we only report experimental results on the English Spanish HDA task on the Reuters dataset. We ﬁrst select 5,000 of the most-frequent features in the English domain as the source domain. We vary the dimensionality of the Spanish domain, i.e.,

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Dimension of target domain 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Accuracy(%)

SVM(1 vs all) SVM(1 vs 1) SVM(ECOC) DAMA ARC-t HFA SHFR(1 vs 1) SHFR(ECOC)

Figure 5: Comparison of different HDA methods under varying dimensions of target labeled data.

Table 5: Comparison in correlation P ji |ρji|. Coding Scheme English French German Italian 1 vs all 0.8532 0.1358 0.7557 0.1289 0.2738 0.0245 0.9331 0.5701 1 vs 1 0.3374 0.0290 0.2509 0.0576 0.1168 0.0110 0.3322 0.1679 ECOC 0.1349 0.0142 0.1478 0.0225 0.0732 0.0145 0.1725 0.1047

the target domain, in the range of {500, 1000, 2000, 3000, 5000} by selecting the most-frequent features of this domain. The results are shown in the Figure 5, from which we observe that SHFR consistently outperforms other baselines, especially when the dimensionality of the target domain data is low.

6.5 Incoherent Dictionary D Construction

In this experiment, we adopt the summation of absolute value of all pairs of columns in the dictionary D to evaluate D s coherence degree, i.e., P ji |ρji|. In other words, the smaller the value of P ji |ρji| is, the more incoherent the dictionary is (Zhang and Huang, 2008). The results of different coding schemes on the Reuters dataset are summarized in Table 5. From the table, we observe that ECOC can be used to signiﬁcantly reduce the coherence degree to enable better construction of the dictionary for the formulated compressed sensing problem.

6.6 Error Correction through Learning a Global G

As discussed in Section 4, the weight vectors of the binary classiﬁers constructed in the target domain, i.e., {w T }s, may be unreliable due to the lack of target labeled data, which may affect the estimation of G. To verify that SHFR can correct the bias of some binary classiﬁers, we conduct comparison experiments between SVM-T and SHFR in terms of classiﬁcation accuracy on each binary task on the Reuters dataset. The results are shown in Table 6, where each column corresponds to a binary task, indexed by k {1, , 15}. From the table, we observe that the predictions of SVM-T on some binary tasks are inaccurate due to the limited number of labeled data. The accuracy is even below 50% (numbers in boldface). However, by learning the transformation G jointly among all the binary tasks, we are able to reduce the bias of the weak binary classiﬁers,

ZHOU, TSANG, PAN, AND TAN

Table 6: Comparison results on each binary task in terms of classiﬁcation accuracy (%).

Binary Classiﬁers 1 2 3 4 5 6 7 8 SVM-T 74.42 67.98 85.67 82.53 43.32 75.95 66.30 73.92 SHFR 65.39 84.81 82.45 94.28 80.76 76.75 79.74 84.63 Difference -9.03 16.82 -3.22 11.75 37.44 0.80 13.44 10.70 Binary Classiﬁers 9 10 11 12 13 14 15 SVM-T 42.78 76.56 78.32 65.43 72.74 46.56 49.71 SHFR 84.45 70.07 90.35 81.35 83.97 80.75 86.43 Difference 41.67 -6.49 12.03 15.92 11.23 34.20 36.72

and increase accuracy by more than 30% on average. Experiments on the other two datasets return similar results, where the performance in terms of accuracy of the binary classiﬁers obtained through G is increased by 26.53% on the BBC dataset and 3.01% on the Sentiment dataset on average, compared to the results generated by SVM-T. The experimental results in Table 6 also verify why SHFR outperforms ARC-t in the experiments shown in Tables 2-4 and Figure 3. ARC-t aims to align the target data with the source data via a transformation G that is learned from similarity and dissimilarity constraints. The constraints are constructed from a large number of source domain labeled data and a few target domain labeled data. In the multi-class setting, the number of similarity constraints to be constructed when the number of classes is large is much smaller than the number of dissimilarity constraints due to the limited number of target domain labeled data. In contrast to ARC-t, SHFR tries to align the target classiﬁers with the source classiﬁers using transformation G, which is shared by all the induced binary tasks or classiﬁers. In other words, G is estimated through all classiﬁers. By borrowing the idea from multi-task feature learning which jointly optimizes all classiﬁers to learn a global feature transformation G, is still possible to estimate a stable and precise G for multi-class HDA. Furthermore, as discussed in Section 4.3, ECOC with a well designed coding matrix is powerful for correcting error, as shown in Table 6.

6.7 Impact of Dichotomizer Size on Classiﬁcation Error

As proven in Section 4, the generalization error of SHFR for multi-class HDA depends on the averaged loss over all the tasks, which relies on two factors: the number of classiﬁers (or measurements) and the degree of sparsity on G. When G is sparse and the constructed binary classiﬁers are sufﬁcient, it is possible to recover a precise G by using the dictionary constructed by w S. To demonstrate how the generalization error of SHFR changes under varying numbers of classiﬁers or dichotomizers, we conduct experiments on the Reuters dataset. Experimental results are showed in Figure 6. From this ﬁgure we observe that when more dichotomizers are constructed, the predictions are more accurate in the target domain. This veriﬁes that having more dichotomizers provides more information to recover the feature mapping G. Furthermore, the standard deviation decreases with the increasing number of dichotomizers; however, as observed from the ﬁgure, multi-class accuracy no longer increases when the number of classiﬁers reaches 31. There are two reasons for this observation: 1) the redundancy in information among the constructed binary classiﬁers prevents improvement in estimating G, and 2) according to Garc ıa Pedrajas and Ortiz-Boyer (2011), when there are many dichotomizers, the minimum distance in ECOC becomes small, which decreases the error correction ability. In general, SHFR obtains bet-

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

0 10 20 30 40 50 60 60

Number of Dichotomizers

Accuracy (%)

Figure 6: Accuracy vs dichotomizer (or task) size.

Table 7: Impact of Nonnegative Constraints and L1 Norm Regularization: comparison between the results of different solvers in terms of classiﬁcation accuracy (%). Results of SHFR are signiﬁcantly better than the other baselines except for the cases denoted with *, judged by the t-test with a signiﬁcance level at 0.05.

Dataset Domain SHFR-LASSO SHFR-NNLS SHFR

English 67.08 1.63 71.47 1.56 72.82 1.08 French 67.17 1.26 72.08 1.19 74.01 1.25 German 69.56 1.28 73.62 1.75* 74.15 1.14 Italian 68.19 2.51 71.98 1.78 73.35 1.31

BBC View 1 84.72 2.18 88.66 1.29 90.57 1.49 View 2 87.06 2.01 90.25 1.43 91.85 0.96

French 56.12 2.30 60.17 2.05 62.12 2.61 German 58.83 2.24 63.47 1.49 65.57 2.32 Japanese 56.72 1.81 61.51 1.76* 62.58 2.86

ter and more stable performance in terms of classiﬁcation accuracy with an increasing number of dichotomizers.

6.8 Impacts of Nonnegative Constraints and ℓ1 Norm Regularization

As discussed in Section 4, similar to the motivation of NMF, the nonnegative constraints on G are to approximate the input vectors (i.e., target features) by nonnegative linear combinations of nonnegative basis vectors (i.e., source features). The idea of nonnegative linear relationship can also be found in multiple kernel learning, where the target kernel is approximated by the nonnegative linear combination of base kernels. To verify the impact of the nonnegative constraints on G, we add a baseline for comparison by moving the nonnegative constraints from the the optimization problem of SHFR, which is denoted by SHFR-LASSO. In this case, the model is reduced to a typical lasso problem.

ZHOU, TSANG, PAN, AND TAN

Regarding the ℓ1 norm regularization term in SHFR, note that some studies in the literature have shown that Non-Negative Least Squares (NNLS) can indeed achieve sparse recovery of nonnegative signals in a noiseless setting (Bruckstein et al., 2008; Donoho and Tanner, 2010; Wang and Tang, 2009; Wang et al., 2011). However, the noiseless assumption does not hold in practice. As mentioned by Slawski et al. (2013), apart from sign constraints, NNLS only consists of a ﬁtting term, and thus is prone to overﬁtting, especially when the training data is high-dimensional. The ℓ1-norm regularizer is therefore necessary to prevent over-adaptation to noisy data and enforce desired structural properties of the solution, which is similar to the idea of adding sparsity to the vector of coefﬁcients of a predictive model (Slawski et al., 2013). In our proposed optimization problem (1), it is insufﬁcient to achieve desirable solution of G without the ℓ1-norm term in the objective since both wt S and wt T are of high dimensionality, and some of {wt T } s are not precise. As stated in Theorem 1 in Slawski and Hein (2011), exact sparsity, i.e., ℓ1-norm, is not needed when the matrix satisﬁes the self-regularizing property, which is not applicable to our optimization problem as G does not have this property. Using the ℓ1-norm ensures that a more sparse solution can be achieved by using nonnegative constraints. To verify the impact of the ℓ1-norm on G, we add another baseline for comparison by moving the ℓ1-norm from the the optimization problem of SHFR, which is denoted by SHFR-NNLS. In this case, the model is reduced to an NNLS problem. In this paper, we adopt the NNLS solver proposed by Cantarella and Piatek (2004) to solve it.5

The results on the three multilingual datasets are summarized in Table 7. We observe from the table that SHFR-LASSO performs the worst and SHFR-NNLS enhances its performances through the nonnegative constraint. This is probably because the feature mapping G is learned globally and the features are naturally nonnegative for text classiﬁcation. As SHFR explicitly imposes both nonnegativity and sparsity in learning the transformation G, it can further enhance the performance compared to SHFR-NNLS by avoiding over adapting the noise in the high-dimensional data.

6.9 Experiments on Removing Negative Alignment

We have observed from Table 6 that the accuracy of SHFR for some constructed binary tasks is slightly less than the accuracy of SVM-T after learning and using the transformation G, which can be referred to as negative transfer (Pan and Yang, 2010). The problem of negative transfer in SHFR mainly occurs for the following two reasons:

1. Negative alignment between the source and target classiﬁers, which happens when the classiﬁcation accuracy of the corresponding target classiﬁer is below 50%. This is because some initial target classiﬁers are less accurate due to the limited number of target domain labeled data.

2. The transformation G is class-invariant, therefore G is enforced to generalize well on all classiﬁers. In this case, G may correct the bias of some weak classiﬁers, but may also smooth the strength of good classiﬁers.

In this experiment, we study how the performance of SHFR is affected by removing negative alignments, and how SHFR performs if only good target classiﬁers whose accuracy is above 50%

5. In practice, the nonnegative lasso problem has been well studied and applied to many applications such as dimensionality reduction (Wang and Ye, 2014) and index tracking (Wu et al., 2014). Nevertheless, how to solve the nonnegative lasso problem is not the focus of our paper. In this paper, we address it by modifying the algorithm proposed by Tan et al. (2015a,b), which has demonstrated superior performance over many state-of-the-art solvers.

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Table 8: Comparison results on each binary task after removing negative alignment in terms of classiﬁcation accuracy (%).

Binary Classiﬁers 1 2 3 4 5 6 7 8 SVM-T 74.42 67.98 85.67 82.53 43.32 75.95 66.30 73.92 SHFR-RN 85.39 83.26 86.45 92.43 83.79 78.19 77.45 81.28 Difference 10.97 15.28 0.87 9.90 40.47 2.24 11.15 7.36 Binary Classiﬁers 9 10 11 12 13 14 15 SVM-T 42.78 76.56 78.32 65.43 72.74 46.56 49.71 SHFR-RN 84.15 78.45 92.14 82.12 84.58 82.72 88.61 Difference 41.37 1.89 13.82 16.69 11.84 36.16 38.90

are used. We investigate whether dropping negative alignments improves accuracy on all the binary tasks as well as the ﬁnal multi-class HDA classiﬁcation problem. Experimental results are shown in Table 8 and Figure 7. Compared to the results in Table 6, we discover from Table 8 that by removing the negative alignments we can improve all binary classiﬁcation performances in terms of accuracy, and thus avoid negative transfers. We sort the binary tasks created from ECOC according to the accuracy on the test data and remove the poorly designed tasks. From Figure 7, we observe that when the number of dichotomizers is small (e.g., less than 30), SHFR with negative alignment removed achieves better performance in terms of multi-class classiﬁcation accuracy. However, in practice, it is difﬁcult to evaluate target classiﬁers due to the limit number of target domain labeled data. Interestingly, we can also ﬁnd from Figure 7 that when the number of dichotomizers or tasks is sufﬁcient, the classiﬁcation accuracy of SHFR with or without removing negative alignments converges to the same value. This is because sub-class partition in ECOC often leads to higher binary accuracy when creating many dichotomizers or tasks. In this case, there are fewer binary target classiﬁers whose accuracy is below 50%. Therefore, we can conclude that SHFR with negative alignments removed enhances the performance in terms of classiﬁcation accuracy when there are insufﬁcient classiﬁers. In the multi-class HDA setting, however, the difﬁculty in alignment evaluation makes this postprocessing impractical. Fortunately, SHFR can achieve the same performance as SHFR-RNs when there are sufﬁcient dichotomizers or tasks, which makes SHFR more practicable for the multi-class HDA problem.

6.10 Training Time Comparison

Lastly, to verify the computational efﬁciency of SHFR, we conduct two experiments to compare the training time in learning the feature mapping between SHFR and other baselines under varying numbers of data dimensions and data instances. These experiments are conducted on a toy dataset, where the dimensions of the source and target data are constructed to be the same. In the ﬁrst experiment, shown in Figure 8(a), we ﬁx the number of training data (including both source and target domain data) to 3,000 and vary the data dimensions in the range of {100, 200, 500, 1000, 2000, 5000}. We observe that SHFR performs faster than all the baselines, and its training time increases slowly with the number of data dimensions. This is because the time complexity of SHFR has a linear relationship with the target domain dimensions. We can also see that the training time of DAMA increases dramatically when the number of data dimensions increases. In contrast, the training time of HFA and ARC-t does not increase much when the number of data dimensions increases. This

ZHOU, TSANG, PAN, AND TAN

0 10 20 30 40 50 60 60

Number of Dichotomizers

Accuracy (%)

SHFR ECOC SHFR ECOC RN

Figure 7: Accuracy v.s. dichotomizers size on removing tasks with negative alignment.

Dimension of training data 102 103 104

Training time(seconds)

DAMA ARC-t HFA SHFR

(a) Training time v.s. data dimensions

Number of training data 102 103 104

Training time(seconds)

DAMA ARC-t HFA SHFR

(b) Training time v.s. data size

Figure 8: Time complexity comparison

is because these two methods adopt the kernel technique such that the computational time depends on the number of data instances instead of the data dimensions. In the second experiment, shown in Figure 8(b), we ﬁx the data dimension to 1,000 and vary the number of training instances in the range of {100, 500, 1000, 2000, 5000}. From the ﬁgure, we observe that the training time of SHFR changes slightly when the number of training instances increases. However, the training time of ARC-t and HFA increases polynomially when the number of training instances increases. These two experiments verify the superiority of SHFR in computational efﬁciency compared with the baseline methods.

7. Conclusion and Future Work

In this paper, we propose a sparse heterogeneous feature representation (SHFR) method for learning sparse transformation between heterogeneous features for HDA by exploring the common underlying structures of multiple classes between the source and target domains for multi-class

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

HDA. SHFR encodes the sparsity and class-invariance properties in learning the feature mapping. Learning feature mapping can be cast as a Compressed Sensing (CS) problem for SHFR. Based on the CS theory, we show that the number of binary learning tasks affects the multi-class HDA performance. In addition, the proposed method has superior scalability over other methods. Extensive experiments demonstrate the effectiveness, efﬁciency, and stability of SHFR. In future, we would like to investigate how to extend the proposed method by exploring the nonlinear feature transformation between image and text feature representations.

Acknowledgments

We would like to thank the action editor Shie Mannor and anonymous reviewers for their valuable comments and constructive suggestions that greatly contributed to improving the ﬁnal version of the paper. Joey Tianyi Zhou was partially supported by programmatic grant no. A1687b0033, A18A1b0045 from the Singapore government s Research, Innovation and Enterprise 2020 plan (Advanced Manufacturing and Engineering domain). Ivor W. Tsang acknowledges the partial funding from the ARC Future Fellowship FT130100746, ARC grant LP150100671, and DP180100106. Sinno J. Pan thanks the support from NTU Singapore Nanyang Assistant Professorship (NAP) grant M4081532.020, Singapore MOE Ac RF Tier-2 grant MOE2016-T2-2-060, and the Data Science & Artiﬁcial Intelligence Research Centre at NTU Singapore. Mingkui Tan was partially supported by National Natural Science Foundation of China (NSFC) 61602185, Program for Guangdong Introducing Innovative and Enterpreneurial Teams 2017ZT07X183, Guangdong Provincial Scientiﬁc and Technological Funds under Grants 2018B010107001.

Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classiﬁers. J. Mach. Learn. Res., 1:113 141, September 2001.

Rie K. Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6:1817 1853, 2005.

Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In NIPS, pages 41 048. MIT Press, 2007.

Richard Baraniuk, Mark Davenport, Ronald De Vore, and Michael Wakin. A simple proof of the restricted isometry property for random matrices. Constr. Approx., 28(3):253 263, 2008.

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In NIPS, pages 137 144, 2006.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Mach. Learn., 79(1-2):151 175, May 2010.

John Blitzer, Ryan Mc Donald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In EMNLP, pages 120 128, 2006.

ZHOU, TSANG, PAN, AND TAN

John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In NIPS, 2007a.

John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. In ACL, 2007b.

Alfred M Bruckstein, Michael Elad, and Michael Zibulevsky. On the uniqueness of nonnegative sparse solutions to underdetermined systems of equations. IEEE Trans. Inform. Theory, 54(11): 4813 4820, 2008.

Emmanuel J. Cand es, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207 1223, August 2006.

Jason Cantarella and Michael Piatek. TSNNLS: A solver for large sparse least squares problems with non-negative variables. ar Xiv preprint cs/0408029, 2004.

Yair Censor and Stavros A. Zenios. Parallel Optimization: Theory, Algorithms and Applications. Oxford University Press, 1997.

Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. Translated learning: Transfer learning across different feature spaces. In NIPS, pages 353 360, 2008.

Hal Daum e III. Frustratingly easy domain adaptation. In ACL, pages 256 263. ACL, June 2007.

Thomas G. Dietterich. Ensemble methods in machine learning. In MCS, pages 1 15. Springer Verlag, 2000.

Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res., 2:263 286, 1995.

Jeff Donahue, Judy Hoffman, Erik Rodner, Kate Saenko, and Trevor Darrell. Semi-supervised domain adaptation with instance constraints. In CVPR, 2013.

David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289 1306, 2006.

David L Donoho and Jared Tanner. Counting the faces of randomly-projected hypercubes and orthants, with applications. Discrete & Comput. Geom., 43(3):522 541, 2010.

Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple sources via auxiliary classiﬁers. In ICML, pages 289 296, 2009.

Lixin Duan, Dong Xu, and Ivor W. Tsang. Learning with augmented features for heterogeneous domain adaptation. In ICML, 2012.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for large linear classiﬁcation. J. Mach. Learn. Res., 9:1871 1874, 2008.

Nicol as Garc ıa-Pedrajas and Domingo Ortiz-Boyer. An empirical study of binary classiﬁer fusion methods for multiclass classiﬁcation. Inf. Fusion, 12(2):111 130, April 2011.

Rayid Ghani. Using error-correcting codes for text classiﬁcation. In ICML, pages 303 310, 2000.

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Mehmet G onen and Ethem Alpaydın. Multiple kernel learning algorithms. J. Mach. Learn. Res., 12:2211 2268, 2011.

Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic ﬂow kernel for unsupervised domain adaptation. In CVPR, pages 2066 2073. IEEE, 2012.

Boqing Gong, Kristen Grauman, and Fei Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In ICML, pages 222 230, 2013.

Maayan Harel and Shie Mannor. Learning from multiple outlooks. In ICML, pages 401 408, 2011.

Judy Hoffman, Erik Rodner, Jeff Donahue, Kate Saenko, and Trevor Darrell. Efﬁcient learning of domain-invariant image representations. In ICLR, 2013.

Judy Hoffman, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, and Kate Saenko. LSDA: Large scale detection through adaptation. In NIPS, pages 3536 3544, 2014.

Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Sch olkopf. Correcting sample selection bias by unlabeled data. In NIPS, pages 601 608, 2007.

Jing Jiang and Cheng Xiang Zhai. Instance weighting for domain adaptation in NLP. In ACL, pages 264 271. ACL, 2007.

Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, pages 1785 1792, 2011.

G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semideﬁnite programming. J. Mach. Learn. Res., 5:27 72, 2004.

Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556 562, 2001.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT, 2009.

Gholam Ali Montazer, Sergio Escalera, et al. Error correcting output codes for multiclass classiﬁcation: Application to two image vision problems. In AISP, pages 508 513. IEEE, 2012.

Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):1345 1359, October 2010. ISSN 1041-4347.

Sinno Jialin Pan, James T. Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In AAAI, pages 677 682, July 2008.

Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Chen Zheng. Cross-domain sentiment classiﬁcation via spectral feature alignment. In WWW, pages 751 760. ACM, 2010.

Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw., 22(2):199 210, 2011.

ZHOU, TSANG, PAN, AND TAN

Peter Prettenhofer and Benno Stein. Cross-language text classiﬁcation using structural correspondence learning. In ACL, pages 1118 1127, 2010.

Guojun Qi, Charu C. Aggarwal, and Thomas S. Huang. Towards semantic knowledge propagation from text corpus to web images. In WWW, pages 297 306, 2011.

Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In ECCV, pages 213 226. Springer-Verlag, 2010.

Chun-Wei Seah, Ivor Wai-Hung Tsang, and Yew-Soon Ong. Healing sample selection bias by source classiﬁer selection. In ICDM, pages 577 586, 2011.

Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu. Transfer learning on heterogenous feature spaces via spectral transformation. In ICDM, pages 1049 1054, 2010.

Martin Slawski and Matthias Hein. Sparse recovery by thresholded non-negative least squares. In NIPS, pages 1926 1934, 2011.

Martin Slawski, Matthias Hein, et al. Non-negative least squares for high-dimensional linear models: Consistency and sparse recovery without regularization. Electron. J. Stat., 7:3004 3056, 2013.

Mingkui Tan, Ivor W. Tsang, and Li Wang. Matching pursuit LASSO part I: Sparse recovery over big dictionary. IEEE Trans. Signal Processing, 63(3):727 741, 2015a.

Mingkui Tan, Ivor W. Tsang, and Li Wang. Matching pursuit LASSO part II: Applications and sparse recovery over batch signals. IEEE Trans. Signal Processing, 63(3):742 753, 2015b.

Kim-Chuan Toh and Sangwoon Yun. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pac. J. Optim., 6(615-640):15, 2010.

Chang Wang and Sridhar Mahadevan. Heterogeneous domain adaptation using manifold alignment. In IJCAI, pages 1541 1546, 2011.

Jie Wang and Jieping Ye. Two-layer feature reduction for sparse-group lasso via decomposition of convex sets. In NIPS, pages 2132 2140, 2014.

Meng Wang and Ao Tang. Conditions for a unique non-negative solution to an underdetermined system. In Allerton, pages 301 307. IEEE, 2009.

Meng Wang, Weiyu Xu, and Ao Tang. A unique nonnegative solution to an underdetermined system: From vectors to matrices. IEEE Trans. Signal Processing, 59(3):1007 1016, 2011.

Lan Wu, Yuehan Yang, and Hanzhong Liu. Nonnegative-lasso and application in index tracking. Comput. Stat. Data An., 70:116 126, 2014.

Qiang Yang, Yuqiang Chen, Gui-Rong Xue, Wenyuan Dai, and Yong Yu. Heterogeneous transfer learning for image clustering via the socialweb. In ACL/IJCNLP, pages 1 9, 2009.

Ming Yuan and Yi Lin. On the non-negative garrotte estimator. J. Roy. Stat. Soc. B, 69(2):143 161, 2007.

MULTI-CLASS HETEROGENEOUS DOMAIN ADAPTATION

Cun-Hui Zhang and Jian Huang. The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist., 36(4):1567 1594, 2008.

Tong Zhang. On the consistency of feature selection using greedy least squares regression. J. Mach. Learn. Res., 10:555 568, 2009.

Joey Tianyi Zhou, Ivor W Tsang, Shen-Shyang Ho, and Klaus-Robert M uller. N-ary decomposition for multi-class classiﬁcation. Mach. Learn., pages 1 22.

Joey Tianyi Zhou, Sinno Jialin Pan, Ivor W. Tsang, and Yan Yan. Hybrid heterogeneous transfer learning through deep learning. In AAAI, pages 2213 2220, 2014.

Joey Tianyi Zhou, Sinno Jialin Pan, Ivor W. Tsang, and Shen-Shyang Ho. Transfer learning for cross-language text categorization through active correspondences construction. In AAAI, pages 2400 2406, 2016a.

Joey Tianyi Zhou, Xinxing Xu, Sinno Jialin Pan, Ivor W. Tsang, Zheng Qin, and Rick Siow Mong Goh. Transfer hashing with privileged information. In IJCAI, pages 2414 2420, 2016b.

Joey Tianyi Zhou, Heng Zhao, Xi Peng, Meng Fang, Zheng Qin, and Rick Siow Mong Goh. Transfer hashing: From shallow to deep. IEEE Trans. Neural Netw. Learning Syst., 29(12):6191 6201, 2018.

Yin Zhu, Yuqiang Chen, Zhongqi Lu, Sinno Jialin Pan, Gui-Rong Xue, Yong Yu, and Qiang Yang. Heterogeneous transfer learning for image classiﬁcation. In AAAI, 2011.