# transfer_learning_via_learning_to_transfer__6d06a483.pdf Transfer Learning via Learning to Transfer Ying Wei 1 2 Yu Zhang 1 Junzhou Huang 2 Qiang Yang 1 In transfer learning, what and how to transfer are two primary issues to be addressed, as different transfer learning algorithms applied between a source and a target domain result in different knowledge transferred and thereby the performance improvement in the target domain. Determining the optimal one that maximizes the performance improvement requires either exhaustive exploration or considerable expertise. Meanwhile, it is widely accepted in educational psychology that human beings improve transfer learning skills of deciding what to transfer through meta-cognitive reflection on inductive transfer learning practices. Motivated by this, we propose a novel transfer learning framework known as Learning to Transfer (L2T) to automatically determine what and how to transfer are the best by leveraging previous transfer learning experiences. We establish the L2T framework in two stages: 1) we learn a reflection function encrypting transfer learning skills from experiences; and 2) we infer what and how to transfer are the best for a future pair of domains by optimizing the reflection function. We also theoretically analyse the algorithmic stability and generalization bound of L2T, and empirically demonstrate its superiority over several state-ofthe-art transfer learning algorithms. 1. Introduction Inspired by human beings capabilities to transfer knowledge across tasks, transfer learning aims to leverage knowledge from a source domain to improve the learning performance or minimize the number of labeled examples required in a target domain. It is of particular significance when tackling tasks with limited labeled examples. Transfer learning has proved its wide applicability in, for example, 1Hong Kong University of Science and Technology, Hong Kong 2Tencent AI Lab, Shenzhen, China. Correspondence to: Ying Wei , Qiang Yang . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). image classification (Long et al., 2015), sentiment classification (Blitzer et al., 2006), dialog systems (Mo et al., 2016), and urban computing (Wei et al., 2016). Three key research issues in transfer learning, pointed by Pan & Yang, are when to transfer, how to transfer, and what to transfer. Once transfer learning from a source domain is considered to benefit a target domain (when to transfer), an algorithm (how to transfer) discovers the transferable knowledge across domains (what to transfer). Different algorithms are likely to discover different transferable knowledge, and thereby lead to uneven transfer learning effectiveness which is evaluated by the performance improvement over non-transfer baselines in a target domain. To achieve the optimal performance improvement for a target domain given a source domain, researchers may try tens to hundreds of transfer learning algorithms covering instance (Dai et al., 2007), parameter (Tommasi et al., 2014), and feature (Pan et al., 2011) based algorithms. Such bruteforce exploration is computationally expensive and practically impossible. As a tradeoff, a sub-optimal improvement is usually obtained from a heuristically selected algorithm, which unfortunately requires considerable expertise in an ad-hoc and unsystematic manner. Exploring different algorithms is not the only way to optimize what to transfer. Previous transfer learning experiences do also help, which has been widely accepted in educational psychology (Luria, 1976; Belmont et al., 1982). Human beings sharpen transfer learning skills of deciding what to transfer by conducting meta-cognitive reflection on diverse transfer learning experiences. For example, children who are good at playing chess may transfer mathematical skills, visuospatial skills, and decision making skills learned from chess to solve arithmetic problems, to solve pattern matching puzzles, and to play basketball, respectively. At a later age, it will be easier for them to decide to transfer mathematical and decision making skills learned from chess, rather than visuospatial skills, to market investment. Unfortunately, all existing transfer learning algorithms transfer from scratch and ignore previous transfer learning experiences. Motivated by this, we propose a novel transfer learning framework called Learning to Transfer (L2T). The key idea of the L2T is to enhance the transfer learning effectiveness from a source to a target domain by leveraging previous Transfer Learning via Learning to Transfer transfer learning experiences to optimize what and how to transfer between them. To achieve the goal, we establish the L2T in two stages. During the first stage, we encode each transfer learning experience into three components: a pair of source and target domains, the transferred knowledge between them parameterized as latent feature factors, and performance improvement. We learn from all experiences a reflection function which maps a pair of domains and the transferred knowledge between them to the performance improvement. The reflection function, therefore, is believed to encrypt transfer learning skills of deciding what and how to transfer. In the second stage, what to transfer between a newly arrived pair of domains is optimized so that the value of the learned reflection function, matching to the performance improvement, is maximized. The contribution of this paper lies in that we propose a novel transfer learning framework which opens a new door to improve transfer learning effectiveness by taking advantage of previous transfer learning experiences. The L2T can discover more transferable knowledge in a systematic and automatic fashion without requiring considerable expertise. We have also provided theoretic analyses to its algorithmic stability and generalization bound, and conducted comprehensive empirical studies showing the L2T s superiority over state-of-the-art transfer learning algorithms. 2. Related Work Transfer Learning Pan & Yang identified three key research issues in transfer learning as what, how, and when to transfer. Parameters (Yang et al., 2007a; Tommasi et al., 2014), instances (Dai et al., 2007), or latent feature factors (Pan et al., 2011) can be transferred between domains. A few works (Yang et al., 2007a; Tommasi et al., 2014) transfer parameters from source domains to regularize parameters of SVM-based models in a target domain. In (Dai et al., 2007), a basic learner in a target domain is boosted by borrowing the most useful source instances. Various techniques capable of learning transferable latent feature factors between domains have been investigated extensively. These techniques include manually selected pivot features (Blitzer et al., 2006), dimension reduction (Pan et al., 2011; Baktashmotlagh et al., 2013; 2014), collective matrix factorization (Long et al., 2014), dictionary learning and sparse coding (Raina et al., 2007; Zhang et al., 2016), manifold learning (Gopalan et al., 2011; Gong et al., 2012), and deep learning (Yosinski et al., 2014; Long et al., 2015; Tzeng et al., 2015). Unlike L2T, all existing transfer learning studies transfer from scratch, i.e., only considering the pair of domains of interest but ignoring previous transfer learning experiences. Better yet, L2T can even collect all algorithms wisdom together, considering that any algorithm mentioned above can be applied in a transfer learning experience. Transfer Learning Multi-task Learning Lifelong Learning Learning to Transfer Training Testing Task 1 Task N Task 1 Task N Task 1 Task N Task N+1 Figure 1. Illustration of the differences between our work and the other three lines of work. Multi-task Learning Multi-task learning (Caruana, 1997; Argyriou et al., 2007) trains multiple related tasks simultaneously and learns shared knowledge among tasks, so that all tasks reinforce each other in generalization abilities. However, multi-task learning assumes that training and testing examples follow the same distribution, as Figure 1 shows, which is different from transfer learning we focus on. Lifelong Learning Assuming a new learning task to lie in the same environment as training tasks, learning to learn (Thrun & Pratt, 1998) or meta-learning (Maurer, 2005; Finn et al., 2017; Al-Shedivat et al., 2018) transfers the knowledge shared among training tasks to the new task. (Ruvolo & Eaton, 2013; Pentina & Lampert, 2015) consider lifelong learning as online meta-learning. Though L2T and lifelong (meta) learning both aim to improve a learning system by leveraging histories, L2T differs from them in that each historical experience we consider is a transfer learning task rather than a traditional learning task as Figure 1 illustrates. Thus we learn transfer learning skills instead of task-sharing knowledge. 3. Learning to Transfer We begin by first briefing the proposed L2T framework. Then we detail the two stages in L2T, i.e., learning transfer learning skills from previous transfer learning experiences and applying those skills to infer what and how to transfer for a future pair of source and target domains. 3.1. The L2T Framework A L2T agent previously conducted transfer learning several times, and kept a record of Ne transfer learning experiences. We define each transfer learning experience as Ee = ( Se, Te , ae, le) in which Se = {Xs e, ys e} and Te = {Xt e, yt e} denote a source domain and a target domain, respectively. X e Rn e m represents the feature matrix if either domain has n e examples in a m-dimensional feature space X e , where the superscript can be either s or t to denote a source or a target domain. y e Y e denotes the vector of labels with the length being n le. The number of target labeled examples is much smaller than that of source labeled examples, i.e., nt le ns le. We focus on the Transfer Learning via Learning to Transfer optimize what to transfer for a target pair of source and target domains source domain target domain transfer algorithm performance improvement learn transfer learning skills 1 e N + 1 e N , 1 + 1 , 1 1 ( ) a W ( ) e e N N a W * 1 1 1 arg max ( , , ) e e e N N N f W W + , ( , , ) e e e e f l o W + , Training Testing ( , , ) e e e e f l o W + , * 1 e N W Figure 2. Illustration of the L2T framework: in the training stage, we have Ne transfer learning experiences {E1, , ENe} from which we learn a reflection function f encrypting transfer learning skills; in the testing stage, given the (Ne + 1)-th source-target pair and the learned reflection function f( 1 ), we optimize the transferred knowledge between them, i.e., W Ne+1, by maximizing the value of f( 2 ). setting X s e = X t e and Ys e = Yt e for each pair of domains. ae A = {a1, , a Na} denotes a transfer learning algorithm having been applied between Se and Te. Suppose that the transferred knowledge by the algorithm ae can be parameterized as We. Finally, each transfer learning experience is labeled by the performance improvement ratio le = pst e /pt e, where pt e is the learning performance (e.g., classification accuracy) on a test dataset in Te without transfer and pst e is that on the same test dataset after transferring We from Se. With Ne transfer learning experiences {E1, , ENe} as the input, the L2T agent learns a function f such that f(Se, Te, We) approximates le as shown in the training stage of Figure 2. We call f a reflection function which encrypts meta-cognitive transfer learning skills - what and how to transfer can maximize the improvement ratio given a pair of domains. Whenever a new pair of domains SNe+1, TNe+1 arrives, the L2T agent can optimize the knowledge to be transferred, i.e., W Ne+1, by maximizing the value of f (see step 2 of the testing stage in Figure 2). 3.2. Parameterizing What to Transfer Transfer learning algorithms applied can vary from experience to experience. Uniformly parameterizing what to transfer for any algorithm out of the base algorithm set A is a prerequisite for learning the reflection function. In this work, we consider A to contain algorithms transferring single-level latent feature factors, because existing parameter-based and instance-based algorithms cannot address the transfer learning setting we focus on (i.e., X e s = X e t and Ye s = Ye t ). Though limited parameter-based algorithms (Yang et al., 2007a; Tommasi et al., 2014) can transfer across domains in heterogeneous label spaces, they can only handle binary classification problems. Deep neural network based algorithms (Yosinski et al., 2014; Long et al., 2015; Tzeng et al., 2015) transferring latent feature factors in multiple levels are left for our future research. As a result, we parameterize what to transfer with a latent feature factor matrix W which is elaborated in the following. Latent feature factor based algorithms aim to learn domaininvariant feature factors across domains. Consider classifying dog pictures as a source domain and cat pictures as a target domain. The domain-invariant feature factors may include eyes, mouth, tails, etc. What to transfer, in this case, is the shared feature factors across domains. The way of defining domain-invariant feature factors dictates two groups of latent feature factor based algorithms, i.e., common latent space based and manifold ensemble based algorithms. Common Latent Space Based This line of algorithms, including but not limited to TCA (Pan et al., 2011), LSDT (Zhang et al., 2016), and DIP (Baktashmotlagh et al., 2013), assumes that domain-invariant feature factors lie in a single shared latent space. We denote by ϕ the function mapping original feature representation into the latent space. If ϕ is linear, it can be represented as an embedding matrix W Rm u where u is the dimensionality of the latent space. Therefore, we can parameterize what to transfer we focus on with W which describes u latent feature factors. Otherwise, if ϕ is nonlinear, what to transfer can still be parameterized with W. Though a nonlinear ϕ is not explicitly specified in most cases such as LSDT using sparse coding, target examples represented in the latent space Zt e = ϕ(Xt e) Rnt e u are always available. Consequently, we obtain the similarity metric matrix (Cao et al., 2013) in the latent space, i.e., G=(Xt e) Zt e(Zt e)T [(Xt e)T ] Rm m according to Xt e G(Xt e)T = Zt e(Zt e)T , where (Xt e) is the pseudo-inverse of Xt e. LDL decomposition on G = LDLT brings the latent feature factor matrix W = LD1/2. Manifold Ensemble Based Initiated by Gopalan et al., manifold ensemble based algorithms consider that a source and a target domain share multiple subspaces (of the same dimension) as points on the Grassmann manifold between them. The representation of target examples on u domain-invariant latent factors turns to Zt(nu) e =[ϕ1(Xt e), , ϕnu(Xt e)] Rnt e nuu, if nu subspaces on the manifold are sampled. When all continuous subspaces on the manifold are sampled, i.e., nu , Gong et al. proved Transfer Learning via Learning to Transfer that Zt( ) e (Zt( ) e )T =Xt e G(Xt e)T where G is the similarity metric matrix. For computational details of G, please refer to (Gong et al., 2012). W=LD1/2 with L and D obtained from performing LDL decomposition on G=LDLT , therefore, is also qualified to represent latent feature factors distributed in a series of subspaces on a manifold. 3.3. Learning from Experiences The goal here is to learn a reflection function f such that f(Se, Te, We) can approximate le for all experiences {E1, , ENe}. The improvement ratio le is closely related to two aspects: 1) the difference between a source and a target domain in the shared latent space, and 2) the discriminative ability of a target domain in the latent space. The smaller difference guarantees more overlap between domains in the latent space, which signifies more transferable latent feature factors and higher improvement ratios as a result. The discriminative ability of a target domain in the latent space is also vital to improve performances. Therefore, we build f to take both aspects into consideration. The Difference between a Source and a Target Domain We follow (Pan et al., 2011) and adopt the maximum mean discrepancy (MMD) (Gretton et al., 2012b) to measure the difference between domains. By mapping two domains into the reproducing kernel Hilbert space (RKHS), MMD empirically evaluates the distance between the mean of source examples and that of target examples: ˆd2 e(Xs e We, Xt e We) i=1 φ(xs ei We) 1 j=1 φ(xt ej We) i,i =1 K(xs ei We, xs ei We) j,j =1 K(xt ej We, xt ej We) i,j=1 K(xs ei We, xt ej We), (1) where xt ej is the j-th example in Xt e, and φ maps from the u-dimensional latent space to the RKHS H. K( , ) = φ( ), φ( ) is the kernel function. Different kernels K lead to different MMD distances and thereby different values of f. Thus learning the reflection function f is equivalent to optimizing K so that the MMD distance can well characterize the improvement ratio le for all pairs of domains. Inspired by multi-kernel MMD (Gretton et al., 2012b), we parameterize K as a linear combination of Nk PSD kernels, i.e., K= Nk k=1 βk Kk (βk 0, k), and learn the coefficients β=[β1, , βNk] instead. Using β, the MMD can be rewritten as ˆd2 e(Xs e We, Xt e We)= Nk k=1 βk ˆd2 e(k)(Xs e We, Xt e We)= βT ˆde, where ˆde =[ ˆd2 e(1), , ˆd2 e(Nk)] with ˆd2 e(k) computed by the k-th kernel Kk. In this paper, we consider RBF kernels Kk(a, b)=exp( a b 2/δk) by varying the bandwidth δk. Unfortunately, the MMD alone is insufficient to measure the difference between domains. The distance variance among all pairs of instances across domains is also required to fully characterize the difference. A pair of domains with small MMD but extremely high variance still have little overlap. Equation (1) is actually the empirical estimation of d2 e(Xs e We, Xt e We) = Exsexs e xtext e h(xs e, xs e , xt e, xt e ) (Gretton et al., 2012b) where h(xs e, xs e , xt e, xt e ) = K(xs e We, xs e We)+K(xt e We, xt e We) K(xs e We, xt e We) K(xs e We, xt e We). Consequently, the distance variance, σ2 e, equals σ2 e(Xs e We, Xt e We) =Exsexs e xtext e [(h(xs e, xs e , xt e, xt e ) Exsexs e xtext e h(xs e, xs e , xt e, xt e ))2]. To be consistent with the MMD characterized with Nk PSD kernels, we rewrite σ2 e = βT Qeβ where Qe = cov(h) = σe(1,1) σe(1,Nk) σe(Nk,1) σe(Nk,Nk) . Each element σe(k1,k2) = cov(hk1, hk2) = E [(hk1 Ehk1)(hk2 Ehk2)]. Note that Ehk1 is shorthand for Exsexs e xtext e hk1(xs e, xs e , xt e, xt e ) where hk1 is calculated using the k1-th kernel. We detail the empirical estimate ˆQe of Qe in the supplementary due to page limit. The Discriminative Ability of a Target Domain In view of limited labeled examples in a target domain, we resort to unlabeled examples to evaluate the discriminative ability. The principles of the unlabeled discriminant criterion are two-fold: 1) similar examples should still be neighbours after being embedded into the latent space; and 2) dissimilar examples should be far away. We adopt the unlabeled discriminant criterion proposed in (Yang et al., 2007b), τe = tr(WT e SN e We)/tr(WT e SL e We), where SL e = nt e j,j =1 Hjj (nte)2 (xt ej xt ej )(xt ej xt ej )T is the local scatter covariance matrix with the neighbour information Hjj defined as Hjj = K(xt ej, xt ej ), if xt ej Nr(xt ej ) and xt ej Nr(xt ej) 0, otherwise . If xt ej and xt ej are mutual r-nearest neighbours to each other, Hjj equals the kernel value K(xt ej, xt ej ). By maximizing the unlabeled discriminant criterion τe, the local scatter covariance matrix guarantees the first principle, while SN e = nt e j,j =1 K(xt ej,xt ej ) Hjj (nte)2 (xt ej xt ej )(xt ej xt ej )T, the non-local scatter covariance matrix, enforces the second principle. τe also depends on kernels which in this case indicate different neighbour information and different degrees of similarity between neighboured examples. With τe(k) obtained from the k-th kernel Kk, the unlabeled discriminant criterion τe can be written as τe = Nk k=1 βkτe(k) = βT τ e where τ e = [τe(1), , τe(Nk)]. Transfer Learning via Learning to Transfer The Optimization Problem Combining the two aspects abovementioned to model the reflection function f, we finally formulate the optimization problem as follows, β , λ , μ , b = arg min β,λ,μ,b βT ˆde + λβT ˆQeβ + μ βT τ e + b, 1 + γ1R(β, λ, μ, b), s.t. βk 0, k {1, , Nk}, λ 0, μ 0, (2) where 1/f = βT ˆde + λβT ˆQeβ + μ βT τ e + b and Lh( ) is the Huber regression loss (Huber et al., 1964) constraining the value of 1/f to be as close to 1/le as possible. γ1 controls the complexity of the parameters by l2-regularization. Minimizing the difference between domains, including the MMD distance βT ˆde and the distance variance βT ˆQeβ, and meanwhile maximizing the discriminant criterion βT τ e in the target domain will contribute a large performance improvement ratio le (i.e., a small 1/le). λ and μ balance the importance of the three terms in f, and b is the bias term. 3.4. Inferring What to Transfer Once the L2T agent has learned the reflection function f(S, T , W; β , λ , μ , b ), it takes advantage of the function to optimize what to transfer, i.e., the latent feature factor matrix W, for a newly arrived source domain SNe+1 and a target domain TNe+1. The optimal latent feature factor matrix W Ne+1 should maximize the value of f. To this end, we optimize the following objective with regard to W, W Ne+1 = arg max W f(SNe+1, TNe+1, W; β , λ , μ , b ) γ2 W 2 F = arg min W (β )T ˆd W + λ (β )T ˆ QWβ + μ 1 (β )T τ W + γ2 W 2 F , (3) where F denotes the matrix Frobenius norm and γ2 controls the complexity of W. The first and second terms in problem (3) can be calculated as (β )T ˆd W = i,i =1 Kk(vi W, vi W)+ j,j =1 Kk(wj W, wj W) 2 i,j=1 Kk(vi W, wj W) , (β )T ˆ QWβ = 1 n2 1 Kk(vi W, vi W)+ Kk(wi W, wi W) 2Kk(vi W, wi W) 1 Kk(vi W, vi W) + Kk(wi W, wi W) 2Kk(vi W, wi W) 2 , where the shorthand vi = xs (Ne+1)i, vi = xs (Ne+1)i , wj = xt (Ne+1)j, wj =xt (Ne+1)j , a=ns Ne+1, and b=nt Ne+1 are used due to space limit. Note that n=min(ns Ne+1, nt Ne+1). The third term in problem (3) can be computed as (β )T τ W = Nk k=1 β k tr(WT SN k W) tr(WT SL k W) . We optimize the non-convex prob- lem (3) w.r.t W by employing a conjugate gradient method in which the gradient is listed in the supplementary material. 4. Stability and Generalization Bounds In this section, we would theoretically investigate how previous transfer learning experiences influence a transfer learning task of interest. We also provide and prove the algorithmic stability and generalization bound for latent feature factor based transfer learning algorithms without experiences considered in the supplementary. Consider S = { S1, T1 , , SNe, TNe } to be Ne transfer learning experiences or the so-called meta-samples (Maurer, 2005). Let L(S) be our algorithm that learns meta-cognitive knowledge from Ne transfer learning experiences in S and applies the knowledge to the (Ne+1)-th transfer learning task SNe+1, TNe+1 . To analyse the stability and give the generalization bound, we make an assumption on the distribution from which all Ne transfer learning experiences as meta-samples are sampled. For every environment E we have, all Ne pairs of source and target domains in S are drawn according to an algebraic β-mixing stationary distribution (DE)Ne, which is not i.i.d.. Intuitively, the algebraical β-mixing stationary distribution (see Definition 2 in (Mohri & Rostamizadeh, 2010)) with the β-mixing coefficient β(m) β0/mr models the dependence between future samples and past samples by a distance of at least m. The independent block technique (Bernstein, 1927) has been widely adopted to deal with non-i.i.d. learning problems. Under this assumption, L(S) is uniformly stable. Theorem 1. Suppose that for any xt e and for any yt e we have xt e 2 rx and |yt e| B. Meanwhile, for any e-th transfer learning experience, we assume that the latent feature factor matrix We r W. To meet the assumption above, we reasonably simplify L(S) so that the latent feature factor matrix for the (Ne +1)-th transfer learning task is a linear combination of all Ne historical latent factor feature matrices plus a noisy latent feature matrix Wϵ satisfying Wϵ rϵ, i.e., WNe+1 = Ne e=1 ce We+Wϵ with each coefficient 0 ce 1. Our algorithm L(S) is uniformly stable. For any S, T as the coming transfer learning task, the following inequality holds: lemp(L(S), (S, T )) lemp(L(Se0), (S, T )) 4(4Ne 3 + rϵ/r W )B2rx λN 2e O B2rx where S = { S1, T1 , , Se0 1, Te0 1 , Se0, Te0 , Se0+1, Te0+1 , , SNe, TNe } denotes the full set of meta-samples, and Se0 = { S1, T1 , , Se0 1, Te0 1 , Se 0, Te 0 , Se0+1, Te0+1 , , SNe, TNe } represents the meta-samples with the e0-th meta-example replaced as Se 0, Te 0 . By generalizing S to be meta-samples S and h S to be L2T L(S), we apply Corollary 21 in (Mohri & Rostamizadeh, Transfer Learning via Learning to Transfer 2010) to give the generalization bound of our algorithm L(S) in Theorem 2. Theorem 2. Let δ = δ (Ne) 4 (r > 1 is required). Then for any sample S of size Ne drawn according to an algebraic β-mixing stationary distribution, and δ 0 such that δ 0, the following generalization bound holds with probability at least 1 δ: R(L(S)) RNe(L(S)) < O (Ne) 1 2(r+1) 1 where R(L(S)) and RNe(L(S)) denote the expected risk and the empirical risk of L2T over meta-samples, respectively. A larger mixing parameter r, indicating more independence, would lead to a tighter bound. Theorem 2 tells that as the number of transfer learning experiences, i.e., Ne, increases, L2T tends to produce a tighter generalization bound. This fact lays the foundation for further conducting L2T in an online manner which can gradually assimilate transfer learning experiences and continuously improve. The detailed proofs for Theorem 1 and 2 can be found in the supplementary. 5. Experiments Datasets We evaluate the L2T framework on two image datasets, Caltech-256 (Griffin et al., 2007) and Sketches (Eitz et al., 2012). Caltech-256, collected from Google Images, contains a total of 30,607 images in 256 categories. The Sketches dataset, however, consists of 20,000 unique sketches by human beings that are evenly distributed over 250 different categories. We construct each pair of source and target domains by randomly sampling three categories from Caltech-256 as the source domain and randomly sampling three categories from Sketches as the target domain, which we give an example in the supplementary material. Consequently, there are 20, 000/250 3 = 720 examples in a target domain of each pair. In total, we generate 1,000 training pairs for preparing transfer learning experiences, 500 validation pairs to determine hyperparameters of the reflection function, and 500 testing pairs to evaluate the reflection function. We characterize each image from both datasets with 4,096-dimensional features extracted by a convolutional neural network pre-trained by Image Net. In this paper we generate transfer learning experiences by ourselves, because we are the first to consider transfer learning experiences and there exists no off-the-shelf datasets. In real-world applications, either the number of labeled examples in a target domain or the transfer learning algorithm could vary from experience to experience. In order to mimic the real environment, we prepare each transfer learning experience by randomly selecting a transfer learning algorithm from a base set A and randomly setting the number of labeled target examples in the range of [3, 120]. The randomly generated training experiences, lying in the same environment (generated by one dataset), are non i.i.d., which fit the algebraical β-mixing assumption theoretically in Section 4. Baselines and Evaluation Metrics We compare L2T with the following nine baseline algorithms in three classes: Non-transfer: Original builds a model using labeled data in a target domain only. Common latent space based transfer learning algorithms: TCA (Pan et al., 2011), ITL (Shi & Sha, 2012), CMF (Long et al., 2014), LSDT (Zhang et al., 2016), STL (Raina et al., 2007), DIP (Baktashmotlagh et al., 2013) and SIE (Baktashmotlagh et al., 2014). Manifold ensemble based algorithms: GFK (Gong et al., 2012). The eight feature-based transfer learning algorithms also constitute the base set A. Based on feature representations obtained by different algorithms, we use the nearestneighbor classifier to perform three-class classification for the target domain. One evaluation metric is classification accuracy on testing examples of a target domain. However, accuracies are incomparable for different target domains at different levels of difficulty. The other evaluation metric we adopt is the performance improvement ratio defined in Section 3.1, so as to compare the L2T over different pairs of domains. 3 15 30 45 60 75 90 105 120 The number of labeled examples in a target domain Performance improvement ratio TCA ITL CMF LSDT GFK STL DIP SIE L2T Figure 4. Average performance improvement ratio comparison over 500 testing pairs of source and target domains. Performance Comparison In this experiment, we learn a reflection function from 1,000 transfer learning experiences, and evaluate the reflection function on 500 testing pairs of source and target domains by comparing the average performance improvement ratio to the baselines. In building the reflection function, we use 33 RBF kernels with the bandwidth δk in the range of [2 8η : 20.5η : 28η] where η = 1 nsente Ne Ne e=1 ns e,nt e i,j=1 xs ei W xt ej W 2 2 follows the median trick (Gretton et al., 2012a). As Figure 4 shows, on average the proposed L2T framework outperforms the baselines up to 10% when varying the number of labeled samples in the target domain. As the number of labeled target examples increases from 3 to 120, the performance improvement ratio becomes smaller because the accuracy of Original without transfer tends to increase. The baseline Transfer Learning via Learning to Transfer 0 15 30 45 60 75 90 105 120 The number of labeled examples in a target domain Classification accuracy Original TCA ITL CMF LSDT GFK STL DIP SIE L2T (a) galaxy / harpsichord / saturn kangaroo / standing-bird / sun 0 15 30 45 60 75 90 105 120 The number of labeled examples in a target domain Classification accuracy Original TCA ITL CMF LSDT GFK STL DIP SIE L2T (b) bat / mountain-bike / saddle bush / person / walkie-talkie 0 15 30 45 60 75 90 105 120 The number of labeled examples in a target domain Classification accuracy Original TCA ITL CMF LSDT GFK STL DIP SIE L2T (c) microwave / spider / watch spoon / trumpet / wheel 0 15 30 45 60 75 90 105 120 The number of labeled examples in a target domain Classification accuracy Original TCA ITL CMF LSDT GFK STL DIP SIE L2T (d) bridge / harp / traffic-light door-handle / hand / present 0 15 30 45 60 75 90 105 120 The number of labeled examples in a target domain Classification accuracy Original TCA ITL CMF LSDT GFK STL DIP SIE L2T (e) bridge / helicopter / tripod key / parrot / traffic-light 0 15 30 45 60 75 90 105 120 The number of labeled examples in a target domain Classification accuracy Original TCA ITL CMF LSDT GFK STL DIP SIE L2T (f) caculator / straw / french-horn doorknob / palm-tree / scissors Figure 3. Classification accuracies on six pairs of source and target domains. algorithms behave differently. The transferable knowledge learned by LSDT helps a target domain a lot when training examples are scarce, while GFK performs poorly until training examples become more. STL is almost the worst baseline because it learns a dictionary from the source domain only but ignores the target domain. It runs at a high risk of failure especially when two domains are distant. DIP and SIE, which minimize the MMD and Hellinger distance between domains subject to manifold constraints, are competent. Note that we have run the paired t-test between L2T and each baseline with all the p-values in the order of 10 12, concluding that the L2T is significantly superior. We also randomly select six of the 500 testing pairs and compare classification accuracies by different algorithms for each pair in Figure 3. The performance of all baselines varies from pair to pair. Among all the baseline methods, TCA performs the best when transferring between domains in Figure 3a and LSDT is the most superior in Figure 3c. However, L2T consistently outperforms the baselines on all the settings. For some pairs, e.g., Figures 3a, 3c and 3f, the three classes in a target domain are comparably easy to tell apart, hence Original without transfer can achieve even better results than some transfer learning algorithms. In this case, L2T still improves by discovering the best transferable knowledge from the source domain, especially when the number of labeled examples is small (see Figure 3c and 3f). If two domains are very related, e.g., the source with galaxy and saturn and the target with sun in Figure 3a, L2T even finds out more transferable knowledge and contributes more significant improvement. Varying the Experiences We further investigate how transfer learning experiences used to learn the reflection function influence the performance of L2T. In this experiment, we evaluate on 50 randomly sampled pairs out of the 500 testing pairs in order to efficiently investigate a wide range of cases in the following. The sampled set is unbiased and sufficient to characterize such influence, evidenced by the asymptotic consistency between the average performance improvement ratio on the 500 pairs in Figure 4 and that on the 50 pairs in the last line of Table 1. First, we fix the number of transfer learning experiences to be 1,000 and vary the set of base transfer learning algorithms. The results are shown in Table 1. Even with experiences generated by single base algorithm, e.g., ITL or DIP, the L2T can still learn a reflection function that significantly better (p-value < 0.05) decides what to transfer than using ITL or DIP directly. With more base algorithms involved, the transfer learning experiences are more diverse to cover more situations of source-target pairs and the knowledge transferred between them. As a result, the L2T learns a better reflection function and thereby achieves higher performance improvement ratios, which coincides with Theorem 2 where a larger r indicating more independence between experiences gives a tighter bound. Second, we fix the set of base algorithms to include all the eight baselines and vary the number of transfer learning experiences used for training. As shown in Figure 5, the average performance improvement ratio achieved by L2T tends to increase as the number of labeled examples in the target domain decreases, given that Original without transfer performs extremely poor with scarce labeled examples. Transfer Learning via Learning to Transfer Table 1. The performance improvement ratios by varying different approaches used to generate transfer learning experiences. For example, ITL+L2T denotes the L2T learning from experiences generated by ITL only, and the second line of results for ITL+L2T is the p-value compared to ITL. # of labeled examples 3 15 30 45 60 75 90 105 120 TCA 1.0181 1.0024 0.9965 0.9973 0.9941 0.9933 0.9938 0.9927 0.9928 ITL 1.0188 1.0248 1.0250 1.0254 1.0250 1.0224 1.0232 1.0224 1.0224 CMF 0.9607 1.0203 1.0224 1.0218 1.0190 1.0158 1.0144 1.0142 1.0125 LSDT 1.0828 1.0168 0.9988 0.9940 0.9895 0.9867 0.9854 0.9834 0.9837 GFK 0.9729 1.0180 1.0232 1.0243 1.0246 1.0219 1.0239 1.0229 1.0225 STL 0.9973 0.9771 0.9715 0.9713 0.9715 0.9694 0.9705 0.9693 0.9693 DIP 1.0875 1.0633 1.0518 1.0465 1.0425 1.0372 1.0365 1.0343 1.0317 SIE 1.0745 1.0579 1.0485 1.0448 1.0412 1.0359 1.0359 1.0334 1.0318 ITL + L2T 1.1210 1.0737 1.0577 1.0506 1.0456 1.0398 1.0394 1.0361 1.0359 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0002 DIP + L2T 1.1605 1.0927 1.0718 1.0620 1.0562 1.0500 1.0483 1.0461 1.0451 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 (LSDT/GFK /SIE) + L2T 1.1660 1.0973 1.0746 1.0652 1.0573 1.0506 1.0485 1.0451 1.0429 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 (TCA/ITL/CMF/GFK /LSDT/SIE/) + L2T 1.1712 1.0954 1.0707 1.0607 1.0529 1.0469 1.0449 1.0421 1.0416 0.0000 0.0000 0.0001 0.0001 0.0106 0.0019 0.0002 0.0047 0.0106 all + L2T 1.1872 1.1054 1.0795 1.0699 1.0616 1.0551 1.0531 1.0500 1.0502 3 20 40 60 80 100 120 50 The number of experiences The number of labeled examples in a target domain Figure 5. Varying the number of transfer learning experiences. MMD Variance Discriminant MMD+Variance MMD+Discriminant Discriminant+Variance L2T Figure 6. Varying the components constituted in the f. 3 20 40 60 80 100 120 Different kernels The number of labeled examples in a target domain Figure 7. Varying the number of kernels considered in the f. More importantly, it increases as the number of experiences increases, which coincides with Theorem 2. Varying the Reflection Function We also study the influence of different configurations of the reflection function on the performance of L2T. First, we vary the components to be considered in building the reflection function f as shown in Figure 6. Considering single type, either MMD, variance, or the discriminant criterion, brings inferior performance and even negative transfer. L2T taking all the three factors into consideration outperforms the others, demonstrating that the three components are all necessary and mutually reinforcing. With all the three components included, we plot values of the learned β in the supplementary material. Second, we change the kernels used. In Figure 7, we present results by either narrowing down or extending the range [2 8η : 20.5η : 28η]. Obviously, more kernels (e.g., [2 12η : 20.5η : 212η]), capable of encrypting better trans- fer learning skills in the reflection function, achieve larger performance improvement ratios. 6. Conclusion In this paper, we propose a novel L2T framework for transfer learning which automatically optimizes what and how to transfer between a source and a target domain by leveraging previous transfer learning experiences. In particular, L2T learns a reflection function mapping a pair of domains and the knowledge transferred between them to the performance improvement ratio. When a new pair of domains arrives, L2T optimizes what and how to transfer by maximizing the value of the learned reflection function. We believe that L2T opens a new door to improve transfer learning by leveraging transfer learning experiences. Many research issues, e.g., incorporating hierarchical latent feature factors as what to transfer and designing online L2T, can be further examined. Transfer Learning via Learning to Transfer Acknowledgements We thank the reviewers for their valuable comments to improve this paper. The research has been supported by National Grant Fundamental Research (973 Program) of China under Project 2014CB340304, Hong Kong CERG projects 16211214/16209715/16244616, Hong Kong ITF ITS/391/15FX and NSFC 61673202. Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via metalearning in nonstationary and competitive environments. In ICLR, 2018. Argyriou, A., Evgeniou, T., and Pontil, M. Multi-task feature learning. In NIPS, pp. 41 48, 2007. Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salzmann, M. Unsupervised domain adaptation by domain invariant projection. In ICCV, pp. 769 776, 2013. Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salzmann, M. Domain adaptation on the statistical manifold. In CVPR, pp. 2481 2488, 2014. Belmont, J. M., Butterfield, E. C., Ferretti, R. P., et al. To secure transfer of training instruct self-management skills. In Detterman, D. K. and Sternberg, R. J. P. (eds.), How and How Much Can Intelligence be Increased, pp. 147 154. Ablex Norwood, NJ, 1982. Bernstein, S. Sur l extension du th eor eme limite du calcul des probabilit es aux sommes de quantit es d ependantes. Mathematische Annalen, 97(1):1 59, 1927. Blitzer, J., Mc Donald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In EMNLP, pp. 120 128, 2006. Cao, Q., Ying, Y., and Li, P. Similarity metric learning for face recognition. In ICCV, pp. 2408 2415, 2013. Caruana, R. Multitask learning. Machine Learning, 28: 41 75, 1997. Dai, W., Yang, Q., Xue, G.-R., and Yu, Y. Boosting for transfer learning. In ICML, pp. 193 200, 2007. Eitz, M., Hays, J., and Alexa, M. How do humans sketch objects? ACM Trans. Graph. (Proc. SIGGRAPH), 31(4): 44:1 44:10, 2012. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In ICML, pp. 1126 1135, 2017. Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pp. 2066 2073, 2012. Gopalan, R., Li, R., and Chellappa, R. Domain adaptation for object recognition: An unsupervised approach. In ICCV, pp. 999 1006, 2011. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. JMLR, 13(Mar): 723 773, 2012a. Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for large-scale two-sample tests. In NIPS, pp. 1205 1213, 2012b. Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. 2007. Huber, P. J. et al. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964. Long, M., Wang, J., Ding, G., Shen, D., and Yang, Q. Transfer learning with graph co-regularization. TKDE, 26(7): 1805 1818, 2014. Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. In ICML, pp. 97 105, 2015. Luria, A. R. Cognitive Development: Its Cultural and Social Foundations. Harvard University Press, 1976. Maurer, A. Algorithmic stability and meta-learning. JMLR, 6(Jun):967 994, 2005. Mo, K., Li, S., Zhang, Y., Li, J., and Yang, Q. Personalizing a dialogue system with transfer learning. ar Xiv preprint ar Xiv:1610.02891, 2016. Mohri, M. and Rostamizadeh, A. Stability bounds for stationary ϕ-mixing and β-mixing processes. JMLR, 11 (Feb):789 814, 2010. Pan, S. J. and Yang, Q. A survey on transfer learning. TKDE, 22(10):1345 1359, 2010. Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Domain adaptation via transfer component analysis. TNN, 22(2): 199 210, 2011. Pentina, A. and Lampert, C. H. Lifelong learning with non-iid tasks. In NIPS, pp. 1540 1548, 2015. Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y. Self-taught learning: transfer learning from unlabeled data. In ICML, pp. 759 766, 2007. Transfer Learning via Learning to Transfer Ruvolo, P. and Eaton, E. ELLA: An efficient lifelong learning algorithm. In ICML, pp. 507 515, 2013. Shi, Y. and Sha, F. Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In ICML, pp. 1079 1086, 2012. Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 1998. Tommasi, T., Orabona, F., and Caputo, B. Learning categories from few examples with multi model knowledge transfer. TPAMI, 36(5):928 941, 2014. Tzeng, E., Hoffman, J., Darrell, T., and Saenko, K. Simultaneous deep transfer across domains and tasks. In ICCV, pp. 4068 4076, 2015. Wei, Y., Zheng, Y., and Yang, Q. Transfer knowledge between cities. In KDD, pp. 1905 1914, 2016. Yang, J., Yan, R., and Hauptmann, A. G. Adapting SVM classifiers to data with shifted distributions. In ICDM, pp. 69 76, 2007a. Yang, J., Zhang, D., Yang, J.-y., and Niu, B. Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics. TPAMI, 29(4), 2007b. Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In NIPS, pp. 3320 3328, 2014. Zhang, L., Zuo, W., and Zhang, D. LSDT: Latent sparse domain transfer learning for visual adaptation. TIP, 25 (3):1177 1191, 2016.