# learning_what_and_where_to_transfer__3033acf2.pdf Learning What and Where to Transfer Yunhun Jang * 1 2 Hankook Lee * 1 Sung Ju Hwang 3 4 5 Jinwoo Shin 1 4 5 As the application of deep learning has expanded to real-world problems with insufficient volume of training data, transfer learning recently has gained much attention as means of improving the performance in such small-data regime. However, when existing methods are applied between heterogeneous architectures and tasks, it becomes more important to manage their detailed configurations and often requires exhaustive tuning on them for the desired performance. To address the issue, we propose a novel transfer learning approach based on meta-learning that can automatically learn what knowledge to transfer from the source network to where in the target network. Given source and target networks, we propose an efficient training scheme to learn meta-networks that decide (a) which pairs of layers between the source and target networks should be matched for knowledge transfer and (b) which features and how much knowledge from each feature should be transferred. We validate our meta-transfer approach against recent transfer learning methods on various datasets and network architectures, on which our automated scheme significantly outperforms the prior baselines that find what and where to transfer in a hand-crafted manner. 1. Introduction Learning deep neural networks (DNNs) requires large datasets, but it is expensive to collect a sufficient amount of labeled samples for each target task. A popular approach for handling such lack of data is transfer learning (Pan & Yang, 2010), whose goal is to transfer knowledge from a known source task to a new target task. The most widely used method for transfer learning is pre-training with fine- *Equal contribution 1School of Electrical Engineering, KAIST, Korea 2OMNIOUS, Korea 3School of Computing, KAIST, Korea 4Graduate School of AI, KAIST, Korea 5AITRICS, Korea. Correspondence to: Jinwoo Shin . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Previous method Proposed method Target Source Layer Feature map Where to Transfer What to Transfer 3.0 0.5 0.4 Meta-networks Figure 1. Top: Prior approaches. Knowledge transfer between two networks is done between hand-crafted chosen pairs of layers without considering importance of channels. Bottom: Our metatransfer method. The meta-networks f, g automatically decide amounts of knowledge transfer between layers of the two networks and importance of channels when transfer. Line width indicates an amount of transfer in pairs of transferring layers and channels. tuning (Razavian et al., 2014): first train a source DNN (e.g. Res Net (He et al., 2016)) with a large dataset (e.g. Image Net (Deng et al., 2009)) and then, use the learned weights as an initialization to train a target DNN. Yet, finetuning definitely is not a panacea. If the source and target tasks are semantically distant, it may provide no benefit. Cui et al. (2018) suggest to sample from the source dataset depending on a target task for pre-training, but it is only possible when the source dataset is available. There is also no straightforward way to use fine-tuning, if the network architectures for the source and target tasks largely differ. Several existing works can be applied to this challenging scenario of knowledge transfer between heterogeneous DNNs and tasks. Learning without forgetting (Lw F) (Li & Hoiem, 2018) proposes to use knowledge distillation, suggested in Hinton et al. (2015), for transfer learning by introducing an additional output layer on a target model, and thus it can be applied to situations where the source and target tasks are different. Fit Net (Romero et al., 2015) proposes a teacherstudent training scheme for transferring the knowledge from a wider teacher network to a thinner student network, by Learning What and Where to Transfer using teacher s feature maps to guide the learning of the student. To guide the student network, Fit Net uses ℓ2 matching loss between the source and target features. Attention transfer (Zagoruyko & Komodakis, 2017) and Jacobian matching (Srinivas & Fleuret, 2018) suggest similar approaches to Fit Net, but use attention maps generated from feature maps or Jacobians for transferring the source knowledge. Our motivation is that these methods, while allowing to transfer knowledge between heterogeneous source and target tasks/architectures, have no mechanism to identify which source information to transfer, between which layers of the networks. Some source information is more important than others, while some are irrelevant or even harmful depending on the task difference. For example, since network layers generate representations at different level of abstractions (Zeiler & Fergus, 2014), the information of lower layers might be more useful when the input domains of the tasks are similar, but the actual tasks are different (e.g., fine-grained image classification tasks). Furthermore, under heterogeneous network architectures, it is not straightforward to associate a layer from the source network with one from the target network. Yet, since there was no mechanism to learn what to transfer to where, existing approaches require a careful manual configuration of layer associations between the source and target networks depending on tasks, which cannot be optimal. Contribution. To tackle this problem, we propose a novel transfer learning method based on the concept of metalearning (Naik & Mammone, 1992; Thrun & Pratt, 2012) that learns what information to transfer to where, from source networks to target networks with heterogeneous architectures and tasks. Our goal is learning to learn transfer rules for performing knowledge transfer in an automatic manner, considering the difference in the architectures and tasks between source and target, without hand-crafted tuning of transfer configurations. Specifically, we learn metanetworks that generate the weights for each feature and between each pair of source and target layers, jointly with the target network. Thus, it can automatically learn to identify which source network knowledge is useful, and where it should transfer to (see Figure 1). We validate our method, learning to transfer what and where (L2T-ww), to multiple source and target task combinations between heterogeneous DNN architectures, and obtain significant improvements over existing transfer learning methods. Our contributions are as follows: We introduce meta-networks for transfer learning that automatically decide which feature maps (channels) of a source model are useful and relevant for learning a target task and which source layers should be transferred to which target layers. To learn the parameters of meta-networks, we propose an efficient meta-learning scheme. Our main novelty is to evaluate the one-step adaptation performance (metaobjective) of a target model learned by minimizing the transfer objective only (as an inner-objective). This scheme significantly accelerates the inner-loop procedure, compared to the standard scheme. The proposed method achieves significant improvements over baseline transfer learning methods in our experiments. For example, in the Image Net experiment, our meta-transfer learning method achieves 65.05% accuracy on CUB200, while the second best baseline obtains 58.90%. In particular, our method outperforms baselines with a large margin when the target task has an insufficient number of training samples and when transferring from multiple source models. Organization. The rest of the paper is organized as follows. In Section 2, we describe our method for selective knowledge transfer, and training scheme for learning the proposed meta-networks. Section 3 shows our experimental results under various settings, and Section 4 states the conclusion. 2. Learning What and Where to Transfer Our goal is to learn to transfer useful knowledge from the source network to the target network, without requiring manual layer association or feature selection. To this end, we propose a meta-learning method that learns what knowledge of the source network to transfer to which layer in the target network. In this paper, we primarily focus on transfer learning between convolutional neural networks, but our method is generic and is applicable to other types of deep neural networks as well. In Section 2.1, we describe meta-networks that learn what to transfer (for selectively transfer only the useful channels/features to a target model), and where to transfer (for deciding a layer-matching configuration that encourages learning a target task). Section 2.2 presents how to train the proposed meta-networks jointly with the target network. 2.1. Weighted Feature Matching If a convolutional neural network is well-trained on a task, then its intermediate feature spaces should have useful knowledge for the task. Thus, mimicking the well-trained features might be helpful for training another network. To formalize the loss forcing this effect, let x be an input, and y be the corresponding (ground-truth) output. For image classification tasks, {x} and {y} are images and their class labels. Let Sm(x) be intermediate feature maps of the mth layer of the pre-trained source network S. Our goal is then to train another target network Tθ with parameter θ utilizing the knowledge of S. Let T n θ (x) be intermediate Learning What and Where to Transfer Image Matching Loss Aggregation (a) Where to transfer Matching Weighted (b) What to transfer Figure 2. Our meta-transfer learning method for selective knowledge transfer. The meta-transfer networks are parameterized by φ and are learned via meta-learning. The dashed lines indicate flows of tensors such as feature maps, and solid lines denote ℓ2 feature matching. (a) gm,n φ outputs weights of matching pairs λm,n between the mth and nth layers of the source and target models, respectively, and (b) f m,n φ outputs weights for each channel. feature maps of the nth layer of the target network. Then, we minimize the following ℓ2 objective, similar to that used in Fit Net (Romero et al., 2015), to transfer the knowledge from Sm(x) to T n θ (x): rθ(T n θ (x)) Sm(x) 2 2 where rθ is a linear transformation parameterized by θ such as a pointwise convolution. We refer to this method as feature matching. Here, the parameter θ consists of both the parameter for linear-transformation rθ and non-linear neural network Tθ, where the former is only necessary in training the latter and is not required at testing time. What to transfer. In general transfer learning settings, the target model is trained for a task that is different from that of the source model. In this case, not all the intermediate features of the source model may be useful to learn the target task. Thus, to give more attention on the useful channels, we consider a weighted feature matching loss that can emphasize the channels according to their utility on the target task: Lm,n wfm (θ|x, wm,n) i,j (rθ(T n θ (x))c,i,j Sm(x)c,i,j)2, where H W is the spatial size of Sm(x) and rθ(T n θ (x)), the inner-summation is over i {1, 2, . . . H} and j {1, 2, . . . W}, and wm,n c is the non-negative weight of channel c with P c wm,n c = 1. Since the important channels to transfer can vary for each input image, we set channel weights as a function, wm,n = [wm,n c ] = f m,n φ (Sm(x)), by taking the softmax output of a small meta-network which takes features of source models as an input. We let φ denote the parameters of meta-networks throughout this paper. Where to transfer. When transferring knowledge from a source model to a target model, deciding pairs (m, n) of layers in the source and target model is crucial to its effectiveness. Previous approaches (Romero et al., 2015; Zagoruyko & Komodakis, 2017) select the pairs manually based on prior knowledge of architectures or semantic similarities between tasks. For example, attention transfer (Zagoruyko & Komodakis, 2017) matches the last feature maps of each group of residual blocks in Res Net (He et al., 2016). However, finding the optimal layer association is not a trivial problem and requires exhaustive tuning based on trial-anderror, given models with different numbers of layers or heterogeneous architectures, e.g., between Res Net (He et al., 2016) and VGG (Simonyan & Zisserman, 2015). Hence, we introduce a learnable parameter λm,n 0 for each pair (m, n) which can decide the amount of transfer between the mth and nth layers of source and target models, respectively. We also set λm,n = gm,n φ (Sm(x)) for each pair (m, n) as an output of a meta-network gm,n that automatically decides important pairs of layers for learning the target task. The combined transfer loss given the weights of channels w and weights of matching pairs λ is Lwfm(θ|x, φ) = X (m,n) C λm,n Lm,n wfm (θ|x, wm,n), where C be a set of candidate pairs. Our final loss Ltotal to train a target model then is given as: Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ). where Lorg is the original loss (e.g., cross entropy) and β > 0 is a hyper-parameter. We note that wm,n and λm,n decide what and where to transfer, respectively. We provide an illustration of our transfer learning scheme in Figure 2. Learning What and Where to Transfer 2.2. Training Meta-Networks and Target Model Our goal is to achieve high performance on the target task when the target model is learned using the training objective Ltotal( |x, y, φ). To maximize the performance, the feature matching term Lwfm( |x, φ) should encourage learning of useful features for the target task, e.g., predicting labels. To measure and increase usefulness of the feature matching decided by meta-networks parameterized by φ, a standard approach is to use the following bilevel scheme (Colson et al., 2007) to train φ, e.g., see (Finn et al., 2017; Franceschi et al., 2018): 1. Update θ to minimize Ltotal(θ|x, y, φ) for T times. 2. Measure Lorg(θ|x, y) and update φ to minimize it. In the above, the actual objective Ltotal for learning the target model is used in the inner-loop, and the original loss Lorg is used as a meta-objective to measure the effectiveness of Ltotal for learning the target model to perform well. However, since our meta-networks affect the learning procedure of the target model weakly through the regularization term Lwfm, their influence on Lorg can be very marginal, unless one uses a very large number of inner-loop iterations T. Consequently, it causes difficulties on updating φ using gradient φLorg. To tackle this challenge, we propose the following alternative scheme: 1. Update θ to minimize Lwfm(θ|x, φ) for T times. 2. Update θ to minimize Lorg(θ|x, y) once. 3. Measure Lorg(θ|x, y) and update φ to minimize it. In the first stage, given the current parameter θ0 = θ, we update the target model for T times via gradient-based algorithms for minimizing Lwfm. Namely, the resulting parameter θT is learned only using the knowledge of the source model. Since transfer is done by the form of feature matching, it is feasible to train useful features for the target task by selectively mimic the source features. More importantly, it increases the influence of the regularization term Lwfm on the learning procedure of the target model in the inner-loop, since the target features are solely trained by the source knowledge (without target labels). The second stage is an one-step adaptation θT +1 from θT toward the target label. Then, in the third stage, the task-specific objective Lorg(θT +1) can measure how quickly the target model has adapted (via only one step from θT ) to the target task, under the sample used in the first and second stage. Finally, the meta-parameter φ can be trained by minimizing Lorg(θT +1). The above 3-stage scheme encourages significantly faster training of φ, compared the standard 2-stage one. This is because the former measures the effect of the regularization Algorithm 1 Learning of θ with meta-parameters φ Input: Dataset Dtrain = {(xi, yi)}, learning rate α repeat Sample a batch B Dtrain with |B| = B Update θ to minimize 1 (x,y) B Ltotal(θ|x, y, φ) Initialize θ0 θ for t = 0 to T 1 do θt+1 θt α θ 1 B P (x,y) B Lwfm(θt|x, φ) end for θT +1 θT α θ 1 (x,y) B Lorg(θT |x, y) Update φ using φ 1 (x,y) B Lorg(θT +1|x, y) until done term Lwfm more directly to the original Lorg, and allows to choose a small T to update φ meaningfully (we choose T = 2 in our experiments). In the case of using the vanilla gradient descent algorithm for updates, the 3-stage training scheme to learn metaparameters φ can be formally written as the following optimization task: minimize φ Lorg(θT +1|x, y) subject to θT +1 = θT α θLorg(θT |x, y), θt+1 = θt α θLwfm(θt|x, φ), t = 0, . . . , T 1, where α > 0 is a learning rate. To solve the above optimization problem, we use Reverse-HG (Franceschi et al., 2017) that can compute φLorg(θT +1|x, y) efficiently using Hessian-vector products. To train the target model jointly with meta-networks, we alternatively update the target model parameters θ and the meta-network parameters φ. We first update the target model for a single step with objective Ltotal(θ|x, y, φ). Then, given current target model parameters, we update the metanetworks parameters φ using the 3-stage bilevel training scheme described above. This eliminates an additional metatraining phase for learning φ. The proposed training scheme is formally outlined in Algorithm 1. 3. Experiments We validate our meta-transfer learning method that learns what and where to transfer, between heterogeneous network architectures and tasks. 3.1. Setups Network architectures and tasks for source and target. To evaluate various transfer learning methods including ours, we perform experiments on two scales of image classification tasks, 32 32 and 224 224. For 32 32 scale, Learning What and Where to Transfer 512 2 2 Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Res Net-32 VGG9 512 2 2 Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Res Net-32 VGG9 (b) One-to-one 512 2 2 Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Res Net-32 VGG9 (c) All-to-all 512 2 2 Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Down-scaling Res Net-32 VGG9 (d) Learned matching Figure 3. (a)-(c) Matching configurations C between Res Net32 (left) and VGG9 (right). (d) The amount λm,n of transfer between layers after learning. Line widths indicates the transfer amount. We omit the lines when λm,n is less than 0.1. we use the Tiny Image Net1 dataset as a source task, and CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), and STL-10 (Coates et al., 2011) datasets as target tasks. We train 32-layer Res Net (He et al., 2016) and 9-layer VGG (Simonyan & Zisserman, 2015) on the source and target tasks, respectively. For 224 224 scale, the Image Net (Deng et al., 2009) dataset is used as a source dataset, and Caltech-UCSD Bird 200 (Wah et al., 2011), MIT Indoor Scene Recognition (Quattoni & Torralba, 2009), Stanford 40 Actions (Yao et al., 2011) and Stanford Dogs (Khosla et al., 2011) datasets as target tasks. For these datasets, we use 34-layer and 18layer Res Net as a source and target model, respectively, unless otherwise stated. Meta-network architecture. For all experiments, we construct the meta-networks as 1-layer fully-connected networks for each pair (m, n) C where C is the set of candidates of pairs, or matching configuration (see Figure 3). It takes the globally average pooled features of the mth layer of the source network as an input, and outputs wm,n c and λm,n. As for the channel assignments w, we use the softmax activation to generate them while satisfying P c wm,n c = 1, and for transfer amount λ between layers, we commonly use Re LU6 (Krizhevsky & Hinton, 2010), max(0, min(6, x)) to ensure non-negativeness of λ and to prevent λm,n from becoming too large. Compared schemes for transfer learning. We compare our methods with the following prior methods and their combinations: learning without forgetting (Lw F) (Li & Hoiem, 2018), attention transfer (AT) (Zagoruyko & Komodakis, 2017) and unweighted feature matching (FM) (Romero et al., 2015).2 Here, AT and FM transfer knowledge on 1https://tiny-imagenet.herokuapp.com/ 2 In our experimental setup, we reproduce similar relative improvements from the scratch for these baselines as reported in the original papers. We do not report the results of Jacobian feature-level as like ours by matching attention maps or feature maps between source and target layers, respectively. The feature-level transfer methods generally choose layers just before down-scaling, e.g., the last layer of each residual group for Res Net, and match pairs of the layers of same spatial size. Following this convention, we evaluate two hand-crafted configurations (single, one-to-one) for prior methods and a new configurations (all-to-all) for our methods: (a) single: use a pair of the last feature in the source model and a layer with the same spatial size in the target model, (b) one-to-one: connect each layer just before downscaling in the source model to a target layer of the same spatial size, (c) all-to-all: use all pairs of layers just before down-scaling, e.g., between Res Net and VGG architectures, we consider 3 5 = 15 pairs. For matching features of different spatial sizes, we simply use a bilinear interpolation. These configurations are illustrated in Figure 3. Among various combinations between prior methods and matching configurations, we only report the results of those achieving the meaningful performance gains. 3.2. Evaluation on Various Target Tasks We first evaluate the effect of learning to transfer what (L2Tw) without learning to transfer where. To this end, we use conventional hand-crafted matching configurations, single and one-to-one, illustrated in Figure 3(a) and 3(b), respectively. For most cases reported in Table 1, L2T-w improves the performance on target tasks compared to the unweighted counterpart (FM): for fine-grained target tasks transferred from Image Net, the gain of L2T-w over FM is more significant. The results support that our method, learning what to transfer, is more effective when target tasks have specific types of input distributions, e.g., fine-grained classification, while the source model is trained on a general task. matching (JM) (Srinivas & Fleuret, 2018) as the improvement of Lw F+AT+JM over Lw F+AT is marginal in our setups. Learning What and Where to Transfer Amounts of Transfer λm,n Epochs 0 50 100 150 (a) Transfer from S1(x) Amounts of Transfer λm,n Epochs 0 50 100 150 (b) Transfer from S2(x) Amounts of Transfer λm,n Epochs 0 50 100 150 (c) Transfer from S3(x) Figure 4. Change of λm,n during training for STL-10 as the targe task with Tiny Image Net as the source task. We plot mean and standard deviation of λm,n of all samples for every 10 epochs. Table 1. Classification accuracy (%) of transfer learning from Tiny Image Net (32 32) or Image Net (224 224) to CIFAR-100, STL-10, Caltech-UCSD Bird 200 (CUB200), MIT Indoor Scene Recognition (MIT67), Stanford 40 Actions (Stanford40) and Stanford Dogs datasets. For Tiny Image Net, Res Net32 and VGG9 are used as a source and target model, respectively, and Res Net34 and Res Net18 are used for Image Net. Source task Tiny Image Net Image Net Target task CIFAR-100 STL-10 CUB200 MIT67 Stanford40 Stanford Dogs Scratch 67.69 0.22 65.18 0.91 42.15 0.75 48.91 0.53 36.93 0.68 58.08 0.26 Lw F 69.23 0.09 68.64 0.58 45.52 0.66 53.73 2.14 39.73 1.63 66.33 0.45 AT (one-to-one) 67.54 0.40 74.19 0.22 57.74 1.17 59.18 1.57 59.29 0.91 69.70 0.08 Lw F+AT (one-to-one) 68.75 0.09 75.06 0.57 58.90 1.32 61.42 1.68 60.20 1.34 72.67 0.26 FM (single) 69.40 0.67 75.00 0.34 47.60 0.31 55.15 0.93 42.93 1.48 66.05 0.76 FM (one-to-one) 69.97 0.24 76.38 1.18 48.93 0.40 54.88 1.24 44.50 0.96 67.25 0.88 L2T-w (single) 70.27 0.09 74.35 0.92 51.95 0.83 60.41 0.37 46.25 3.66 69.16 0.70 L2T-w (one-to-one) 70.02 0.19 76.42 0.52 56.61 0.20 59.78 1.90 48.19 1.42 69.84 1.45 L2T-ww (all-to-all) 70.96 0.61 78.31 0.21 65.05 1.19 64.85 2.75 63.08 0.88 78.08 0.96 Next, instead of using hand-crafted matching pairs of layers, we also learn where to transfer starting from all matching pairs illustrated in Figure 3(c). The proposed final scheme in our paper, learning to transfer what and where (L2T-ww), often improves the performance significantly compared to the hand-crafted matching (L2T-w). As a result, L2T-ww achieves the best accuracy for all cases (with large margin) reported in Table 1, e.g., on the CUB200 dataset, we attain 10.4% relative improvement compared to the second best baseline. Figure 3(d) shows the amounts λm,n of transfer between pairs of layers after learning transfer from Tiny Image Net to STL-10. As shown in the figure, our method transfers knowledge to higher layers in the target model: λ2,5 = 1.40, λ1.5 = 2.62, λ3,4 = 2.88, λ2,4 = 0.74. The amounts λm,n of other pairs are smaller than 0.1, except λ1,2 = 0.21. Clearly, those matching pairs are not trivial to find by hand-crafted tuning, which justifies that our method for learning where to transfer is useful. Furthermore, since our method outputs sample-wise λm,n, amounts of transfer are adjusted more effectively compared to fixed matching pairs over all the samples. For example, amounts of transfer from source features S1(x) have relatively smaller variance over the samples (Figure 4(a)) compared to the those of S3(x) (Figure 4(c)). This is because higher-level features are more task-specific while lower-level features are more task-agnostic. It evidences that meta-networks gφ adjust the amounts of transfer for each sample considering the relationship between tasks and the levels of abstractions of features. 3.3. Experiments on Limited-Data Regimes When a target task has a small number of labeled samples for training, transfer learning can be even more effective. To evaluate our method (L2T-ww) on such limiteddata scenario, we use CIFAR-10 as a target task dataset by reducing the number of samples. We use N {50, 100, 250, 500, 1000} training samples for each class, Learning What and Where to Transfer Table 2. Classification accuracy (%) of VGG9 on STL-10 transferred from multiple source models. The first source model is Res Net32 trained on Tiny Image Net. The additional source model is one of three: Res Net20 trained on Tiny Image Net, another Res Net32 trained on Tiny Image Net, and Res Net32 trained on CIFAR-10. We report the performance of the target model transferred from a single source model and two source models. First source Tiny Image Net (Res Net32) Second source None Tiny Image Net (Res Net20) Tiny Image Net (Res Net32) CIFAR-10 (Res Net32) Scratch 65.18 0.91 65.18 0.91 65.18 0.91 65.18 0.91 Lw F 68.64 0.58 68.56 2.24 68.05 2.12 69.51 0.63 AT 74.19 0.22 73.24 0.12 73.78 1.16 73.99 0.51 Lw F+AT 75.06 0.57 74.72 0.46 74.77 0.30 74.41 1.51 FM (single) 75.00 0.34 75.83 0.56 75.99 0.11 74.60 0.73 FM (one-to-one) 76.38 1.18 77.45 0.48 77.69 0.79 77.15 0.41 L2T-ww (all-to-all) 78.31 0.21 79.35 0.41 79.80 0.52 80.52 0.29 Scratch Lw F AT Lw F+AT L2T-ww Accuracy (%) Number of Training Samples per Class 50 100 250 500 1000 Figure 5. Transfer from Tiny Image Net to CIFAR-10 with varying numbers of training samples per class in CIFAR-10. x-axis is plotted in logarithmic scale. and compare the performance of learning from scratch, Lw F, AT, Lw F+AT and L2T-ww. The results are reported in Figure 5. They show that ours achieves significant more improvements compared to other baselines, when the volume of the target dataset is smaller. For example, in the case of N = 50, our method achieves 64.91% classification accuracy, while the baselines, Lw F+AT, AT, Lw F and scratch show 53.76%, 51.76%, 43.32% and 39.99%, respectively. Observe that ours needs only 50 samples per class to achieve similar accuracy of Lw F with 250 samples per class. 3.4. Experiments on Multi-Source Transfer In practice, one can have multiple pre-trained source models with various source datasets. Transfer from multiple sources may potentially provide more knowledge for learning a target task, however, using them simultaneously could require more hand-crafted configurations of transfer, such as balancing the transfer from many sources or choosing different pairs of layers depending on the source models. To evaluate the effects of using multiple source models, we consider the scenarios transferred from two source models simultaneously, where the models are different architectures (Res Net20, Res Net32) or trained on different datasets (Tiny Image Net, CIFAR-10). In Table 2, we report the results of ours (L2T-ww) and other transfer methods on a target task STL-10 with 9-layer VGG as a target model architecture. Our method consistently improves the target model performance over more informative transitions (from left to right in Table 2) on sources, i.e., when using a larger source model (Res Net20 Res Net32) or using a different second source dataset (Tiny Image Net CIFAR-10). This is not the case for all other methods. In particular, compare the best performance of each method transferred from two Tiny Image Net models and Tiny Image Net+CIFAR-10 models as sources. Then, one can conclude that ours is the only one that effectively aggregates the heterogeneous source knowledge, i.e., Tiny Image Net+CIFAR-10. It shows the importance of choosing the right configurations of transfer when using multiple source models, and confirms that ours can automatically decide the useful configuration from many possible candidate pairs for transfer. 3.5. Visualization With learning what to transfer, our weighted feature matching will allocate larger attention to task-related channels of feature maps. To visualize the attention used in knowledge transfer, we compare saliency maps (Simonyan et al., 2014) for unweighted (FM) and weighted (L2T-w) matching between the last layers of source and target models. Saliency maps can be computed as follows: Mi,j = max c Lm,n wfm (θ|x, wm,n) Learning What and Where to Transfer More activated pixels in L2T-w Less activated pixels in L2T-w More activated pixels in L2T-w Less activated pixels in L2T-w (b) Stanford Dogs Figure 6. More (the second column) and less (the third column) activated pixels in the saliency maps of L2T-w compared to unweighted feature matching (FM) on images of (a) CUB200 and (b) Stanford Dogs datasets. When computing saliency maps, we use normalized gradients. One can observe that the higher activated pixels induced by L2T-w tend to correspond to where task-specific objects are, while less activated location spread over entire location. where x is an image, c is a channel of the image, e.g., RGB, and (i, j) {1, 2, . . . , H} {1, 2, . . . , W} is a pixel position. For the unweighted case, we use uniform weights. On the other hand, for the weighted case, we use the outputs wm,n = f m,n φ (Sm(x)) of meta-networks learned by our meta-training scheme. Figure 6 shows which pixels are more or less activated in the saliency map of L2T-w compared to FM. As shown in the figure, pixels containing task-specific objects (birds or dogs) are more activated when using L2T-w, while background pixels are less activated. It means that the weights wm,n make knowledge of the source model be more task-specific, consequently it can improve transfer learning. 4. Conclusion We propose a transfer method based on meta-learning which can transfer knowledge selectively depending on tasks and architectures. Our method transfers more important knowledge for learning a target task, with identifying what and where to transfer using meta-networks. To learn the metanetworks, we design an efficient meta-learning scheme which requires a few steps in the inner-loop procedure. By doing so, we jointly train the target model and the metanetworks. We believe that our work would shed a new angle for complex transfer learning tasks between heterogeneous or/and multiple network architectures and tasks. Learning What and Where to Transfer Acknowledgements This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government MSIT (No.2016-0-00563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion) and supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF2018R1A5A1059921). Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011), 2011. Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals of operations research, 2007. Cui, Y., Song, Y., Sun, C., Howard, A., and Belongie, S. Large scale fine-grained categorization and domainspecific transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2009. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017. Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning, 2017. Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop, Advances in Neural Information Processing Systems 29 (NIPS 2015), 2015. Khosla, A., Jayadevaprakash, N., Yao, B., and Fei-Fei, L. Novel dataset for fine-grained image categorization. In The 1st Workshop on Fine-Grained Visual Categorization, the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), 2011. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Krizhevsky, A. and Hinton, G. Convolutional deep belief networks on cifar-10, 2010. Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. Naik, D. K. and Mammone, R. Meta-neural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, 1992. Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 2010. Quattoni, A. and Torralba, A. Recognizing indoor scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2009. Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. Cnn features off-the-shelf: an astounding baseline for recognition. In The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2014), 2014. Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In The 3rd International Conference on Learning Representations (ICLR 2015), 2015. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In The 3rd International Conference on Learning Representations (ICLR 2015), 2015. Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. In The 2nd International Conference on Learning Representations Workshop (ICLR 2014), 2014. Srinivas, S. and Fleuret, F. Knowledge transfer with Jacobian matching. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018. Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012. Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. Learning What and Where to Transfer Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., and Fei-Fei, L. Human action recognition by learning bases of action attributes and parts. In The IEEE International Conference on Computer Vision (ICCV 2011), 2011. Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In The 5th International Conference on Learning Representations (ICLR 2017), 2017. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In The European Conference on Computer Vision (ECCV 2014), 2014.