# deep_model_transferability_from_attribution_maps__3fac59ad.pdf

Deep Model Transferability from Attribution Maps

Jie Song1,3, Yixin Chen1, Xinchao Wang2, Chengchao Shen1, Mingli Song1,3

1Zhejiang University, 2Stevens Institute of Technology 3Alibaba-Zhejiang University Joint Institute of Frontier Technologies {sjie,chenyix,chengchaoshen,brooksong}@zju.edu.cn xinchao.wang@stevens.edu

Exploring the transferability between heterogeneous tasks sheds light on their intrinsic interconnections, and consequently enables knowledge transfer from one task to another so as to reduce the training effort of the latter. In this paper, we propose an embarrassingly simple yet very efﬁcacious approach to estimating the transferability of deep networks, especially those handling vision tasks. Unlike the seminal work of taskonomy that relies on a large number of annotations as supervision and is thus computationally cumbersome, the proposed approach requires no human annotations and imposes no constraints on the architectures of the networks. This is achieved, speciﬁcally, via projecting deep networks into a model space, wherein each network is treated as a point and the distances between two points are measured by deviations of their produced attribution maps. The proposed approach is several-magnitude times faster than taskonomy, and meanwhile preserves a task-wise topological structure highly similar to the one obtained by taskonomy. Code is available at https://github.com/zju-vipa/ Transferbility From Attribution Maps.

1 Introduction

Deep learning has brought about unprecedented advances in many if not all the major artiﬁcial intelligence tasks, especially computer vision ones. The state-of-the-art performances, however, come at the costs of the often burdensome training process that requires an enormous number of human annotations and GPU hours, as well as the partially interpretable and thus the only intermittently predictable black-box behaviors. Understanding the intrinsic relationships between such deeplearning tasks, if any, may on the one hand elucidate the rationale of the encouraging results achieved by deep learning, and on the other hand allows for more predictable and explainable transfer learning from one task to another, so that the training effort can be signiﬁcantly reduced.

The seminal work of taskonomy [37] made the pioneering attempt towards disentangling the relationships between visual tasks through a computational approach. This is accomplished by training ﬁrst all the task models and then all the feasible transfers among models, in a fully supervised manner. Based on the obtained transfer performances, an afﬁnity matrix of transferability is derived, upon which an Integer Program can be further imposed to compute the ﬁnal budget-constrained task-transferability graph. Despite the intriguing results achieved, the training cost, especially that for the combinatorial-based transferability learning, makes taskonomy prohibitively expensive to estimate. Even for the ﬁrst-order transferability estimation, the training costs grow quadratically with respect to the number of tasks involved; when adding a new task to the graph, the transferability has to be explicitly trained between the new task and all those in the task dictionary.

In this paper, we propose an embarrassingly simple yet competent approach to estimating the transferability between different tasks, with a focus on the computer vision ones. Unlike taskonomy that relies on training the task-speciﬁc models and their transferability using human annotations, in

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

our approach we assume no labelled data are available, and we are given only the pre-trained deep networks, which can be nowadays found effortless online. Moreover, we do not impose constraints on the architectures of the deep networks, such as networks handling different tasks sharing the same architectures.

At the heart of our approach is to project pre-trained deep networks into a common space, termed model space. The model space accepts networks of heterogeneous architectures and handling different tasks, and transforms each network into a point. The distance between two points in the model space is then taken to be the measure of their relatedness and the consequent transferability. Such construction of the model space enables prompt model insertion or deletion, as updating the transferability graph boils down to computing nearest neighbors in the model space, which is therefore much lighter than taskonomy that requires the pair-wise re-training for each newly added task.

The projection to the model space is attained by feeding unlabelled images, which can be obtained handily online, into a network and then computing the corresponding attribution maps. An attribution map signals pixels in the input image highly relevant to the downstream tasks or hidden representations, and therefore highlights the attention of a network over a speciﬁc task. In other words, the model space can be thought as a space deﬁned on top of attribution maps, where the afﬁnity between points or networks is evaluated using the distance between their produced attribution maps, which again, requires no supervision and can be computed really fast.

The intuition behind adopting attribution maps for network-afﬁnity estimation is rather straightforward: models focusing on similar regions of input images are expected to produce correlated representations, and thus potentially give rise to favorable transfer-learning results. This assumption is inspired by the work of [36], which utilizes the attention of a teacher model to guide the learning of a student and produces encouraging results. Despite its very simple nature, the proposed approach yields truly promising results: it leads to a speedup factor of several magnitudes of times and meanwhile maintains a highly similar transferability topology, as compared to taskonomy. In addition, experiments on vision tasks beyond those involved in taskonomy also produce intuitively plausible results, validating the proposed approach and providing us with insights on their transferability.

Our contribution is therefore a lightweight and effective approach towards estimating transferability between deep visual models, achieved via projecting each model into a common space and approximating their afﬁnity using attribution maps. It requires no human annotations and is readily applicable to pre-trained networks specializing in various tasks and of heterogeneous architectures. Running at a speed several magnitudes faster than taskonomy and producing competitively similar results, the proposed model may serve as a competent transferability estimator and an effectual substitute for taskonomy, especially when human annotations are unavailable, when the model library is large in size, or when frequent model insertion or update takes place.

2 Related Work

We brieﬂy review here some topics that are most related to the proposed work, including model reusing, transfer learning, and attribution methods for deep models.

Model Reusing. Reusing pre-trained models has been an active research topic in recent years. Hinton et al. [9] ﬁrstly propose the concept of knowledge distillation where the trained cumbersome teacher models are reused to produce soft labels for training a lightweight student model. Following their teacher-student scheme, some more advanced methods [24, 36, 6, 15] are proposed to fully exploit the knowledge encoded in the trained teacher model. However, in these works all the teachers and the student are trained for the same task. To reuse models of different tasks, Rusu et al. [25] propose the progressive neural net to extract useful features from multiple teachers for a new task. Parisotto et al. [19] propose Actor-Mimic to use the guidance from several expert teachers of distinct tasks. However, none of these works explore the relatedness among different tasks. In this paper, by explicitly modeling the model transferability, we provide an effective method to pick a trained model most beneﬁcial for solving the target task.

Transfer Learning. Another way of reusing trained models is to transfer the trained model to another task by reusing the features extracted from certain layers. Razavian et al. [22] demonstrated that features extracted from deep neural networks could be used as generic image representations to

𝑎𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟

Collecting Probe Data

Forward Backward Computing Attribution Maps Estimating Model Transferability

Model Transferability

𝑎𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟

2.5𝑑𝑠𝑒𝑔𝑚. 𝑟𝑔𝑏2𝑚𝑖𝑠𝑡 𝑟𝑔𝑏2𝑠𝑓𝑛𝑜𝑟𝑚

3𝑑𝑘𝑒𝑦𝑝𝑡𝑠 2𝑑𝑒𝑑𝑔𝑒

2𝑑𝑠𝑒𝑔𝑚. 𝑐𝑢𝑟𝑣𝑎𝑡𝑢𝑟𝑒

Figure 1: An illustrative diagram of the workﬂow of the proposed method. It mainly consists of three steps: collecting probe data, computing attribution maps, and estimating model transferability.

tackle the diverse range of visual tasks. Yosinski et al. [34] investigated the transferability of deep features extracted from every layer of a deep neural network. Azizpour et al. [2] investigated several factors affecting the transferability of deep features. Recently, the effects of pre-training datasets for transfer learning are studied [13, 7, 12, 33, 23]. None of these works, however, explicitly quantify the relatedness among different tasks or trained models to provide a principled way for model selection. Zamir et al. [37] proposed a fully computational approach, known as taskonomy, to address this challenging problem. However, taskonomy requires labeled data and is computationally expensive, which limited its applications in large-scale real-world problems. Recently, Dwivedi and Roig [4] proposed to use representation similarity analysis to approximate the task taxonomy. In this paper, we introduce a model space for modeling task transferability and propose to measure the transferability via attribution maps, which, unlike taskonomy, requires no human annotations and works directly on pre-trained models. We believe our method is a good complement to existing works.

Attribution Methods for Deep Models. Attribution refers to assigning importance scores to the inputs for a speciﬁed output. Existing attribution methods can be mainly divided into two groups, including perturbation- [38, 39, 40] and gradient-based [28, 3, 27, 30, 26, 18, 1] methods. Perturbation-based methods compute the attribution of an input feature by making perturbations, e.g., removing, masking or altering, to individual inputs or neurons and observe the impact on later neurons. However, such methods are computationally inefﬁcient as each perturbation requires a separate forward propagation through the network. Gradient-based methods, on the other hand, estimate the attributions for all input features in one or few forward and backward passes throughout the network, which renders them generally more efﬁcient. Simonyan et al. [28] construct attributions by taking the absolute value of the partial derivative of the target output with respect to the input features. Later, Layer-wise Relevance Propagation (ϵ-LRP) [3], gradient*input [27], integrated gradients [30] and deep LIFT [26] are proposed to aid understanding the information ﬂow of deep neural networks. In this paper, we directly adopt some of these off-the-shelf methods to produce the attribution maps. Devising more suitable attribution method for our problem is left to future work.

3 Estimating Model Transferability from Attribution Maps

We provide in this section the details of the proposed transferability estimator. We start by giving the problem setup and an overview of the method, followed by describing its three steps, and ﬁnally show the efﬁciency analysis.

3.1 Problem Setup

Assume we are given a set of pre-trained deep models M = {m1, m2, ..., m N}, where N is the total number of models involved. No constraints are imposed on the architectures of these models. We use ti to denote the task handled by model mi, and use T = {t1, t2, ..., t N} to denote the task dictionary, i.e., the set of all the tasks involved in M. Furthermore, we assume that no labeled annotations are available. Our goal is to efﬁciently quantify the transferability between different tasks in T , so that given a target task, we can read out from the learned transferability matrix the source task that potentially yields the highest transfer performance.

3.2 Overview

The core idea of our method is to embed the pre-trained deep models into the model space, wherein models are represented by points and model transferability is measured by the distance between corresponding points. To this end, we utilize the attribution maps to construct such a model space. The assumption is that related models should produce similar attribution maps for the same input image. The workﬂow of our method consists of three steps, as shown in Figure 1. First, we collect an unlabeled probe dataset, which will be used to construct the model space, from a randomly selected data distribution. Second, for each trained model, we adopt off-the-shelf attribution methods to compute the attribution maps of all images in the constructed probe dataset. Finally, for each model, all its attribution maps are collectively viewed as a single point in the model space, based on which the model transferability is estimated. In what follows, we provide details for each of the three steps.

3.3 Key Steps

Step 1: Building the Probe Dataset. As deep models handling different tasks or even the same one may be of heterogeneous architectures or trained on data from various domains, it is non-trivial to measure their transferability directly from their outputs or intermediate features. To bypass this problem, we feed the same input images to these models and measure the model transferability by the similarity of their response to the same stimuli. We term the set of all such input images probe data, which is shared by all the tasks involved.

Intuitively, the probe dataset should be designed not only large in size but also rich in diversity, as models in M may be trained on various domains for different tasks. However, experiments show that the proposed method works surprisingly well even when the probe data are collected in a single domain and of moderately small size ( 1, 000 images). The produced transferability relationship is highly similar to the one derived by taskonomy. This property renders the proposed method attractive as little effort is required for collecting the probe data. More details can be found in Section 4.2.3.

Step 2: Computing Attribution Maps. Let us denote the collected probe data by X = {X1, X2, ..., XNp}, Xi = xi 1, xi 2, ..., xi W HC RW HC, where W, H and C respectively denote the width, the height and the channels of the input images, and Np is the size of the probe data. Note that for brevity the maps are symbolized in vectorization form here. For model mi, it takes an input X = Ti(X) RWi Hi Ci and produces a hidden representation R = [r1, r2, ..., r D]. Here, Ti serves as a preprocessing function that transforms the images in probe data for model mi, as we allow different models to take images of different sizes as input, and D is the dimension of the representation. For each model mi in M, our goal in this step is to produce an attribution map Ai j = [ai j1, ai j2, ...] RW HC for each image Xj in the probe data X.

In fact, an attribution map Ai,k j can be computed for each unit rk in R. However, as we consider the transferability of R, we average the attribution maps of all r in R as the overall attribution map of R. Formally, we have Ai j = 1 D PD k=1 Ai,k j . Speciﬁcally, here we adopt three off-the-shelf attribution methods to produce the attribution maps: saliency map [28], gradient * input [27], and ϵ-LRP [3]. Saliency map computes attributions by taking the absolute value of the partial derivative of the target output with respect to the input. Gradient * input refers to a ﬁrst-order taylor approximation of how the output would change if the input was set to zero. ϵ-LRP, on the other hand, computes the attributions by redistributing the prediction score (output) layer by layer until the input layer is reached. For all the three attribution methods, the overall attribution map Ai j can be computed through one single forward-and-backward propagation [1] in Tensorﬂow. The formulations of the three attribution maps are summarized in Table 1. More details can be found from [28, 27, 3, 1].

Table 1: Mathematical formulations of saliency map [28], gradient * input [27] and ϵ-LRP [3]. Note that the superscript g denotes a novel deﬁnition of partial derivative [1].

Method Saliency Map [28] Gradient * Input [27] ϵ-LRP [3, 1]

d=1 , g = f(z)

For model mi, the produced attribution map Ai is of the same size as the input X, i.e., Ai RWi Hi Ci. We do the inverse of T to transform the attribution maps back to the same size as the images in the probe data: Ai = T 1( Ai), Ai RW HC. As attribution maps of all models are transformed into the same size, the transferability can be computed based on these maps.

Step 3: Estimating Model Transferability. Once step 2 is completed, we have Np attribution maps Ai = {Ai 1, Ai 2, ..., Ai Np} for each model mi, where Ai j denotes the attribution map of j-th image Xj in X. The model mi can be viewed as a sample in the model space RNW HC, formed by concatenating all the attribution maps. The distance of two models are taken to be

d(mi, mj) = Np PNp k=1 cos_sim(Ai k, Aj k) , (1)

where cos_sim(Ai k, Aj k) = Ai k Aj k Ai k Aj k . The model transferability map, which measures the pairwise

transferability relationships, can then be derived based on these distances. The model transferability, as shown by taskonomy [37], is inherently asymmetric. In other words, if model mi ranks ﬁrst in being transferred to task tj among all the models (except mj) in M, mj does not necessarily rank ﬁrst in being transferred to task ti. Yet, the proposed model space is symmetric in distance, as we have d(mi, mj) = d(mj, mi). We argue that the symmetric property of the distance in the model space makes little negative effect on the transferability relationships, as the task transferability rankings of the source tasks are computed by relative comparison of distances. Experiments demonstrate that with the symmetric model space, the proposed method is able to effectively approximate the asymmetric transferability relationships produced by taskonomy.

3.4 Efﬁciency Analysis

Here we make a rough comparison between the efﬁciency of the proposed approach and that of taskonomy. As we assume task-speciﬁc trained models are available, we compare the computation cost of our method with that of only the transfer modeling in taskonomy. For taskonomy, let us assume the transfer model is trained for E epochs on the training data of size N, then for a task dictionary of size T, the computation cost can be approximately denoted as ENT(T 1)-times forward-and-backward propagation1. For our method working on the probe dataset, however, only one time of forward-and-backward propagation is required. The overall computation cost for building the model space in our method is about TM-times forward-and-backward propagation, where M is the size of the probe dataset and usually M N. The proposed method is thus about EN(T 1)

M -times more efﬁcient than taskonomy. This also means the speedup over taskonomy will be even more signiﬁcant, if more tasks are involved and hence T enlarges.

In our experiments, the proposed method takes about 20 GPU hours to compute the pairwise transferability relationships on one Quadro P5000 card for 20 pre-trained taskonomy models, while taskonomy takes thousands of GPU hours on the cloud2 for the same number of tasks.

4 Experiments

4.1 Experimental Settings

Pre-trained Models. Two groups of trained models are adopted to validate the proposed method. In the ﬁrst group, we adopt 20 trained models of single-image tasks released by taskonomy [37], of which the task relatedness has been constructed and also released. It is used as the oracle to evaluate the proposed method. Note that all these models adopt an encoder-decoder architecture, where the encoder is used to extract representations and the decoder makes task predictions. For these models, the attribution maps are computed with respect to the output of the encoder.

To further validate the proposed method, we construct a second group of trained models which are collected online. We have managed to obtain 18 trained models in this group: two VGGs [29] (VGG16, VGG19), three Res Nets [8] (Res Net50, Res Net101, Res Net152), two Inceptions (Inception V3 [32],

1Here for simplicity, we ignore the computation-cost difference caused by the model architectures. 2As the hardware conﬁgurations are not clear here, we list the GPU hours only for perceptual comparison.

Input Curvature Denoise Edge 2D Edge 3D Keypoint 2D Keypoint 3D Colorization Reshade

Input Rgb2depth Rgb2mist Rgb2sfnorm Segment 2D Vanishing Pts Seg Semantic Class 1000 Class Places

Figure 2: Visualization of attribution maps produced using ϵ-LRP on taskonomy models. Some tasks produce visually similar attribution maps, such as Rgb2depth and Rgb2mnist.

Inception Res Net V2 [31]), three Mobile Nets [10] (Mobile Net, 0.5 Mobile Net, 0.25 Mobile Net), four Inpaintings [35] (Image Net, Celeb A, Celeb A-HQ, Places), FCRN [14], FCN [17], PRN [5] and Tiny Face Detector [11]. All these models are also viewed in an encoder-decoder architecture. The sub-model which produces the most compact features is viewed as the encoder and the remainder as the decoder. Similar to taskonomy models, the attribution maps are computed with respect to the output of the encoder. More details of these models can be found in the supplementary material.

Probe Datasets. We build three datasets, taskonomy data [37], indoor scene [20], and COCO [16], as the probe data to evaluate our method. The domain difference between taskonomy data and COCO is much larger than that between taskonomy data and indoor scene. For all the three datasets, we randomly select about 1, 000 images to construct the probe datasets. More details of the three probe datasets are provided in the supplementary material. In Section 4.2.3, we demonstrate the performances of the proposed method evaluated on these three probe datasets.

4.2 Experiments on Models in Taskonomy

4.2.1 Visualization of Attribution Maps

We ﬁrst visualize the attribution maps produced by various trained models for the same input images. Two examples are given in Figure 2. Attribution maps are produced by ϵ-LRP on taskonomy data. From the two examples, we can see that some tasks produce visually similar attribution maps. For example, Rgb2depth, Rgb2mist 3, Class 1000, Class Places and Denoise, Keypoint 2D . In each cluster, trained models pay their attentions to the similar regions, thus the knowledge they learned are intuitively highly correlated (as seen in Section 4.2.2) and can be transferred to each other (as seen in Section 4.2.3). Two examples may produce conclusions where the constructed model transferability deviates from the underlying model relatedness. However, such deviation is alleviated by aggregating the results of more examples drawn from the data distribution. For more visualization examples, please see the supplementary material.

4.2.2 Rationality of the Assumption

Here we adopt Singular Vector Canonical Correlation Analysis (SVCCA) [21] to validate the rationality underlying our assumption: if tasks produce similar attribution maps, the representations extracted from corresponding models should be highly correlated, thus they are expected to yield favorable transfer-learning performance to each other. In SVCCA, each neuron is represented by an activation vector: its set of response to a set of inputs and hence the layer can be represented by the subspace

3Here we use to denote a cluster of tasks, of which the attribution maps are highly similar.

Autoencoder

Keypoint 2D

Keypoint 3D

Colorization

Room Layout

Segment 25D

Vanishing Point

Segment Semantic

Class Places

Inpainting Whole

Autoencoder

Denoise Edge 2D Edge 3D Keypoint 2D Keypoint 3D

Colorization

Reshade Rgb2depth

Rgb2mist Rgb2sfnorm Room Layout Segment 25D

Segment 2D Vanishing Point Segment Semantic

Class 1000 Class Places Inpainting Whole

1.0 0.2 0.8 0.3 0.2 0.5 0.2 0.4 0.1 0.1 0.1 0.2 0.1 0.2 0.4 0.1 0.2 0.1 0.1 0.6

0.2 1.0 0.2 0.3 0.5 0.2 0.5 0.3 0.4 0.3 0.4 0.4 0.1 0.5 0.3 0.1 0.3 0.1 0.1 0.2

0.8 0.2 1.0 0.3 0.2 0.6 0.2 0.4 0.2 0.1 0.1 0.2 0.1 0.2 0.4 0.1 0.2 0.1 0.1 0.7

0.3 0.3 0.3 1.0 0.3 0.2 0.3 0.2 0.2 0.1 0.1 0.2 0.1 0.3 0.3 0.1 0.2 0.1 0.1 0.2

0.2 0.5 0.2 0.3 1.0 0.1 0.4 0.2 0.4 0.4 0.4 0.4 0.2 0.5 0.3 0.1 0.3 0.2 0.2 0.2

0.5 0.2 0.6 0.2 0.1 1.0 0.2 0.3 0.1 0.1 0.1 0.1 0.1 0.2 0.3 0.1 0.2 0.1 0.1 0.4

0.2 0.5 0.2 0.3 0.4 0.2 1.0 0.3 0.4 0.3 0.4 0.5 0.2 0.5 0.3 0.1 0.3 0.2 0.2 0.2

0.4 0.3 0.4 0.2 0.2 0.3 0.3 1.0 0.2 0.2 0.2 0.2 0.2 0.3 0.4 0.1 0.2 0.2 0.2 0.3

0.1 0.4 0.2 0.2 0.4 0.1 0.4 0.2 1.0 0.5 0.4 0.4 0.2 0.4 0.2 0.1 0.3 0.2 0.2 0.1

0.1 0.3 0.1 0.1 0.4 0.1 0.3 0.2 0.5 1.0 0.6 0.4 0.2 0.4 0.2 0.2 0.3 0.2 0.2 0.1

0.1 0.4 0.1 0.1 0.4 0.1 0.4 0.2 0.4 0.6 1.0 0.4 0.2 0.4 0.2 0.2 0.3 0.2 0.2 0.1

0.2 0.4 0.2 0.2 0.4 0.1 0.5 0.2 0.4 0.4 0.4 1.0 0.3 0.5 0.3 0.1 0.3 0.2 0.2 0.2

0.1 0.1 0.1 0.1 0.2 0.1 0.2 0.2 0.2 0.2 0.2 0.3 1.0 0.3 0.2 0.3 0.3 0.4 0.4 0.1

0.2 0.5 0.2 0.3 0.5 0.2 0.5 0.3 0.4 0.4 0.4 0.5 0.3 1.0 0.5 0.3 0.4 0.3 0.3 0.2

0.4 0.3 0.4 0.3 0.3 0.3 0.3 0.4 0.2 0.2 0.2 0.3 0.2 0.5 1.0 0.2 0.3 0.2 0.2 0.3

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.1 0.3 0.3 0.2 1.0 0.2 0.3 0.3 0.1

0.2 0.3 0.2 0.2 0.3 0.2 0.3 0.2 0.3 0.3 0.3 0.3 0.3 0.4 0.3 0.2 1.0 0.3 0.3 0.2

0.1 0.1 0.1 0.1 0.2 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.3 0.3 1.0 0.4 0.1

0.1 0.1 0.1 0.1 0.2 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.3 0.3 0.4 1.0 0.1

0.6 0.2 0.7 0.2 0.2 0.4 0.2 0.3 0.1 0.1 0.1 0.2 0.1 0.2 0.3 0.1 0.2 0.1 0.1 1.0

Autoencoder

Keypoint 2D

Keypoint 3D

Colorization

Room Layout

Segment 25D

Vanishing Point

Segment Semantic

Class Places

Inpainting Whole

Autoencoder

Denoise Edge 2D Edge 3D Keypoint 2D Keypoint 3D

Colorization

Reshade Rgb2depth

Rgb2mist Rgb2sfnorm Room Layout Segment 25D

Segment 2D Vanishing Point Segment Semantic

Class 1000 Class Places Inpainting Whole

0.0 0.1 0.0 0.1 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.1

0.1 0.0 0.1 0.1 0.0 0.1 0.4 0.0 0.0 0.1 0.3 0.0 0.0 0.0 0.1 0.0 0.1 0.2 0.2 0.1

0.0 0.1 0.0 0.2 0.0 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.0 0.1 0.1 0.1 0.0 0.2 0.2 0.1

0.1 0.1 0.2 0.0 0.1 0.3 0.2 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.0 0.1 0.0 0.2 0.2 0.1

0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.1 0.1 0.0

0.1 0.1 0.0 0.3 0.0 0.0 0.1 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.1 0.0 0.1 0.0 0.1

0.1 0.4 0.1 0.2 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.1 0.3 0.1 0.1 0.0 0.1 0.1 0.1

0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.0 0.0 0.1 0.2 0.0

0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0

0.0 0.1 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1

0.0 0.3 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.0 0.1 0.1 0.1 0.0 0.0 0.1 0.1 0.1 0.1

0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.1 0.0 0.0 0.1 0.1 0.0

0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.1 0.0 0.2 0.1 0.2 0.0 0.3 0.4 0.0

0.0 0.0 0.1 0.1 0.1 0.0 0.3 0.1 0.1 0.0 0.1 0.1 0.2 0.0 0.2 0.1 0.1 0.1 0.1 0.0

0.1 0.1 0.1 0.0 0.0 0.1 0.1 0.2 0.0 0.0 0.0 0.1 0.1 0.2 0.0 0.0 0.1 0.1 0.0 0.1

0.1 0.0 0.1 0.1 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.2 0.1 0.0 0.0 0.0 0.1 0.2 0.1

0.1 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.2 0.1

0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.1 0.3 0.1 0.1 0.1 0.0 0.0 0.0 0.1

0.2 0.2 0.2 0.2 0.1 0.0 0.1 0.2 0.1 0.1 0.1 0.1 0.4 0.1 0.0 0.2 0.2 0.0 0.0 0.1

0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Correlation

0.47 0.46 0.44

0.46 0.46 0.48

0.42 0.42 0.42

Figure 3: Left: visualization of the correlation matrix from SVCCA. Middle: the difference between correlation matrix from SVCCA and the transferability matrix derived from attribution maps. Both of them are normalized for better visualization. Right: the Correlation-Priority Curve (CPC).

spanned by the activation vectors of all the neurons in this layer. SVCCA ﬁrst adopts Singular Value Decomposition (SVD) of each subspace to obtain new subspaces that comprise the most important directions of the original subspaces, and then uses Canonical Correlation Analysis (CCA) to compute a series of correlation coefﬁcients between the new subspaces. The overall correlation is measured by the average of these correlation coefﬁcients.

Experimental results on taskonomy data with ϵ-LRP are shown in Figure 3. In the left, the correlation matrix over the pre-trained taskonomy models is visualized. In the middle, we plot the difference between the correlation matrix and the model transferability matrix derived from attribution maps in the proposed method. It can be seen that the values in the difference matrix are in general small, implying that the correlation matrix is highly similar to the model transferability matrix. To further quantify the overall similarity between these two matrices, we compute their Pearson correlation (ρp = 0.939) and Spearman correlation (ρs = 0.660). All these results show that the similarity of attribution maps is a good indicator of the correlation between representations.

In addition, we can see that some tasks, like Edge3d and Colorization, tend to be more correlated to other tasks, as the colors of the corresponding row or column are darker than those of others, while some other tasks are not, like Vanishing Point. In taskonomy, the priorities4 of Edge3d, Colorization and Vanishing Point are 5.4, 5.8 and 14.2, respectively. It indicates that more correlated representations tend to be more suitable for transferring learning to each other. To make this clearer, we depict the Correlation-Priority Curve (CPC) in the right of Figure 3. In this ﬁgure, for each priority p shown on the abscissa, the correlation shown on the ordinate is computed as correlation(p) = 1

i =j I(ri j = p)ρi,j, where I is the indicator function and ρi,j is the correlation between representations extracted from two models mi and mj. It can be seen that as the priority becomes lower, the average correlation becomes weaker. All these results verify the rationality underlying the assumption.

4.2.3 Deep Model Transferability

We adopt two evaluation metrics, P@K and R@K5, which are widely used in the information retrieval ﬁeld, to compare the model transferability constructed from our method with that from tasknomy. Each target task is viewed as a query, and its top-5 source tasks that produce the best transferring performances in taskonomy are regarded as relevant to the query. To better understand the results, we introduce one baseline using random ranking, and the oracle, the ideal method which always produces the perfect results. Additionally, we also evaluate SVCCA for computing the model transferability relationships. The experimental results are depicted in Figure 4. Based on the results, we can make the following conclusions.

The topology structure of the model transferability derived from the proposed method is similar to that of oracle. For example, when only top-3 predictions are examined, the precision can be about 85% on COCO with ϵ-LRP. To see this clearer, we also depict the task similarity tree constructed

4The priority of a task i refers to the average ranking when transferred to other tasks: pi = 1 N PN j ri j, where ri j denotes the ranking of task i when transferred to task j. A smaller value of p denotes a higher priority. 5P: precision, R: recall, @K: only the top-K results are examined.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

taskonomy_saliency taskonomy_grad*input taskonomy_elrp coco_saliency coco_grad*input coco_elrp indoor_saliency indoor_grad*input indoor_elrp svcca random ranking oracle

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

taskonomy_saliency taskonomy_grad*input taskonomy_elrp coco_saliency coco_grad*input coco_elrp indoor_saliency indoor_grad*input indoor_elrp svcca random ranking oracle

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Segment Semantic

Curvature Keypoint 3D Segment 25D

Edge 3D Rgb2sfnorm Reshade Rgb2depth

Rgb2mist Room Layout Segment 2D

Keypoint 2D Denoise Autoencoder

Inpainting Whole Colorization Edge 2D

Class Places Class 1000 Vanishing Point Task Similarity Tree

Figure 4: From left to right: P@K curve, R@K curve and task similarity tree constructed by ϵ-LRP. Results of SVCCA are produced using validation data from taskonomy.

by agglomerative hierarchical clustering in Figure 4. This tree is again highly similar to that of taskonomy where 3D, 2D, geometric, and semantic tasks cluster together. ϵ-LRP and gradient* input generally produce better performance than saliency. This phenomenon can be in part explained by the fact that saliency generates attributions entirely based on gradients that denote the direction for optimization. However, the gradients are not able to fully reﬂect the relevance between the inputs and the outputs of the deep model, thus leading to inferior results. It also implies the attribution method can affect the performance of our method. Devising better attribution methods may further promote the accuracy of our method, which is left as future work. The proposed method works quite well on the probe data from different domains, such as indoor scene and COCO. It implies that the proposed method is robust to different choices of the probe data to some degree, which makes the data collection effortless. Furthermore, it can be seen that the probe data from indoor scene and COCO surprisingly better predict the taskonomy transferability than the probe data from taskonomy data. We conjecture that more complex textures disentangle the attributions better, thus the probe data from COCO and indoor scene which are generally more complex in texture yield superior results to taskonomy as probe data. However, more research is necessary to discover if the explanation holds in general. SVCCA also works well in estimating the transferability of taskonomy models. However, the proposed method yields superior or comparable performance to SVCCA when using gradient * input and ϵ-LRP for attribution. What s more, as the proposed method measures transferability by computing distances, it is several times more efﬁcient than SVCCA, especially when the hidden representation is large in dimension or a new task is added into a large task dictionary.

With all these observations and the fact that the proposed method is signiﬁcantly more efﬁcient than taskonomy, the proposed method is indeed an effectual substitute for taskonomy, especially when human annotations are unavailable, when the model library is large in size, or when frequent model insertion and update takes place.

4.3 Experiments on Models beyond Taskonomy

To give a more comprehensive view of the proposed method, we also conduct experiments on the online collected pre-trained models beyond taskonomy. Results are shown in Figure 5. The left two subﬁgures show the correlation matrix from SVCCA and the model transferability matrix produced by our method. The right two subﬁgures depict the task similarity trees produced by SVCCA and the proposed method. The classiﬁcation and inpainting models are listed in different colors. We have the following observations.

The proposed method produces an afﬁnity matrix and a task similarity tree alike those derived from SVCCA, although the collected models are heterogeneous in architectures, tasks, and input size. These results further validate that models producing similar attribution maps also produce highly correlated representations. All the Image Net-trained classiﬁcation models, despite their different architectures, tend to cluster together. Furthermore, the same-task trained models with the similar architectures tend to be more related than with dissimilar architectures. For example, Res Net50 is more related to Res Net101 and Res Net152 than VGG, Mobile Net and Inception models, indicating that the architecture plays a certain role in regularization for solving the tasks.

Inception v3

Inception Res Net v2

0.5 Mobile Net

0.25 Mobile Net

Semantic Segment

Inpainting Image Net

Inpainting Celeb A

Inpainting Celeb A-HQ

Inpainting Places

Face Detection

Face Alignment

Inception v3 Inception Res Net v2

VGG16 VGG19 Res Net50 Res Net101 Res Net152

Mobile Net 0.5 Mobile Net 0.25 Mobile Net

FCRN Semantic Segment Inpainting Image Net

Inpainting Celeb A Inpainting Celeb A-HQ

Inpainting Places

Face Detection Face Alignment

1.0 0.8 0.5 0.5 0.3 0.3 0.3 0.2 0.1 0.3 0.2 0.2 0.1 0.2 0.2 0.2 0.2 0.3

0.8 1.0 0.5 0.5 0.3 0.3 0.3 0.2 0.1 0.3 0.2 0.2 0.1 0.2 0.2 0.2 0.2 0.3

0.5 0.5 1.0 0.8 0.4 0.4 0.4 0.2 0.1 0.4 0.3 0.2 0.1 0.2 0.2 0.2 0.2 0.3

0.5 0.5 0.8 1.0 0.4 0.4 0.4 0.2 0.1 0.4 0.3 0.2 0.1 0.2 0.2 0.2 0.2 0.3

0.3 0.3 0.4 0.4 1.0 0.6 0.6 0.2 0.1 0.4 0.2 0.2 0.1 0.2 0.2 0.2 0.3 0.4

0.3 0.3 0.4 0.4 0.6 1.0 0.6 0.2 0.1 0.4 0.2 0.2 0.1 0.2 0.2 0.2 0.3 0.3

0.3 0.3 0.4 0.4 0.6 0.6 1.0 0.2 0.1 0.4 0.2 0.2 0.1 0.2 0.2 0.2 0.2 0.3

0.2 0.2 0.2 0.2 0.2 0.2 0.2 1.0 0.4 0.7 0.2 0.3 0.1 0.2 0.2 0.1 0.2 0.4

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.4 1.0 0.4 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.7 0.4 1.3 0.3 0.4 0.1 0.3 0.3 0.3 0.2 0.7

0.2 0.2 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.3 1.0 0.2 0.1 0.2 0.2 0.2 0.3 0.3

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.1 0.4 0.2 1.0 0.1 0.2 0.2 0.1 0.1 0.4

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 1.0 0.2 0.2 0.2 0.1 0.1

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.3 0.2 0.2 0.2 1.0 0.8 0.8 0.1 0.2

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.3 0.2 0.2 0.2 0.8 1.0 0.8 0.2 0.3

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.3 0.2 0.1 0.2 0.8 0.8 1.0 0.1 0.2

0.2 0.2 0.2 0.2 0.3 0.3 0.2 0.2 0.1 0.2 0.3 0.1 0.1 0.1 0.2 0.1 1.0 0.2

0.3 0.3 0.3 0.3 0.4 0.3 0.3 0.4 0.1 0.7 0.3 0.4 0.1 0.2 0.3 0.2 0.2 1.0

Inception v3

Inception Res Net v2

0.5 Mobile Net

0.25 Mobile Net

Semantic Segment

Inpainting Image Net

Inpainting Celeb A

Inpainting Celeb A-HQ

Inpainting Places

Face Detection

Face Alignment

Inception v3 Inception Res Net v2

VGG16 VGG19 Res Net50 Res Net101 Res Net152

Mobile Net 0.5 Mobile Net 0.25 Mobile Net

FCRN Semantic Segment Inpainting Image Net

Inpainting Celeb A Inpainting Celeb A-HQ

Inpainting Places

Face Detection Face Alignment

1.0 0.7 0.5 0.4 0.3 0.3 0.3 0.4 0.6 0.3 0.2 0.2 0.2 0.2 0.2 0.1 0.2 0.4

0.7 1.0 0.4 0.4 0.3 0.3 0.3 0.3 0.4 0.2 0.2 0.2 0.1 0.2 0.2 0.1 0.2 0.3

0.5 0.4 1.0 0.8 0.4 0.3 0.3 0.3 0.4 0.3 0.2 0.2 0.2 0.2 0.2 0.1 0.2 0.4

0.4 0.4 0.8 1.0 0.4 0.3 0.3 0.3 0.4 0.3 0.2 0.2 0.2 0.2 0.2 0.1 0.2 0.3

0.3 0.3 0.4 0.4 1.0 0.8 0.8 0.2 0.3 0.2 0.1 0.4 0.2 0.2 0.2 0.1 0.2 0.2

0.3 0.3 0.3 0.3 0.8 1.0 0.8 0.2 0.3 0.2 0.1 0.4 0.1 0.1 0.2 0.1 0.2 0.2

0.3 0.3 0.3 0.3 0.8 0.8 1.0 0.2 0.2 0.2 0.1 0.3 0.1 0.1 0.2 0.1 0.2 0.2

0.4 0.3 0.3 0.3 0.2 0.2 0.2 1.0 0.6 0.4 0.3 0.2 0.2 0.2 0.2 0.1 0.2 0.4

0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.6 1.0 0.5 0.3 0.2 0.2 0.2 0.3 0.1 0.2 0.6

0.3 0.2 0.3 0.3 0.2 0.2 0.2 0.4 0.5 1.0 0.3 0.1 0.1 0.1 0.2 0.1 0.2 0.4

0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.3 0.3 0.3 1.0 0.1 0.1 0.1 0.2 0.1 0.1 0.3

0.2 0.2 0.2 0.2 0.4 0.4 0.3 0.2 0.2 0.1 0.1 1.0 0.1 0.1 0.2 0.1 0.2 0.2

0.2 0.1 0.2 0.2 0.2 0.1 0.1 0.2 0.2 0.1 0.1 0.1 1.0 0.8 0.5 0.2 0.1 0.2

0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.8 1.0 0.5 0.1 0.1 0.2

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.2 0.2 0.2 0.5 0.5 1.0 0.1 0.2 0.2

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 1.0 0.1 0.1

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.2 0.1 0.1 0.2 0.1 1.0 0.2

0.4 0.3 0.4 0.3 0.2 0.2 0.2 0.4 0.6 0.4 0.3 0.2 0.2 0.2 0.2 0.1 0.2 1.0

0.2 0.4 0.6 0.8 1 1.2 1.4

Inception Res Net v2

Inception v3

Face Detection

Semantic Segment

Face Alignment

0.25 Mobile Net

0.5 Mobile Net

Inpainting Image Net

Inpainting Places

Inpainting Celeb A

Inpainting Celeb A-HQ

0.2 0.4 0.6 0.8 1 1.2 1.4

Inpainting Places

Face Detection

Semantic Segment

Inception v3

Inception Res Net v2

Face Alignment

0.5 Mobile Net

0.25 Mobile Net

Inpainting Celeb A-HQ

Inpainting Celeb A

Inpainting Image Net

Figure 5: Results on collected models beyond taskonomy. From left to right: afﬁnity matrix from SVCCA, afﬁnity matrix from attribution maps, task similarity tree from SVCCA, and task similarity tree from attribution maps.

The inpainting models, albeit trained on data from different data domain, also tend to cluster together. It implies that different models of the same task, albeit trained on data from different data domain, tend to play similar role in transfer learning. However, more research is necessary to verify if this observation holds in general.

We also merge the two groups into one to further evaluate the proposed method, of which the results are provided in the supplementary material, providing us with more insights on model transferability.

5 Conclusion

We introduce in this paper an embarrassingly simple yet efﬁcacious approach towards estimating the transferability between deep models, without using any human annotation. Speciﬁcally, we project the pre-trained models of interest into a model space, wherein each model is treated as a point and the distance between two points are used to approximate their transferability. The projection to the model space is achieved by computing the attribution maps from the unlabelled probe dataset. The proposed approach imposes no constraints on the architectures on the models, and turns out to be robust to the selection of the probe data. Despite the lightweight construction, it yields a transferability map highly similar to the one obtained by taskonomy yet runs at a speed several magnitudes faster, and therefore may serve as a compact and express transferability estimation, especially when no annotations are available, the model library is large in size, or frequent model insertion or update takes place.

Acknowledgments

This work is supported by National Key Research and Development Program (2016YFB1200203), National Natural Science Foundation of China (61572428), Key Research and Development Program of Zhejiang Province (2018C01004), and the Major Scientifc Research Project of Zhejiang Lab (No. 2019KD0AC01).

[1] Marco B Ancona, Enea Ceolini, Cengiz Oztireli, and Markus H. Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations, 2018.

[2] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. Factors of transferability for a generic convnet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1790 1802, Sep. 2016.

[3] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, Wojciech Samek, and Oscar Deniz Suarez. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance propagation. In Plo S one, 2015.

[4] Kshitij Dwivedi and Gemma Roig. Representation similarity analysis for efﬁcient task taxonomy & transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

[5] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. In European Conference on Computer Vision, 2018.

[6] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Bornagain neural networks. In International Conference on Machine Learning, pages 1602 1611, 2018.

[7] Kaiming He, Ross B. Girshick, and Piotr Dollár. Rethinking imagenet pre-training. Co RR, abs/1811.08883, 2018.

[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016.

[9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

[10] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. Co RR, abs/1704.04861, 2017.

[11] Peiyun Hu and Deva Ramanan. Finding tiny faces. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1522 1530, 2017.

[12] Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. What makes imagenet good for transfer learning? Co RR, abs/1608.08614, 2016.

[13] Simon Kornblith, Jon Shlens, and Quoc V. Le. Do better imagenet models transfer better? In Computer Vision and Pattern Recognition, 2019.

[14] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239 248. IEEE, 2016.

[15] Xu Lan, Xiatian Zhu, and Shaogang Gong. Knowledge distillation by on-the-ﬂy native ensemble. In Advances in Neural Information Processing Systems, pages 7527 7537, 2018.

[16] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.

[17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015.

[18] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. Explaining nonlinear classiﬁcation decisions with deep taylor decomposition. Pattern Recognition, 65:211 222, 2017.

[19] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. ar Xiv preprint ar Xiv:1511.06342, 2015.

[20] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In Computer Vision and Pattern Recognition, 2009.

[21] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, pages 6076 6085, 2017.

[22] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: An astounding baseline for recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 512 519, 2014.

[23] Stephan R. Richter, Zeeshan Hayder, and Vladlen Koltun. Playing for benchmarks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.

[24] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ar Xiv preprint ar Xiv:1412.6550, 2014.

[25] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016.

[26] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning, 2017.

[27] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences. Co RR, abs/1605.01713, 2016.

[28] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. Co RR, abs/1312.6034, 2013.

[29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. Co RR, abs/1409.1556, 2015.

[30] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, 2017.

[31] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inceptionresnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[32] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818 2826, 2016.

[33] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition, pages 1521 1528, 2011.

[34] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, pages 3320 3328. 2014.

[35] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5505 5514, 2018.

[36] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. Co RR, abs/1612.03928, 2017.

[37] Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[38] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, 2014.

[39] Jian Zhou and Olga G. Troyanskaya. Predicting effects of noncoding variants with deep learning based sequence model. Nature Methods, 12:931 934, 2015.

[40] Luisa M. Zintgraf, Taco Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network decisions: Prediction difference analysis. In International Conference on Learning Representations, 2017.