# deep_model_reassembly__657a5f38.pdf

Deep Model Reassembly

Xingyi Yang1 Daquan Zhou1,2 Songhua Liu1 Jingwen Ye1 Xinchao Wang1

1National University of Singapore 2Bytedance {xyang,daquan.zhou,songhua.liu}@u.nus.edu, {jingweny,xinchao}@nus.edu.sg

In this paper, we explore a novel knowledge-transfer task, termed as Deep Model Reassembly (De Ry), for general-purpose model reuse. Given a collection of heterogeneous models pre-trained from distinct sources and with diverse architectures, the goal of De Ry, as its name implies, is to first dissect each model into distinctive building blocks, and then selectively reassemble the derived blocks to produce customized networks under both the hardware resource and performance constraints. Such ambitious nature of De Ry inevitably imposes significant challenges, including, in the first place, the feasibility of its solution. We strive to showcase that, through a dedicated paradigm proposed in this paper, De Ry can be made not only possibly but practically efficient. Specifically, we conduct the partitions of all pre-trained networks jointly via a cover set optimization, and derive a number of equivalence set, within each of which the network blocks are treated as functionally equivalent and hence interchangeable. The equivalence sets learned in this way, in turn, enable picking and assembling blocks to customize networks subject to certain constraints, which is achieved via solving an integer program backed up with a training-free proxy to estimate the task performance. The reassembled models, give rise to gratifying performances with the user-specified constraints satisfied. We demonstrate that on Image Net, the best reassemble model achieves 78.6% top-1 accuracy without fine-tuning, which could be further elevated to 83.2% with end-to-end training. Our code is available at https://github.com/Adamdad/De Ry.

1 Introduction

The unprecedented advances of deep learning and its pervasive impact across various domains are partially attributed to, among many other factors, the numerous pre-trained models released online. Thanks to the generosity of our community, models of diverse architectures specializing in the same or distinct tasks can be readily downloaded and executed in a plug-and-play manner, which, in turn, largely alleviates the model reproducing effort. The sheer number of pre-trained models also enables extensive knowledge transfer tasks, such as knowledge distillation, in which the pre-trained models can be reused to produce lightweight or multi-task students.

In this paper, we explore a novel knowledge transfer task, which we coin as Deep Model Reassembly (De Ry). Unlike most prior tasks that largely focus on reusing pre-trained models as a whole, De Ry, as the name implies, goes deeper into the building blocks of pre-trained networks. Specifically, given a collection of such pre-trained heterogeneous models or Model Zoo, De Ry attempts to first dissect the pre-trained models into building blocks and then reassemble the building blocks to tailor models subject to users specifications, like the computational constraints of the derived network. As such, apart from the flexibility for model customization, De Ry is expected to aggregate knowledge from heterogeneous models without increasing computation cost, thereby preserving or even enhancing the downstream performances.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Network Partition Network Reassembly

Conv-BN-Re LU 3

Conv-BN-Re LU 2

Conv-BN-Re LU 1

Conv-BN-Re LU 3

Conv-BN-Re LU 2

Conv-BN-Re LU 1

Transformer layer 3

Transformer layer 2

Transformer layer 1

Transformer layer 3

Transformer layer 2

Transformer layer 1

Layer Equivalence Set I

Equivalence Set II

Supervised Self-supervised

Figure 1: Overall workflow of De Ry. It partitions pre-trained models into equivalent sets of neural blocks and then reassemble them for downstream transfer. Both steps are optimized through solving constrained programs.

Admittedly, the nature of De Ry per se makes it a highly challenging and ambitious task; in fact, it is even unclear whether a solution is feasible, given that no constraints are imposed over the model architectures in the model zoo. Besides, the reassembly process, which assumes the building blocks can be extracted in the first place, calls for a lightweight strategy to approximate the model performances without re-training, since the reassembled model, apart from the parametric constraints, is expected to behave reasonably well.

We demonstrate in this paper that, through a dedicated optimization paradigm, De Ry can be made not only possible by highly efficient. At the heart of our approach is a two-stage strategy that first partitions pre-trained networks into building blocks to form equivalence sets, and then selectively assemble building blocks to customize tailored models. Each equivalence set, specifically, comprises various building blocks extracted from heterogeneous pre-trained models, which are treated to be functionally equivalent and hence interchangeable. Moreover, the optimization of the two steps is purposely decoupled, so that once the equivalence sets are obtained and fixed, they can readily serve as the basis for future network customization.

We show the overall workflow of the proposed De Ry in Figure 1. It starts by dissecting pre-trained models into disjoint sets of neural blocks through solving a cover set optimization problem, and derives a number of equivalence sets, within each of which the neural blocks are treated as functionally swappable. In the second step, De Ry searches for the optimal block-wise reassembly in a training-free manner. Specifically, the transfer-ability of a candidate reassembly is estimated by counting the number of linear regions in feature representations [55], which reduces the searching cost by 104 times as compared to training all models exhaustively.

The reassembled networks, apart from satisfying the user-specified hard constraints, give rise to truly encouraging results. We demonstrate through experiments that, the reassembled model achieves > 78% top-1 accuracy on Imagenet with all blocks frozen. If we allow for finetuning, the performances can be further elevated, sometimes even surpassing any pre-trained network in the model zoo. This phenomenon showcases that De Ry is indeed able to aggregate knowledge from various models and enhance the results. Besides, De Ry imposes no constraints on the network architectures in the model zoo, and may therefore readily handle various backbones such as CNN, transformers, and MLP.

Our contributions are thus summarized as follows.

1. We explore a new knowledge transfer task termed Deep Model Reassembly (De Ry), which enables reassembling customized networks from a zoo of pre-trained models under user-specified constraints.

2. We introduce a novel two-stage strategy towards solving De Ry, by first partitioning the networks into equivalence sets and then reassembling neural blocks to customize networks. The two steps are modeled and solved using constrained programming, backed up with training-free performance approximations that significantly speed up the knowledge-transfer process.

3. The proposed approach achieves competitive performance on a series of transfer learning benckmarks, sometimes even surpassing than any candidate in the model zoo, which, in turn, sheds light on the the universal connectivity among pre-trained neural networks.

2 Related Work

Transfer learning from Model Zoo. A standard deep transfer learning paradigm is to leverage a single trained neural network and fine-tune the model on the target task [85, 87, 45, 35, 98, 96, 33]

Problem No need to retrain

Adaptive Architecture

No Additional Computation

Utilize All Models

Heterogeneous Architecture Single Model Transfer ! % ! % % Zoo Transfer by Selection ! % ! % ! Zoo Transfer by Ensemble ! % % ! ! Zoo Transfer by Parameter Fusion ! % ! ! % Neural Architecture Search % ! - - - De Ry ! ! ! ! !

Table 1: Comparison of a series of transfer learning tasks and our proposed Deep Model Reassembly.

or impart the knowledge to other models [31, 88, 70, 90, 91, 89, 48]. The availability of large-scale model repositories brings about a new problem of transfer learning from a model zoo rather than with a single model. Currently, there are three major solutions. One line of works focuses on select one best model for deployment, either by exhaustive fine-tuning [39, 74, 87] or quantifying the model transferability [94, 92, 57, 76, 4, 73, 76, 6, 41] on the target task. However, due to the unreliable measurement of transferability, the best model selection may be inaccurate, possibly resulting in a suboptimal solution. The second idea was to apply ensemble methods [19, 99, 2, 97], which inevitably leads to prohibitive computational costs at test time. The third approach is to adaptively fuse multiple pre-trained models into a single target model. However, those methods can only combine identical [71, 18, 78] or homogeneous [72, 58] network structures, whereas most model zoo contains diverse architectures. In contrast to standard approaches in Table 1, De Ry dissects the pre-trained models into building blocks and rearranges them in order to reassemble new pre-trained models.

Neural Representation Similarity. Measuring similarities between deep neural network representations provide a practical tool to investigate the forward dynamics of deep models. Let X Rn d1 and Y Rn d2 denote two activation matrices for the same n examples. A neural similarity index s(X, Y ) is a scalar to measure the representations similarity between X and Y , although they do not necessarily satisfy the triangle inequality required of a proper metric. Several methods including linear regression [86, 31], canonical correlation analysis (CCA) [65, 27, 64], centered kernel alignment (CKA) [40], generalized shape metrics [81]. In this study, we leverage the representations similarity towards function level to quantify the distance between two neural blocks.

Neural Architecture Search. Automatic neural archtecture search (NAS) has achieved promising performance-efficiency trade-offs as well as reducing human efforts. With a pre-defined search space [75, 47, 82, 69], the designing problem of the optimal architecture is formalized as a discrete optimization, where the best solution could be found with reinforcement learning (RL) [100], evolutionary algorithms (EA) [66] or gradient-based search [47]. Because it is costly to measure the performance of each candidate, several surogate methods like one-shot NAS [62, 5, 47], predictorbased NAS [46, 51, 80] and zero-shot NAS [55, 9, 1] are proposed to accelerate the evaluation process. In this paper, we similarly formalize the network reassembly as a search problem; However, compared with NAS that searches at random initialization, De Ry is searching the optimal structure combination along side with network weights. In addition, the search space of De Ry is not preset heuristically, but determined by the network partition results.

Network Stitching. Initially proposed by [44], model stitching aims to plug-in the bottom layers of one network into the top layers of another network, thus forming a stitched network [3, 16]. It provides an alliterative approach to investigate the representation similarity and invariance of neural networks. A recent line of work achieves competitive performance by stitching a visual transformer on top of the Res Net [74]. Instead of stitching two identical-structured networks in a bottom-top manner, in our study, we investigate to assemble arbitrary pre-trained networks by model stitching.

3 Deep Model Reassembly

In this section, we dive into the proposed De Ry. We first formulate De Ry, and then define the functional similarity and equivalent sets of neural blocks to partition networks by maximizing overall groupbility. The resulting neural blocks are then linked by solving an integer program.

Figure 2: The top-1 accuracy difference between off-theshelf pre-trained models on 4 down-stream tasks.

Backbone Init. #Params(M)Acc(%)

Res Net50 in1k sup 23.71 84.67 inat2021 sup 23.71 82.57

Res Net50 inat2021(Stage 1&2) in1k(Stage 3&4) 23.98 85.30

Res Net50 in1k sup 23.71 84.67 Swin-T 27.60 85.56

Res Net50(Stage 1&2) Swin-T(Stage 3&4) in1k sup 27.94 85.77

Table 3: Accuracy on CIFAR-100 with the pretrained networks and their reassembled ones.

3.1 Problem Formulation

Assume we have a collection of N pre-trained deep neural network models Z = {Mi}N i=1 that each composed of Li N layers of operation {F (k) i }Li l=1, therefore Mi = F (1) i F (2) i F (Li) i . Each model can be trained on different tasks or with varied structures. We call Z a Model Zoo. We define a learning task T composed of a labeled training set Dtr = {xj, yj}M j=1 and a test set Dts = {xj}L j=1.

Definition 1 (Deep Model Reassembly) Given a task T, our goal is to find the best-performed L-layer compositional model M on T, subject to hard computational or parametric constraints.

We therefore formulate it as an optimization problem

M = max M PT (M), s.t.M = F (l1) i1 F (l2) i2 F (l L) i L , |M| C (1)

where F (l) i is the l-th layer of the i-th model, PT (M) indicates the performance on T, and |M| C denotes the constraints. For two consecutive layers with dimension mismatch, we add a single stitching layer with 1 1 convolution operation to adjust the feature size. The stitching layer structure is described in Supplementary. No Single Wins For All. Figure 2 provides a preliminary experiment that 8 different pre-trained models are fine-tuned on 4 different image classification tasks. It is clear that no single model universally dominants in transfer evaluations. It builds up our primary motivation to reassemble trained models rather than trust the best candidate. Reassembly Might Win. Table 3 compares the test performance between the reassembled model and its predecessors. The bottom two stages of the Res Net50 i Naturalist2021 (inat2021 sup) [77] are stitched with Res Net50 Image Net-1k (in1k sup) stage 3&4 to form a new model for fine-tuning on CIFAR100. This reassembled model improves its predecessors by 0.63%/2.73% accuracy respectively. Similar phenomenon is observed on the reassembled model between Res Net50 in1k and Swin-T in1k. Despite its simplicity, the experiment provides concrete evidence that the neural network reassembly could possibly lead to better model in knowledge transfer. Reducing the Complexity. From the overall M = PN i=1 Li layers, the search space of Eq 1 is of size L-permutations of M P(M, L), which is undesirably large. To reduce the overall search cost, we intend to partition the networks into blocks rather than the layer-wise-divided setting. Moreover, it is time-consuming to evaluate each model on the target data through full-time fine-tuning. Therefore, we hope to accelerate the model evaluation, even without model training.

Based on the above discussion, the essence of De Ry lies in two steps (1) Partition the networks into blocks and (2) Reassemble the factorized neural blocks. In the following sections, we elaborate on what is a good partition? and what is a good assembly? .

3.2 Network Partition by Functional Equivalence

A network partition [21, 21] is a division of a neural network into disjoint sub-nets. In this study, we refer specifically to the partition of neural network Mi along depth into K blocks {B(k) i }K k=1 so that each block is a stack of p layers B(k) i = F (l) i F (l+1) i F (l+p) i and k is its stage index. Inspired by the hierarchical property of deep neural networks, we aim to partition the neural networks according to their function level, for example, dividing the network into a low-level block that identifies curves and a high-level block that recognizes semantics. Although we cannot strictly differentiate low-level from high-level , it is feasible to define functional equivalence.

Definition 2 (Functional Equivalence) Given two functions B and B with same input space X and output space Y. d : Y Y R is the metric defined on Y. For all inputs x X, if the outputs are the equivalent d(B(x), B (x)) = 0, we say B and B are functional equivalent.

A function is then uniquely determined by its peers who generate the same output with the same input. However, we can no longer define functional equivalence among neural networks, since network blocks might have varied input-output dimensions. It is neither possible to feed the same input to intermediate blocks with different input dimensions, nor allow for a mathematically valid definition for metric space [13, 7] when the output dimensions are not identical. We therefore resort to recent measurements on neural representation similarity [27, 40] and define the functional similarity for neural networks. The intuition is simple: two networks are functionally similar when they produces similar outputs with similar inputs.

Definition 3 (Functional Similarity for Neural Networks) Assume we have a neural similarity index s( , ) and two neural networks B : X Rn din Y Rn dout and B : X Rn d in Y Rn d out. For any two batches of inputs X X and X X with large similarity s(X, X ) > ϵ, the functional similarity between B and B are defined as their output similarity s(B(X), B (X )).

This definition generalizes well to typical knowledge distillation (KD) [31] when din = d in, which we will elaborate in the Appendix. We also show in Appendix that Def.3 provides a necessary and insufficient condition for two identical networks. Using the method of Lagrange multipliers, the conditional similarity in Def.3 can be further simplified to S(B, B ) = s(B(X), B (X ))+s(X, X ), which is a summation of its input-output similarity. The full derivation is shown in the Appendix.

Finding the Equivalence Sets of Neural Blocks. With Def.3, we are equipped with the math tools to partition the networks into equivalent sets of blocks. Blocks in each set are expected to have high similarity, which are treated to be functionally equivalent and hence interchangeable.

With a graphical notion, we represent each neural network as a path graph G(V, E) [25] with two nodes of vertex degree 1, and the other n 2 nodes of vertex degree 2. The ultimate goal is to find the best partition of each graph into K disjoint sub-graphs along the depth, and the dissected sub-nets are concurrently grouped into K functional equivalence sets, that sub-graph within each group has maximum internal functional similarity S(B, B ). In addition, we take a mild assumption that each sub-graph should have approximately similar size |B(k) i | < (1 + ϵ) |Mi|

K , where | | indicates the model size and ϵ is coefficient controls size limit for each block. We solve the above problem by posing a tri-level constrained optimization with joint clustering and partitioning

max Baj J(A, {B(k) i }) = max A(ik,p) {0,1}

k=1 A(ik,j)S(B(k) i , Baj) (Clustering) (2)

j=1 A(ik,j) = 1, {B(k) i }K k=1 = arg max B(k) i

k=1 A(ik,j)S(B(k) i , Baj) (Partition) (3)

s.t. B(1) i B(2) i B(K) i = Mi, B(k1) i B(k2) j = , k1 = k2 (4)

|B(k) i | < (1 + ϵ)|Mi|

K , k = 1, . . . K (5)

Where A NKN K is the 0-1 assignment matrix, where A(ik,j) = 1 denote the B(k) i block belongs to the j-th equivalence set, otherwise 0. Note that each block only belongs to one equivalence set, thus each column sums up to 1, PK j=1 A(ik,j) = 1. Baj is the anchor node for the j-th equivalence

set, which has the maximum summed similarly with all blocks in set j. B(k1) i B(k2) j = refers to the fact the no two blocks has overlapping nodes.

The inner optimization largely resembles the conventional set cover problem [32] or (K, 1 + ϵ) graph partition problem [36] that directly partition a graph into k sets. Although the graph partition falls exactly in a NP-hard [28] problem, heuristic graph partitioning algorithms like Kernighan-Lin (KL) algorithm [38] and Fiduccia Mattheyses (FM) algorithm [23] can be applied to solve our problem efficiently. In our implementation, we utilize a variant KL algorithm. With a random initialized network partition {B(k)}K k=1|t=0 for M at t = 0, we iteratively find the optimal separation by swapping nodes (network layer). Given the two consecutive block B(k)|t = F (l) i F (l+pk) i and

B(k+1)|t = F (l+pk+1) i F (l+pk+pk+1) i at time t, we conduct a forward and a backward neural network layer swap between successive blocks, whereas the partition achieving the largest objective value becomes the new partition

(B(k)|t+1, B(k+1)|t+1) = arg max{J(B(k)|t, B(k+1) i |t), J(B(k) i |f t, B(k+1) i |f t), J(B(k) i |b t, B(k+1) i |b t)} (6)

where (B(k) i |f t, B(k+1) i |f t) = B(k)|t F l+pk i B(k+1) i , (B(k) i |b t, B(k+1) i |b t) = B(k)|t F l+pk+1 i B(k+1) i (7)

For the outer optimization, we do a K-Means [52] style clustering. With the current network partition {B(k) }K k=1, we alternate between assigning each block to a equivalence set Gj, and identifying the anchor block within each set Baj Gj. It has been proved that both KL and K-Means algorithms converge to a local minimum according to the initial partition and anchor selection. We repeat the optimization for R = 200 runs with different seeds and select the best partition as our final results.

3.3 Network Reassembly by Solving an Integer Program

As we have divided each deep network into K partitions, each belongs to one of the K equivalence sets, all we want now is to find the best combination of neural blocks as a new pre-trained model under certain computational constraints. Consider K disjoint equivalence sets G1, . . . , GK of blocks to be reassembled into a new deep network of parameter constraint Cparam and computational constraint CFLOPs, the objective is to choose exactly one block from each group Gj as well as from each network stage index j such that the reassembled model achieves optimal performance on the target task without exceeding the capacity. We introduce two the binary matrices X(ik,j) and Y(ik,j) to uniquely identity the reassembled model M(X, Y ). X(ik,j) takes on value 1 if and only if B(k) i is chosen in group Gj, and Y(ik,j) = 1 if B(k) i comes from the k-th block. The selected blocks are arranged by the block stage index. The problem is formulated as

max X,Y PT (M(X, Y )) (8)

s.t. |M(X, Y )| Cparam, FLOPs(M(X, Y )) CFLOPs (9)

k=1 X(ik,j) = 1, X(ik,j) {0, 1}, j = 1, . . . , K (10)

j=1 Y(ik,j) = 1, Y(ik,j) {0, 1}, k = 1, . . . , K (11)

where PT is again the task performance. Equation 10 and 11 indicates that each model only possesses a single block from each equivalence set and each stage index. As such, the reassembled blocks are automatically ordered through its stage index in their original model. The problem falls exactly into a 0-1 Integer Programming [60] problem with a non-linear objective. Conventional methods train each M(X, Y ) to obtain PT . Instead of training each candidate till convergence, we estimate the transfer-ability of a network by counting the linear regions in the network as a training-free proxy.

Estimating the Performance with Training-Free Proxy. The number of linear region [56, 26] is a theoretical-grounded tool to describe the expressivity of a neural network, which has been successfully applied on NAS without training [55, 10]. We, therefore, calculate the data-dependent linear region to estimate the transfer performance of each model-task combination. The intuition is straightforward: the network can hardly learn to distinguish inputs with similar binary codes.

We apply random search to get a generation of reassembly candidates. For a whole mini-batch of inputs, we feed them into each network and binarilize the features vectors using a sign function. Similar to NASWOT [55], we compute the kernel matrix K using Hamming distance d( , ) and rank the models using log(det K). Since the computation of K requires nothing more than a few batches of network forwarding, we replace PT in Equation 8 with NASWOT score for fast model evaluation.

4 Experiments

In this section, we first explore some basic properties of the the proposed De Ry task, and then evaluate our solution on a series of transfer learning benchmarks to verify its efficiency.

Figure 4: FROZEN-TUNING accuracy on Image Net by replacing the 3nd and 4th stage of R50 to target blocks.

Figure 5: Pair-wise Linear CKA between pre-trained R50 and (1) R101 (2) RX50 and (3) Reg8G.

Model Zoo Setup. We construct our model zoo by collecting pre-trained weights from Torchvision 1, timm 2 and Open MMlab 3. We includes a series of manually designed CNN models like Res Net [30] and Res Ne Xt[84], as well as NAS-based architectures like Reg Net Y [63] and Mobile Netv3 [34]. Due to recent popularity of vision transformer, we also take several well-known attention-based architectures into consideration, including Vision Transformer (Vi T) [20] and Swin-Transformer [49]. In addition to the differentiation of the network structure pre-trained on Image Net, we include models with a variety of pre-trained strategies, including Sim CLR [8], Mo Cov2 [11] and BYOL [24] for Res Net50, Mo Cov3 [12] and MAE [29] for Vi T-B. Those models are pre-trained on Image Net1k [68], Image Net21K [67], Xrays [15] and i Naturalist2021 [77], Finally we result in 21 network architectures, with 30 pre-trained weights in total. We manually identify the atomic node to satisfy our line graph assumption. Each network is therefore a line graph composed of atomic nodes.

Implementation details. For all experiments, we set the partition number K = 4 and the block size coefficient ϵ = 0.2. We sample 1/20 samples from each train set to calculate the linear CKA representation similarity. The NASWOT [55] score is estimated with 5-batch average, where each mini-batch contains 32 samples. We set 5 levels of computational constraints, with Cparam {10, 20, 30, 50, 90} and CFLOPs {3, 5, 6, 10, 20}, which is denoted as De Ry(K, Cparam,CFLOPs). For each setting, we randomly generated 500 candidates. Each reassembled model is evaluate under 2 protocols (1) FROZEN-TUNING. We freeze all trained blocks and only update the parameter for the stitching layer and the last linear classifier and (2) FULL-TURNING. All network parameter are updated. All experiments are conducted on a 8 Ge Force RTX 3090 server. To reduce the feature similarity calculation cost, we construct the similarity table offline on Image Net. The complexity analysis and full derivation are shown in the Appendix.

4.1 Exploring the Properties for Deep Reassembly

Similarity, Position and Reassembly-ability. Figure 4 validates our functional similarity, reassembled block selection, and its effect on the model performance. For the Res Net50 trained on Image Net, we replace its 3nd and 4th stage with a target block from another pre-trained network (Res Net101, Res Ne Xt50 and Reg Net Y8G), connected by a single stitching layer. Then, the reassembled networks are re-trained on Image Net for a 20 epochs under FROZEN-TURNING protocol. The derived functional similarity in Section 3.2 is shown as the diameter of each circle. We observe that, the stitching position makes a substantial difference regarding the reassembled model performance. When replaced with a target block with the same stage index, the reassembled model performs surprisingly well, with 70% top-1 accuracy, even if its predecessors are trained with different architectures, seeds, and hyperparameters. It is also noted that, though function similarity is not numerically proportional to the target performance, it correctly reflects the performance ranking within the same target network. It suggests that our function similarity provides a reasonable criteria to identify equivalence set. In sum, the coupling between the similarity-position-performance explains our design to select one block from each equivalence set as well as the stage index. We also visualize the linear CKA [40] similarity between the R50 and the target networks in Figure 5. An interesting finding is that diagonal pattern for the feature similarity. The representation at the same stage is highly similar. More similarity visualizations are provided in the Appendix.

Partition Results. Due to the space limitation, the partition results of the model zoo are provided in the Appendix. Our observation is that, the equivalent sets tend to cluster the blocks by stage index.

1https://pytorch.org/vision/stable/index.html 2https://github.com/rwightman/pytorch-image-models 3https://github.com/open-mmlab

Figure 6: Plots of NASWOT [55] score and test accuracy for (Left) 10 pre-trained model on 8 downstream tasks and (Right) timm model zoo on Image Net. τ is the Kendall s Tau correlation.

For example, all bottom layers of varied pre-trained networks are within the same equivalence set. It provides valuable insight that neural networks learns similar patterns at similar network stage.

Architecture or Pre-trained Weight. Since De Ry searches for the architecture and weights concurrently, a natural question arises that Do both architecture and pre-trained weights lead to the final improvement? Or only architecture counts? We provide the experiments in the Appendix that both factors contribute. It is observed that training the De Ry architecture from scratch leads to a substantial performance drop compared with De Ry model with both new structures and pre-trained weights. It validates our arguments that our reassembled models benefit from the pre-trained models for efficient transfer learning.

Verifying the training-free proxy. As the first attempt to apply the NASWOT to measure model transfer-ability, we verify its efficacy before applying it to De Ry task. We adopt the score to rank 10 pre-trained models on 8 image classification tasks, as well as the timm model zoo on Image Net, shown in Figure 6. We also compute the Kendall s Tau correlation [37] between the fine-tuned accuracy and the NASWOT score. It is observed that the NASWOT score provides a reasonable predictor for model transfer-ability with a high Kendall s Tau correlation.

4.2 Transfer learning with Reassembled Model

Evaluation on Image Net1k. We first compare the reassembled network on Image Net [68] with current best-performed architectures. We train each model for either 100 epochs as SHORT-TRAINING or a 300 epochs as FULL-TRAINING. Except for De Ry, all models are trained from scratch. We optimize each network with Adam W [50] alongside a initial learning rate of 1e 3 and cosine lr-decay, mini-batch of 1024 and weight decay of 0.05. We apply Rand Aug [17], Mixup [95] and Cut Mix [93] as data augmentation. All model are trained and tested on 224 image resolutions.

Table 7 provides the Top-1 accuracy comparison on Imagenet with various computational constraint. We underline the best-performed model in the model zoo. First, It is worth-noting that De Ry provide very competitive model, even under FROZEN-TURNING or SHORT-TRAINING protocol. De Ry(4,90,20) manages to reach 78.6% with 1.27M parameter trainable, which provides convincing clue that the heterogeneous trained model are largely graftable. With only SHORT-TRAINING, De Ry models also match up with the full-time trained model in the zoo. For example, De Ry(4,10,3) gets to 76.9% accuracy within 100 epochs training, surpassing all small-sized models. The performance can be further improved towards 78.4% with the standard 300-epoch training. Second, De Ry brings about faster convergence. We compare with Res Net-50 and Swin-T under the same SHORT-TRAINING setting in Table 9 and Figure 10. It is clear that, by assembling the off-the-self pre-trained blocks, the De Ry models can be optimized faster than the it competitors, achieving 0.9% and 0.2% accuracy improvement over the Swin-T model with less parameter and computational requirements. Third, as showcased in Figure 8, our De Ry is able to search for diverse and hybrid network structures. De Ry(4,10,3) learns to adopt light-weight blocks like Mobile Netv3, while De Ry(4,90,20) gets to a large CNN-Swin hybrid architecture. Similar hybrid strategy has been proved to be efficient in manual network design [54, 83].

Transfer Image classification. We evaluate transfer learning performance on 9 natural image datasets. These datasets covered a wide range of image classification tasks, including 3 object classification

Architecture #Train/All Params (M) FLOPs (G) Top-1 RSB-Res Net-18 11.69/11.69 1.82 70.6 Reg Net Y-800M 6.30/6.30 0.8 76.3 Vi T-T16 5.7/5.7 1.3 74.1 De Ry(4,10,3)-FZ 1.02/7.83 2.99 41.2 De Ry(4,10,3)-FT 7.83/7.83 2.99 76.9 De Ry(4,10,3)-FT 7.83/7.83 2.99 78.4

RSB-Res Net-50 25.56/25.56 4.12 79.8 Reg Net Y-4GF 20.60/20.60 4.0 79.4 Vi T-S16 22.0/22.0 4.6 79.6 Swin-T 28.29/28.29 4.36 81.2 De Ry(4,30,6)-FZ 1.57/24.89 4.47 60.5 De Ry(4,30,6)-FT 24.89/24.89 4.47 79.6 De Ry(4,30,6)-FT 24.89/24.89 4.47 81.2

RSB-Res Net-101 44.55/44.55 7.85 81.3 Reg Net Y-8GF 39.20/39.20 8.1 81.7 Swin-S 49.61/49.61 8.52 82.8 De Ry(4,50,10)-FZ 3.92/40.41 6.43 72.0 De Ry(4,50,10)-FT 40.41/40.41 6.43 81.3 De Ry(4,50,10)-FT 40.41/40.41 6.43 82.3

Reg Net Y-16GF 83.6/83.6 16.0 82.9 Vi T-B16 86.86/86.86 33.03 79.8 Swin-B 87.77/87.77 15.14 83.1 De Ry(4,90,20)-FZ 1.27/ 80.66 13.29 78.6 De Ry(4,90,20)-FT 80.66/ 80.66 13.29 82.4 De Ry(4,90,20)-FT 80.66/ 80.66 13.29 83.2

Table 7: Top-1 accuracy of models trained on Image Net. means the model is trained for 100 epochs. FZ and FT denote the reassembled blocks are frozen or fine-tuned. Trainable parameters are marked in red.

Reg Net Y1.6gf

Reg Nety800mf

block3-3 block3-7

Mobilenetv3

Large 100 blocks.2.2

Reg Net Y16GF

block1-0 block1-1

blocks.9 blocks.11

Vi T-tiny blocks.2 blocks.10

layer1.2 layer3.1

Resne Xt50-32x4d

layer1.0 layer1.2

Swin-small stages.2.blocks.16

stages.3.blocks.1

layer3.4 Layer4.2

layer3.8 layer3.22

Reg Net Y3.2GF

block2-1 block3-1

layer1.0 layer1.1

Swin-base stages.2.blocks.11

stages.3.blocks.1

Swin-small stages.0.downsample

stages.2.blocks.15

Swin-small stages.0.blocks.1

Resne Xt50-32x4d

layer1.0 layer1.2

De Ry(4,10,3)

De Ry(4,20,5)

Resne Xt50-32x4d

layer1.0 layer1.2

layer1.2 layer3.1

De Ry(4,30,6)

De Ry(4,50,10)

In1k Supervised In1k Mo Cov2

De Ry(4,90,20)

Image Net + JFT-300M

Image Net + IG-1B-Targeted

Semi-Weakly Supervised

Figure 8: Reassembled structures on Image Net.

Architecture Params (M) FLOPs (G) Top-1 Top-5 Res Net-50 25.56 4.12 76.8 93.3 Swin-T 28.29 4.36 78.3 94.6 De Ry(30, 6)-FT 24.89 4.47 79.6 94.8 Res Net-101 44.55 7.85 79.0 94.5 Swin-S 49.61 8.52 80.8 95.7 De Ry(50, 10)-FT 40.41 6.43 81.2 95.6 Table 9: Top-1 and Top-5 Accuracy for the Image Net 100-epoch FULL-TUNING experiment.

Table 10: (Left) Test accuracy and (Right) Train loss comparison under the 100-epoch training on Image Net.

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

0 25 50 75 Parameters (M)

Model Zoo Rand Init. Model Zoo Pretrained De Ry De Ry+In1k Best model in Model Zoo LEEP model Log ME model

Figure 11: Transfer performance on 9 image classification tasks with the model zoo and our De Ry. Each blue or orange point refers to a single model trained from scratch or pre-trained weights.

tasks CIFAR-10 [43], CIFAR-100 [43] and Caltech-101 [22]; 5 fine-grained classification tasks Flower-102 [59], Stanford Cars [42], FGVC Aircraft[53], Oxford-IIIT Pets [61] and CUB-Bird [79] and 1 texture classification task DTD [14]. We FULL-TUNE all candidate networks in the model zoo and compare them with our De Ry model. Two model selection strategies Log ME [92] and LEEP [57] are also taken as our baselines. For fair comparison, we further train the reassembled network on Image Net for 100 epochs to further boost the transfer performance. Following [8, 41], we perform hyperparameter tuning for each model-task combination, which are elaborated in the Appendix.

Figure 11 compares the transfer performance between our proposed De Ry and all candidate models. By constructing models from building blocks, the De Ry generally surpasses all network trained from scratch within the same computational constraints, even beats pre-trained ones on Cars, Aircraft, and Flower. If allowing for pre-training on Image Net (De Ry+In1k), we can further promote the test accuracy, even better than the best-performing candidate in the original model zoo (highlighted by ). The performance improvement rises up as parameter constraints increase, which demonstrates the scalability of the proposed solution. Model selection approaches like Log ME and LEEP may not necessarily get the optimal model, thus failing to release the full potential of the model zoo. These

findings provide encouraging evidence that De Ry gives rise to an alternative approach to improve the performance when transferring from a zoo of models.

4.3 Ablation Study

To support the effectiveness of the De Ry pipeline, we further verify influence of each design in our solution through ablation studies.

Cover Set Partition

Train-Free Reassembly Acc (%) Search Cost (GPU days) ! ! 72.0 0.23 % ! 70.5 1.48 ! % 73.5 135 % % 72.2 135

Table 2: Ablation study on partition and reassembly strategy.

Partition and Reassembly Strategy. First, we conduct experiments by replacing the (1) cover set partition and (2) training-free reassembly with random partition or search. For the partition ablation, we randomly dissect each network into K partitions and reassemble the blocks in an order-less manner using our training-free proxy. For the reassembly ablation, we retain the cover set partition and fine-tune each randomly reassembled network for 100 epochs. Due to the computation limitation, we can only evaluate 25 candidates for reassembly ablation. We report the 100-epoch FROZEN-TUNING top-1 accuracy and the search time on Image Net in Table 2 under the De Ry(4,50,10) setting. Note that we do not include the similarity computation time into our account since it is computed offline. We see that the majority of the search cost comes from the fine-tuning stage. The training-free proxy largely alleviates the tremendous computational cost by 104 times, with marginal performance degradation. On the other hand, the cover set model partition not only improves the transfer performance but also reduces the reassembly search space from O(QN i=1 Li 1 K 1 ) to O(1). Both stages are crucial.

Granularity of Partition. To see the impact of partition number K, here we show the experimental results with different K {4, 5, 6}. We set the configuration to De Ry(K, 30, 6). The reassembled network is trained with FULL-TUNING setting on Image Net for 100 epochs. We report the parameter size, FLOPs as well as their top-1 and top-5 accuracy. As demonstrated in the Table 3, We notice that, as the partition number K increases, the performance of the reassembled model remains quite stable or slightly increase. It suggest that De Ryis highly flexible with different granularity of partitioning.

Partition Number # Param FLOPs Top-1 Top-5

K = 4 24.89 4.47 79.6 94.8 K = 5 21.14 5.53 79.7 94.9 K = 6 23.38 5.39 79.8 95.0 Table 3: Ablation study on partition granularity with K {4, 5, 6}.

5 Conclusion

In this study, we explore a novel knowledge-transfer task called Deep Model Reassembly (De Ry). De Ry seeks to deconstruct heterogeneous pre-trained neural networks into building blocks and then reassemble them into models subject to user-defined constraints. We provide a proof-of-concept solution to show that De Ry can be made not only possible but practically efficient. Specifically, pretrained networks are partitioned jointly via a cover set optimization to form a series of equivalence sets. The learned equivalence sets enable choosing and assembling blocks to customize networks, which is accomplished by solving integer program with a training-free task-performance proxy. De Ry not only achieves gratifying performance on a series of transfer learning benchmarks, but also sheds light on the functional similarity between neural networks by stitching heterogeneous models.

Acknowledgement

This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-RP-2021-023). Xinchao Wang is the corresponding author.

[1] Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas Donald Lane. Zero-cost proxies for lightweight {nas}. In International Conference on Learning Representations, 2021.

[2] Andrea Agostinelli, Jasper Uijlings, Thomas Mensink, and Vittorio Ferrari. Transferability metrics for selecting source model ensembles. ar Xiv preprint ar Xiv:2111.13011, 2021.

[3] Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. Advances in Neural Information Processing Systems, 34, 2021.

[4] Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, and Leonidas Guibas. An information-theoretic approach to transferability in task transfer learning. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2309 2313. IEEE, 2019.

[5] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 550 559. PMLR, 10 15 Jul 2018.

[6] Daniel Bolya, Rohit Mittapalli, and Judy Hoffman. Scalable diverse model selection for accessible transfer learning. Advances in Neural Information Processing Systems, 34, 2021.

[7] Valeri ı Vladimirovich Buldygin and IU V Kozachenko. Metric characterization of random variables and random processes, volume 188. American Mathematical Soc., 2000.

[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020.

[9] W. Chen, X. Gong, and Z. Wang. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective, 2021.

[10] Wuyang Chen, Xinyu Gong, and Zhangyang Wang. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. In International Conference on Learning Representations, 2021.

[11] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020.

[12] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640 9649, 2021.

[13] B. Choudhary. The Elements of Complex Analysis. New Age International Publishers, 1992.

[14] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.

[15] Joseph Paul Cohen, Joseph D. Viviano, Paul Bertin, Paul Morrison, Parsa Torabian, Matteo Guarrera, Matthew P Lungren, Akshay Chaudhari, Rupert Brooks, Mohammad Hashir, and Hadrien Bertrand. Torch XRay Vision: A library of chest X-ray datasets and models. In Medical Imaging with Deep Learning, 2022.

[16] Adrián Csiszárik, Péter K orösi-Szabó, Ákos Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. Advances in Neural Information Processing Systems, 34, 2021.

[17] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702 703, 2020.

[18] Dong Dai and Tong Zhang. Greedy model averaging. Advances in Neural Information Processing Systems, 24, 2011.

[19] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1 15. Springer, 2000.

[20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

[21] Tomas Feder, Pavol Hell, Sulamita Klein, and Rajeev Motwani. Complexity of graph partition problems. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 464 472, 1999.

[22] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178 178. IEEE, 2004.

[23] Charles M Fiduccia and Robert M Mattheyses. A linear-time heuristic for improving network partitions. In 19th design automation conference, pages 175 181. IEEE, 1982.

[24] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271 21284, 2020.

[25] Jonathan L Gross, Jay Yellen, and Mark Anderson. Graph theory and its applications. Chapman and Hall/CRC, 2018.

[26] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. In International Conference on Machine Learning, pages 2596 2604. PMLR, 2019.

[27] David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639 2664, 2004.

[28] Juris Hartmanis. Computers and intractability: a guide to the theory of np-completeness (michael r. garey and david s. johnson). Siam Review, 24(1):90, 1982.

[29] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. ar Xiv:2111.06377, 2021.

[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[31] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2(7), 2015.

[32] Dorit S Hochba. Approximation algorithms for np-hard problems. ACM Sigact News, 28(2):40 52, 1997.

[33] Qibin Hou, Daquan Zhou, and Jiashi Feng. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13713 13722, 2021.

[34] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314 1324, 2019.

[35] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. ar Xiv preprint ar Xiv:1801.06146, 2018.

[36] George Karypis and Vipin Kumar. Multilevel k-way hypergraph partitioning. VLSI design, 11(3):285 300, 2000.

[37] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81 93, 1938.

[38] Brian W Kernighan and Shen Lin. An efficient heuristic procedure for partitioning graphs. The Bell system technical journal, 49(2):291 307, 1970.

[39] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In European conference on computer vision, pages 491 507. Springer, 2020.

[40] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519 3529. PMLR, 2019.

[41] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2661 2671, 2019.

[42] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3d RR-13), Sydney, Australia, 2013.

[43] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[44] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 991 999, 2015.

[45] Xingjian Li, Haoyi Xiong, Hanchao Wang, Yuxuan Rao, Liping Liu, Zeyu Chen, and Jun Huan. Delta: Deep learning transfer using feature map with attention for convolutional networks. ar Xiv preprint ar Xiv:1901.09229, 2019.

[46] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

[47] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.

[48] Huihui Liu, Yiding Yang, and Xinchao Wang. Overcoming catastrophic forgetting in graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, 2021.

[49] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. International Conference on Computer Vision (ICCV), 2021.

[50] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

[51] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In NIPS 2018. Microsoft, December 2018.

[52] James Mac Queen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281 297. Oakland, CA, USA, 1967.

[53] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.

[54] Sachin Mehta and Mohammad Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. ar Xiv preprint ar Xiv:2110.02178, 2021.

[55] Joe Mellor, Jack Turner, Amos Storkey, and Elliot J Crowley. Neural architecture search without training. In International Conference on Machine Learning, pages 7588 7598. PMLR, 2021.

[56] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. Advances in neural information processing systems, 27, 2014.

[57] Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. Leep: A new measure to evaluate transferability of learned representations. In International Conference on Machine Learning, pages 7294 7305. PMLR, 2020.

[58] Dang Nguyen, Khai Nguyen, Dinh Phung, Hung Bui, and Nhat Ho. Model fusion of heterogeneous neural networks via cross-layer alignment. ar Xiv preprint ar Xiv:2110.15538, 2021.

[59] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722 729. IEEE, 2008.

[60] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998.

[61] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.

[62] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4095 4104. PMLR, 10 15 Jul 2018.

[63] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428 10436, 2020.

[64] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30, 2017.

[65] JO Ramsay, Jos ten Berge, and GPH Styan. Matrix correlation. Psychometrika, 49(3):403 423, 1984.

[66] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2902 2911. PMLR, 06 11 Aug 2017.

[67] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses, 2021.

[68] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015.

[69] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[70] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108, 2019.

[71] Yang Shu, Zhi Kou, Zhangjie Cao, Jianmin Wang, and Mingsheng Long. Zoo-tuning: Adaptive transfer from a zoo of models. In International Conference on Machine Learning, pages 9626 9637. PMLR, 2021.

[72] Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045 22055, 2020.

[73] Jie Song, Yixin Chen, Xinchao Wang, Chengchao Shen, and Mingli Song. Deep model transferability from attribution maps. In Advances in Neural Information Processing Systems, 2019.

[74] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. ar Xiv preprint ar Xiv:2106.10270, 2021.

[75] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

[76] Anh T Tran, Cuong V Nguyen, and Tal Hassner. Transferability and hardness of supervised classification tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1395 1405, 2019.

[77] Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, and Oisin Mac Aodha. Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12884 12893, 2021.

[78] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. ar Xiv preprint ar Xiv:2002.06440, 2020.

[79] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. Technical Report CNS-TR-201, Caltech, 2010.

[80] Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural predictor for neural architecture search. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision ECCV 2020, pages 660 676, Cham, 2020. Springer International Publishing.

[81] Alex Williams, Erin Kunz, Simon Kornblith, and Scott Linderman. Generalized shape metrics on neural representations. Advances in Neural Information Processing Systems, 34, 2021.

[82] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

[83] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early convolutions help transformers see better. Advances in Neural Information Processing Systems, 34:30392 30400, 2021.

[84] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492 1500, 2017.

[85] LI Xuhong, Yves Grandvalet, and Franck Davoine. Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, pages 2825 2834. PMLR, 2018.

[86] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J Di Carlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23):8619 8624, 2014.

[87] Xingyi Yang, Xuehai He, Yuxiao Liang, Yue Yang, Shanghang Zhang, and Pengtao Xie. Transfer learning or self-supervised learning? a tale of two pretraining paradigms. ar Xiv preprint ar Xiv:2007.04234, 2020.

[88] Xingyi Yang, Jingwen Ye, and Xinchao Wang. Factorizing knowledge in neural networks. European Conference on Computer Vision, 2022.

[89] Yiding Yang, Zunlei Feng, Mingli Song, and Xinchao Wang. Factorizable graph convolutional networks. Advances in Neural Information Processing Systems, 33:20286 20296, 2020.

[90] Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, and Xinchao Wang. Distilling knowledge from graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

[91] Jingwen Ye, Yixin Ji, Xinchao Wang, Kairi Ou, Dapeng Tao, and Mingli Song. Student becoming the master: Knowledge amalgamation for joint scene parsing, depth estimation, and more. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2829 2838, 2019.

[92] Kaichao You, Yong Liu, Jianmin Wang, and Mingsheng Long. Logme: Practical assessment of pre-trained models for transfer learning. In International Conference on Machine Learning, pages 12133 12143. PMLR, 2021.

[93] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023 6032, 2019.

[94] Guojun Zhang, Han Zhao, Yaoliang Yu, and Pascal Poupart. Quantifying and improving transferability in domain generalization. Advances in Neural Information Processing Systems, 34, 2021.

[95] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

[96] Daquan Zhou, Qibin Hou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. Rethinking bottleneck structure for efficient mobile network design. In European Conference on Computer Vision, pages 680 697. Springer, 2020.

[97] Daquan Zhou, Xiaojie Jin, Xiaochen Lian, Linjie Yang, Yujing Xue, Qibin Hou, and Jiashi Feng. Autospace: Neural architecture search with less human interference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 337 346, 2021.

[98] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. ar Xiv preprint ar Xiv:2103.11886, 2021.

[99] Zhi-Hua Zhou. Ensemble learning. In Machine learning, pages 181 210. Springer, 2021.

[100] Barret Zoph and Quoc Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.