# initializing_models_with_larger_ones__bf9aa67c.pdf Published as a conference paper at ICLR 2024 INITIALIZING MODELS WITH LARGER ONES Zhiqiu Xu1, Yanjie Chen2, Kirill Vishniakov3, Yida Yin2, Zhiqiang Shen3, Trevor Darrell2, Lingjie Liu1, Zhuang Liu4 1University of Pennsylvania 2UC Berkeley 3MBZUAI 4Meta AI Research Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era. Code is available at https://github.com/Oscar XZQ/weight-selection. 1 INTRODUCTION The initialization of neural network weights is crucial for their optimization. Proper initialization aids in model convergence and prevents issues like gradient vanishing. Two prominent initialization techniques, Xavier initialization (Glorot & Bengio, 2010) and Kaiming initialization (He et al., 2015), have played substantial roles in neural network training. They remain the default methods in modern deep learning libraries like Py Torch (Paszke et al., 2019). Figure 1: Large pretrained models offer new opportunities for initializing small models. These methods were developed for training neural networks from random initialization. At that time, it was the common practice. However, the landscape has changed. A variety of pretrained models are now readily available, thanks to collective efforts from the community (Wolf et al., 2019; Wightman, 2019). These models are trained on large datasets like Image Net21K (Deng et al., 2009) and LAION-5B (Schuhmann et al., 2022) and are often optimized by experts. As a result, fine-tuning from these pretrained models (Kolesnikov et al., 2020; Hu et al., 2022) is usually considered a preferred option today, rather than training models from scratch. However, these pretrained large models can be prohibitive in their resource demand, preventing their wide adoption for resource-constrained settings, e.g., on mobile devices. For many pretrained model families, even the smallest model instance can be considered extremely large in certain contexts. For example, masked autoencoders (MAE) (He et al., 2022) and CLIP (Radford et al., 2021) both provide Vi T-Base (Dosovitskiy et al., 2021), a 80M-parameter architecture, as their smallest pretrained Transformer model. This is already too large for applications on edge devices, and the smallest LLa MA (Touvron et al., 2023) model is even another 100 times larger, with 7B parameters. With few small pretrained models available, developers would have to train them from scratch on target datasets to suit their needs. This approach misses the opportunity to utilize large pretrained models, whose knowledge is learned from extensive training on large data. Published as a conference paper at ICLR 2024 In this work, we tackle this issue by introducing a weight initialization method that uses large pretrained models to train small models. Specifically, we introduce weight selection, where a subset of weights from a pretrained large model is selected to initialize a smaller model. This allows for knowledge learned by the large model to transfer to the small model through its weights. Thanks to the modular design of modern neural networks, weight selection involves only three simple steps: layer selection, component mapping, and element selection. This method can be applied to any smaller model within the same model family as the large model. Using weight selection for initializing a small model is straightforward and adds no extra computational cost compared to training from scratch. It could also be useful even for large model training, e.g., initializing a LLa MA-7B with trained weights from LLa MA-30B. We apply weight selection to train small models on image classification datasets of different scales. We observe significant improvement in accuracy across datasets and models compared with baselines. Weight selection also substantially reduces the training time required to reach the same level of accuracy. Additionally, it can work alongside another popular method for knowledge transfer from large models knowledge distillation (Hinton et al., 2015). We believe weight selection can be a general technique for training small models. Our work also encourages further research on utilizing pretrained models for efficient deployment. 2 RELATED WORK Weight initialization. Weight initialization is a crucial aspect of model training. Glorot & Bengio (2010) maintain constant variance by setting the initial values of the weights using a normal distribution, aiming to prevent gradient vanishing or explosion. He et al. (2015) modify it to adapt to Re LU activations (Nair & Hinton, 2010). Mishkin & Matas (2016) craft the orthogonality in weight matrices to keep gradient from vanishing or exploding. Saxe et al. (2014) and Vorontsov et al. (2017) put soft constraints on weight matrices to ensure orthogonality. There are methods that use external sources of knowledge like data distribution or unsupervised training to initialize neural networks. A data-dependent initialization can be obtained from performing K-means clustering or PCA (Kr ahenb uhl et al., 2015; Tang et al., 2017) on the training dataset. Larochelle et al. (2009), Masci et al. (2011), Trinh et al. (2019), and Gani et al. (2022) show training on unsupervised objectives can provide a better initialization for supervised training. Utilizing pretrained models. Transfer learning (Zhuang et al., 2020) is a common framework for using model weights pretrained from large-scale data. Model architecture is maintained and the model is fine-tuned on specific downstream tasks (Kolesnikov et al., 2020). Knowledge distillation involves training a usually smaller student model to approximate the output of a teacher model (Hinton et al., 2015; Tian et al., 2019; Beyer et al., 2022). This allows the student model to maintain the performance of a teacher while being computationally efficient. Another alternative approach for using pretrained models is through weight pruning (Le Cun et al., 1990; Han et al., 2015; Li et al., 2017b; Liu et al., 2019). It involves removing less significant weights from the model, making it more efficient without significantly compromising performance. Lin et al. (2020) transform parameters of a large network to an analogous smaller one through learnable linear layers using knowledge distillation to match block outputs. Sanh et al. (2019) and Shleifer & Rush (2020) create smaller models by initializing with a subset of layers from a pretrained BERT (Devlin et al., 2018). This method requires the smaller model to have the same width as teacher s. Czyzewski et al. (2022) borrow weights from existing convolutional networks to initialize novel convolutional neural networks. Chen et al. (2016) and Chen et al. (2021) transform weights from smaller networks as an effective initialization for larger models to accelerate convergence. Trockman et al. (2023) initialize convolutional layers with Gaussian distribution according to pretrained model s covariance. Similarly, Trockman & Kolter (2023) initialize self-attention layers according to observed diagonal patterns from pretrained Vi Ts. These two methods use statistics from but do not directly utilize pretrained parameters. The concurrent work, Xia et al. (2023), applies structured pruning to reduce existing large language models to their smaller versions and uses the pruned result as an initialization. Zhong et al. (2024) explores knowledge transfer between transformer-based LLM through parameters. Weight selection, in contrast, directly utilizes pretrained parameters, does not require extra computation, and is suitable for initializing any smaller variants of the pretrained model. Published as a conference paper at ICLR 2024 3 WEIGHT SELECTION Given a pretrained model, our goal is to obtain an effective weight initialization for a smaller-size model within the same model family. Borrowing terminology from knowledge distillation, we refer to the pretrained model as teacher and the model we aim to initialize as student. 3.1 APPROACH Modern neural network architectures often follow a modular approach: design a layer and replicate it to build the model (He et al., 2016; Vaswani et al., 2017; Tolstikhin et al., 2021; Dosovitskiy et al., 2021; Liu et al., 2022). This design ethos promotes scalability: models can be widened by increasing the embedding dimension or the number of channels in each block, and deepened by stacking more layers. It also enables us to perform weight selection following three steps: layer selection, component mapping, and element selection. Figure 2: Weight selection. To initialize a smaller variant of a pretrained model, we uniformly select parameters from the corresponding component of the pretrained model. Layer selection. Selecting layers from teacher is the first step. For each layer in student, a layer from teacher is selected as the source for initialization. The procedure for layer selection is slightly different for isotropic architectures and hierarchical architectures. An isotropic architecture refers to the neural network where each layer exhibits a consistent and uniform layerwise design throughout the model. Vi T (Dosovitskiy et al., 2021) and MLP-mixer (Tolstikhin et al., 2021) belong to isotropic architectures. A hierarchical architecture is characterized by multi-scale representations and a hierarchy in embedding dimensions. Hierarchical architectures typically have stages with varying scales and embedding dimensions. For example, classic convolutional networks like VGG (Simonyan & Zisserman, 2014) progressively decrease spatial dimensions while increasing channel dimensions, capturing multi-scale features. Modern architectures like Swin-Transformer (Liu et al., 2021) and Conv Ne Xt (Liu et al., 2022) also employ this hierarchical design. For isotropic architectures, we select the first N layers from the teacher, where N represents the student s layer count, denoted as first-N selection. When dealing with hierarchical structures like Conv Ne Xt (Liu et al., 2022), first-N selection is applied at each individual stage. An alternative method is uniform layer selection, where evenly-spaced layers in teacher are selected. Empirical results in Section 4.3 and Appendix B show that first-N layer selection is ideal for weight selection. Component mapping. In the second step, we map components between student and teacher. From the previous step, we obtained layer mapping from teacher to student. The task is then reduced to initializing one student layer with one teacher layer. Thanks to the modular approach adopted by modern neural network design, layers in models of the same family have an identical set of components which only differ in their width. The process for matching the corresponding components is thus a natural one-to-one mapping. Element selection. Upon establishing component mapping, the next step is to initialize student s component using its larger counterpart from teacher. The default method for element selection is Published as a conference paper at ICLR 2024 uniform selection, where evenly-spaced elements are selected from teacher s tensor as shown in figure 2. Details on uniform selection and other element selection methods will be introduced next part. 3.2 METHODS FOR ELEMENT SELECTION In this part, we formulate element selection and introduce different selection criteria. Consider a weight tensor Ws from student that we seek to initialize with teacher s weight tensor Wt. If Wt has shape t1, t2, ..., tn, then Ws, which is of the same component type with Wt, will also span n dimensions. Our goal is to select a subset of Wt s elements to initialize Ws. Several possible methods for element selection are discussed as follows. We compare the performance of these element selection methods in Section 4.3. We find that as long as consistency is maintained (as explained in the paragraph Random selection with consistency, weight selection can achieve a similar level of performance. We propose using uniform selection as default for weight selection in practice. Uniform selection (default). For each dimension i of Wt, select evenly-spaced si slices out of ti. For example, to initialize a linear layer Ws of shape 2 3 with a linear layer Wt of shape 4 6, we select 1st and 3rd slice along the first dimension, and 1st, 3rd, and 5th slice along the second dimension. We present pseudocode for uniform selection in Algorithm 1. The algorithm starts with a copy of teacher s weight tensor Wt and iteratively performs selection on all dimensions of Wt to reach the desired shape for student. Notably, in architectures that incorporate grouped components such as the multi-head attention module in Vi Ts and the grouped convolution in Res Ne Xt (Xie et al., 2017) uniform selection absorbs information from all groups. For example, when applied to Vi Ts, uniform selection will select parameters from all heads in the attention block, which is likely to be beneficial for inheriting knowledge from the pretrained Vi Ts. Algorithm 1 Uniform element selection from teacher s weight tensor Input: Wt teacher s weight tensor Input: s desired dimension for student s weight tensor Output: Ws with shape s 1: procedure UNIFORMELEMENTSELECTION(Wt, s) 2: Ws Copy of Wt student s weight tensor 3: n length of Wt.shape 4: for i = 1 n do 5: dt Wt.shape[i] 6: ds s[i] 7: indices Select ds evenly-spaced numbers from 1 to dt 8: Ws Select indices along Ws s ithdimension 9: end for 10: return Ws 11: end procedure Consecutive selection. For each dimension i of Wt, select consecutive si slices out of ti. In contrast to uniform selection, for architectures with grouped components, consecutive selection selects some entire groups while omitting the contrast. For architectures without such grouped components, consecutive selection is equivalent to uniform selection. Random selection (with consistency). For all weight tensors, and for each dimension i of Wt, select the same randomly-generated set of si slices out of ti. Through empirical experiments in Section 4.3, we find that consistency (selecting the same indices for all weight matrices) is key for weight selection to reach its best performance. Motivation for maintaining consistency stems from the existence of residual connections neurons that are added in the teacher model should have their operations preserved in the student. Furthermore, maintaining consistency preserves complete neurons during element selection, since only consistent positions are selected. It is worth noting that uniform selection and consecutive selection inherently preserve consistency, which are both special instances of random selection with consistency. Random selection (without consistency). Along every dimension i of Wt, randomly select si slices out of ti. Unlike random selection w/ consistency, this method does not require selecting the same indices for every weight tensor. We design this method to examine the importance of consistency. Published as a conference paper at ICLR 2024 4 EXPERIMENTS 4.1 SETTINGS Datasets. We evaluate weight selection on 9 image classification datasets including Image Net1K (Deng et al., 2009), CIFAR-10, CIFAR-100 (Krizhevsky, 2009), Flowers (Nilsback & Zisserman, 2008), Pets (Parkhi et al., 2012), STL-10 (Coates et al., 2011), Food-101 (Bossard et al., 2014)), DTD (Cimpoi et al., 2014), SVHN (Netzer et al., 2011) and Euro SAT (Helber et al., 2019; 2018). These datasets vary in scales ranging from 5K to 1.3M training images. Models. We perform experiments on Vi T-T/16 (Touvron et al., 2021a) and Conv Ne Xt-F (Liu et al., 2022), with Image Net-21K pretrained Vi T-S/16 and Conv Ne Xt-T as their teachers respectively. We obtain weights for Image Net-21K pretrained Vi T-S/16 from Steiner et al. (2021) and Image Net-21K pretrained Conv Ne Xt-T from Liu et al. (2022). We present the detailed configurations in Table 1. configuration student teacher model Vi T-T Conv Ne Xt-F Vi T-S Conv Ne Xt-T depth 12 2 / 2 / 6 / 2 12 3 / 3 / 9 / 3 embedding dimension 192 96 / 192 / 384 / 768 384 48 / 96 / 192 / 384 number of heads 3 - 6 - number of parameters 5M 5M 22M 28M Table 1: Model Configurations. We perform main experiments on Conv Ne Xt and Vi T, and use student that halve the embedding dimensions of their corresponding teacher. Training. We follow the training recipe from Conv Ne Xt (Liu et al., 2022) with adjustments to batch size, learning rate, and stochastic depth rate (Huang et al., 2016) for different datasets. See Appendix A for details. To ensure a fair comparison, we adapt hyperparameters to baseline (training with random initialization), and the same set of hyperparameters is used for training with weight selection. Random initialization baseline. We utilize the model-specific default initialization from timm library (Wightman, 2019), a popular computer vision library with reliable reproducibility. Its default initialization of Vi T-T and Conv Ne Xt-F employs a truncated normal distribution with a standard deviation of 0.02 for linear and convolution layers. The truncated normal distribution, designed to clip initialization values, is adopted to develop modern neural networks (Liu et al., 2022). 4.2 RESULTS Experiment results are presented in Table 2. Across all 9 image classification datasets, weight selection consistently boosts test accuracy, especially for smaller datasets. Notably, weight selection addresses the well-known challenge of training Vi T on small datasets, which likely contributes to the significant accuracy improvement for Vi T. Training curves for Image Net-1K are shown in Figure 3. Both models benefit from weight selection early on and maintain this advantage throughout training. dataset (scale ) random init weight selection change Image Net-1K 73.9 75.6 1.6 SVHN 94.9 96.5 1.6 Food-101 79.6 86.9 7.3 Euro SAT 97.5 98.6 1.1 CIFAR-10 92.4 97.0 4.6 CIFAR-100 72.3 81.4 9.1 STL-10 61.5 83.4 21.9 Flowers 62.4 81.9 19.5 Pets 25.0 68.6 43.6 DTD 49.4 62.5 13.1 random init weight selection change 76.1 76.4 0.3 95.7 96.9 1.2 86.9 89.0 2.1 98.4 98.8 0.4 96.6 97.4 0.8 81.4 84.4 3.0 81.4 92.3 10.9 80.3 94.5 14.2 72.9 87.3 14.4 63.7 68.8 5.1 (b) Conv Ne Xt-F Table 2: Test accuracy on image classification datasets. On all 9 datasets, employing weight selection for initialization leads to an improvement in test accuracy. Datasets are ordered by their image counts. Weight selection provides more benefits when evaluated on datasets with fewer images. Published as a conference paper at ICLR 2024 0 50 100 150 200 250 300 0 accuracy (%) random init test acc weight selection test acc random init train loss weight selection train loss 0 50 100 150 200 250 300 0 accuracy (%) random init test acc weight selection test acc random init train loss weight selection train loss (b) Conv Ne Xt-F Figure 3: Training curves on Image Net-1K. When initialized using weight selection from Image Net21K pretrained models, both Vi T-T (from Vi T-S) and Conv Ne Xt-F (from Conv Ne Xt-T) exhibit superior performance compared to their randomly-initialized counterparts. 4.3 COMPARISONS Advantage over classic initialization. We compare weight selection and its variants with two widely-adopted initialization methods: Xavier initialization (Glorot & Bengio, 2010) and Kaiming initialization (He et al., 2015) and present results on CIFAR-100 as shown in Table 3. All variants of weight selection yield considerably better results than classic initialization methods. Comparison of selection methods. Consistency is the key for weight selection to reach its best performance. For both model architectures, uniform selection, consecutive selection, and random selection with consistency achieve a similar level of performance. Note that uniform selection and consecutive selection inherently maintain consistency. When removing consistency, a sharp drop in performance is observed. For random selection experiments, we report the median performance of three different selection results. init Vi T-T Conv Ne Xt-F timm default (trunc normal) 72.3 81.4 Xavier (Glorot & Bengio, 2010) 72.1 82.8 Kaiming (He et al., 2015) 73.0 82.5 weight selection (uniform) 81.4 84.4 weight selection (consecutive) 81.6 84.0 weight selection (random w/ consistency) 81.7 83.9 weight selection (random w/o consistency) 77.4 82.8 Table 3: Comparison with classic initialization methods. We present CIFAR-100 test accuracy for weight selection s variants and classic initialization methods. Weight selection methods with consistency outperform classic initialization methods by a large margin. 4.4 COMPATIBILITY WITH KNOWLEDGE DISTILLATION Weight selection transfers knowledge from pretrained models via parameters. Another popular approach for knowledge transferring is knowledge distillation (Hinton et al., 2015), which utilizes outputs from pretrained models. Here we explore the compatibility of these two techniques. Settings. We evaluate the performance of combining weight selection with two different approaches in knowledge distillation logit-based distillation and feature-based distillation by using weight selection as initialization for student in knowledge distillation. Logit-based distillation uses KLdivergence as the loss function for matching student s and teacher s logits. Denote student s output probabilities as ps, and teacher s output probabilities as pt, the loss for logit-based distillation can be formulated as L = Lclass + α KL(pt||ps) (1) where Lclass is supervised loss, and α is the coefficient for distillation loss. Note that matching logits requires teacher to be trained on the same dataset as student. For logit-based distillation, We train Vi T-T on Image Net-1K while using the Image Net-1K pretrained Vi T-S model from Dei T (Touvron et al., 2021a) as the teacher for both knowledge distillation and weight selection. α is set to 1. Published as a conference paper at ICLR 2024 Feature-based distillation steps in when a classification head of the target dataset is not available. Denote teacher s output as Ot, and student s as Os. Feature-based distillation can be formulated as L = Lclass + α L1(Ot, MLP(Os)) (2) An MLP is used to project student s output to teacher s embedding dimension, and L1 loss is used to match the projected student s output and teacher s output. For feature-based distillation, we perform CIFAR-100 training experiments on Vi T-T, using Image Net-21K pretrained Vi T-S as the teacher for both knowledge distillation and weight selection. We tune α on distillation trials and use the same value for α for trials that combine distillation and weight selection. Results. Table 4 provides results for knowledge distillation and weight selection when applied individually or together. These results show the compatibility between weight selection and different types of knowledge distillation. Without incurring additional inference costs, employing weight selection alone produces a better result than vanilla logit-based distillation and feature-based distillation. More importantly, the combination of weight selection and knowledge distillation delivers the best results, boosting accuracies to 76.0% on Image Net-1K and 83.9% on CIFAR-100. These results further confirm weight selection s usefulness as an independent technique and the compatibility between weight selection and knowledge distillation. setting Image Net-1K (logit-based distillation) CIFAR-100 (feature-based distillation) test acc change test acc change baseline 73.9 - 72.3 - distill 74.8 0.9 78.4 6.4 weight selection 75.5 1.6 81.4 9.1 distill + weight selection 76.0 2.1 83.9 11.6 Table 4: Compatibility with knowledge distillation. Weight selection is useful as an independent technique, and can be combined with knowledge distillation to achieve the best performance. In this section, we perform a comprehensive analysis of weight selection. Unless otherwise specified, the standard setting (for comparison) of weight selection is initializing Vi T-T with uniform selection from the Image Net-21K pretrained Vi T-S, trained and evaluated on CIFAR-100. Reduction in training time. We find weight selection can significantly reduce training time. We directly measure the reduction in training time by training Vi T-T with weight selection for different numbers of epochs and present the results in Figure 4a. The number of warmup epochs is modified to maintain its ratio with the total epochs for each trial. With weight selection, the same performance on CIFAR-100 can be obtained with only 1/3 epochs compared to training from random initialization. 50 100 150 200 250 300 60 accuracy (%) 77.2 79.0 80.3 81.4 train with weight selection train from random init training epochs (a) Comparison with random initialization 0 10 20 40 60 80 300 60 accuracy (%) 80.0 81.5 82.5 pretraining + finetuning weight selection pretraining epochs (b) Comparison with pretraining + finetuning Figure 4: Faster training. Compared to random initialization, Vi T-T can reach the same performance on CIFAR-100 with only 1/3 epochs compared to training from random initialization. When compared to pretraining (on Image Net-1K) + finetuning, weight selection is able to match the accuracy at 60 epochs of pretraining, saving 6.12x training time. Comparison with transfer learning. We conduct experiments to find the training budget needed for pretraining to match the accuracy of weight selection. For this experiment, we train Vi T-T on Published as a conference paper at ICLR 2024 Image Net-1K with different numbers of epochs and then fine-tune on CIFAR-100 for 300 epochs. As shown in Figure 4b, it takes 60 epochs of pretraining on Image Net-1K to achieve the same performance on CIFAR-100. Under this setting, weight selection is 6.12x faster compared to reaching the same performance with pretraining, without need to access the dataset used for pretraining. Pretrained models. We study the effect of using different pretrained models as weight selection teacher. Models with supervised pretraining turn out to be the best teacher. We evaluate the performance of Vi T-B as teacher under different pretraining regimes: CLIP (Radford et al., 2021), MAE (He et al., 2022), and DINO (Caron et al., 2021). Table 5 presents the results. Initializing with pretrained weights consistently outperforms random initialization. Image Net-21K pretrained teacher provides the most effective initialization. Note that for this experiment, we use Vi T-B as teacher for weight selection, since it is the smallest model that MAE and CLIP provide. Pretrained models CIFAR-10 CIFAR-100 STL-10 supervised (Image Net-21K) 95.1 77.6 73.1 CLIP (Radford et al., 2021) 94.9 77.3 66.0 MAE (He et al., 2022) 95.9 77.2 71.0 DINO (Caron et al., 2021) 95.0 75.7 69.4 Table 5: Different pretrained models. Supervised pretraining makes the best teacher. Layer selection. Shleifer & Rush (2020) select evenly-spaced layers from BERT to initialize small models. We compare two layer selection methods: first-N layer selection and uniform layer selection. To evaluate different layer selection methods, we create Vi T-A of 6 layers (half of the Vi T-T depth), with other configurations identical to Vi T-T. In this experiment, we use Vi T-A and Conv Ne Xt-F as student, and Image Net-21K pretrained Vi T-S and Conv Ne Xt-T as their weight selection teacher. From the results presented in Table 6, we find first-N layer selection performs consistently better than uniform layer selection. Presumably, since layers initialized by first-N selection are naturally contiguous and closer to input processing, they offer a more effective initialization for smaller models. setting Vi T-A Conv Ne Xt-F random init 69.6 81.3 first-N layer selection 77.6 84.4 uniform layer selection 76.7 83.2 Table 6: Layer selection. First-N layer selection performs better than uniform layer selection. Comparison with pruning. We test the existing structured and unstructured pruning methods on reducing pretrained Vi T-S to Vi T-T. An important thing to note is that our setting is different from neural network pruning. Structured pruning (Li et al., 2017a) typically only prunes within residual blocks for networks with residual connections, and unstructured pruning (Han et al., 2015) prune weights by setting weights to zero instead of removing it. Despite that these pruning methods are not designed for our setting, we can extend structured and unstructured pruning methods to be applied here. Specifically, we can adopt L1 pruning and magnitude pruning for element selection. For L1 pruning, we use L1 norm to prune the embedding dimension. For magnitude pruning, we squeeze the selected parameters into the required shape of student. We present results in Table 7. L1 pruning yields better results compared to random initialization baseline. However, since L1 norm inevitably breaks consistency, it could not reach the same performance with weight selection. Magnitude pruning only produces marginally better results over random initialization, presumably due to the squeezing operation which breaks the original structure. setting Vi T-T Conv Ne Xt-F random init 72.3 81.4 weight selection 81.4 84.4 L1 pruning 79.5 82.8 magnitude pruning 73.8 81.9 Table 7: Comparison with pruning. L1 and magnitude pruning performs worse than weight selection. teacher params test acc Vi T-S 22M 81.4 Vi T-B 86M 77.6 Vi T-L 307M 76.9 Table 8: Teacher s size. Smaller teacher provides better initialization. Teacher s size. We show that initializing from a teacher of closer size produces better results. A larger teacher means a higher percentage of parameters will be discarded, which translates to more Published as a conference paper at ICLR 2024 information loss during weight selection. We present results for using Vi T-S, Vi T-B, and Vi T-L as weight selection teacher to initialize Vi T-T in Table 8. Interestingly, even a 5M-parameter subset from 301M parameters in Vi T-L is an effective initialization, increasing accuracy by 4.5%. Linear probing. We use linear probing to directly measure the raw model s ability as a feature extractor, which can be a good indicator of the initialization quality. Linear probing is a technique used to assess a pretrained model s representations by training a linear classifier on top of the fixed features extracted from the model. Following the recipe in He et al. (2022), we apply linear probing on CIFAR-100 to evaluate Vi T-T and Conv Ne Xt-F initialized with weight selection from their Image Net-21K pretrained teachers, Vi T-S and Conv Ne Xt-T respectively. We compare between random initialization and weight selection as shown in Table 9. Without any training, weight selection performs significantly better than random initialization in producing features. setting Vi T-T Conv Ne Xt-F random init 13.5 7.1 weight selection 28.2 23.6 Table 9: Linear probing on CIFAR-100. Weight selection produces a non-trivial feature extractor. Mimetic initialization. Mimetic initialization (Trockman & Kolter, 2023) uses the diagonal properties observed in the pretrained self-attention layer s weights to initialize Vi Ts. We present results for mimetic initialization in Table 10. By directly utilizing pretrained parameters, weight selection outperforms mimetic initialization by a large margin. In addition, we visualize the product of Wq W T k and V Wproj the first head in the first attention block of Vi T-T with random initialization, pretrained Vi T-S, and Vi T-T with weight selection. As shown in Figure 5, weight selection enables small models to inherit the desirable diagonal properties in their self-attention layers, which typically only exist in pretrained models. setting CIFAR-10 CIFAR-100 STL-10 random init 92.4 72.3 61.5 mimetic init 93.3 74.7 67.5 weight selection 97.0 81.4 83.4 Table 10: Comparison with mimetic initialization. Weight selection significantly outperforms mimetic initialization by directly utilizing pretrained parameters. (a) random init (b) pretrained Vi T-S (c) weight selection Figure 5: Visualization of self-attention layers. Visualization of Wq W T k (left) and V Wproj (right) for Vi T-T with random initialization, pretrained Vi T-S, and Vi T-T with weight selection. Weight selection can inherit the diagonal property of self-attention layers that only exists in pretrained Vi Ts. 6 CONCLUSION We propose weight selection, a novel initialization method that utilizes large pretrained models. With no extra cost, it is effective for improving the accuracy of a smaller model and reducing its training time needed to reach a certain accuracy level. We extensively analyze its properties and compare it with alternative methods. We hope our research can inspire further exploration into utilizing large pretrained models to create smaller models. Acknowledgement. We would like to thank Haodi Zou for creating illustrative figures, and Mingjie Sun, Boya Zeng, Chen Wang, Jiatao Gu, Maolin Mao for valuable discussions and feedback. Published as a conference paper at ICLR 2024 Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEi T: BERT pre-training of image transformers. In ICLR, 2022. Lucas Beyer, Xiaohua Zhai, Am elie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. In CVPR, 2022. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In ECCV, 2014. Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e J egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021. Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, and Qun Liu. bert2bert: Towards reusable pretrained language models. ar Xiv preprint ar Xiv:2110.07143, 2021. Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In ICLR, 2016. M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In CVPR, 2014. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020. Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011. Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020. Maciej A Czyzewski, Daniel Nowak, and Kamil Piechowiak. Breaking the architecture barrier: A method for efficient knowledge transfer across networks. ar Xiv preprint ar Xiv:2212.13970, 2022. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. Hanan Gani, Muzammal Naseer, and Mohammad Yaqub. How to train vision transformer on small-scale datasets? In BMVC, 2022. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010. Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Neur IPS, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. Published as a conference paper at ICLR 2024 Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS, 2018. Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016. Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In ECCV, 2020. Philipp Kr ahenb uhl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializations of convolutional neural networks. In ICLR, 2015. Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009. Hugo Larochelle, Yoshua Bengio, J erˆome Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. Journal of machine learning research, 2009. Yann Le Cun, John S Denker, and Sara A Solla. Optimal brain damage. In Neur IPS, 1990. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. ICLR, 2017a. Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017b. Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, and Jingbo Zhu. Weight distillation: Transferring the knowledge in neural network parameters. ar Xiv preprint ar Xiv:2009.09152, 2020. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021. Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In ICLR, 2019. Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022. Zhuang Liu, Zhiqiu Xu, Joseph Jin, Zhiqiang Shen, and Trevor Darrell. Dropout reduces underfitting. In ICML, 2023. Jonathan Masci, Ueli Meier, Dan Cires an, and J urgen Schmidhuber. Stacked convolutional autoencoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning ICANN, 2011. Dmytro Mishkin and Jiri Matas. All you need is a good init. In ICLR, 2016. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In Neur IPS, 2011. Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. 2008. Published as a conference paper at ICLR 2024 Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In CVPR, 2012. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Py Torch: An imperative style, high-performance deep learning library. In Neur IPS, 2019. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108, 2019. Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014. Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Neur IPS, 2022. Sam Shleifer and Alexander M Rush. Pre-trained summarization distillation. ar Xiv preprint ar Xiv:2010.13002, 2020. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Neur IPS, 2014. Andreas Steiner, Alexander Kolesnikov, , Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. ar Xiv preprint ar Xiv:2106.10270, 2021. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. Jing Lei Tang, Dong Wang, Zhi Guang Zhang, Li Jun He, Jing Xin, and Yang Xu. Weed identification based on k-means feature learning combined with convolutional neural network. Computers and electronics in agriculture, 2017. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. ar Xiv preprint ar Xiv:1910.10699, 2019. Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. In Neur IPS, 2021. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv e J egou. Training data-efficient image transformers & distillation through attention. In ICML, 2021a. Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv e J egou. Going deeper with image transformers. In ICCV, 2021b. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selfie: Self-supervised pretraining for image embedding. ar Xiv preprint ar Xiv:1906.02940, 2019. Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. In ICML, 2023. Asher Trockman, Devin Willmott, and J Zico Kolter. Understanding the covariance structure of convolutional filters. In ICLR, 2023. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017. Published as a conference paper at ICLR 2024 Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learning recurrent networks with long term dependencies. In ICML, 2017. Ross Wightman. Git Hub repository: Pytorch image models. https://github.com/ rwightman/pytorch-image-models, 2019. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R emi Louf, Morgan Funtowicz, et al. Huggingface s transformers: State-of-the-art natural language processing. ar Xiv preprint ar Xiv:1910.03771, 2019. Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. ar Xiv preprint ar Xiv:2310.06694, 2023. Saining Xie, Ross Girshick, Piotr Doll ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018. Ming Zhong, Chenxin An, Weizhu Chen, Jiawei Han, and Pengcheng He. Seeking neural nuggets: Knowledge transfer in large language models from a parametric perspective. In ICLR, 2024. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In AAAI, 2020. Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 2020. Published as a conference paper at ICLR 2024 A TRAINING SETTINGS Training recipe. We provide our training recipe with configurations in Table 11. The recipe is adapted from Conv Ne Xt (Liu et al., 2022). Training Setting Configuration optimizer Adam W base learning rate 4e-3 weight decay 0.05 optimizer momentum β1, β2=0.9, 0.999 batch size 4096 training epochs 300 learning rate schedule cosine decay warmup epochs 50 warmup schedule linear layer-wise lr decay (Clark et al., 2020; Bao et al., 2022) None randaugment (Cubuk et al., 2020) (9, 0.5) mixup (Zhang et al., 2018) 0.8 cutmix (Yun et al., 2019) 1.0 random erasing (Zhong et al., 2020) 0.25 label smoothing (Szegedy et al., 2016) 0.1 layer scale (Touvron et al., 2021b) 1e-6 head init scale (Touvron et al., 2021b) None gradient clip None Table 11: Our basic recipe. Hyper-parameters. Table 12 and Table 13 record batch size, warmup epochs, and training epochs of Conv Ne Xt-F and Vit-T, respectively, for each dataset. The batch size of each dataset is chosen proportional to its total size. The warmup epochs are set as around one-fifth of the total training epochs. Base learning rates for Conv Ne Xt-F and Vi T-T are 4e-3 and 2e-3 respectively. C-10 C-100 Pets Flowers STL-10 Food101 DTD SVHN Euro SAT IN1k batch size 1024 1024 128 128 128 1024 128 1024 512 4096 warmup epochs 50 50 100 100 50 50 100 10 50 50 training epochs 300 300 600 600 300 300 600 50 300 300 drop path rate 0.1 0.1 0.1 0.1 0 0.1 0.2 0.1 0.1 0 Table 12: Hyper-parameter setting on Conv Ne Xt-F. C-10 C-100 Pets Flowers STL-10 Food101 DTD SVHN Euro SAT IN1k batch size 512 512 512 512 512 512 512 512 512 4096 warmup epochs 50 50 100 100 50 50 100 10 50 50 training epochs 300 300 600 600 300 300 600 50 300 300 Table 13: Hyper-parameter setting on Vi T-T. Published as a conference paper at ICLR 2024 B ADDITIONAL ANALYSIS Layer selection. To further confirm the first-N layer selection s superiority, we conduct experiments that rule out the effect of element selection. Specifically, we evaluate the results of directly fine-tuning different subsets of 6 layers out of 12 layers of a pretrained Vi T-T model. We present the results in Table 15. There is a clear trend favoring shallow layers, with first-N layers better than mid-N layers, and both of them have better performance than last-N layers. Uniform layer selection is slightly worse than first-N layer selection. Based on this empirical result, we find first-N layer selection a suitable method for weight selection. setting CIFAR-100 test acc first-N layer selection 81.6 mid-N layer selection 68.3 last-N layer selection 62.0 uniform layer selection 76.3 Table 14: Layer selection. First-N layer selection performs significantly better than uniform layer selection when ruling out the effect of element selection. We also find that empirical experiments from Vi T-L (which has 24 layers) to Vi T-S has different preference over layer selection. Under this setting, uniform layer selection achieves slightly better performance than first-N layer selection. Our main approach would still adopt first-N layer selection as the default method, for the reason that in the case of a smaller teacher (which is favored in weight selection), first-N layer selection consistently outperforms its alternatives. setting CIFAR-100 test acc first-N layer selection 76.9 mid-N layer selection 75.9 last-N layer selection 77.1 uniform layer selection 77.5 Table 15: Layer selection (Vi T-L as teacher). Uniform layer selection yields slightly better results than first-N layer selection when student s ratio to teacher is small. Weight components. We conduct ablation studies on Vi T-T to understand the influence of distinct model components on performance. In particular, we evaluate the performance of weight selection without one of the following particular types of layers: patch embedding, position embedding, attention block, normalization layer, or MLP layer. As illustrated in Table 16, excluding component from initialization leads to substantial drops in accuracy for all datasets. The results confirm that initializing with all components from pretrained models is necessary. Setting CIFAR-10 CIFAR-100 STL-10 random init 92.4 72.3 61.5 weight selection 97.0 81.4 83.4 w/o patch embed 96.8 79.5 77.1 w/o pos embed 95.6 78.4 80.2 w/o attention 96.2 77.3 80.5 w/o normalization 96.2 79.0 79.8 w/o mlp 95.6 78.8 74.2 Table 16: Vi T component ablation. Using all components from pretrained models is the best. Longer training on Image Net-1K. To assess if weight selection remains beneficial for extended training durations, we use the improved training recipe from Liu et al. (2023). Specifically, the total epochs are extended to 600, and mixup / cutmix are reduced to 0.3. The results, as displayed in Table 17, affirm that our method continues to provide an advantage even under extended training durations. Both Vi T-T and Conv Ne Xt-F, when initialized using weight selection, consistently surpass models with random initialization. This confirms that weight selection does not compromise the model s capacity to benefit from longer training. Published as a conference paper at ICLR 2024 setting Vi T-T Conv Ne Xt-F test acc change test acc change random init 73.9 - 76.1 - weight selection 75.5 1.6 76.4 0.3 random init (longer training) 76.3 - 77.5 - weight selection (longer training) 77.4 1.1 77.7 0.2 Table 17: Longer training. Weight selection s improvement is robust under stronger recipe. C RESULTS ON MORE ARCHITECTURES We present results on additional architectures, namely Res Net (He et al., 2016) and MLP-mixer (Tolstikhin et al., 2021). Model configurations are provided in Table 18 and the detailed results on image classification datasets are presented in 19. Notably, MLP-mixer enjoys more benefits from weight selection. We speculate that it is due to different inductive bias among different architectures. configuration student teacher model Mixer-b16-half Res Net-18 Mixer-b16 Res Net-34 depth 12 2 / 2 / 2 / 2 12 3 / 4 / 6 / 3 embedding dimension 384 64 / 128 / 256 / 512 768 64 / 128 / 256 / 512 number of parameters 15.8M 11M 59.9M 63.5M Table 18: Model Configurations. We perform experiments on Mixer-MLP and Res Net, and use student that halve the embedding dimensions of their corresponding teacher. dataset (scale ) random init weight selection change SVHN 95.9 96.2 0.3 Food-101 83.8 85.9 2.1 Euro SAT 97.5 98.6 1.1 CIFAR-10 96.4 97.1 0.7 CIFAR-100 80.3 82.5 2.2 STL-10 84.3 94.1 9.8 Flowers 78.1 95.8 17.7 Pets 78.2 87.8 9.6 DTD 49.5 64.6 15.1 (a) Res Net random init weight selection change 87.9 93.3 5.4 77.5 83.7 6.2 98.4 98.8 0.4 93.5 96.1 2.6 73.4 79.8 6.4 73.3 82.0 8.7 69.1 83.2 14.1 50.4 69.6 19.2 42.9 49.3 6.4 (b) MLP-mixer Table 19: Test accuracy on image classification datasets. Weight selection shows consistent improvements on other architectures across all 9 image classification datasets.