# on_warmstarting_neural_network_training__1b74e00c.pdf

On Warm-Starting Neural Network Training

Jordan T. Ash Microsoft Research NYC ash.jordan@microsoft.com

Ryan P. Adams Princeton University rpa@princeton.edu

In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily ﬁnancial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate to warm start the optimization rather than initialize from scratch and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the ﬁnal training losses are similar. While it appears that some hyperparameter settings allow a practitioner to close this generalization gap, they seem to only do so in regimes that damage the wall-clock gains of the warm start. Nevertheless, it is highly desirable to be able to warm-start neural network training, as it would dramatically reduce the resource usage associated with the construction of performant deep learning systems. In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. We also provide a surprisingly simple trick that overcomes this pathology in several important situations, and present experiments that elucidate some of its properties.

1 Introduction

Although machine learning research generally assumes a ﬁxed set of training data, real life is more complicated. One common scenario is where a production ML system must be constantly updated with new data. This situation occurs in ﬁnance, online advertising, recommendation systems, fraud detection, and many other domains where machine learning systems are used for prediction and decision making in the real world [1 3]. When new data arrive, the model needs to be updated so that it can be as accurate as possible and account for any domain shift that is occurring.

As a concrete example, consider a large-scale social media website, to which users are constantly uploading images and text. The company requires up-to-the-minute predictive models in order to recommend content, ﬁlter out inappropriate media, and select advertisements. There might be millions of new data arriving every day, which need to be rapidly incorporated into production ML pipelines.

It is natural in this scenario to imagine maintaining a single model that is updated with the latest data at regular cadence. Every day, for example, new training might be performed on the model with the updated, larger dataset. Ideally, this new training procedure is initialized from the parameters of yesterday s model, i.e., it is warm-started from those parameters rather than given a fresh initialization. Such an initialization makes intuitive sense: the data used yesterday are mostly the same as the data today, and it seems wasteful to throw away all previous computation. For convex optimization problems, warm starting is widely used and highly successful (e.g., [1]), and the theoretical properties of online learning are well understood.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Figure 1: A comparison between Res Nets trained using a warm start and a random initialization on CIFAR-10. Blue lines are models trained on 50% of CIFAR-10 for 350 epochs then trained on 100% of the data for a further 350 epochs. Orange lines are models trained on 100% of the data from the start. The two procedures produce similar training performance but differing test performance.

However, warm-starting seems to hurt generalization in deep neural networks. This is particularly troubling because warmstarting does not damage training accuracy.

Figure 1 illustrates this phenomenon. Three 18-layer Res Nets have been trained on the CIFAR-10 natural image classiﬁcation task to create these ﬁgures. One was trained on 100% of the data, one was trained on 50% of the data, and a third warm-started model was trained on 100% of the data but initialized from the parameters found from the 50% trained model. All three achieve the upper bound on training accuracy. However, the warm-started network performs worse on test samples than the network trained on the same data but with a new random initialization. Problematically, this phenomenon incentivizes performance-focused researchers and engineers to constantly retrain models from scratch, at potentially enormous ﬁnancial and environmental cost [4]. This is an example of Red AI [5], disregarding resource consumption in pursuit of raw predictive performance.

The warm-start phenomenon has implications for other situations as well. In active learning, for example, unlabeled samples are abundant but labels are expensive: the goal is to identify maximallyinformative data to have labeled by an oracle and integrated into the training set. It would be time efﬁcient to simply warm-start optimization each time new samples are appended to the training set, but such an approach seems to damage generalization in deep neural networks. Although this phenomenon has not received much direct attention from the research community, it seems to be common practice in deep active learning to retrain from scratch after every query step [6, 7]; popular deep active learning repositories on Github randomly reinitialize models after every selection. [8, 9].

The ineffectiveness of warm-starting has been observed anecdotally in the community, but this paper seeks to examine its properties closely in controlled settings. Note that the ﬁndings in this paper are not inconsistent with extensive work on unsupervised pre-training [10, 11] and transfer learning in the small-data and few shot regimes [12 15]. Rather here we are examining how to accelerate training in the large-data supervised setting in a way consistent with expectations from convex problems.

This article is structured as follows. Section 2 examines the generalization gap induced by warmstarting neural networks. Section 3 surveys approaches for improving generalization in deep learning, and shows that these techniques do not resolve the problem. In Section 4, we describe a simple trick that overcomes this pathology, and report on experiments that give insights into its behavior in batch online learning and pre-training scenarios. We defer our discussion of related work to Section 5, and include a statement on broad impacts in Section 6.

2 Warm Starting Damages Generalization In this section we provide empirical evidence that warm starting consistently damages generalization performance in neural networks. We conduct a series of experiments across several different architectures, optimizers, and image datasets. Our goal is to create simple, reproducible settings in which the warm-starting phenomenon is observed.

2.1 Basic Batch Updating Here we consider the simplest case of warm-starting, in which a single training dataset is partitioned into two subsets that are presented sequentially. In each series of experiments, we randomly segment the training data into two equally-sized portions. The model is trained to convergence on the ﬁrst half, then is trained on the union of the two batches, i.e., 100% of the data. This is repeated for three classiﬁers: Res Net-18 [16], a multilayer perceptron (MLP) with three layers and tanh activations, and logistic regression. Models are optimized using either stochastic gradient descent (SGD) or the Adam variant of SGD [17], and are ﬁtted to the CIFAR-10, CIFAR-100, and SVHN image data. All models are trained using a mini-batch size of 128 and a learning rate of 0.001, the smallest learning rate used in the learning schedule for ﬁtting state-of-the-art Res Net models [16]. The effect of these parameters is investigated in Section 3. Presented results are on a held-out, randomly-chosen third of available data.

RESNET RESNET MLP MLP LR LR CIFAR-10 SGD ADAM SGD ADAM SGD ADAM RANDOM INIT 56.2 (1.0) 78.0 (0.6) 39.0 (0.2) 39.4 (0.1) 40.5 (0.6) 33.8 (0.6) WARM START 51.7 (0.9) 74.4 (0.9) 37.4 (0.2) 36.1 (0.3) 39.6 (0.2) 33.3 (0.2) SVHN RANDOM INIT 89.4 (0.1) 93.6 (0.2) 76.5 (0.3) 76.7 (0.4) 28.0 (0.2) 22.4 (1.3) WARM START 87.5 (0.7) 93.5 (0.4) 75.4 (0.1) 69.4 (0.6) 28.0 (0.3) 22.2 (0.9) CIFAR-100 RANDOM INIT 18.2 (0.3) 41.4 (0.2) 10.3 (0.2) 11.6 (0.2) 16.9 (0.18) 10.2 (0.4) WARM START 15.5 (0.3) 35.0 (1.2) 9.4 (0.0) 9.9 (0.1) 16.3 (0.28) 9.9 (0.3)

Table 1: Validation percent accuracies for various optimizers and models for warmstarted and randomly initialized models on indicated datasets. We consider an 18-layer Res Net, three-layer

multilayer perceptron (MLP), and logistic regression (LR).

Figure 2: An online learning experiment for CIFAR-10 data using a Res Net. The horizontal axis shows the total number of samples in the training set available to the learner. The generalization gap between warm-started and randomly-initialized models is signiﬁcant.

Our results (Table 1) indicate that generalization performance is damaged consistently and signiﬁcantly for both Res Nets and MLPs. This effect is more dramatic for CIFAR-10, which is considered relatively challenging to model (requiring, e.g., data augmentation), than for SVHN, which is considered easier. Logistic regression, which enjoys a convex loss surface, is not signiﬁcantly damaged by warm starting for any datasets. Figure 10 in the Appendix extends these results and shows that the gap is inversely proportional to the fraction of data available in the ﬁrst round of training.

This result is surprising. Even though MLP and Res Net optimization is non-convex, conventional intuition suggests that the warm-started solution should be close to the full-data solution and therefore a good initialization. One view on pre-training is that the initialization is a prior on weights; we often view prior distributions as arising from inference on old (or hypothetical) data and so this sort of pre-training should always be helpful. The generalization gap shown here creates a computational burden for real-life machine learning systems that must be retrained from scratch to perform well, rather than initialized from previous models. First-round results for Table 1 are in Appendix Table 2. 2.2 Online Learning A common real-world setting involves data that are being provided to the machine learning system in a stream. At every step, the learner is given k new samples to append to its training data, and it updates its hypothesis to reﬂect the larger dataset. Financial data, social media data, and recommendation systems are common examples of scenarios where new samples are constantly arriving. This paradigm is simulated in Figure 2, where we supply CIFAR-10 data, selected randomly without replacement, in batches of 1,000 to an 18-layer Res Net. We examine two cases: 1) where the model is retrained from scratch after each batch, starting from a random initialization, and 2) where the model is trained to convergence starting from the parameters learned in the previous iteration. In both cases, the models are optimized with Adam, using an initial learning rate of 0.001. Each was run ﬁve times with different random seeds and validation sets composed of a random third of available data, reinitializing Adam s parameters at each step of learning.

Figure 2 shows the trade-off between these two approaches. On the right are the training times: clearly, starting from the previous model is preferable and has the potential to vastly reduce computational costs and wall-clock time. However, as can be seen on the left, generalization performance is worse in the warm-started situation. As more data arrive, the gap in validation accuracy increases substantially. Means and standard deviations across ﬁve runs are shown. Although this work focuses on image data, we ﬁnd consistent results with other dataset and architecture choices (Appendix Figure 13). 3 Conventional Approaches The design space for initializing and training deep neural network models is very large, and so it is important to evaluate whether there is some known method that could be used to help warm-started training ﬁnd good solutions. Put another way, a reasonable response to this problem is Did you see whether X helped? where X might be anything from batch normalization [18] to increasing minibatch size [19]. This section tries to answer some of these questions and further empirically probe the warm-start phenomenon. Unless otherwise stated, experiments in this section use a Res Net-18 model trained using SGD with a learning rate of 0.001 on CIFAR-10 data. All experiments were run ﬁve times to report means and standard deviations. No experiments in this paper use data augmentation or learning rate schedules, and all validation sets are a randomly-chosen third of the training data.

Figure 3: A comparison between Res Nets trained from both a warm start and a random initialization on CIFAR-10 for various hyperparameters. Orange dots are randomlyinitialized models and blue dots are warm-started models. Warm-started models that perform roughly as well as randomly-initialized models offer no beneﬁt in terms of training time.

3.1 Is this an effect of batch size or learning rate? One might reasonably ask whether or not there exist any hyperparameters that close the generalization gap between warm-started and randomly-initialized models. In particular, can setting a larger learning rate at either the ﬁrst or second round of learning help the model escape to regions that generalize better? Can shrinking the batch size inject stochasticity that might improve generalization [20, 21]?

Figure 4: Left: Validation accuracy as training progresses on 50% of CIFAR-10. Right: Validation accuracy damage, as percentage difference from random initialization, after training on 100% of the data. Each warm-started model was initialized by training on 50% of CIFAR data for the indicated number of epochs.

Here we again consider a warm-started experiment of training on 50% of CIFAR-10 until convergence, then training on 100% of CIFAR-10 using the initial round of training as an initialization. We explore all combinations of batch sizes {16, 32, 64, 128}, and learning rates {0.001, 0.01, 0.1}, varying them across the three rounds of training. This allows for the possibility that there exist different hyperparameters for the ﬁrst stage of training that are better when used with a different set after warm-starting. Each combination is run with three random initializations.

Figure 3 visualizes these results. Every resulting 100% model is shown from all three initializations and all combinations, with color indicating whether it was a random initialization or a warm-start. The horizontal axis shows the time to completion, excluding the pre-training time, and the vertical axis shows the resulting validation performance.

Interestingly, we do ﬁnd warm-started models that perform as well as randomly-initialized models, but they are unable to do so while beneﬁting from their warm-started initialization. The training time for warm-started Res Net models that generalize as well as randomly-initialized models is roughly the same as those randomly-initialized models. That is, there is no computational beneﬁt to using these warm-started initializations. It is worth noting that this plot does not capture the time or energy required to identify hyperparameters that close the generalization gap; such hyperparameter searches are often the culprit in the resource footprint of deep learning [5]. Wall-clock time is measured by assigning every model identical resources, consisting of 50GB of RAM and an NVIDIA Tesla P100 GPU.

This increased ﬁtting time occurs because warm-started models, when using hyperparameters that generalize relatively well, seem to forget what was learned in the ﬁrst round of training. Appendix Figure 11 provides evidence this phenomenon by computing the Pearson correlation between the weights of converged warm-started models and their initialization weights, again across various choices for learning rate and batch size, and comparing it to validation accuracy. Models that generalize well have little correlation with their initialization there is a trend downward in accuracy with increasing correlation suggesting that they have forgotten what was learned in the ﬁrst round of training. Conversely, a similar plot for logistic regression shows no such relationship.

3.2 How quickly is generalization damaged? One surprising result in our investigation is that only a small amount of training is necessary to damage the validation performance of the warm-started model. Our hope was that warm-starting success might be achieved by switching from the 50% to 100% phase before the ﬁrst phase of training was completed. We ﬁt a Res Net-18 model on 50% of the training data, as before, and checkpointed its parameters every ﬁve epochs. We then took each of these checkpointed models and used them as an initialization for training on 100% of those data. As shown in Figure 4, generalization is damaged even when initializing from parameters obtained by training on incomplete data for only a few epochs.

Figure 5: A two-phase experiment like those in Sections 2 and 3, where a Res Net is trained on 50% of CIFAR-10 and is then given the remainder in the second round of training. Here we examine the average gradient norms separately corresponding to the initial 50% of data and the second 50% for models that are either warm-started or initialized with the shrink and perturb (SP) trick. Notice that in warm-started models, there is a drastic gap between these gradient norms. Our proposed trick balances these respective magnitudes while still allowing models to beneﬁt from their ﬁrst round of training; i.e they ﬁt training data much quicker than random initializations. 3.3 Is regularization helpful?

Figure 6: We ﬁt a Res Net and MLP (with and without bias nodes) to CIFAR-10 and measure performance as a function of the shrinkage parameter λ.

A common approach for improving generalization is to include a regularization penalty. Here we investigate three different approaches to regularization: 1) basic L2 weight penalties [22], 2) conﬁdence-penalized training [23], and 3) adversarial training [24]. We again take a Res Net ﬁtted to 50% of available training data and use its parameters to warm-start learning on 100% of the data. We apply regularization in both rounds of training, and while it is helpful, regularization does not resolve the generalization gap induced by warm starting. Appendix Table 3 shows the result of these experiments for indicated regularization penalty sizes. Our experiments show that applying the same amount of regularization to randomly-initialized models still produces a better-generalizing classiﬁer. 4 Shrink, Perturb, Repeat While the presented conventional approaches do not remedy the warm-start problem, we have identiﬁed a remarkably simple trick that efﬁciently closes the generalization gap. At each round of training t, when new samples are appended to the training set, we propose initializing the network s parameters by shrinking the weights found in the previous round of optimization towards zero, then adding a small amount of parameter noise. Speciﬁcally, we initialize each learnable parameter t

i at training round t as t

i + pt, where pt N(0, σ2) and 0 < λ < 1. Shrinking weights preserves hypotheses. For network layers that use Re LU nonlinearities, shrinking parameters preserves the relative activation at each layer. If bias terms and batch normalization are not used, the output of every layer is a scaled version of its non-shrunken counterpart. In the last layer, which usually consists of a linear transformation followed by a softmax nonlinearity, shrinking parameters can be interpreted as increasing the entropy of the output distribution, effectively diminishing the model s conﬁdence. For no-bias, no-batchnorm Re LU models, while shrinking weights does not necessarily preserve the output f (x) they parametrize, it does preserve the learned hypothesis, i.e. argmax f (x); a simple proof is provided for completeness as Proposition 1 in the Appendix.

For more sophisticated architectures, this property largely still holds: Figure 6 shows that for a Res Net, which includes batch normalization, only extreme amounts of shrinking are able to damage classiﬁer performance. This is because batch normalization s internal estimates of mean and variance can compensate for the rescaling caused by weight shrinking. Even for a Re LU MLP that includes bias nodes, performance is surprisingly resilient to shrinking; classiﬁer damage is done only for λ < 0.6 in Figure 6. Separately, note that when internal network layers instead use sigmoidal activations, shrinking parameters moves them further from saturating regions, allowing the model to more easily learn from new data. Shrink-perturb balances gradients. Figure 5 shows a visualization of average gradients during the second of a two-phase training procedure for a Res Net on CIFAR-10, like those discussed in Sections 2 and 3. We plot the second phase of training, where gradient magnitudes are shown separately for the two halves of the dataset. For this experiment models are optimized with SGD, using a small learning rate to zoom in on this effect. Outside of this plot, experiments in this section use the Adam optimizer.

Figure 8: Model performance as a function of λ and σ. Numbers indicate the average ﬁnal performance and total train time for online learning experiments where Res Nets are provided CIFAR-10 samples in sequence, 1,000 per round, and trained to convergence at each round. Note that the bottom left of this plot corresponds to pure random initializing while the top right corresponds to pure warm starting. Left: Validation accuracy tends to improve with more aggressive shrinking. Adding noise often improves generalization. Right: Model train times increase with decreasing values of λ. This is expected, as decreasing λ widens the gap between shrink-perturb parameters and warmstarted parameters. Noise helps models train more quickly. Unlabeled boxes correspond to initializations too small for the model to reliably learn.

For warm-started models, gradients from new, unseen data tend to be much larger magnitude than those from data the model has seen before. These imbalanced gradient contributions are known to be problematic for optimization in mutli-task learning scenarios [25], and suggest that under warm-started initializations the model does not learn in the same way as it would with randomly-initializied training [26]. We ﬁnd that remedying this imbalance without damaging what the model has already learned is key to efﬁciently resolving the generalization gap studied in this article.

Shrinking the model s weights increases its loss, and correspondingly increases the magnitude of the gradient induced even by samples that have already been seen. Preposition 1 shows that in an L-layer Re LU network without bias nodes or batch normalization, shrinking weights by λ shrinks softmax inputs by λL, rapidly increasing the entropy of the softmax distribution and the cross-entropy loss. As shown in Figure 5, the loss increase caused by shrink perturb trick is able to balance gradient contributions between previously unseen samples and data on which the model has already been trained.

The success of the shrink and perturb trick lies in its ability to standardize gradients while preserving learned hypotheses. We could instead normalize gradient contributions by, for example, adding a signiﬁcant amount of parameter noise, but this also damages the learned function. Consequently, this strategy drastically increases training time without fully closing the warm-start generalization gap (Appendix Table 4). As an alternative to shrinking all weights, we could try to increase the entropy of the output distribution by shrinking only parameters in the last layer (Appendix Figure 14), or by regularizing the model s conﬁdence while training (Appendix Table 3), but these are unable to resolve the warm-start problem. For sophisticated architectures especially, we ﬁnd it is important to holistically modify parameters before training on new data.

Figure 7: An online learning experiment varying λ and keeping the noise scale ﬁxed at 0.01. Note that λ = 1 corresponds to fully-warm-started initializations and λ = 0 corresponds to fully-random initializations. The proposed trick with λ = 0.6 performs identically to randomly initializing in terms of validation accuracy, but trains much more quickly. Interestingly, smaller values of λ are even able to outperform random initialization while still training faster.

The perturbation step, adding noise after shrinking, improves both training time and generalization performance. The trade-off between relative values of λ and σ is studied in Figure 8. Note that in this ﬁgure, and in this section generally, we refer to the noise scale rather than to σ. In practice, we add noise by adding parameters from a scaled, randomly-initialized network, to compensate for the fact that many random initialization schemes use different variances for different kinds of parameters.

Figure 7 demonstrates the effectiveness of this trick. Like before, we present a passive online learning experiment where 1,000 CIFAR-10 samples are supplied to a

Res Net in sequence. At each round we can either reinitialize network parameters from scratch or warm start, initializing them to

Figure 9: Pre-trained models ﬁtted to a varying fraction of the indicated dataset. We compare these warmstarted, pre-trained models to randomly initialized and shrink-perturb initialized counterparts, trained on the same fraction of target data. The relative performance of warmstarting and randomly initializing varies, but shrink-perturb performs at least as well as the best strategy.

those found in the previous round of optimization. As expected, we see that warm-started models train faster but generalize worse. However, if we instead initialize parameters using the shrink and perturb trick, we are able to both close this generalization gap and signiﬁcantly speed up training. Appendix Sections 8.2.1-8.2.6 present extensive results varying λ and noise scale, experimenting with dataset type, model architecture, and L2 regularization, all showing the same overall trend. Indeed, we notice that shrink-perturb parameters that better balance gradient contributions better remedy the warm-start problem. That said, we ﬁnd that one does not need to shrink very aggressively to adequately enough correct gradients and efﬁciently close the warm-start generalization gap.

4.1 The shrink and perturb trick and regularization Exercising the shrink and perturb trick at every step of SGD would be very similar to applying an aggressive, noisy L2 regularization. That is, shrink-perturbing every step of optimization yields the SGD update i λ( i + @L

@ i ) + p for loss L, weight i, and learning rate , making the shrinkage term λ behave like a weight decay parameter. It is natural to ask, then, how does this trick compare with weight decay? Appendix Figure 12 shows that in non-warm-started environments, where we just have a static dataset, the iterative application of the shrink-perturb trick results in marginally improved performance. These experiments ﬁt a Res Net to convergence on 100% of CIFAR-10 data, then shrink and perturb weights before repeating the process, resulting in a modest performance improvement. We can conclude that the shrink-perturb trick has two beneﬁts. Most signiﬁcantly, it allows us to quickly ﬁt high-performing models in sequential environments without having to retrain from scratch. Separately, it offers a slight regularization beneﬁt, which in tandem with the ﬁrst property sometimes allows shrink-perturb models to generalize even better than randomly-initialized models.

This L2 regularization beneﬁt is not enough to explain the success of the shrink-perturb trick. As Appendix Table 3 demonstrates, L2-regularized models are still vulnerable to the warm-start generalization gap. Appendix Sections 8.2.5 and 8.2.6 show that we are able to mitigate this performance gap with the shrink and perturb trick even when models are being aggressively regularized (regularization penalties any larger prevent networks from being able to ﬁt the training data) with weight decay.

4.2 The shrink and perturb trick and pre-training Despite successes on a variety of tasks, deep neural networks still generally require large training sets to perform well. For problems where only limited data are available, it has become popular to warm-start learning using parameters from training on a different but related problem [14, 27]. Transfer and few-shot learning in this form has seen success in computer vision and NLP [28].

The experiments we perform here, however, imply that when the second problem is not data-limited, this transfer learning approach deteriorates model quality. That is, at some point, the pre-training transfer learning approach is similar to warm-starting under domain shift, and generalization should suffer.

We demonstrate this phenomenon by ﬁrst training a Res Net-18 to convergence on one dataset, then using that solution to warm-start a model trained on a varying fraction of another dataset. When only a small portion of target data are used, this is essentially the same as the pre-training transfer learning approach. As the proportion increases, the problem turns into what we have described here as warm starting. Figure 9 shows the result of this experiment, and it appears to support our intuition. Often, when the second dataset is small, warm starting is helpful, but there is frequently a crossover point where better generalization would be achieved by training from scratch on that fraction of the target data. Sometimes, when source and target datasets are dissimilar, it would better to randomly initialize regardless of the amount of target data available.

The exact point at which this crossover occurs (and whether it happens at all) depends not just on model type but also on the statistical properties of the data in question; it cannot be easily predicted. We ﬁnd that shrink-perturb initialization, however, allows us to avoid having to make such a prediction: shrink-perturbed models perform at least as well as warm-started models when pre-training is the most performant strategy and as well as randomly-initialized models when it is better to learn from scratch. Figure 9 displays this effect for λ = 0.3 and noise scale 0.0001. Comprehensive shrink-perturb settings are presented for this scenario in Appendix Section 8.2.7, all showing similar results. 5 Discussion and Research Surrounding the Warm Start Problem Warm-starting and online learning are well understood for convex models like linear classiﬁers [29] and SVMs [30, 31]. Excluding the shrink-perturb trick, it does not appear that generally applicable techniques exist for deep neural networks that do not damage generalization, so models are typically retrained from scratch [6, 32].

There has been a variety of work in closely related areas, however. For example, in analyzing critical learning periods, researchers show that a network initially trained on blurry images then on sharp images is unable to perform as well as one trained from scratch on sharp images, drawing a parallel between human vision and computer vision [26]. We show that this phenomenon is more general, with test performance damaged even when ﬁrst and second datasets are drawn from identical distributions. Initialization. The problem of warm starting is closely related to the rich literature on initialization of neural network training from scratch . Indeed, new insights into what makes an effective initialization have been critical to the revival of neural networks as machine learning models. While there have been several proposed methods for initialization [33, 34, 10, 35, 36], this body of literature primarily concerns itself with initializations that are high-quality in the sense that they allow for quick and reliable model training. That is, these methods are typically built with training performance in mind rather than generalization performance.

Work relating initialization to generalization suggests that networks whose weights have moved far from their initialization are less likely to generalize well compared with ones that have remained relatively nearby [37]. Here we have shown with experimental results that warm-started networks that have less in common with their initializations seem to generalize better than those that have more (Appendix Figure 11). So while it is not surprising that there exist initializations that generalize poorly, it is surprising that warm starts are in that class. Still, before retraining, our proposed solution brings parameters closer their initial values than they would be if just warm starting, suggesting some relationship between generalization and distance from initialization. Generalization. The warm-start problem is fundamentally about generalization performance, which has been extensively studied both theoretically and empirically within the context of deep learning. These articles have investigated generalization by studying classiﬁer margin [38, 39], loss geometry [40, 19, 41], and measurements of complexity [42, 43], sensitivity [44], or compressiblity [45].

These approaches can be seen as attempting to measure the intricacy of the hypothesis learned by the network. If two models are both consistent for the same training data, the one with the less complicated concept is more likely to generalize well. We know that networks trained with SGD are implicitly regularized [20, 21], suggesting that standard training of neural networks incidentally ﬁnds low-complexity solutions. It s possible, then, that the initial round of training disqualiﬁes solutions that would most naturally explain the general problem of interest. If so, by balancing gradient contributions, the shrink and perturb trick seems to make these solutions accessible again. Pre-training. As previously discussed, the warm-start problem is very similar to the idea of unsupervised and supervised pre-training [46, 11, 10, 47]. Under that paradigm, learning where limited labeled data are available is aided by ﬁrst training on related data. The warm start problem, however, is not about limited labeled data in the second round of training. Instead, the goal of warm starting is to hasten the time required to ﬁt a neural network by initializing using a similar supervised problem without damaging generalization. Our results suggest that while warm-starting is beneﬁcial when labeled data are limited, it actually damages generalization to warm-start in data-rich situations. Concluding thoughts. This article presented the challenges of warm-starting neural network training and proposed a simple and powerful solution. While warm-starting is a problem that the community seems somewhat aware of anecdotally, it does not seem to have been directly studied. We believe that this is a major problem in important real-life tasks for which neural networks are used, and it speaks directly to the resources consumed by training such models.

6 Broader Impact

The shrink and perturb trick allows models to be efﬁciently updated without sacriﬁcing generalization performance. In the absence of this method, achieving best-possible performance requires neural networks to be randomly-initialized each time new data are appended to the training set. As mentioned earlier, this requirement can cost signiﬁcant computational resources, and as a result, is partially responsible for the deleterious environmental ramiﬁcations studied in recent years [4, 5].

Additionally, the enormous computational expense of retraining models from scratch disproportionately burdens research groups without access to abundant computational resources. The shrink and perturb trick lowers this barrier, democratizing participation in online learning, active learning, and pre-training research with neural networks.

7 Funding Disclosure and Competing Interests

This work was partially funded by NSF IIS-2007278 and by a Siemens Future Makers graduate student fellowship. RPA is on the board of directors at Cambridge Machines Ltd. and is a scientiﬁc advisor to Manifold Bio.

[1] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah,

Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at Facebook. In Workshop on Data Mining for Online Advertising, pages 1 9. ACM, 2014.

[2] Badrish Chandramouli, Justin J Levandoski, Ahmed Eldawy, and Mohamed F Mokbel. Stream-

Rec: a real-time recommender system. In International Conference on Management of Data, pages 1243 1246. ACM, 2011.

[3] Ludmila I Kuncheva. Classiﬁer ensembles for detecting concept change in streaming data:

Overview and perspectives. In 2nd Workshop SUEMA, volume 2008, pages 5 10, 2008.

[4] Emma Strubell, Ananya Ganesh, and Andrew Mc Callum. Energy and policy considerations

for deep learning in NLP. In Annual Meeting of the Association for Computational Linguistics, 2019.

[5] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green AI. ar Xiv preprint

ar Xiv:1907.10597, 2019.

[6] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set

approach. In International Conference on Learning Representations, 2018.

[7] Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar-

wal. Deep batch active learning by diverse, uncertain gradient lower bounds. ar Xiv preprint ar Xiv:1906.03671, 2019.

[8] Rostamiz. https://github.com/ej0cl6/deep-active-learning, 2017 2019.

[9] Kuan-Hao Huang. https://github.com/ej0cl6/deep-active-learning, 2018 2019.

[10] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent,

and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625 660, 2010.

[11] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In

International Conference on Unsupervised and Transfer Learning Workshop, pages 17 37. JMLR, 2011.

[12] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, and Daan Wierstra. Matching networks for

one shot learning. In Advances in Neural Information Processing Systems, pages 3630 3638, 2016.

[13] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In

Advances in Neural Information Processing Systems, pages 4077 4087, 2017.

[14] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-

tion of deep networks. In International Conference on Machine Learning, pages 1126 1135. JMLR. org, 2017.

[15] Jordan T Ash, Robert E Schapire, and Barbara E Engelhardt. Unsupervised domain adaptation

using approximate label matching. ar Xiv preprint ar Xiv:1602.04889, 2016.

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

[17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International

Conference on Learning Representations, 2014.

[18] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training

by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

[19] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping

Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. International Conference on Learning Representations, 2016.

[20] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized

matrix sensing and neural networks with quadratic activations. pages 75:2 47, 2018.

[21] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati

Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151 6159, 2017.

[22] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In

Advances in Neural Information Processing Systems, pages 950 957, 1992.

[23] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Reg-

ularizing neural networks by penalizing conﬁdent output distributions. ar Xiv preprint ar Xiv:1701.06548, 2017.

[24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-

fellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.

[25] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn.

Gradient surgery for multi-task learning. ar Xiv preprint ar Xiv:2001.06782, 2020.

[26] Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep

networks. In International Conference on Learning Representations, 2018.

[27] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. ar Xiv preprint

ar Xiv:1803.02999, 2, 2018.

[28] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. How transferable are

neural networks in NLP applications? In Empirical Methods in Natural Language Processing, pages 478 489, 2016.

[29] Bo-Yu Chu, Chia-Hua Ho, Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin. Warm start for

parameter selection of linear classiﬁers. In International Conference on Knowledge Discovery and Data Mining, pages 149 158. ACM, 2015.

[30] Dennis De Coste and Kiri Wagstaff. Alpha seeding for support vector machines. In International

Conference on Knowledge Discovery and Data Mining, pages 345 349. ACM, 2000.

[31] Zeyi Wen, Bin Li, Ramamohanarao Kotagiri, Jian Chen, Yawen Chen, and Rui Zhang. Im-

proving efﬁciency of SVM k-fold cross-validation by alpha seeding. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[32] Pranav Shyam, Wojciech Ja skowski, and Faustino Gomez. Model-based active exploration. In

International Conference on Machine Learning, pages 5779 5788, 2018.

[33] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initial-

ization and momentum in deep learning. In International Conference on Machine Learning, pages 1139 1147, 2013.

[34] Rupesh K Srivastava, Klaus Greff, and J urgen Schmidhuber. Training very deep networks. In

Advances in Neural Information Processing Systems, pages 2377 2385, 2015.

[35] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward

neural networks. In International Conference on Artiﬁcial Intelligence and Statistics, pages 249 256, 2010.

[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Sur-

passing human-level performance on Image Net classiﬁcation. In IEEE International Conference on Computer Vision, pages 1026 1034, 2015.

[37] Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance

from initialization. In Neural Information Processing Systems, 2017.

[38] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds

for neural networks. In Advances in Neural Information Processing Systems, pages 6240 6249, 2017.

[39] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward

neural networks. ar Xiv preprint ar Xiv:1810.05369, 2018.

[40] Sepp Hochreiter and J Schmidhuber. Flat minima. Neural Computation, 9(1):1 42, 1997.

[41] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the

loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6389 6399, 2018.

[42] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding

deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

[43] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-Rao metric,

geometry, and complexity of neural networks. pages 888 896, 2017.

[44] Roman Novak, Yasaman Bahri, Daniel A Abolaﬁa, Jeffrey Pennington, and Jascha Sohl-

Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018.

[45] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P Adams, and Peter Orbanz. Non-vacuous

generalization bounds at the Image Net scale: a PAC-Bayesian compression approach. In International Conference on Learning Representations, 2019.

[46] Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, and Larry S Davis. An analysis of

pre-training on object detection. ar Xiv preprint ar Xiv:1904.05871, 2019.

[47] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise

training of deep networks. In Advances in neural information processing systems, pages 153 160, 2007.