# when_to_retrain_a_machine_learning_model__ba34ffbb.pdf When to retrain a machine learning model Florence Regol 1 2 Leo Schwinn 3 Kyle Sprague 2 Mark Coates 1 Thomas Markovich 2 A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of data. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments addressing classification tasks show that the method consistently outperforms existing baselines on 7 datasets. 1. Introduction In many industrial machine learning settings, data are continuously arriving and evolving (Gama et al., 2014). This means that a model, fθ, that was trained on a fixed dataset, D, will become outdated. This usually translates to a cost in 1Mc Gill University, Canada 2Block, Toronto, Canada 3Technical University of Munich, Germany. Correspondence to: Florence Regol . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). the form of a missed opportunity. However, retraining a new model, fθ , on a more up-to-date dataset, D , is also costly. Beyond the obvious costs of computational resources and energy (Strubell et al., 2020), there are human resource costs associated with assigning experts to deploy and maintain the model, as well as collecting and cleaning data. Deploying a new model also generally comes with a higher risk. Therefore, the optimal retraining schedule depends on this comprehensive cost of retraining, on the cost of making mistakes, and on future model performance. Figure 1 provides a visualization of the task. Although this retraining problem is ubiquitous in industry (Gama et al., 2014), there are few works in the machine learning literature that tackle it directly. It has been framed as an application of the distribution shift detection problem (Bifet & Gavald a, 2007), where the conventional strategy involves triggering retraining whenever a substantial shift is detected (Bifet & Gavald a, 2007; Cerqueira et al., 2021; Pesaranghader & Viktor, 2016). However, this approach overlooks retraining costs. This can be particularly problematic when training is expensive, as demonstrated in our experiments. Others have reduced the need for retraining by incorporating robustness to distribution shifts (Schwinn et al., 2022) or adapting to them (Filos et al., 2020), but these methods have limits on the extent of the shift they can handle. Other related areas include online, adaptive, life-long, and transfer learning (Hoi et al., 2021), which aim to update models to new or evolving data distributions. However, these methods are primarily concerned with maximal model performance, while the goal of our work is to explicitly minimize the overall cumulative cost. In particular, continual learning approaches and the like cannot delay updates due to future cost considerations. Moreover, in practice, the cost of retraining can go beyond the number of gradient updates or sample complexity, as discussed above. Finally, because this is a sequential decision problem, it can be framed within the offline reinforcement learning framework (Levine et al., 2020). In theory, offline RL methods should be applicable, but few, if any, are designed for very low-data settings. They require substantial amounts of data for training and hyperparameter tuning, and are therefore largely unsuitable to use in this context. A direct treatment of the cost consideration in the retraining problem is presented by ˇZliobait e et al. (2015) and by Ma- When to retrain a machine learning model 0 1 2 3 4 5 6 7 8 9 10 Evaluated on dataset retrain retrain model f trained on dataset: 10 retraining schedule Figure 1. The Retraining Problem: The performance of a model trained on a dataset Di gradually decreases when evaluated on more recent datasets in the presence of distribution shift. The task is to determine when retraining is beneficial compared to keeping an older model. We must take into consideration the trade-off between potential accuracy gains and the costs associated with retraining. In the training schedule θ shown here, retraining occurs twice, at t = 4 and t = 8. hadevan & Mathioudakis (2024). The formulation by Mahadevan & Mathioudakis (2024) accounts for the trade-off between the cost of retraining and the cost of performance. Their method, CARA, relies on approximating the performance of a model on new data, and the retraining decision is based on this value. However, this approach makes several limiting assumptions: 1) the relative cost objective assumes that the difficulty of the task remains constant; and 2) the performance approximation assumes the data distribution is almost stationary. Instead, we consider a more general objective that combines both the retraining cost and the average performance over a specified horizon. We detail the relationship between our objective and CARA s objective in Appendix A.12. Our formulation is more general and does not depend on strong assumptions regarding the data distribution and its impact on performance. Additionally, our method can leverage new observations of the model s performance. Our proposed method involves forecasting the performance of both future and current models and making decisions based on the uncertainty of our predictions. There is no constraint on how the retrained model is obtained. It can be fine-tuned from a previous model, trained from scratch, or derived using any other procedure. We show the effectiveness of our approach on five real datasets and two synthetic datasets. We make the following contributions: We introduce a principled formulation of a practical version of the retraining problem. We explain connections to existing formulations and offline reinforcement learning. We establish upper limits on the optimal number of retrains based on performance bounds which can be used to determine whether you should consider retraining or not. We propose a novel retraining decision procedure based on performance forecasting: UPF. Our proposed algorithm outperforms existing baselines. It requires minimal data by fully leveraging the structure specific to the retraining problem, employing compact regression models, and balancing the uncertainty caused by data scarcity through an uncertainty-informed decision process. 2. Related Work We discuss related work and fields relevant to the retraining problem. A more detailed literature review, including connections to other related fields is provided in Appendix A.3. Retraining problem Few works explicitly target the retraining problem. ˇZliobait e et al. (2015) propose a return on investment (ROI) framework to monitor and assess the retraining decision process, but do not introduce a method for actually deciding when to retrain. Mahadevan & Mathioudakis (2024) develop a retraining decision algorithm, CARA, which integrates the cost of retraining and introduces a staleness cost for persisting with an old model. CARA approximates the staleness cost using offline data consisting of several trained models and their historical performance. Three versions of CARA are proposed: (i) retraining if the estimated staleness exceeds a threshold; (ii) retraining based on estimated cumulative staleness; or (iii) identifying an optimal retraining frequency. While providing promising results, CARA requires access to some of the data that will be used for retraining, and is very computationally intensive, so there is no adaptation to data obtained during the online decision period. Hoffman et al. (2025) address a related problem: deciding whether to retain the current model (i.e., no retraining), fully retrain it, or refine it via finetuning. The authors formulate an objective that balances retraining cost, the impact of concept drift (ambi- When to retrain a machine learning model guity), the uncertainty associated with each option (risk), and the expected performance. Distribution shift detection The retraining problem is closely connected to distribution shift detection and mitigation (Wang et al., 2024a; Hendrycks & Gimpel, 2017; Rabanser et al., 2019). Some approaches decide to adapt a model after detection of a changed distribution (Sugiyama & Kawanabe, 2012; Zhang et al., 2023). Since the signal for these methods is designed to adapt a model rather than trigger a full retraining, they are not appropriate to be used as full retraining signals. Other approaches, however, directly treat the detection of a distribution shift as a cue for retraining. ADWIN (Bifet & Gavald a, 2007) uses statistical testing of the label or feature distribution. Another approach is to directly monitor the model s performance. FHDDM (Pesaranghader & Viktor, 2016) employs Hoeffding s inequality, while Raab et al. (2020) propose a method that relies on a Kolmogorov-Smirnov Windowing test. These approaches work well with low retraining costs, but perform poorly when retraining costs are high, as they tend to recommend retraining far too often. Additionally, they lack adaptability to varying costs, and it is difficult to determine the correct significance level to use for a given retraining-to-performance cost ratio. Offline reinforcement learning Lastly, we discuss connections to offline reinforcement learning (ORL), where an agent must learn a policy from a fixed dataset of rewards, actions, and states. This subset of RL is challenging, as the agent cannot explore and must rely on the dataset to infer underlying dynamics and handle distribution shifts. Levine et al. (2020) provide an extensive review. Q-learning and value function methods, which focus on predicting future action costs, have become the preferred approaches (Levine et al., 2020; Kalashnikov et al.; Hejna et al., 2023; Kostrikov et al., 2022). Some methods incorporate epistemic uncertainty into the Q-function to address distribution shifts of unseen actions (Kumar et al., 2020; Luis et al., 2023). If we view the states as encoding both time and the model in use, and actions as either retraining or maintaining the current model, we can frame our problem as ORL. However, most existing RL approaches focus on scaling to large state or action spaces, employ large models, and assume access to abundant data, making them unsuitable for our context. A more detailed discussion on the connections and limitations of ORL methods is included in Appendix A.11. 3. Problem Setting We outline our formulation of the retraining problem. We have access to a sequence of datasets, D w,...,D0,...DT with features and labels xi,t Xt,yi,t Yt,Dt = {(xi,t,yi,t)} Dt i=1 , which are assumed to be drawn from a sequence of distributions Dt pt. In practice, this reflects the gradual distribution shifts that occur when collecting data over time, so we specifically cannot assume that pt = pt+1 (this would correspond to a special case of the problem, which we refer to as the no distribution shift case). The datasets are acquired at discrete times t = [ w,...,0,...,T]. The sequence is split into an offline period that spans t = [ w,...0], followed by an online period [t = 1,...T]. At each time step t of the online period, we are given the option to (re)train a model ft, using the data acquired up until time t, for a retraining cost of ct. Datasets and trained models can be formed and obtained through any means depending on the task; for example, f1 could be fine-tuned from f0 and D1 could contain D0. The complete sequence of decisions can be encoded as a binary vector θ {0,1}T , where θt = 1 indicates that we retrain the model at time t. We introduce rθ(t) as a mapping function that returns, at time t, the last training time, i.e., rθ(t) = maxt [0,...,t]s.t.θt =1 t , or rθ(t) = 0 if θ 1 = 0.). At each time step t, we are required to generate a certain number of predictions Nt on a test set, which incurs a loss ℓ(ˆy,y), scaled by a cost et. This would correspond to actually using the model to make predictions, for example, to detect fraud failing to detect a fraudulent transaction costs et, and approximately Nt transactions are verified at time t. To make these predictions at time t, we use the most recently trained model, which we denote by frθ(t). To ensure that there is always at least one model available during the online period, we always train the last offline model f0. The target cost is a function of θ, which encodes the retraining decisions, and combines two costs: the cost associated with model performance, T t=1 et Nt i=1 ℓ(frθ(t)(xi,t),yi,t), and the cost to retrain, θtct: Cα(θ) = E[ T t=1 et Nt i=1 ℓ(frθ(t)(xi,t),yi,t) + θtct]. (1) To make the expression more concise, we condense the expected loss into a scalar pei,j where the two indices denotes the model index, and the timestep, respectively: EDj[ℓ(fi(Xj),Yj)], if i j , 0, otherwise. (2) We can simplify the problem by assuming a fixed cost of retraining, ct = c, cost of loss, et = e, and number of predictions, Nt = N. The solutions we develop are easily extended to the case where these are varying, but known, quantities. Introducing a cost-to-performance ratio parameter α = c e N , we can compactly write the online objective as: Cα(θ) = e N (α θ 1 + T t=1 perθ(t),t). (3) When to retrain a machine learning model 3.1. Offline and Online data The cost Cα(θ) is only evaluated over the online period. We assume that we have access to all the datasets and trained models during the offline period. In practice, the number of models and datasets is typically limited to only a few (around 10 to 20 at most), which is why we characterize this problem as being in a low-data regime. We denote this data as Ioffline = (D w,...D0,f w,...,f0). In the online mode, each decision at time t can only rely on information available prior to that time, which we denote by I t or j > t). This is infeasible; however, we assume that there is a temporal correlation between the performances of different models trained at different times, which we aim to exploit to build a predictive model. We therefore propose to 1) model these future values as random variables and learn their distributions; and 2) base our decisions on the predicted distributions to construct our method, the Uncertainty-Performance Forecaster (UPF). 4.1. Future Performance Forecaster The first component of our algorithm involves learning a performance predictor to forecast unknown entries in pe, which are defined as pei,j = EDj[ℓ(fi(Xj),Yj)] for i j (see Eqn 2). In a classification setting where we consider the 0-1 loss ℓ(y ,y) = 1[y y], these are 1 accuracy. We introduce random variables Aij and model the entries peij as realizations of these. Although this prediction task may initially seem similar to the performance estimation (PE) problem (Garg et al., 2020), it is fundamentally different. PE focuses on estimating the performance of an existing model under distribution shift, whereas our task involves When to retrain a machine learning model forecasting the future performance of models that do not yet exist. Crucially, PE lacks a temporal dimension, as it does not account for the evolution of models over time. Since the Ai,j random variables are bounded, we model them (after appropriate scaling) as Beta distributed with parameters α(ri,j),β(ri,j) that depend on some input feature ri,j. These features contain information about the current state of the feature distribution as well as the timestamp (see the section Inputs in Appendix A.7 for full details). This forecasting formulation allows us to capture both covariate and concept drift. This choice of the Beta distribution is particularly appropriate when the performance metrics Ai,j are accuracies, as in our experiment, since accuracy can be interpreted as the sum of Bernoulli random variables. Of course, other distributions could also be considered. We also define their associated mean µ(ri,j) and variance σ(ri,j). Given the parameters α(ri,j),β(ri,j), we model the random variables to be independent of each other: P (A0,0,...,AT,T {α(ri,j),β(ri,j)}T i j) (8) = i j Beta(α(ri,j),β(ri,j)). (9) where Beta() denotes the pdf of a Beta distribution. We choose the input features rij to include the indices of the training and evaluation datasets (i and j, respectively), along with additional features that capture the gap between the training and evaluation timesteps (the difference j i, and summary statistics of the distribution shift zshift (see Appendix A.7 for details). The input features are thus given by ri,j = [i,j,j i,zshift]. From the offline data, we have access to observations ai,j Ai,j, and can build a regression dataset to learn the parameters α(ri,j),β(ri,j). We specify the learning task by constructing (ri,j,ai,j) pairs: M r2]. circles is a 2 dimensional synthetic dataset. The input features as uniformly generated as Xt U[0,1] The label is generated using a moving rule yt = 1[(r1 (0.2 + 0.02t))2 + (r2 (0.2 + 0.02t))2 0.5 ]. i Wild (Beery et al., 2020) is a multiclass dataset featuring images of animals captured in the wild at various locations. Originally used as a domain transfer benchmark, we adapted it into a standard classification dataset by including the location ID as a feature for the model. To obtain a long enough sequence of datasets D0,D1,..., we create the individual datasets Di using overlapping windows on the timeframe, i.e., half of the most recent images in Di are contained in Di+1. We avoid data leakage by ensuring that the train/val/test splits are maintained. When to retrain a machine learning model A.6.1. BASE MODEL OF THE IWILD DATASET To motivate our cost considerations, we present an experiment where the base model architecture is not fixed and is searched for across a list of potential model architectures. This could happen in practice for important applications; nothing forces a practitioner to use the same base model f at each timestep. Our architecture involves using a pretrained vision model, with a new output layer added to match the correct number of classes for our task, which is then fine-tuned for up to 20 epochs. The fine-tuning process uses the Adam optimizer with a fixed learning rate of 10 4 and a weight decay parameter of 10 5. Training was conducted using 4 H100 GPUs for 2 days. At each timestep ft, we perform a random search over the pretrained vision models made available from timm, which includes 188 vision models of varying configuration and base architecture. We include the list in Appendix A.15. We also include in our search the option to early stop or not, using the validation set. The model used for ft is the one that obtains the best validation accuracy. A.7. Performance forecaster In this section, we provide additional details on the proposed algorithm to forecast the performance. To restate, instead of learning the α(ri,j),β(ri,j parameters, we learn the mean and variance parameters; µ(ri,j) (49) σ(ri,j). (50) And convert the learned parameters to the parameters of a beta distribution using the following relation (with appropriate clipping if needed): α = µ(µ(1 µ) β = (1 µ)(µ(1 µ) Inputs ri,j As stated, the input of our performance forecaster model contains the model index i, the timesteps j, the time since retrain j i and summary statistics of the distribution shift zshift. zshift is constructed by taking the average feature shift between the features of the most recently available subsequent datasets Dt and Dt 1 (where t denotes the time step of the most recent available dataset). We compute the mean features of each dimension for a given dataset; x = 1 Dt Dt i=1 xi and compute the ℓ1 distance between the mean feature vector of the two subsequent datasets; zshift = xt xt 1 1 (53) The input features are thus given by concatenating ri,j = [i,j,j i,zshift]. Since our methodology involves forecasting the performance of future models and on future datasets to be used by our decision algorithm, we assess the regression performance of our forecasting models and analyze how it impacts the overall performance of our UPF algorithm. To do so, we construct two versions of our forecaster module µϕ(ri,j) that are designed to be less performant than our proposed method. UPF overfit: A baseline designed to overfit the training data. We use a Gaussian Process-based µϕ(ri,j) with no white noise kernel, using a single dot product kernel from scikit-learn. UPF overfit+noise: This variant further decreases performance by using the same overfitting model and adding random noise to the target values. We report two metrics, the average mean absolute error of our prediction µ and the average bias of our prediction µϕ(ri,j) ai,j on the test set. We start by reporting the retraining performance of each baseline w.r.t. our base retraining When to retrain a machine learning model metric, the AUC of cost values evaluated at different α in Table 5. As expected, the best performing method is the method with our proposed UPF baseline which is expected to reach the best MAE error on it s performance prediction, on all datasets. Table 5. AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. The bolded entries represent the best, and the underlined entries indicate the second best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level. Gauss circles epicgames electricity yelp airplanes UPF overfit+noise 0.3845 0.0722 0.3253 2.6389 0.1194 2.3767 UPF overfit 0.3849 0.0663 0.3224 2.6001 0.1194 2.3352 UPF 0.3836* 0.0662* 0.3203* 2.5910* 0.1175* 2.3094* We then visualize the effect of the performance forecasting precision (measured with MAE and bias) on the decision algorithm s performance (measured by ˆCα(θ)) in the following figures. Overall, we observe that the impact of poor performance depends on the difficulty of the underlying dataset. For the airplane dataset, which is of standard difficulty, we can observe a gradual impact of the degradation in forecasting performance on the overall retraining metric in Figure 5. The best MAE leads to the best cost metric ˆCα(θ), and the performance gradually decreases as the MAE and bias worsen. The Epicgame dataset 6, which is more challenging due to its less regular performance trends, shows a different behavior. Here, the overall forecasting performance is worse (the best achievable MAE is higher), and we observe a less regular pattern where poorer MAE does not always result in a proportional increase in cost, as shown in terms of scale. Similarly, when 0.0 0.2 0.4 0.6 0.1207 0.2785 0.3243 0.0 0.2 0.4 0.6 -0.0191 0.1297 0.2831 Figure 5. Airplanes. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias). 0.00 0.02 0.04 0.06 0.08 0.10 0.2446 0.3103 0.3326 0.00 0.02 0.04 0.06 0.08 0.10 0.2393 0.31 0.3326 Figure 6. Epicgames. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias). When to retrain a machine learning model turning to the synthetic datasets, the circle dataset, which is constructed with concept drift (changing p(Y X)), is more challenging than the Gauss dataset, which only exhibits feature drift (where p(X) changes, but p(Y X) remains constant). This impacts the effect of poor forecasting performance. In Figure 7, for the circle dataset, we observe that a small decrease in MAE paired with stronger bias can have a more sudden and drastic effect on the decision policy. Conversely, in the Gauss dataset (Figure 8), the effect of poorer forecasting performance is less pronounced. 0.00 0.05 0.10 0.15 0.20 0.25 0.021 0.0216 0.0234 0.00 0.05 0.10 0.15 0.20 0.25 -0.0177 -0.0169 0.0234 Figure 7. Circles. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias). 0.0 0.1 0.2 0.3 0.4 0.5 0.045 0.0649 0.0805 0.0 0.1 0.2 0.3 0.4 0.5 -0.0416 -0.02 0.0805 Figure 8. Gauss. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias). When to retrain a machine learning model A.8. Extension to non-bounded metrics In this section, we show how we can extend our methodology to model non-bounded metrics often used in regression tasks, such as the root mean square error (RMSE) or mean absolute error (MAE). To do so, we replace the use of a Beta distribution to a log Normal distribution to model our performance metric r.v. Ai,j. A log normal distribution is parameterized with location m and scale parameter v. We can learn the mean and variance parameters using the same Gaussian approximation; Log Norm(m(ri,j),v(ri,j)) N(µ(ri,j),σ(ri,j)), (54) and recover the location and scale parameters using the relation; m = ln(v) v2 A.8.1. BETA APPROXIMATION VS NORMAL In our method, we approximate the Beta distribution with a Normal distribution to ease the learning process; Beta(α(ri,j),β(ri,j)) N(µ(ri,j),σ(ri,j)). (57) We verify here that this approximation doesn t have too big an effect on the end performance. We compare the UPF method, which uses Ai,j Beta(α(ri,j),β(ri,j)), with a UPF (Gaussian), which doesn t use the Beta distribution and instead uses a Gaussian with learned parameters to model the performance metric: Ai,j N(µ(ri,j),σ(ri,j)). In Figures 9, 10, 11 and 12, we can see that this does not have too big an effect on the overall behavior and performance. 0.0 0.2 0.4 strategy UPF UPF (Gaussian) Figure 9. Gauss 0.00 0.25 0.50 0.75 1.00 strategy UPF UPF (Gaussian) Figure 10. Electricity When to retrain a machine learning model 0.0 0.1 0.2 strategy UPF UPF (Gaussian) Figure 11. Circles 0.0 0.2 0.4 0.6 strategy UPF UPF (Gaussian) Figure 12. Airplanes A.9. Training complexity In this section, we compare the training complexity of each baseline. We report the average time required for the offline training process, online inference and discuss runtime complexity. The CARA baseline comprises two computationally intensive components. First, it constructs the C matrix, representing its performance estimation. This algorithm involves inferring, with a modified model, each point of the new dataset and reweighting each, which scales with O( Dnew ). This needs to be done in both offline and online phases. Then, in the offline phase, it performs an annealing search over parameters to find the best value that minimizes this cost approximation, taking into account the retraining cost associated with each decision. In Table 6, we can see that this result in the highest runtime for both online and offline phases. Table 6. Average runtime of the baselines on the circles dataset. CARA cum. CARA CARA per. UPF ADWIN FHDDM KSWIN Offline ms 8.4871 8.6608 7.8461 0.0947 0.0274 0.0122 0.3392 Online (one step)ms 1.5604 1.5046 1.5940 0.0247 0.0351 0.0103 0.3438 In comparison, our approach consists of fitting a linear model on a small dataset. The shift distribution features must be obtained, but they involve comparing two histograms, scaling as O(w2 Dt ) rather than exponentially with Dt . The distribution shift baselines do not have an offline phase, as they monitor shifts in the underlying distribution continuously. Their runtime complexity is therefore very low, at O( Dt ), as reflected in Table 6 A.10. Additional results In this section, we include additional figures to visualize our results in Figures 13, 14, 15, 16, 17, 18, and Figures 19. Overall, the results are generally consistent and exhibit a similar trend. The Epic Games dataset, however, is more challenging and presents greater difficulties for all baselines. In particular, UPF performs worse than other baselines at low values of the retraining cost ratio α. For those operating points, UPF does reach the correct retraining frequency; however, it is unable to pinpoint the optimal moments to retrain, resulting in worse performance than baselines that retrain more frequently, as shown in the right panel of Figure 19. When to retrain a machine learning model 0.0 0.2 0.4 0.6 0.8 1.0 1.50 0.0 0.2 0.4 0.6 0.8 1.0 num. retrains UPF CARA CARA cumul. CARA per. KSWIN-5% KSWIN-50% FHDDM-5% FHDDM-50% ADWIN-5% ADWIN-50% Figure 13. Result on the electricity dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α. 0.00 0.02 0.04 0.06 0.08 0.10 1.100 0.00 0.02 0.04 0.06 0.08 0.10 num. retrains Figure 14. Result on the yelp dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α. When to retrain a machine learning model 0.00 0.01 0.02 0.03 0.04 0.05 3.15 0.00 0.01 0.02 0.03 0.04 0.05 num. retrains Figure 15. Result on the epicgames dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α. 0.0 0.1 0.2 0.3 0.4 0.5 0.5 0.0 0.1 0.2 0.3 0.4 0.5 num. retrains Figure 16. Result on the Gauss dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α. When to retrain a machine learning model 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 num. retrains Figure 17. Result on the circles dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α. 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 num. retrains UPF CARA CARA cumul. CARA per. KSWIN-5% KSWIN-50% FHDDM-5% FHDDM-50% ADWIN-5% ADWIN-50% Figure 18. Result on the airplanes dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α. We additionally include results with the oracle baselines in Figures 19. We can see that the UPF baseline is reasonably close to the optimal algorithm in two of the datasets (circles and electricity), but struggles for the more challenging dataset, epicgames. Looking at the number of retrains, we can see that UPF more closely follows the retraining frequency of the oracle for all datasets. When to retrain a machine learning model 0.0 0.1 0.2 0.0 0.5 1.0 1.50 0.00 0.02 0.04 3.12 0.0 0.1 0.2 num. retrains UPF oracle CARA 0.0 0.5 1.0 num. retrains UPF oracle CARA 0.00 0.02 0.04 num. retrains UPF oracle CARA Figure 19. Result on the circles (left), electricity (middle) and epicgames (right) datasets. Top) Cost ˆCα(θ) vs α. Bottom) Number of retrains vs α. A.11. Methodology as offline RL We can frame the retraining problem as an offline RL task (Levine et al., 2020). We define a state space where each state is described by the index of the trained model and the timestep; S {T} {T}. The action space is to either retrain or not, so A = {0,1}. The state transitions are deterministic and known: T(St+1 St = (i,t),A) = 1 if A = 0,St+1 = (i,t + 1) 1 if A = 1,St+1 = (t + 1,t + 1) 0 o.w. . (58) Figure 20 provides a visualization of the MDP. Since the state transitions are deterministic, we can define the deterministic transition function: st+1 = t(at,st). (59) The reward function only depends on the end state (which describes the performance of a model i evaluated at timestep t) and on the action. Using pe S to denote the performance at a state S and reusing of tradeoff parameter α, we have the reward When to retrain a machine learning model No retrain (keep) Figure 20. Visualization of the MDP r(at,st+1) = αat pest+1. (60) To match our setting, the discount factor has to be set to one γ = 1. The goal is to learn a policy π on offline data to generalize to the online period. The offline dataset is given by: Doffline = {sn,an,rn}N n=1. The objective is defined as: J(π) = Eτ pπ(τ)[ T +w t=w r(st,at)], (61) which is the same objective as we defined, with the added option of defining a random policy to make decisions pπ(θ): J(π) = Eθ pπ(θ)[ T +w t=w r(st,at)] (62) = Eθ pπ(θ)[ T +w t=w αat + pest+1] (63) = Eθ pπ(θ)[Cα(θ)]. (64) Q-learning (approximate dynamic methods) The basic idea of Q-learning is to define a Q function and to derive a deterministic policy π from it. The Q function is defined as follows; Qπ(st,at) = Eτ pτ st,at[ T +w t =t r(st ,at )] (65) and the policy is set to: π(at st) = δ(at = arg max Q(st,at)). (66) When to retrain a machine learning model Since the optimal policy π should satisfy Q (st,at) = r(st,at) + Est T (st+1 st,at)[max at+1 Q (st+1,at+1)], (67) one algorithm is to train Qϕ until that equation is satisfied. In our case, the transition is deterministic, so we can define st+1 = t(st,at) and have Q (st,at) = r(st,at) + maxat+1Q (t(st,at),at+1). (68) The idea is then to parameterize Qϕ, and minimize the following for all samples in the dataset using the Bellman update: n (Qϕ(sn,an) [r(sn,an) + max a Qϕ(s ,a )])2 . (69) First we set the target: yn = r(sn,an) + max a Qϕ(s ,a ) (70) then we optimize ϕ n (Qϕ(sn,an) yn)2. (71) and the algorithm iterates between those two steps. We can therefore apply any Q-learning method to our problem, provided that it uses a standard Qϕ parameterization. Connecting Q-learning to our UPF algorithm In our setting, we have special knowledge of the structure of Q. First, there is no randomness on the transition state, so we know that: yn = r(sn,an) + max an+1 Qϕ(t(sn,an),an+1) (72) By definition, we have that: Qϕ(st,at) = atα pes,t + max at+1 Qϕ(t(st,at),at+1) (73) While computing the Bellman update and setting the target, we can see that the Q function of one of the last states Qϕ(s T,x, ) will have to predict the end performance: Qϕ(s T,x, ) = pes T,x , (74) = fϕ(s T,x). (75) By the DAG structure of the transition function, and since the α value is known, we can parameterize recursively all the Qϕ functions with shareable components: Qϕ(s T 1,x,a T 1,x) = αa T 1,x fϕ(s T 1,x) + max( α fϕ(s T,T ), fϕ(s T,x)), (76) where each fϕ(s T 1,x) is modeling the performance pes T,x at that given state. The MSE objective that is traditionally applied (Eqn. 71) can then be decomposed into 2 terms, where one of the terms corresponds to our objective: L = n (Qϕ(sn,an) yn)2 (77) = ( αan,x fϕ(sn) + max( α fϕ(s T,T ), fϕ(s T,x)) (78) (anα + pesn + max an+1 Qϕ(t(sn,an),an+1))) 2 (79) = (fϕ(sn) pesn + max( α fϕ(s T,T ), fϕ(s T,x)) + max an+1 Qϕ(t(sn,an),an+1))) 2 (80) L = n (fϕ(sn) pesn) 2 + C. (81) When to retrain a machine learning model The term (fϕ(sn) pesn) 2 in the loss function aligns with our objective, as Ai,j represents our model s approximation of the performance metric pei,j. Therefore, with this specific parameterization, we can establish a connection between Q-learning and our learning method. However, as noted in the main text, applying existing ORL methods to this problem would not be effective. The problem involves a deterministic transition matrix and a highly structured reward, both of which are uncommon in typical RL settings. Additionally, most RL methods prioritize scalability to large state or action spaces, use complex models, and assume access to plentiful data, making them ill-suited for our scenario. A key requirement for our approach is training efficiency, given our limited performance data and the need for online adaptation as more information becomes available. If the computational cost of deciding when to retrain is comparable to the retraining process itself, the approach becomes impractical. A.11.1. OFFLINE RL BASELINES In this section, we present results using an offline RL baseline that is appropriate for low-data settings: Least-Squares Policy Iteration (LSPI) (Lagoudakis & Parr, 2003). We follow the detailed RL formulation as previously presented. To implement LSPI, we use the model index i and timesteps t as states (following the formulation from the previous section). In LSPI, various approximation methods are introduced to solve the linear equation, but these are unnecessary in our case, as we can solve it exactly due to the small size of our problem. We present various versions of this baseline by changing the λ parameter. In Table 7, we can see that this proposed baseline is not competitive. These initial results for this basic formulation of the offline RL problem indicate that more care and design should be taken to appropriately solve this problem using offline RL, supporting that existing RL methods, as they are, may not be well-suited to solve the problem. Table 7. AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. The bolded entries represent the best, and the underlined entries indicate the second best. The The denotes statistically significant difference with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level. electricity Gauss circles airplanes yelp CHI epicgames i Wild ADWIN-5% 2.8099 0.4533 0.0753 2.6353 0.1298 0.3217 3.7371 ADWIN-50% 2.8131 0.4848 0.0753 2.7147 0.1298 0.3238 4.2564 KSWIN-5% 3.8979 0.3975 0.0753 3.2300 0.1322 0.3420 4.4268 KSWIN-50% 4.0521 0.9530 0.0794 3.2042 0.1655 0.3537 4.4268 FHDDM-5% 3.1525 0.3893 0.0753 2.6577 0.1324 0.3298 4.4267 FHDDM-50% 3.4037 0.5918 0.0772 2.7077 0.1450 0.3389 4.4268 CARA cumul. 2.7147 0.3862 0.0731 2.2900 0.1299 0.3228 3.8922 CARA per. 2.8986 0.4678 0.0800 2.4061 0.1318 0.3260 3.7527 CARA 2.7198 0.3841 0.0726 2.2753* 0.1294 0.3202 3.9506 LSPI λ = 1 4.3820 1.0530 0.2412 3.7140 0.1493 0.3523 - LSPI λ = 0.5 4.5260 1.0837 0.2455 3.6924 0.1442 0.3566 - LSPI λ = 0.0 4.5317 1.0933 0.2478 3.5862 0.1378 0.3573 - UPF (ours) 2.5782* 0.3829* 0.0668* 2.2865 0.1293* 0.3189* 3.0498* oracle 2.4217 0.3724 0.0627 2.2298 0.1275 0.3170 2.4973 A.12. Relating our objective to the CARA formulation In (Mahadevan & Mathioudakis, 2024), even though they are also tackling the retraining problem, they are formulating the problem differently. Instead of using a binary vector to model the retraining decisions, they use a sequence of model indices S = [s1,...,s T ] with the constraint that st {0,...,t}. If st = t, it signifies a retrain. The cost objective they consider is similar to ours; they sum over the timesteps to get the cumulative total cost. The cost per When to retrain a machine learning model timestep is encoded in an upper triangular matrix C: Ψt,t if t < t κ if t = t (cost of retraining) o.w. (82) where Ψt,t is defined as some relative staleness cost . The total cost is defined as: Ccara(S) = T t=1 C[st,t]. (83) The staleness cost is defined as the cost of using a model f1 to classify data from Q2, approximated by dataset D3: Ψ(Q2,D3,f1) q Q2 1 D3 x,y D3 sim(q,x)ℓ(f1,x,y) (84) The aim of this metric is to predict the performance of f1 on the query points in Q2 by computing the loss on a reference dataset D3. The idea is to weight the loss at each sample of D3 by how similar they are to the query samples in Q2 (this is the role of sim(q,x)). ℓ(f3(q),yq) 1 D3 x,y D3 sim(q,x)ℓ(f1,x,y) (85) Ψ(Q2,D3,f1) Ne EQ2[ℓ(f3(X),Y )] (86) Nepet3,t2 (87) The relative staleness cost is defined as the difference between staleness costs: Ψt,t = Ψ(Qt,Dt,ft ) Ψ(Qt,Dt ,ft ). (88) This is intended to approximate the relative gap of performance: Ψt,t Ne(pet ,t pet,t) (89) In our experiment, we directly use Ψ(Qt,Dt,ft ) as an approximation of pet ,t and apply the CARA algorithm directly on the staleness costs instead of using the relative staleness cost. Relating it to our formulation Our objective is given by; C(θ) = c θ 1 + e N T t=1 perθ,t. (90) To understand the connection with our formulation, we start by rewriting the CARA cost as: Ccara(S) = T t=1 1[st = t]κ + 1[st < t] Ψt,st (91) = T t=1 1[st = t]κ + 1[st < t] Ψt,st (92) T t=1 1[st = t]κ + Ne1[st < t](pest,t pet,t) from (89) (93) Ccara(θ) = κ θ 1 + Ne T t=1 (perθ,t pet,t) switching to our notation with θ. (94) This reveals the assumptions that are required for both solutions to coincide. First, this approximation for the loss of a future model ft should hold: ℓ(ft(xq),yq) 1 Dt x,y Dt sim(xq,x)ℓ(f1,x,y) (95) When to retrain a machine learning model Second, in order to have: C(θ) = Ccara(θ) (96) κ = c + Ne T t=1 pet,t θ 1 . (97) Proof: We require that: c θ 1 + Ne T t=1 perθ,t = κ θ 1 + Ne T t=1 (perθ,t pet,t). (98) This implies that: c θ 1 + Ne T t=1 perθ,t = κ θ 1 + Ne T t=1 perθ,t Ne T t=1 pet,t , (99) and hence that: κ = c + Ne T t=1 pet,t θ 1 . (100) The cost of retraining κ in the CARA formulation must thus scale with the minimum performance cost that can be obtained by always using the most recent model Ne T t=1 pet,t, divided by the number of retrains that have been made. It is of course not possible to set κ to this value, as it depends on θ, but it gives insight into how the formulations relate to each other. A.13. Varying training data size In this section, we provide experimental results where we assume that we have access to fewer offline time steps and analyze how it impacts the results. We display the relative improvement of the best baseline vs. the competing baselines by reporting normalized AUC values in Tables 8,9, and10. Overall, our method remains effective in scenarios with reduced training data. It demonstrates greater robustness compared to the CARA baselines, which can be explained by the fact that it can adapt to new information received during the online process, which CARA cannot do. With very few training steps (w = 2), the CARA baselines suffer the most, reaching more than twice the error for some datasets. With more data (w = 4), the relative performance is more in line with larger datasets (w = 7), with UPF remaining the best. Table 8. w = 2. Normalized AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. We normalize by dividing by the best value for each dataset. The bolded entries represent the best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level. w = 2 electricity airplanes yelp CHI epicgames Gauss circles CARA 1.0000 1.0101 1.0100 1.0282 2.6519 1.4792 CARA c. 1.0669 1.0680 0.0544 2.7437 4.0150 1.6872 CARA per. 2.1971 1.6703 0.0661 2.9131 10.6965 1.8901 UPF 1.0258 1.0000* 1.0000* 1.0000 1.0000* 1.0000* A.14. Results on the Wild Temporal dataset In this section, we present preliminary results on one dataset from the suite of temporal datasets from Yao et al. (2022). Specifically, we present preliminray results from the yearbook dataset. To construct our sequence of datasets Dt,..., we follow the construction from (Yao et al., 2022). For training, we iteratively add more samples from each year, spanning from 1930 to 2012. For testing, we evaluate only on samples from the most When to retrain a machine learning model Table 9. w = 4. Normalized AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. We normalize by dividing by the best value for each dataset. The bolded entries represent the best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level. w = 4 electricity airplanes yelp CHI epicgames Gauss circles CARA 1.0093 1.0024 1.0000 1.0063 1.0049 1.0653 CARA per. 1.1029 1.0721 1.0017 1.0168 1.0984 1.0045 CARA c. 1.0153 1.0060 1.0025 1.0220 1.0042 1.0501 UPF 1.0000* 1.0000* 1.0008 1.0000* 1.0000* 1.0000* Table 10. w = 7. Normalized AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. We normalize by dividing by the best value for each dataset. The bolded entries represent the best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level. w = 7 electricity airplanes yelp CHI epicgames Gauss circles CARA c. 1.0530 1.0065 1.0046 1.0122 1.0086 1.0944 CARA per. 1.1244 1.0575 1.0193 1.0223 1.2219 1.1976 CARA 1.0549 1.0000* 1.0008 1.0041 1.0031 1.0868 UPF (ours) 1.0000* 1.0050 1.0000* 1.0000* 1.0000* 1.0000* recent year. As for the model ft, we use the ERM model from (Yao et al., 2022), and follow the training procedure from Yao et al. (2022). We use a similar setup to the one followed in our experiment, setting the offline window size w = 7, evaluating over an online phase of T = 8 steps, and presenting results over 10 trials (See table 11). Preliminary results for this dataset which can be seen in Table 12 are inline with the results from the main paper. Table 11. Dataset description. w denotes the number of timestep of the offline phase, T denotes the number of timestep of the online phase. The Model describes the architecture used for each ft. Dataset Model αmax w M<0 T Dataset size ( D ) Num. features Task yearbook ERM 0.5 7 21 8 (varies) 32X32X3 Binary A.15. List of timm pretrained vision models b e i t b a s e p a t c h 1 6 2 2 4 , b e i t v 2 b a s e p a t c h 1 6 2 2 4 , caformer s18 , c a i t s 2 4 2 2 4 , c a i t x x s 2 4 2 2 4 , c a i t x x s 3 6 2 2 4 , c o a t l i t e m i n i , c o a t l i t e s m a l l , c o a t l i t e t i n y , c o a t m i n i , c o a t t i n y , c o a t n e t 0 r w 2 2 4 , c o a t n e t b n 0 r w 2 2 4 , coatnet nano rw 224 , c o a t n e t r m l p 1 r w 2 2 4 , coatnet rmlp nano rw 224 , When to retrain a machine learning model Table 12. AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. The bolded entries represent the best, and the underlined entries indicate the second best. The The denotes statistically significant difference with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level. CARA cumul 0.0351 CARA per. 0.0195 CARA 0.0322 UPF 0.0120* Oracle 0.0105 coatnext nano rw 224 , convformer s18 , c o n v i t b a s e , c o n v i t s m a l l , c o n v i t t i n y , convmixer 1024 20 ks9 p14 , c o n v n e x t a t t o , c o n v n e x t a t t o o l s , convnext base , convnext femto , c o n v n e x t f e m t o o l s , convnext nano , c o n v n e x t n an o ols , convnext pico , c o n v n e x t p i c o o l s , convnext small , c o n v n e x t t i n y , c o n v n e x t t i n y h n f , c o n v n e x t v 2 a t t o , convnextv2 femto , convnextv2 nano , convnextv2 pico , c o n v n e x t v 2 t i n y , c r o s s v i t 1 5 2 4 0 , c r o s s v i t 1 5 d a g g e r 2 4 0 , c r o s s v i t 1 5 d a g g e r 4 0 8 , c r o s s v i t 1 8 2 4 0 , c r o s s v i t 1 8 d a g g e r 2 4 0 , c r o s s v i t 9 2 4 0 , c r o s s v i t 9 d a g g e r 2 4 0 , c r o s s v i t b a s e 2 4 0 , c r o s s v i t s m a l l 2 4 0 , c r o s s v i t t i n y 2 4 0 , c s 3 d a r k n e t f o c u s l , cs3darknet focus m , c s 3 d a r k n e t l , cs3darknet m , c s 3 d a r k n e t x , cs3edgenet x , c s 3 s e e d g e n e t x , When to retrain a machine learning model c s 3 s e d a r k n e t l , c s 3 s e d a r k n e t x , cspdarknet53 , c s p r e s n e t 5 0 , c s p r e s n e x t 5 0 , darknet53 , darknetaa53 , d a v i t b a s e , d a v i t s m a l l , d a v i t t i n y , d e i t 3 b a s e p a t c h 1 6 2 2 4 , deit3 medium patch16 224 , d e i t 3 s m a l l p a t c h 1 6 2 2 4 , d e i t b a s e d i s t i l l e d p a t c h 1 6 2 2 4 , d e i t b a s e p a t c h 1 6 2 2 4 , d e i t s m a l l d i s t i l l e d p a t c h 1 6 2 2 4 , d e i t s m a l l p a t c h 1 6 2 2 4 , d e i t t i n y d i s t i l l e d p a t c h 1 6 2 2 4 , d e i t t i n y p a t c h 1 6 2 2 4 , densenet121 , densenet161 , densenet169 , densenet201 , densenetblur121d , dla102 , dla102x , dla102x2 , dla169 , dla34 , dla46 c , dla46x c , dla60 , d l a 6 0 r e s 2 n e t , d l a 6 0 r e s 2 n e x t , dla60x , dla60x c , dm nfnet f0 , dm nfnet f1 , dpn68 , dpn68b , dpn92 , dpn98 , e c a n f n e t l 0 , e c a n f n e t l 1 , e c a n f n e t l 2 , e c a r e s n e t 3 3 t s , e c a r e s n e x t 2 6 t s