# when_to_retrain_a_machine_learning_model__ba34ffbb.pdf

When to retrain a machine learning model

Florence Regol 1 2 Leo Schwinn 3 Kyle Sprague 2 Mark Coates 1 Thomas Markovich 2

A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of data. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments addressing classification tasks show that the method consistently outperforms existing baselines on 7 datasets.

1. Introduction

In many industrial machine learning settings, data are continuously arriving and evolving (Gama et al., 2014). This means that a model, fθ, that was trained on a fixed dataset, D, will become outdated. This usually translates to a cost in

1Mc Gill University, Canada 2Block, Toronto, Canada 3Technical University of Munich, Germany. Correspondence to: Florence Regol <florencer@block.xyz>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

the form of a missed opportunity. However, retraining a new model, fθ , on a more up-to-date dataset, D , is also costly. Beyond the obvious costs of computational resources and energy (Strubell et al., 2020), there are human resource costs associated with assigning experts to deploy and maintain the model, as well as collecting and cleaning data. Deploying a new model also generally comes with a higher risk. Therefore, the optimal retraining schedule depends on this comprehensive cost of retraining, on the cost of making mistakes, and on future model performance. Figure 1 provides a visualization of the task.

Although this retraining problem is ubiquitous in industry (Gama et al., 2014), there are few works in the machine learning literature that tackle it directly. It has been framed as an application of the distribution shift detection problem (Bifet & Gavald a, 2007), where the conventional strategy involves triggering retraining whenever a substantial shift is detected (Bifet & Gavald a, 2007; Cerqueira et al., 2021; Pesaranghader & Viktor, 2016). However, this approach overlooks retraining costs. This can be particularly problematic when training is expensive, as demonstrated in our experiments. Others have reduced the need for retraining by incorporating robustness to distribution shifts (Schwinn et al., 2022) or adapting to them (Filos et al., 2020), but these methods have limits on the extent of the shift they can handle. Other related areas include online, adaptive, life-long, and transfer learning (Hoi et al., 2021), which aim to update models to new or evolving data distributions. However, these methods are primarily concerned with maximal model performance, while the goal of our work is to explicitly minimize the overall cumulative cost. In particular, continual learning approaches and the like cannot delay updates due to future cost considerations. Moreover, in practice, the cost of retraining can go beyond the number of gradient updates or sample complexity, as discussed above. Finally, because this is a sequential decision problem, it can be framed within the offline reinforcement learning framework (Levine et al., 2020). In theory, offline RL methods should be applicable, but few, if any, are designed for very low-data settings. They require substantial amounts of data for training and hyperparameter tuning, and are therefore largely unsuitable to use in this context.

A direct treatment of the cost consideration in the retraining problem is presented by ˇZliobait e et al. (2015) and by Ma-

When to retrain a machine learning model

0 1 2 3 4 5 6 7 8 9 10 Evaluated on dataset

retrain retrain

model f trained on dataset:

10 retraining schedule

Figure 1. The Retraining Problem: The performance of a model trained on a dataset Di gradually decreases when evaluated on more recent datasets in the presence of distribution shift. The task is to determine when retraining is beneficial compared to keeping an older model. We must take into consideration the trade-off between potential accuracy gains and the costs associated with retraining. In the training schedule θ shown here, retraining occurs twice, at t = 4 and t = 8.

hadevan & Mathioudakis (2024). The formulation by Mahadevan & Mathioudakis (2024) accounts for the trade-off between the cost of retraining and the cost of performance. Their method, CARA, relies on approximating the performance of a model on new data, and the retraining decision is based on this value. However, this approach makes several limiting assumptions: 1) the relative cost objective assumes that the difficulty of the task remains constant; and 2) the performance approximation assumes the data distribution is almost stationary. Instead, we consider a more general objective that combines both the retraining cost and the average performance over a specified horizon. We detail the relationship between our objective and CARA s objective in Appendix A.12. Our formulation is more general and does not depend on strong assumptions regarding the data distribution and its impact on performance. Additionally, our method can leverage new observations of the model s performance. Our proposed method involves forecasting the performance of both future and current models and making decisions based on the uncertainty of our predictions. There is no constraint on how the retrained model is obtained. It can be fine-tuned from a previous model, trained from scratch, or derived using any other procedure. We show the effectiveness of our approach on five real datasets and two synthetic datasets. We make the following contributions:

We introduce a principled formulation of a practical version of the retraining problem. We explain connections to existing formulations and offline reinforcement learning.

We establish upper limits on the optimal number of retrains based on performance bounds which can be used to determine whether you should consider retraining or not.

We propose a novel retraining decision procedure based

on performance forecasting: UPF. Our proposed algorithm outperforms existing baselines. It requires minimal data by fully leveraging the structure specific to the retraining problem, employing compact regression models, and balancing the uncertainty caused by data scarcity through an uncertainty-informed decision process.

2. Related Work

We discuss related work and fields relevant to the retraining problem. A more detailed literature review, including connections to other related fields is provided in Appendix A.3.

Retraining problem Few works explicitly target the retraining problem. ˇZliobait e et al. (2015) propose a return on investment (ROI) framework to monitor and assess the retraining decision process, but do not introduce a method for actually deciding when to retrain. Mahadevan & Mathioudakis (2024) develop a retraining decision algorithm, CARA, which integrates the cost of retraining and introduces a staleness cost for persisting with an old model. CARA approximates the staleness cost using offline data consisting of several trained models and their historical performance. Three versions of CARA are proposed: (i) retraining if the estimated staleness exceeds a threshold; (ii) retraining based on estimated cumulative staleness; or (iii) identifying an optimal retraining frequency. While providing promising results, CARA requires access to some of the data that will be used for retraining, and is very computationally intensive, so there is no adaptation to data obtained during the online decision period. Hoffman et al. (2025) address a related problem: deciding whether to retain the current model (i.e., no retraining), fully retrain it, or refine it via finetuning. The authors formulate an objective that balances retraining cost, the impact of concept drift (ambi-

When to retrain a machine learning model

guity), the uncertainty associated with each option (risk), and the expected performance.

Distribution shift detection The retraining problem is closely connected to distribution shift detection and mitigation (Wang et al., 2024a; Hendrycks & Gimpel, 2017; Rabanser et al., 2019). Some approaches decide to adapt a model after detection of a changed distribution (Sugiyama & Kawanabe, 2012; Zhang et al., 2023). Since the signal for these methods is designed to adapt a model rather than trigger a full retraining, they are not appropriate to be used as full retraining signals. Other approaches, however, directly treat the detection of a distribution shift as a cue for retraining. ADWIN (Bifet & Gavald a, 2007) uses statistical testing of the label or feature distribution. Another approach is to directly monitor the model s performance. FHDDM (Pesaranghader & Viktor, 2016) employs Hoeffding s inequality, while Raab et al. (2020) propose a method that relies on a Kolmogorov-Smirnov Windowing test. These approaches work well with low retraining costs, but perform poorly when retraining costs are high, as they tend to recommend retraining far too often. Additionally, they lack adaptability to varying costs, and it is difficult to determine the correct significance level to use for a given retraining-to-performance cost ratio.

Offline reinforcement learning Lastly, we discuss connections to offline reinforcement learning (ORL), where an agent must learn a policy from a fixed dataset of rewards, actions, and states. This subset of RL is challenging, as the agent cannot explore and must rely on the dataset to infer underlying dynamics and handle distribution shifts. Levine et al. (2020) provide an extensive review. Q-learning and value function methods, which focus on predicting future action costs, have become the preferred approaches (Levine et al., 2020; Kalashnikov et al.; Hejna et al., 2023; Kostrikov et al., 2022). Some methods incorporate epistemic uncertainty into the Q-function to address distribution shifts of unseen actions (Kumar et al., 2020; Luis et al., 2023).

If we view the states as encoding both time and the model in use, and actions as either retraining or maintaining the current model, we can frame our problem as ORL. However, most existing RL approaches focus on scaling to large state or action spaces, employ large models, and assume access to abundant data, making them unsuitable for our context. A more detailed discussion on the connections and limitations of ORL methods is included in Appendix A.11.

3. Problem Setting

We outline our formulation of the retraining problem. We have access to a sequence of datasets, D w,...,D0,...DT with features and labels xi,t Xt,yi,t Yt,Dt = {(xi,t,yi,t)} Dt i=1 , which are assumed to be drawn from

a sequence of distributions Dt pt. In practice, this reflects the gradual distribution shifts that occur when collecting data over time, so we specifically cannot assume that pt = pt+1 (this would correspond to a special case of the problem, which we refer to as the no distribution shift case). The datasets are acquired at discrete times t = [ w,...,0,...,T]. The sequence is split into an offline period that spans t = [ w,...0], followed by an online period [t = 1,...T]. At each time step t of the online period, we are given the option to (re)train a model ft, using the data acquired up until time t, for a retraining cost of ct. Datasets and trained models can be formed and obtained through any means depending on the task; for example, f1 could be fine-tuned from f0 and D1 could contain D0.

The complete sequence of decisions can be encoded as a binary vector θ {0,1}T , where θt = 1 indicates that we retrain the model at time t. We introduce rθ(t) as a mapping function that returns, at time t, the last training time, i.e., rθ(t) = maxt [0,...,t]s.t.θt =1 t , or rθ(t) = 0 if θ 1 = 0.).

At each time step t, we are required to generate a certain number of predictions Nt on a test set, which incurs a loss ℓ(ˆy,y), scaled by a cost et. This would correspond to actually using the model to make predictions, for example, to detect fraud failing to detect a fraudulent transaction costs et, and approximately Nt transactions are verified at time t. To make these predictions at time t, we use the most recently trained model, which we denote by frθ(t). To ensure that there is always at least one model available during the online period, we always train the last offline model f0.

The target cost is a function of θ, which encodes the retraining decisions, and combines two costs: the cost associated with model performance, T t=1 et Nt i=1 ℓ(frθ(t)(xi,t),yi,t), and the cost to retrain, θtct:

Cα(θ) = E[ T t=1 et

Nt i=1 ℓ(frθ(t)(xi,t),yi,t) + θtct]. (1)

To make the expression more concise, we condense the expected loss into a scalar pei,j where the two indices denotes the model index, and the timestep, respectively:

EDj[ℓ(fi(Xj),Yj)], if i j , 0, otherwise. (2)

We can simplify the problem by assuming a fixed cost of retraining, ct = c, cost of loss, et = e, and number of predictions, Nt = N. The solutions we develop are easily extended to the case where these are varying, but known, quantities. Introducing a cost-to-performance ratio parameter α = c e N , we can compactly write the online objective as:

Cα(θ) = e N (α θ 1 + T t=1 perθ(t),t). (3)

When to retrain a machine learning model

3.1. Offline and Online data

The cost Cα(θ) is only evaluated over the online period. We assume that we have access to all the datasets and trained models during the offline period. In practice, the number of models and datasets is typically limited to only a few (around 10 to 20 at most), which is why we characterize this problem as being in a low-data regime. We denote this data as Ioffline = (D w,...D0,f w,...,f0). In the online mode, each decision at time t can only rely on information available prior to that time, which we denote by I<t. I<t therefore contains both the offline data Ioffline, and the online data that was collected up to the timestep t: Ionline <t . The online data is similar to the offline data, but it only contains the models that were actually trained; Ionline <t = (D1,...Dt 1,{fi}i s.t. θi=1)).

Each entry of θ can therefore be modeled by a binary function g(t,I<t) {0,1}:

θ = [g(1,Ioffline),...,g(T,I<T ))] . (4)

Given ct, et, and Nt, the task is to determine the g that generates the retraining schedule θ minimizing cost Cα(θ):

θ = arg min θ {0,1}T Cα(θ). (5)

3.2. Some analysis

Before introducing methods that learn to generate such a schedule θ, we characterize basic properties of the problem. Specifically, we establish bounds on the number of retraining actions of the optimal solution. These can be used to determine whether we even need to consider retraining. We also provide guidance on leveraging existing performance bounds (such as scaling laws) to compute the relevant quantities in these bounds. These theoretical insights can be used to derive practical rules-of-thumb on a case-by-case basis.

Our upper bound depends on the difference between the expected performance of a model trained on dataset Di and the performance of a model trained on the subsequent dataset Di+1, evaluated on the same dataset from any timestep Dt:

L pei,t pei+1,t t [T] (6)

Given this quantity, we derive the following result of an upper bound for the number of retrains of the optimal solution, which we denote by r = θ 1: Proposition 3.1. Given that L pei,t pei+1,t t [T], a horizon T N, and a relative cost of retrain α, the number of retrains of the solution to Equation 5 r θ 1 satisfies:

The proof is provided in Appendix A.4. Suppose a practitioner has reasonable approximations of L and α, and a

horizon to consider, T. Then if T

L) < 1, no retraining should be performed. We demonstrate how this result should be used in practice in Appendix A.4.1.

Bounding L General bounds for L are too loose to be helpful; however, in some cases, reasonable estimates can be derived. For the specific case of no distribution shift IID data, where the data simply accumulates (Dt Dt+1,Dt p(D) t), we can leverage known theoretical results, such as Probably Approximately Correct (PAC) learning theory (Valiant, 1984) or Rademacher Complexity (Bartlett & Mendelson, 2002). For large-scale training settings, precise empirical scaling laws are available (Kaplan et al.; Hoffmann et al., 2024). Kaplan et al. derive that the loss L of the neural network scales with respect to dataset size N as L = (N/5.4 1013) 0.095. Such scaling laws enable accurate estimation of expected performance improvements from expanded datasets L. Thus, they enable informed decisions about when retraining can yield substantial benefits. For more discussion see Appendix A.5. Even in real-world applications, where data often exhibit temporal or spatial dependencies, making the non-distribution shift IID assumption unrealistic, bounds have been derived using stability analysis or tailored Rademacher complexity bounds (Mohri & Rostamizadeh, 2008; 2007; 2010).

4. Methodology

A retraining decision algorithm must specify the decision functions gϕ(t,I<t) {0,1} (where ϕ contains the algorithm s parameters) used to build the decision vector θ. To make perfect decisions, we would need future performance values, i.e., pei,j (i > t or j > t). This is infeasible; however, we assume that there is a temporal correlation between the performances of different models trained at different times, which we aim to exploit to build a predictive model. We therefore propose to 1) model these future values as random variables and learn their distributions; and 2) base our decisions on the predicted distributions to construct our method, the Uncertainty-Performance Forecaster (UPF).

4.1. Future Performance Forecaster

The first component of our algorithm involves learning a performance predictor to forecast unknown entries in pe, which are defined as pei,j = EDj[ℓ(fi(Xj),Yj)] for i j (see Eqn 2). In a classification setting where we consider the 0-1 loss ℓ(y ,y) = 1[y y], these are 1 accuracy. We introduce random variables Aij and model the entries peij as realizations of these. Although this prediction task may initially seem similar to the performance estimation (PE) problem (Garg et al., 2020), it is fundamentally different. PE focuses on estimating the performance of an existing model under distribution shift, whereas our task involves

When to retrain a machine learning model

forecasting the future performance of models that do not yet exist. Crucially, PE lacks a temporal dimension, as it does not account for the evolution of models over time.

Since the Ai,j random variables are bounded, we model them (after appropriate scaling) as Beta distributed with parameters α(ri,j),β(ri,j) that depend on some input feature ri,j. These features contain information about the current state of the feature distribution as well as the timestamp (see the section Inputs in Appendix A.7 for full details). This forecasting formulation allows us to capture both covariate and concept drift. This choice of the Beta distribution is particularly appropriate when the performance metrics Ai,j are accuracies, as in our experiment, since accuracy can be interpreted as the sum of Bernoulli random variables. Of course, other distributions could also be considered. We also define their associated mean µ(ri,j) and variance σ(ri,j). Given the parameters α(ri,j),β(ri,j), we model the random variables to be independent of each other:

P (A0,0,...,AT,T {α(ri,j),β(ri,j)}T i j) (8)

= i j Beta(α(ri,j),β(ri,j)). (9)

where Beta() denotes the pdf of a Beta distribution. We choose the input features rij to include the indices of the training and evaluation datasets (i and j, respectively), along with additional features that capture the gap between the training and evaluation timesteps (the difference j i, and summary statistics of the distribution shift zshift (see Appendix A.7 for details). The input features are thus given by ri,j = [i,j,j i,zshift].

From the offline data, we have access to observations ai,j Ai,j, and can build a regression dataset to learn the parameters α(ri,j),β(ri,j). We specify the learning task by constructing (ri,j,ai,j) pairs:

M<t = {(ri,j,ai,j); fi I<t, Dj I<t}. (10)

Direct learning of the α,β parameters can be unstable. Therefore, we use a Gaussian approximation:

Beta(α(ri,j),β(ri,j)) N(µ(ri,j),σ(ri,j)), (11)

This allows us to write the likelihood of our dataset as:

L(M<t;ϕ) = i,j M<t P(ai,j ri,j,ϕ) (12)

= i,j M<t N(ai,j;µϕ(ri,j),σϕ(ri,j)). (13)

We parameterize the variance as a constant σϕ(ri,j) = σϕ. Maximizing the likelihood w.r.t. to the mean parameters µϕ(ri,j) then becomes a standard mean square error minimization. Given the expectation of operating in a very low-data regime, we rely on simple inference models, such

as linear regression. Once the parameters are learned, we can recover the corresponding αϕ(ri,j),βϕ(ri,j) parameters to obtain our predictive distribution (see Appendix A.7 for additional details);

Pϕ(Ai,j) = Beta(αϕ(ri,j),βϕ(ri,j)). (14)

As stated, this parameterization is appropriate for bounded losses. Other distributions can be used to model different loss domains (see Appendix A.8). As I<t grows at each time step, our training data increases, so we retrain and obtain a new Pϕ(Ai,j) each time. As constructed, past decisions influence the dataset Ionline available for the next iteration, but this effect is ignored by the algorithm. Empirically, we find that the algorithm performs well despite this. One direction worth investigating is the incorporation of random decisions to allow the predictor to learn over a broader region of actions and responses. As our methodology involves forecasting future performance as a key subtask, we evaluate and quantify the impact of success in this task on the overall performance of our algorithm, as detailed in Appendix A.7.

4.2. Decisions under uncertainty

Now we describe how we use Pϕ(Ai,j) to decide whether to retrain. We introduce a random variable C that represents the total cost (Eqn. 3) (given a sequence of decisions θ):

C(θ) = e N (α θ 1 + T t=1 Arθ(t),t). (15)

We can therefore define our decision rule based on this random cost using our learned distribution of performances Pϕ( Ai,j). Given the past decisions θ<t, our next decision θt is obtained by comparing the δ-level quantiles of the total cost incurred if we retrain, denoted by Cθ<t retrain, and the cost incurred if we do not, denoted by Cθ<t keep. Using F 1 X (δ) as the quantile function of a random variable, our rule is given by:

θt = 1[F 1 Cθ<t retrain(δ) < F 1 Cθ<t keep(δ)]. (16)

The quantile parameter δ allows us to control how conservative we are. Lower values of δ lead to decisions that prioritize costs with lower variance, while setting δ = 0.5 simply selects the decision that minimizes the median total cost. As defined, the retraining decision θt is deterministic.

We begin by giving explicit expressions for the conditional random variables Cθ<t retrain and Cθ<t keep. If we decide to retrain at time step t, the incurred costs include the retraining cost α, the performance cost of the most recent model At,t, and future costs for the decisions we will make. Specifically, we incur Cθ<t+1 retrain if the next decision is to retrain, and Cθ<t+1 keep if it is not. If we choose not

When to retrain a machine learning model

to retrain and keep the current model, we only incur the performance cost of the old model, Arθ(t 1),t.

These random variables can therefore be recursively defined:

Cθ<t retrain = α + At,t + θt+1 Cθ<t+1 retrain (17)

+ (1 θt+1) Cθ<t+1 keep (18)

= α + At,t + T t =t+1 Ar θ(t ),t + α θt (19)

Cθ<t keep = Arθ(t 1),t + T t =t+1 Ar θ(t ),t + α θt (20)

As shown, the cost random variables are constructed recursively by summing the distribution of the cost of performances Ai,j that would be selected by the decision rule θ, as θ and the α parameter are both deterministic.

The decision rule introduced in Eqn. 16 can therefore be written as:

θt = 1[F 1 α+At,t+ T t =t+1 Ar θ(t ),t +α θt (δ) (21)

< F 1 Arθ(t 1),t+ T t =t+1 Ar θ(t ),t +α θt (δ)]. (22)

We use the learned Beta distributions, introduced in the previous section, plugging them into Eqn. 22 in order to make a retraining decision.

If the parameterization Pϕ(Ai,j) does not lead to a closed form expression, we use Monte Carlo methods to obtain quantile estimates:

F 1 Cθ<t retrain(δ) ˆF 1 Cθ<t retrain(δ) (23)

where ˆF 1 Cθ<t retrain(δ) is obtained through bootstrapping.

Connection to offline reinforcement learning The formulation closely resembles a Q-learning formulation. The C values defined in Eqns. 1820 strongly align with Q functions. Indeed, one possible approach is to bypass the learning of the pe and directly optimize the decision-making process using Q-learning approaches. The problem we are considering can be viewed as a corner case of offline RL, where the state space is finite and enumerable, the training data are extremely limited and the transition function is deterministic and fully known. In fact, our methodology can be reinterpreted as an offline variant of a Q-learning approach with a specific parameterization of the Q function, further justifying the motivation behind our method. We explore and formalize this connection in Appendix A.11. However, as we have explained in the related work section, existing ORL methods are not suitable for this setting. We provide the results for one ORL baseline in Appendix A.11 to exemplify that point.

5. Experiments

Evaluation Metrics The performance of a retraining decision method is evaluated based on both the average performance and the total retraining cost. The tradeoff between these factors is controlled by α. When using the zero-one loss in classification, α can be seen as the ratio of retraining cost to the cost of misclassifications. In practice, α is application-dependent and should be set by the practitioner. The retraining cost would be low (small α) for situations such as fine-tuning small models. By contrast, when retraining large language models, or in high-stakes settings requiring extensive validation, the retraining cost is high (large α). The retraining decision method should be robust across all scenarios. The appropriate value of α can be very difficult to estimate and will likely be an approximation in practice. Consequently, we present experiments that test the robustness of the method to inaccuracies in α in Section A.2.

In our experiments, we address classification tasks with a zero-one loss, and set e N = 1. We report an empirical estimate of the target cost ˆCα(θ) (Eqn. 3), obtained from the test set, over varying α:

Cα(θ) ˆCα(θ) α θ 1 + T t=1 petest rθ(t),t, (24)

where petest i,j = 1 acctest with ℓ(y,y ) = 1[y y ], To summarize the results at multiple α operating points, we report the area-under-the-curve (AUC) of ˆCα(θ). We compute 10 α operating points and we allow α to range from 0 (no retrain cost) to αmax (where the cost is too high to justify any retraining). The upper bound, αmax, is determined by the α value at which the oracle reaches 0 retrains.1 The oracle is obtained by determining the optimal schedule that minimizes the target cost, assuming exact knowledge of all future peij entries, i.e., θoracle = arg minθ ˆCα(θ).

Datasets We present results on synthetic and real datasets. For the real datasets, we use datasets with a timestamp for each sample and partition the data in time to create a sequence of datasets D0,D1,.... For each trial, we sample a different sequence of length w + T within the complete dataset sequence available. We report results on: (i) the electricity dataset (?), a binary classification task predicting the rise or fall of electricity prices in New South Wales, Australia; (ii) the airplane dataset (Gomes et al., 2017), which records whether a flight is delayed; (iii) yelp CHI (Dou et al., 2020), which classifies if a user s review is legitimate; and (iv) epicgames (Ozmen et al., 2024), where the task is to

1The use of the oracle to define the range of α values for the AUC computation does not bias the performance assessment via pollution with future knowledge. None of the algorithms makes use of the oracle information. Using the oracle merely ensures that the performance comparison is conducted over the range of relevant α.

When to retrain a machine learning model

predict whether an author s critique of a game was selected as a top critique. As a base model f, we use XGBoost (Chen & Guestrin, 2016).

We also present a larger vision dataset that requires a larger network to process. i Wild Cam (Beery et al., 2020) consists of images of animals in the wilderness, captured at various locations, and the task involves multi-class animal classification. Our approach uses a pretrained vision model, augmented with a linear layer that processes the image representation along with the location domain to produce the final classification output. We allow for a different pretrained architecture model at each timestep t, and perform a random search over a set of 188 choices from the Huggingface library (Wightman, 2019). These encompass a wide variety of networks, including Vi T (Dosovitskiy et al., 2021), Res Ne T (He et al., 2016) and CNNs (O Shea & Nash, 2015). Appendix A.6 provides additional details on the architecture, training procedure, and hyperparameter search. For the synthetic dataset, we follow Mahadevan & Mathioudakis (2024) to generate two 2D datasets with covariate shift (Gauss) and concept drift (circles) (Pesaranghader et al., 2016). Appendix A.6 contains details on the generation. We report 3 trials for i Wild and 10 trials for the other datasets.

Baselines and algorithm settings We set the confidence threshold of our UPF algorithm to δ = 95%, as it is a standard value used for confidence intervals. For µϕ(ri,j), we use a linear regression model, Elastic Net CV (Zou & Hastie, 2005), from the scikit-learn library. All other optimization parameters are set to default choices from the scikit learn libraries. We report results on shift detection baselines and the three variants of the CARA baseline, as well as the oracle.

For the distribution shift detection baselines, we set the window size to the size of an individual dataset D , and retrain when the algorithm detects a distribution shift. (Then we reset the algorithm with the dataset of the last retrained model.) As these methods cannot take into account the cost of retraining, we vary the significance level threshold δ to obtain different frequencies of retraining. We include ADWIN-δ (Bifet & Gavald a, 2007), which is based on statistical testing of the label distribution, FHDDM-δ (Pesaranghader & Viktor, 2016), which is based on Hoeffding s inequality, and KSWIN-δ (Raab et al., 2020), which is based on the Kolmogorov-Smirnov test.

CARA (Mahadevan & Mathioudakis, 2024) searches for the best strategy with fixed parameters using the offline data. The standard version, CARA, searches for the best threshold of approximate performance and retrains when it drops below it. The cumulative version, CARA cumul., searches for the best threshold of the cumulated approximate per-

0.0 0.2 0.4 0.6 0.8 1.0 1.5

0.0 0.2 0.4 0.6 0.8 1.0

num. retrains

UPF CARA cumul. ADWIN-5% ADWIN-50% oracle

Figure 2. Results on the electricity dataset. Top) Cost ˆCα(θ) vs α. Bottom) Number of retrains vs α. In the top figure, we can see that UPF consistently reaches low ˆCα(θ) across different α. In the bottom figure, the number of retrains of UPF follows the optimal baseline more closely.

formance; and the periodic strategy, CARA per., searches for the best retraining frequency. Appendix A.12 provides additional details on the CARA baseline.

Appendix A.11 includes results for an offline RL baseline.

We start by presenting in Table 1 the area-under-the-curve (AUC) of the total cost value ˆCα(θ). The AUC is computed as the area over a range of α values determined by the oracle performance. Lower values of AUC are better because we aim to reduce the cost over the operating range. Overall, our proposed method achieves the best trade-off between the number of retrains and average accuracy across all baselines and datasets. To gain better insight into the behavior of the different algorithms and how they are impacted by varying retraining cost parameters, we provide a detailed overview for one dataset with two values of α: one where the cost of retraining is low and one where it is high, as shown in Table 2. Figure 2 depicts how the the total cost ˆCα(θ) and the number of retrains vary as α is changed. Appendix A.10 contains the complete set of results and figures.

When to retrain a machine learning model

Table 1. AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. The bolded entries represent the best, and the underlined entries indicate the second best. The denotes statistically significant difference with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

electricity Gauss circles airplanes yelp CHI epicgames i Wild

ADWIN-5% 2.8099 0.4533 0.0753 2.6353 0.1298 0.3217 3.7371 ADWIN-50% 2.8131 0.4848 0.0753 2.7147 0.1298 0.3238 4.2564 KSWIN-5% 3.8979 0.3975 0.0753 3.2300 0.1322 0.3420 4.4268 KSWIN-50% 4.0521 0.9530 0.0794 3.2042 0.1655 0.3537 4.4268 FHDDM-5% 3.1525 0.3893 0.0753 2.6577 0.1324 0.3298 4.4267 FHDDM-50% 3.4037 0.5918 0.0772 2.7077 0.1450 0.3389 4.4268 CARA cumul. 2.7147 0.3862 0.0731 2.2900 0.1299 0.3228 3.8922 CARA per. 2.8986 0.4678 0.0800 2.4061 0.1318 0.3260 3.7527 CARA 2.7198 0.3841 0.0726 2.2753* 0.1294 0.3202 3.9506

UPF (ours) 2.5782* 0.3829* 0.0668* 2.2865 0.1293* 0.3189* 3.0498*

oracle 2.4217 0.3724 0.0627 2.2298 0.1275 0.3170 2.4973

Table 2. We compare the best performing algorithms for the electricity dataset with the optimal decisions (the oracle) in both high and low retraining cost settings. For each baseline, we report the number of retrains and the average accuracy, as well as our primary metric ˆCα(θ) that combines both factors using α. The results show that the proposed method achieves the best ˆCα(θ) value and closely approximates the oracle s behavior in both scenarios, highlighted in bold.

High retrain cost α = 0.9 Low retrain cost α = 0.1

#retrain Average Acc ˆCα(θ) #retrain Average Acc ˆCα(θ)

ADWIN-5% 1.0 0.58 0.7 0.04 3.27 0.4 1.0 0.58 0.7 0.04 2.47 0.25 ADWIN-50% 1.17 0.38 0.72 0.03 3.32 0.32 1.17 0.38 0.72 0.03 2.39 0.26 CARA 0.0 0.0 0.65 0.02 2.78 0.19 0.33 0.75 0.66 0.04 2.73 0.25 CARA cumul. 0.0 0.0 0.65 0.02 2.78 0.19 0.33 0.48 0.67 0.02 2.68 0.18 CARA per. 1.0 0.0 0.69 0.02 3.34 0.14 1.0 0.0 0.69 0.02 2.54 0.14

UPF (ours) 0.1 0.3 0.68 0.04 2.69 0.26 2.5 0.67 0.75 0.03 2.24 0.17

oracle 0.0 0.0 0.66 0.03 2.68 0.26 5.6 1.44 0.83 0.02 1.93 0.06

First, examining the behavior of the optimal solution (oracle), we unsurprisingly observe that in the high retraining cost scenario, both the number of retrains and the average accuracy are lower, while in the low retraining cost scenario, the number of retrains and the average accuracy are higher. Next, we observe that the proposed UPF method follows the oracle more closely than the other baselines and is more sensitive to the α parameter compared to the cost-aware method (CARA). This is particularly apparent in Figure 2. The CARA baselines relies heavily on its assumptions about performance and is therefore not as robust in scenarios where those assumptions do not hold. The detection shift methods cannot take the varying parameters as input, so the results remain the same for both values of α. Since these methods do not account for retraining costs, they perform better when the cost is very low, as they simply retrain whenever a shift is detected. This can be a good strategy if retraining costs little. Indeed, we observe that all ADWIN and FHDDM variants are closer to the optimal values in the low range of α in the left of Figure 2.

However, as the cost of retraining increases, these methods become impractical. Varying the threshold can yield better results a lower significance requirement (50%) allows for more retraining and therefore works better when retraining costs are low, while the inverse holds in a high-cost regime, where a more conservative retraining strategy is preferable. However, it is not possible to know in advance which significance threshold should be used for a given α, making these methods largely impractical for such a setting. Finally, we present a comparison of the timing complexity of the algorithms in Appendix A.9, where we report the online and offline processes for each method. We observe that UPF is among the lowest-cost methods in both categories.

Ablation study: Importance of uncertainty We perform an ablation study, with results reported in Appendix A.1. In our approach, we model the distribution of future costs and set targets at the 95% quantile to ensure robustness against noisy predictions. In Appendix A.1, we compare this uncertainty-aware strategy with a deterministic counterpart,

When to retrain a machine learning model

which does not account for uncertainty. The result of the ablation study confirms that accounting for uncertainty does, in fact, enhance robustness and improve performance.

Sensitivity study: Robustness to wrong α: Our approach requires the practitioner to specify the ratio of cost-toperformance, which could be difficult to determine and could be specified incorrectly. In Appendix A.2, we assess the robustness of our method to a wrongly specified α and find that performance is not significantly impacted by the misspecification of α.

7. Conclusion and limitations

We have proposed a practical formulation of the important problem of model retraining, which has been neglected in the literature, and highlighted its complexity. Our method outlines a promising avenue, as our experiments have shown that even with distribution shift, it is not unreasonable to expect some patterns in future performance that could be predicted with the help of uncertainty modeling. This datadriven approach is lightweight, practical, and outperforms existing approaches. It is robust to varying cost settings and has demonstrated resilience to misspecified cost-toperformance ratios. We have also highlighted the quantities of interest to estimate in order to better understand the characteristics of a specific problem. While our study demonstrates promising results in predicting optimal retraining schedules, several aspects warrant further exploration. Since our model treats performance prediction as a forecasting task, it is primarily suited to gradual changes. Handling abrupt shifts in performance remains an open challenge. On the experimental side, our main experiments investigate a setting where the offline dataset (w = 7) is non-negligible in size. However, we achieved good performance even with a reduced dataset, which shows that initial training costs can be reduced (see Appendix A.13). We evaluated the method individually for each dataset, but future work could further reduce costs by transferring schedulers across datasets and tasks. Additionally, adapting techniques from Hyperparameter Optimization could enhance performance forecasting.

Acknowledgements

This research was partially funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), [reference number 260250]. Cette recherche a et e partiellement financ ee par le Conseil de recherches en sciences naturelles et en g enie du Canada (CRSNG), [num ero de r ef erence 260250]. Ce projet de recherche no 324302 est rendu possible grˆace au financement du Fonds de recherche du Qu ebec.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Adams, R. P. and Mac Kay, D. J. C. Bayesian online changepoint detection, 2007.

Bar-Shalom, G., Geifman, Y., and El-Yaniv, R. Windowbased distribution shift detection for deep neural networks. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2023.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. J. of Machine Learning Research, 3:463 482, 2002.

Beery, S., Cole, E., and Gjoka, A. The iwildcam 2020 competition dataset. ar Xiv:2004.10340, 2020.

Bifet, A. and Gavald a, R. Learning from time-changing data with adaptive windowing. In Proc. SIAM Int. Conf. on Data Mining (SDM), 2007.

Cerqueira, V., Gomes, H. M., Bifet, A., and Torgo, L. STUDD: a student teacher method for unsupervised concept drift detection. J. of Machine Learning Research, 112:4351 4378, 2021.

Chebotar, Y., Vuong, Q., Hausman, K., Xia, F., Lu, Y., Irpan, A., Kumar, A., Yu, T., Herzog, A., Pertsch, K., Gopalakrishnan, K., Ibarz, J., Nachum, O., Sontakke, S. A., Salazar, G., Tran, H. T., Peralta, J., Tan, C., Manjunath, D., Singh, J., Zitkovich, B., Jackson, T., Rao, K., Finn, C., and Levine, S. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In Proc. Conf. on Robot Learning, 2023.

Chen*, M., Goel*, K., Sohoni*, N., Poms, F., Fatahalian, K., and Re, C. Mandoline: Model evaluation under distribution shift. In Proc. Int. Conf. of Machine Learning (ICML), 2021.

Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. In Proc. Int. Conf. on Knowledge Discovery and Data Mining (ACM SIGKDD), 2016.

Dai, Z., Yu, H., Low, B. K. H., and Jaillet, P. Bayesian optimization meets Bayesian optimal stopping. In Proc. Int. Conf. on Machine Learning (ICML), pp. 1496 1506, 2019.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

When to retrain a machine learning model

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. Int. Conf. on Learning Representations (ICLR), 2021.

Dou, Y., Liu, Z., Sun, L., Deng, Y., Peng, H., and Yu, P. S. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proc. Int. Conf. Inf. and Knowledge Management (ACM), 2020.

Fang, T., Lu, N., Niu, G., and Sugiyama, M. Rethinking importance weighting for deep learning under distribution shift. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2020.

Filos, A., Tigkas, P., Mcallister, R., Rhinehart, N., Levine, S., and Gal, Y. Can autonomous vehicles identify, recover from, and adapt to distribution shifts? In Proc. Int. Conf. Machine Learning (ICML), 2020.

Gama, J. a., ˇZliobaitundefined, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. A survey on concept drift adaptation. J. ACM Comput. Surv., 46(4), 2014.

Garg, S., Wu, Y., Balakrishnan, S., and Lipton, Z. A unified view of label shift estimation. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2020.

Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfahringer, B., Holmes, G., and Abdessalem, T. Adaptive random forests for evolving data stream classification. J. of Machine Learning Research, 106:1469 1495, 2017.

Guillory, D., Shankar, V., Ebrahimi, S., Darrell, T., and Schmidt, L. Predicting with confidence on unseen distributions. In Proc. Int. Conf. on Computer Vision (ICCV), 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proc. Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.

Hejna, J., Gao, J., and Sadigh, D. Distance weighted supervised learning for offline interaction data. In Proc. Int. Conf. Machine Learning (ICML), 2023.

Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. Proc. of Int. Conf. Learning Representations (ICML), 2017.

Hoffman, K., Salerno, S., Leek, J., and Mc Cormick, T. Some models are useful, but for how long?: A decision theoretic approach to choosing when to refit large-scale prediction models. ar Xiv:2405.13926, 2025.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. Training compute-optimal large language models. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2024.

Hoi, S. C., Sahoo, D., Lu, J., and Zhao, P. Online learning: A comprehensive survey. Neurocomputing, 459:249 289, 2021.

Janner, M., Li, Q., and Levine, S. Offline reinforcement learning as one big sequence modeling problem. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2021.

Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., and Levine, S. Scalable deep reinforcement learning for vision-based robotic manipulation. In Proc. Conf. on Robot Learning.

Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. ar Xiv:2001.08361.

Kossen, J., Farquhar, S., Gal, Y., and Rainforth, T. Active testing: sample-efficient model evaluation. In Proc. Int. Conf. on Machine Learning (ICML), 2021.

Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit Q-learning. In Proc. Int. Conf. on Learning Representations (ICLR), 2022.

Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative Q-learning for offline reinforcement learning. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2020.

Lagoudakis, M. G. and Parr, R. E. Least-squares policy iteration. J. Machine Learning Research, 4:1107 1149, 2003.

Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv:2005.01643, 2020.

Li, A., Boyd, A. J., Smyth, P., and Mandt, S. Detecting and adapting to irregular distribution shifts in bayesian online learning. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2021.

Liu, W., Wang, X., Owens, J., and Li, Y. Energy-based out-of-distribution detection. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2020.

When to retrain a machine learning model

Luis, C. E., Bottero, A. G., Vinogradska, J., Berkenkamp, F., and Peters, J. Model-based uncertainty in value functions. In Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTAT), 2023.

Mahadevan, A. and Mathioudakis, M. Cost-aware retraining for machine learning. Knowledge-Based Systems, 293: 111610, 2024.

Mohri, M. and Rostamizadeh, A. Stability bounds for noniid processes. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2007.

Mohri, M. and Rostamizadeh, A. Rademacher complexity bounds for non-iid processes. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2008.

Mohri, M. and Rostamizadeh, A. Stability bounds for stationary φ-mixing and β-mixing processes. J. of Machine Learning Research, 11(2), 2010.

Montiel, J., Read, J., Bifet, A., and Abdessalem, T. Scikitmultiflow: A multi-output streaming framework. J. of Machine Learning Research, 19(72):1 5, 2018.

O Donoghue, B., Osband, I., Munos, R., and Mnih, V. The uncertainty Bellman equation and exploration. In Proc. Int. Conf. on Machine Learning (ICML), 2017.

O Shea, K. and Nash, R. An introduction to convolutional neural networks. ar Xiv:1511.08458, 2015.

Ozmen, M., , Regol, F., and Markovich, T. Benchmarking edge regression on temporal networks. J. Data-centric Machine Learning Research (DMLR), 2024.

Pesaranghader, A. and Viktor, H. L. Fast hoeffding drift detection method for evolving data streams. In Proc. Machine Learning and Know. Discovery in Databases (ECML PKDD), 2016.

Pesaranghader, A., Viktor, H. L., and Paquet, E. A framework for classification in data streams using multistrategy learning. In Proc. Discovery Science, 2016.

Pesaranghader, A., Viktor, H., and Paquet, E. Reservoir of diverse adaptive learners and stacking fast hoeffding drift detection methods for evolving data streams. J. Machiche Learning Research, 107:1711 1743, 2018.

Raab, C., Heusinger, M., and Schleif, F.-M. Reactive soft prototype computing for concept drift streams. Neurocomputing, 416:340 351, 2020.

Rabanser, S., G unnemann, S., and Lipton, Z. Failing loudly: An empirical study of methods for detecting dataset shift. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2019.

Rakotoarison, H., Adriaensen, S., Mallik, N., Garibov, S., Bergman, E., and Hutter, F. In-context freeze-thaw bayesian optimization for hyperparameter optimization. In Proc. Int. Conf. on Machine Learning (ICML), 2024.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTAT), 2011.

Scheirer, W. J., de Rezende Rocha, A., Sapkota, A., and Boult, T. E. Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (7):1757 1772, 2013.

Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2018.

Schwinn, L., Bungert, L., Nguyen, A., Raab, R., Pulsmeyer, F., Precup, D., Eskofier, B., and Zanca, D. Improving robustness against real-world and worst-case distribution shifts through decision region quantification. In Proc. Int. Conf. on Machine Learning (ICML), 2022.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proc. of the IEEE, 104 (1):148 175, 2016.

Strubell, E., Ganesh, A., and Mc Callum, A. Energy and policy considerations for modern deep learning research. In Proc. Conf. on Artificial Intelligence (AAAI), 2020.

Sugiyama, M. and Kawanabe, M. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. The MIT Press, 2012.

Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw bayesian optimization. ar Xiv:1406.3896, 2014.

Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Wang, H., Vaze, S., and Han, K. Dissecting out-ofdistribution detection and open-set recognition: A critical analysis of methods and benchmarks. Int. J. of Comp. Vision, 133:1326 1351, 2024a.

Wang, W., Fan, Z., and Ng, S. H. Trajectory-based multiobjective hyperparameter optimization for model retraining. ar Xiv:2405.15303, 2024b.

Wightman, R. Pytorch image models. https://github. com/rwightman/pytorch-image-models, 2019.

When to retrain a machine learning model

Yao, H., Choi, C., Cao, B., Lee, Y., Koh, P. W. W., and Finn, C. Wild-time: A benchmark of in-the-wild distribution shift over time. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2022.

Zhang, Y.-J., Zhang, Z.-Y., Zhao, P., and Sugiyama, M. Adapting to continuous covariate shift via online density ratio estimation. In Proc. Conf. on Neural Inf. Proces. Syst. (Neur IPS), 2023.

Zou, H. and Hastie, T. Regularization and Variable Selection Via the Elastic Net. J. of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301 320, 03 2005.

ˇZliobait e, I., Budka, M., and Stahl, F. T. Towards costsensitive adaptation: When is it worth updating your predictive model? Neurocomputing, 150:240 249, 2015.

When to retrain a machine learning model

A. Appendix

A.1. Ablation study - The importance of uncertainty

In our approach, we model the distribution of future costs and set targets at the 95% quantile to ensure robustness against noisy predictions. To assess whether this strategy enhances robustness and improves performance, we compare the proposed UPF algorithm, with the 95% quantile, against a deterministic version, referred to as PF, which selects the predicted decision that minimizes costs. This corresponds to setting the quantile to 50% in our algorithm (PF = UPF-50%). We observe in Table 3 that relying on conservative quantiles in our predictions results in better overall outcomes, compared to the deterministic version, PF, with statistical significance observed across all datasets except for electricity.

Table 3. Ablation study on accounting for uncertainty in our prediction. Targeting the 95% quantile is better overall than the deterministic approach (equivalent to a 50% quantile). The denotes statistically significant difference with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

electricity gauss circles airplanes yelp epicgames

PF 2.5884 0.13* 0.3673 0.03 0.0697 0.01 2.3688 0.35 0.1180 0.00 0.3211 0.01 UPF 2.6056 0.14 0.3643 0.03* 0.0670 0.01* 2.2688 0.26* 0.1175 0.00* 0.3202 0.00*

A.2. Ablation study - Robustness to wrong α

In our setting, we assume that the relative cost of performance and retraining α is known. However, in practice, this tradeoff value can be hard to estimate accurately. It is therefore of high practical interest to assess the impact of a misspecified α value, and to identify the settings where misspecification is the most impactful. In Figure 3, we present how wrongly specified α values impact the performance of our algorithm and the CARA baseline on one dataset. Both algorithms are reasonably robust, as it requires a large deviation from the true α value (upper right and bottom left) to start seeing a degradation of performance of more than 1%. UPF is generally more robust to changes of α. Both algorithms are more susceptible to overestimation of α.

0.0 0.02 0.04 0.06 0.08

% increase C( )

0.0 0.02 0.04 0.06 0.08

% increase C( )

Figure 3. Impact of wrong α measured by the percentage increase of ˆCα(θ) on the epicgames dataset. left) CARA right) UPF. Overall, both methods are reasonably robust to a wrong α specification, with UPF being the more robust.

A.3. Extended Discussion of Related Work

Retraining problem Few works explicitly target the retraining problem. ˇZliobait e et al. (2015) propose a return on investment (ROI) framework to monitor and assess the retraining decision process. Mahadevan & Mathioudakis (2024) develop a retraining decision algorithm, CARA, which integrates the cost of retraining into its formulation. It introduces the concept of a staleness cost which represents the cost of not retraining. The approach involves approximating the staleness cost and optimizing various strategies to reduce the overall cost, based on some offline data. The offline data consist of a few trained models, each with an associated dataset that was collected prior to the retraining decision process. Mahadevan &

When to retrain a machine learning model

Mathioudakis (2024) propose three methods: the first retrains when the estimated staleness cost exceeds a threshold; the second tracks the accumulated staleness cost and applies a threshold on that value; and the third searches for the optimal retraining frequency. The staleness cost approximation for using a model on a dataset relies on the loss of individual known samples. This loss is scaled by the average similarity between the features of these known samples and the features of the dataset of interest. Consequently, it assumes access to the features of some of the samples at a given time before deciding to retrain. Moreover, the search for the threshold or the period is computationally intensive and therefore can only be done once using some offline data; it cannot modify the parameters as new information arrives.

Distribution shift detection The retraining problem is closely connected to distribution shift detection and mitigation (Wang et al., 2024a; Hendrycks & Gimpel, 2017; Scheirer et al., 2013; Cerqueira et al., 2021; Bar-Shalom et al., 2023; Rabanser et al., 2019). Some methods adapt the model to adjust to evolving distributions (Sugiyama & Kawanabe, 2012; Zhang et al., 2023; Fang et al., 2020; Pesaranghader et al., 2018). Since the signal for these methods is designed to adapt a model rather than trigger a full retraining, they are not appropriate to be used as full retraining signals. Some approaches, however, directly treat the detection of a distribution shift as a cue for retraining (Bifet & Gavald a, 2007; Pesaranghader & Viktor, 2016; Raab et al., 2020), and can be used as baselines. ADWIN (Bifet & Gavald a, 2007) uses statistical testing of the label or feature distribution. Another approach is to directly monitor the model s performance. FHDDM (Pesaranghader & Viktor, 2016) employs Hoeffding s inequality, while (Raab et al., 2020) relies on a Kolmogorov-Smirnov Windowing test. These approaches may work well when retraining costs are low, but they become unsuitable when retraining is expensive it is not always optimal to retrain after every minor shift. This is tied to a more general weakness of lacking adaptability to varying costs. While the significance level parameteter can be adjusted, the appropriate significance level for a given retraining-to-performance cost ratio is unknown and difficult to estimate.

Changepoint detection Another closely related field is changepoint detection, which is similar to the distribution shift problem. Changepoint detection is the task of identifying points in a sequence where the statistical properties of the data change abruptly. This problem was introduced and presented by Adams & Mac Kay (2007), where they aim to infer the most probable distribution of the most recent changepoint in an online setting. Recent work, such as (Li et al., 2021), has expanded on this problem in ways closer to our retraining setting, as they incorporate adaptation into the changepoint detection process,The sensitivity of the detection is controlled by certain sensitivity parameters.

However, to transform the changepoint detection problem formulated by Li et al. (2021) into the retraining problem we consider, we would need to introduce a cost for adaptation, a cost for accuracy loss, and then formulate an optimization problem to find the appropriate sensitivity parameter for achieving the optimal number of adaptations. However, since this parameter lacks a specific physical or practical meaning, it is unclear beforehand how the choice of its value will impact the adaptation rate. Furthermore, in our setting, the optimal rate of adaptation (or retraining frequency) is unknown. Determining this optimal retraining frequency is one of the major challenges of the retraining problem.

Bayesian Optimization Our method is based on forecasting future model performance using historical data. This approach closely aligns with Bayesian Optimization (see (Shahriari et al., 2016) for a review on this topic), commonly used in the Hyperparameter Optimization (HPO) field. The Freeze-Thaw method, introduced by Swersky et al. (2014), leverages Gaussian Processes to predict the trajectory of validation loss, enabling early stopping and optimization of the hyperparameter search space. It remains a relevant technique (Rakotoarison et al., 2024). Similarly, Dai et al. (2019) derive a Bayes-optimal stopping rule using a related approach. This method can be extended to predict the performance of other models and address hyperparameter optimization challenges (Wang et al., 2024b). In our context, we predict the performance of different models under potential distribution shifts, but the underlying idea is similar.

Label-free performance estimation Similarly, our approach is also related to the general fields of performance estimation without labels (Garg et al., 2020; Guillory et al., 2021; Chen* et al., 2021) and active testing (Kossen et al., 2021). Part of the problem is similar in that the goal is to estimate performance; however, the similarity ends there, as these methods generally assume access to the model f for which performance is estimated, as well as access to the features of the dataset (Garg et al., 2020). Our approach involves forecasting performance not only for known models but also for unknown models. While our approach does not explicitly differentiate strategies, it is true that we have access to additional information. Therefore, extensions that leverage existing techniques in this area could strengthen our method.

This forecasting problem can seem similar to the problem of uncertainty quantification (Hendrycks & Gimpel, 2017; Liu et al., 2020), but we are targeting average performance of unknown models, not the probability of error of a given model at a given input P(f(x) = y x).

When to retrain a machine learning model

Offline reinforcement learning Lastly, we discuss the connection to the offline reinforcement learning (ORL) setting, where the agent must learn a policy from a fixed dataset of rewards, actions, and states. This subset of RL is particularly challenging, as the agent cannot explore the entire MDP and can only rely on the dataset to infer the underlying dynamics and handle distribution shifts (Ross et al., 2011; Levine et al., 2020; Hejna et al., 2023). Policy gradient methods can be adapted to the offline setting using variants of importance sampling, but they are generally prone to high variance and require large amounts of data to be effective (Levine et al., 2020). For this reason, Q-learning and value function methods, where the task is to predict the future costs of actions, have emerged as the preferred approaches for ORL (Levine et al., 2020; Kalashnikov et al.; Hejna et al., 2023; Kostrikov et al., 2022). Lagoudakis & Parr (2003) presents a classical method that uses a linear approximation of the Q-function, while (Kalashnikov et al.) employs convolution-based Q-function architectures for vision tasks.Others have leveraged advancements in sequential learning, applying transformer-based architectures to predict rewards(Janner et al., 2021) or Q-functions(Chebotar et al., 2023). Some methods integrates epistemic uncertainty on Q-function to account for the distribution shift of unseen actions (Kumar et al., 2020; O Donoghue et al., 2017; Luis et al., 2023).

If we view the states as time and the model in use, and actions as either retraining or maintaining the current model, we can frame this problem as an offline reinforcement learning (RL) problem. The problem would also feature a deterministic transition matrix and a highly structured reward which unusual in RL. However, most existing approaches focus on scaling to very large state spaces, employing large models, and assuming access to abundant data, making them unsuitable for our context. A key requirement for our approach is that it must be highly efficient to train. If the resources required for making a retraining decision are comparable to those for retraining the model itself, the approach becomes impractical.

A.4. Proof of Proposition 3.1

We provide the proof for our result from Proposition 3.1, which states the following.

Given that L pei,t pei+1,t t [T], a horizon of T N, and a relative cost of retrain α, the number of retrains of the solution to Equation 5 r θ 1 satisfies:

We start by defining a function that takes the model index i and the timesteps t as arguments, and outputs the performance pe(i,t) = pei,t, and rewrite the objective:

Cα(θ) = α θ 1 + T t=1 pe(rθ(t),t), (26)

θ = arg min θ {0,1}T Cα(θ), (27)

where we still have that rθ(t) returns the most recent index of retraining at t.

Subproblem with a fixed number of retrains We can break down this optimization problem into subproblems, where we solve for the optimal retraining schedule for a given fixed number of retrains r. We define such a subproblem as follows:

Cr(θ) = αr + T t=1 pe(rθ(t),t), (28)

θ r = arg min θ {0,1}T s.t. θ =r Cr(θ). (29)

Since we know that we will have r retrains, we can rewrite this subproblem by encoding the retraining decisions as r timesteps of retrain t1 < < tr. We use a simple index mapping function I [T]r {0,1}T :

I({t1,...,tr}) = θ s.t.

θt = 1 if t {t1,...,tr} θt = 0 o.w. (30)

We can remove the constant αr from the objective as it does not depend on the parameters anymore. The solution of Eqn 29 is given by:

When to retrain a machine learning model

θ r = arg min θ {0,1}T s.t. θ =r αr + T t=1 pe(rθ(t),t) (31)

= arg min θ {0,1}T s.t. θ =r

T t=1 pe(rθ(t),t) since the αr is fixed (32)

= I arg min t1< <tr [T ]r

t1 s=1 pe(0,s) + r 1 i=1 (

ti+1 s=ti+1 pe(ti,s)) + T s=tr+1 pe(tr,s) (33)

θ r = I arg min t1< <tr [T ]r Mr({t1,...,tr}) (34)

where Mr({t1,...,tr})

t1 s=1 pe(0,s) + r 1 i=1 (

ti+1 s=ti+1 pe(ti,s)) + T s=tr+1 pe(tr,s) (35)

We therefore can focus on the new objective Mr({t1,...,tr}) as minimizing this objective is equivalent to finding θ r.

{t1,...,tr} = arg min t1< <tr [T ]r Mr({t1,...,tr}) (36)

M r Mr({t1,...,tr} ) (37)

θ r = I ({t1,...,tr} ) (38)

Lemma A.1. Given a discrete function pe [T] [T] R with bounded L pe(i,t) pe(i + 1,t) , a timestep horizon T N, and a number of retrains r {1,T 1}, we can show that:

L(T r)2 M r M r+1 (39)

That is, the relative improvement of performance cost that you can gain by increasing the number of retrainings from r to r + 1 is upper bounded by L(T r)2.

This allows us to preemptively determine the maximum number of retains r we have to consider for solving our initial problem, as we know the cost of adding one more retrain (α). Therefore, once L(T r)2 is smaller than α, the optimal solution cannot have higher than r retrains. That is,

L(T r )2 < α Ô r < T α

This concludes our proof for Proposition A.4. We provide the proof for Lemma A.1 in the following section.

Proof Lemma A.1 To prove this lemma, we decompose the M r+1 quantity into the Mr value we would obtain with the first r timesteps of the solution {t1,...tr+1} and some value:

When to retrain a machine learning model

t 1 s=1 pe(0,s) + r 1 i=1

t i+1 s=t i +1 pe(t i ,s) +

t r+1 s=t r+1 pe(t r,s) + T s=t r+1+1 pe(t r+1,s)

t 1 s=1 pe(0,s) + r 1 i=1

t i+1 s=t i +1 pe(t i ,s) + T s=t r+1 pe(t r,s) T s=t r+1+1 pe(t r,s) + T s=t r+1+1 pe(t r+1,s)

t 1 s=1 pe(0,s) + r 1 i=1

t i+1 s=t i +1 pe(t i ,s) + T s=t r+1 pe(t r,s)

T s=t r+1+1 pe(t r,s) + T s=t r+1+1 pe(t r+1,s) (adding zero)

= Mr({t1,...,tr+1} t r+1) T s=t r+1+1 pe(t r,s) + T s=t r+1+1 pe(t r+1,s)

= Mr({t1,...,tr+1} t r+1) T s=t r+1+1 (pe(t r,s) pe(t r+1,s)).

By definition, we know that;

Mr({t1,...,tr+1} t r+1) M r . (41)

That is, the M value that we obtain by removing the last timestamp using the solution for the r + 1 problem.

Using that inequality in our previous result, we obtain the final result:

M r+1 M r T s=t r+1+1 (pe(t r,s) pe(t r+1,s)) (42)

M r T s=t r+1+1 L(t r+1 t r) (43)

= M r (T t r+1)L(t r+1 t r) (44)

M r L(T r)2. (45)

A.4.1. PROPOSITION 3.1 IN PRACTICE

In this section, we illustrate how to use the result from Proposition 3.1 in practice. To restate, proposition states the following;

Given that L pei,t pei+1,t t [T], a horizon of T N, and a relative cost of retrain α, the number of retrains of the solution to Equation 5 r θ 1 satisfies:

We present the α values that guarantee various numbers of optimal retrains r = 0,1,2 in our experiment. Since we can t provide a true upper bound for the L value, we approximate it using the empirical maximum value that we observe in a specific dataset for pei,t pei+1,t . In Figure 4, we can see that the α at which we know for certain that we don t need to retrain is not too far off the operational region of the problem. The oracle decides to not retrain around α = 0.5, and the bound from our result guarantees that we don t have to retrain if the selected α is larger than 0.96.

A.5. Bounding L

In this section, we provide more details on the known results from the literature that can be connected to the bound L.

When to retrain a machine learning model

0.0 0.5 1.0

0.00 0.25 0.50 0.75 1.00

num. retrains

UPF Oracle r * 0 r * 1 r * 2

Figure 4. Results on the Gauss dataset, with the α values from Proposition A.1 providing different upper bounds on the optimal number of retrain r . Left) Cost ˆCα(θ) vs α. Right) Number of retrains vs α.

Approximating L from known upper bounds For some simple models, explicit bounds on the expected performance as a function of the number of samples N have been derived. We can use those upper bounds to approximate L under no distribution shift, where the dataset size is steadily increasing by a known number of samples D .

Theorem A.2 (Standard generalization in the Gaussian model (from (Schmidt et al., 2018) )). Let (x1,y1),...,(x(i+1) D ,y(i+1) D ) Rd { 1} be drawn i.i.d. from a (θ ,σ)-Gaussian model with θ 2 =

d. Let ˆw Rd be the unit vector in the direction of z = 1 (i+1) D (i+1) D i=1 yixi, i.e., ˆw = z/ z 2. Then with probability at least

1 2exp( d 8(σ2+1)), the linear classifier f ˆ w has classification error at most;

pei,t exp (2

(i + 1) D 1)2d

(i + 1) D + 4σ)2σ2 . (47)

For the proof please refer to (Schmidt et al., 2018). An L bound value can therefore be loosely approximated to match the gap of the upper bound;

pei,t pei+1,t < L exp (2

(i + 1) D 1)2d

(i + 1) D + 4σ)2σ2 exp (2

(i + 2) D 1)2d

(i + 2) D + 4σ)2σ2 . (48)

Beyond IID data. In real-world applications, data often exhibits temporal or spatial dependencies, making the nondistribution shift i.i.d. assumption unrealistic. For non-i.i.d. processes, stability analysis (Mohri & Rostamizadeh, 2007; 2010) or bounds based on Rademacher complexity (Mohri & Rostamizadeh, 2008) can be used to analyze generalization performance and thus to derive retraining schedules in more complex scenarios.

In the context, of the proposed retraining framework, bounds like this theoretically allow us to make precise statements about the benefit of retraining L to derive optimal retraining schedules. In practice, deriving a retraining schedule from these bounds would provide a loose and non-sufficient estimate. Thus, we introduce a data-driven algorithm to estimate optimal retraining schedules in our work.

When to retrain a machine learning model

Empirical knowledge on the scaling law of L on N for LLMs Kaplan et al. derive scaling laws for large language models (LLMs) concerning the dependency of the final cross-entropy loss depending on model size, dataset size and compute budget used for training. They find a power-law for all of the aforementioned parameters. For example, they find that the loss L of the neural network scales with respect to the dataset size N as L = (N/5.4 1013) 0.095. This empirical relationship provides valuable insights for determining optimal retraining schedules. By quantifying how loss decreases with increasing dataset size, it enables researchers to estimate the expected performance improvements from expanded datasets L and to make informed decisions about when retraining would yield substantial benefits.

A.6. Dataset

Dataset statistics can be viewed in Table 4.

Table 4. Dataset description. w denotes the number of timestep of the offline phase, T denotes the number of timestep of the online phase. The Model describes the architecture used for each ft.

Dataset Model αmax w M<0 T Dataset size ( D ) Num. features Task Total N

Gauss XGBoost 0.5 7 21 8 5000 2 Binary - (Synthetic) circles XGBoost 0.25 7 21 8 5000 2 Binary - (Synthetic)

electricity XGBoost 1 7 21 8 2000 6 Binary 4,5312 yelp CHI XGBoost 0.1 7 21 8 4000 25 Binary 67,395 epicgames XGBoost 0.1 7 21 8 1000 400 Binary 17,584 airplanes XGBoost 0.7 7 21 8 3000 7 Binary .. i Wild Vision Model (see A.6.1) 1 7 21 8 40,605 224x224+1 100 539,383

In this section, we provide a more detailed overview of each retraining datasets. Except for the i Wild experiment, each individual dataset Dt is constructed with distinct samples, with no overlap between Dt and Dt 1. For the electricity, airplanes, yelp CHI, and epicgames datasets, the partitions are determined based on the timestamp of each sample (i.e., the datasets are divided in temporal sequence).

electricity (?) is a binary classification where the task is to predict the rise or fall of electricity prices in New South Wales, Australia. The distribution evolve due to change in consumption patterns.

airplanes (Gomes et al., 2017) is also a binary task where the task is to predict if a flight will be delayed. We follow Mahadevan & Mathioudakis (2024) and use the Sklearn Multiflow library version (Montiel et al., 2018) of the airplane dataset.

yelp CHI (Dou et al., 2020) is a spam dataset. The dataset contains users, hotels and restaurants. An interaction occurs when a user submits a review for one of these hotels or restaurants. Reviews are categorized as either filtered (indicating spam) or recommended (indicating legitimate content).

epicgames (Ozmen et al., 2024) includes critiques from authors on games released on the epicgames platform. Interaction features are created by vectorizing the critiques using TF-IDF and incorporating the author s overall rating. The interaction label indicates whether the critique was chosen as a top critique.

Gauss is a 2 dimensional synthetic dataset. The input features as generated as Xt N(µ1(t),µ2(t),σ1) where µ1(t) = (t+1)

100 , µ2(t) = 0.5 (t+1)

100 , σ = 0.1. The label is generated using a fixed rule y = 1[4 r1 0.5) 2 > r2].

circles is a 2 dimensional synthetic dataset. The input features as uniformly generated as Xt U[0,1] The label is generated using a moving rule yt = 1[(r1 (0.2 + 0.02t))2 + (r2 (0.2 + 0.02t))2 0.5 ].

i Wild (Beery et al., 2020) is a multiclass dataset featuring images of animals captured in the wild at various locations. Originally used as a domain transfer benchmark, we adapted it into a standard classification dataset by including the location ID as a feature for the model. To obtain a long enough sequence of datasets D0,D1,..., we create the individual datasets Di using overlapping windows on the timeframe, i.e., half of the most recent images in Di are contained in Di+1. We avoid data leakage by ensuring that the train/val/test splits are maintained.

When to retrain a machine learning model

A.6.1. BASE MODEL OF THE IWILD DATASET

To motivate our cost considerations, we present an experiment where the base model architecture is not fixed and is searched for across a list of potential model architectures. This could happen in practice for important applications; nothing forces a practitioner to use the same base model f at each timestep.

Our architecture involves using a pretrained vision model, with a new output layer added to match the correct number of classes for our task, which is then fine-tuned for up to 20 epochs. The fine-tuning process uses the Adam optimizer with a fixed learning rate of 10 4 and a weight decay parameter of 10 5. Training was conducted using 4 H100 GPUs for 2 days.

At each timestep ft, we perform a random search over the pretrained vision models made available from timm, which includes 188 vision models of varying configuration and base architecture. We include the list in Appendix A.15. We also include in our search the option to early stop or not, using the validation set. The model used for ft is the one that obtains the best validation accuracy.

A.7. Performance forecaster

In this section, we provide additional details on the proposed algorithm to forecast the performance.

To restate, instead of learning the α(ri,j),β(ri,j parameters, we learn the mean and variance parameters;

µ(ri,j) (49)

σ(ri,j). (50)

And convert the learned parameters to the parameters of a beta distribution using the following relation (with appropriate clipping if needed):

α = µ(µ(1 µ)

β = (1 µ)(µ(1 µ)

Inputs ri,j As stated, the input of our performance forecaster model contains the model index i, the timesteps j, the time since retrain j i and summary statistics of the distribution shift zshift. zshift is constructed by taking the average feature shift between the features of the most recently available subsequent datasets Dt and Dt 1 (where t denotes the time step of the most recent available dataset). We compute the mean features of each dimension for a given dataset; x = 1 Dt Dt i=1 xi and compute the ℓ1 distance between the mean feature vector of the two subsequent datasets;

zshift = xt xt 1 1 (53)

The input features are thus given by concatenating ri,j = [i,j,j i,zshift].

Since our methodology involves forecasting the performance of future models and on future datasets to be used by our decision algorithm, we assess the regression performance of our forecasting models and analyze how it impacts the overall performance of our UPF algorithm.

To do so, we construct two versions of our forecaster module µϕ(ri,j) that are designed to be less performant than our proposed method.

UPF overfit: A baseline designed to overfit the training data. We use a Gaussian Process-based µϕ(ri,j) with no white noise kernel, using a single dot product kernel from scikit-learn.

UPF overfit+noise: This variant further decreases performance by using the same overfitting model and adding random noise to the target values.

We report two metrics, the average mean absolute error of our prediction µ and the average bias of our prediction µϕ(ri,j) ai,j on the test set. We start by reporting the retraining performance of each baseline w.r.t. our base retraining

When to retrain a machine learning model

metric, the AUC of cost values evaluated at different α in Table 5. As expected, the best performing method is the method with our proposed UPF baseline which is expected to reach the best MAE error on it s performance prediction, on all datasets.

Table 5. AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. The bolded entries represent the best, and the underlined entries indicate the second best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

Gauss circles epicgames electricity yelp airplanes

UPF overfit+noise 0.3845 0.0722 0.3253 2.6389 0.1194 2.3767 UPF overfit 0.3849 0.0663 0.3224 2.6001 0.1194 2.3352 UPF 0.3836* 0.0662* 0.3203* 2.5910* 0.1175* 2.3094*

We then visualize the effect of the performance forecasting precision (measured with MAE and bias) on the decision algorithm s performance (measured by ˆCα(θ)) in the following figures.

Overall, we observe that the impact of poor performance depends on the difficulty of the underlying dataset.

For the airplane dataset, which is of standard difficulty, we can observe a gradual impact of the degradation in forecasting performance on the overall retraining metric in Figure 5. The best MAE leads to the best cost metric ˆCα(θ), and the performance gradually decreases as the MAE and bias worsen.

The Epicgame dataset 6, which is more challenging due to its less regular performance trends, shows a different behavior. Here, the overall forecasting performance is worse (the best achievable MAE is higher), and we observe a less regular pattern where poorer MAE does not always result in a proportional increase in cost, as shown in terms of scale. Similarly, when

0.0 0.2 0.4 0.6

0.1207 0.2785 0.3243

0.0 0.2 0.4 0.6

-0.0191 0.1297 0.2831

Figure 5. Airplanes. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias).

0.00 0.02 0.04 0.06 0.08 0.10

0.2446 0.3103 0.3326

0.00 0.02 0.04 0.06 0.08 0.10

0.2393 0.31 0.3326

Figure 6. Epicgames. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias).

When to retrain a machine learning model

turning to the synthetic datasets, the circle dataset, which is constructed with concept drift (changing p(Y X)), is more challenging than the Gauss dataset, which only exhibits feature drift (where p(X) changes, but p(Y X) remains constant). This impacts the effect of poor forecasting performance. In Figure 7, for the circle dataset, we observe that a small decrease in MAE paired with stronger bias can have a more sudden and drastic effect on the decision policy. Conversely, in the Gauss dataset (Figure 8), the effect of poorer forecasting performance is less pronounced.

0.00 0.05 0.10 0.15 0.20 0.25

0.021 0.0216 0.0234

0.00 0.05 0.10 0.15 0.20 0.25

-0.0177 -0.0169 0.0234

Figure 7. Circles. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias).

0.0 0.1 0.2 0.3 0.4 0.5

0.045 0.0649 0.0805

0.0 0.1 0.2 0.3 0.4 0.5

-0.0416 -0.02 0.0805

Figure 8. Gauss. Cost ˆCα(θ) vs α with the forecasting performance metrics (mae and bias).

When to retrain a machine learning model

A.8. Extension to non-bounded metrics

In this section, we show how we can extend our methodology to model non-bounded metrics often used in regression tasks, such as the root mean square error (RMSE) or mean absolute error (MAE).

To do so, we replace the use of a Beta distribution to a log Normal distribution to model our performance metric r.v. Ai,j.

A log normal distribution is parameterized with location m and scale parameter v. We can learn the mean and variance parameters using the same Gaussian approximation;

Log Norm(m(ri,j),v(ri,j)) N(µ(ri,j),σ(ri,j)), (54)

and recover the location and scale parameters using the relation;

m = ln(v) v2

A.8.1. BETA APPROXIMATION VS NORMAL

In our method, we approximate the Beta distribution with a Normal distribution to ease the learning process;

Beta(α(ri,j),β(ri,j)) N(µ(ri,j),σ(ri,j)). (57)

We verify here that this approximation doesn t have too big an effect on the end performance. We compare the UPF method, which uses Ai,j Beta(α(ri,j),β(ri,j)), with a UPF (Gaussian), which doesn t use the Beta distribution and instead uses a Gaussian with learned parameters to model the performance metric: Ai,j N(µ(ri,j),σ(ri,j)). In Figures 9, 10, 11 and 12, we can see that this does not have too big an effect on the overall behavior and performance.

0.0 0.2 0.4

strategy UPF UPF (Gaussian)

Figure 9. Gauss

0.00 0.25 0.50 0.75 1.00

strategy UPF UPF (Gaussian)

Figure 10. Electricity

When to retrain a machine learning model

0.0 0.1 0.2

strategy UPF UPF (Gaussian)

Figure 11. Circles

0.0 0.2 0.4 0.6

strategy UPF UPF (Gaussian)

Figure 12. Airplanes

A.9. Training complexity

In this section, we compare the training complexity of each baseline. We report the average time required for the offline training process, online inference and discuss runtime complexity.

The CARA baseline comprises two computationally intensive components. First, it constructs the C matrix, representing its performance estimation. This algorithm involves inferring, with a modified model, each point of the new dataset and reweighting each, which scales with O( Dnew ). This needs to be done in both offline and online phases. Then, in the offline phase, it performs an annealing search over parameters to find the best value that minimizes this cost approximation, taking into account the retraining cost associated with each decision. In Table 6, we can see that this result in the highest runtime for both online and offline phases.

Table 6. Average runtime of the baselines on the circles dataset.

CARA cum. CARA CARA per. UPF ADWIN FHDDM KSWIN

Offline ms 8.4871 8.6608 7.8461 0.0947 0.0274 0.0122 0.3392 Online (one step)ms 1.5604 1.5046 1.5940 0.0247 0.0351 0.0103 0.3438

In comparison, our approach consists of fitting a linear model on a small dataset. The shift distribution features must be obtained, but they involve comparing two histograms, scaling as O(w2 Dt ) rather than exponentially with Dt .

The distribution shift baselines do not have an offline phase, as they monitor shifts in the underlying distribution continuously. Their runtime complexity is therefore very low, at O( Dt ), as reflected in Table 6

A.10. Additional results

In this section, we include additional figures to visualize our results in Figures 13, 14, 15, 16, 17, 18, and Figures 19. Overall, the results are generally consistent and exhibit a similar trend. The Epic Games dataset, however, is more challenging and presents greater difficulties for all baselines. In particular, UPF performs worse than other baselines at low values of the retraining cost ratio α. For those operating points, UPF does reach the correct retraining frequency; however, it is unable to pinpoint the optimal moments to retrain, resulting in worse performance than baselines that retrain more frequently, as shown in the right panel of Figure 19.

When to retrain a machine learning model

0.0 0.2 0.4 0.6 0.8 1.0 1.50

0.0 0.2 0.4 0.6 0.8 1.0

num. retrains

UPF CARA CARA cumul. CARA per. KSWIN-5% KSWIN-50% FHDDM-5% FHDDM-50% ADWIN-5% ADWIN-50%

Figure 13. Result on the electricity dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α.

0.00 0.02 0.04 0.06 0.08 0.10 1.100

0.00 0.02 0.04 0.06 0.08 0.10

num. retrains

Figure 14. Result on the yelp dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α.

When to retrain a machine learning model

0.00 0.01 0.02 0.03 0.04 0.05 3.15

0.00 0.01 0.02 0.03 0.04 0.05

num. retrains

Figure 15. Result on the epicgames dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α.

0.0 0.1 0.2 0.3 0.4 0.5 0.5

0.0 0.1 0.2 0.3 0.4 0.5

num. retrains

Figure 16. Result on the Gauss dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α.

When to retrain a machine learning model

0.00 0.05 0.10 0.15 0.20 0.25

0.00 0.05 0.10 0.15 0.20 0.25

num. retrains

Figure 17. Result on the circles dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α.

0.0 0.2 0.4 0.6

0.0 0.2 0.4 0.6

num. retrains

UPF CARA CARA cumul. CARA per. KSWIN-5% KSWIN-50% FHDDM-5% FHDDM-50% ADWIN-5% ADWIN-50%

Figure 18. Result on the airplanes dataset. left) Cost ˆCα(θ) vs α. right) Number of retrains vs α.

We additionally include results with the oracle baselines in Figures 19. We can see that the UPF baseline is reasonably close to the optimal algorithm in two of the datasets (circles and electricity), but struggles for the more challenging dataset, epicgames. Looking at the number of retrains, we can see that UPF more closely follows the retraining frequency of the oracle for all datasets.

When to retrain a machine learning model

0.0 0.1 0.2

0.0 0.5 1.0 1.50

0.00 0.02 0.04 3.12

0.0 0.1 0.2

num. retrains

UPF oracle CARA

0.0 0.5 1.0

num. retrains

UPF oracle CARA

0.00 0.02 0.04

num. retrains

UPF oracle CARA

Figure 19. Result on the circles (left), electricity (middle) and epicgames (right) datasets. Top) Cost ˆCα(θ) vs α. Bottom) Number of retrains vs α.

A.11. Methodology as offline RL

We can frame the retraining problem as an offline RL task (Levine et al., 2020). We define a state space where each state is described by the index of the trained model and the timestep; S {T} {T}. The action space is to either retrain or not, so A = {0,1}. The state transitions are deterministic and known:

T(St+1 St = (i,t),A) =

1 if A = 0,St+1 = (i,t + 1) 1 if A = 1,St+1 = (t + 1,t + 1) 0 o.w. . (58)

Figure 20 provides a visualization of the MDP. Since the state transitions are deterministic, we can define the deterministic transition function:

st+1 = t(at,st). (59)

The reward function only depends on the end state (which describes the performance of a model i evaluated at timestep t) and on the action. Using pe S to denote the performance at a state S and reusing of tradeoff parameter α, we have the reward

When to retrain a machine learning model

No retrain (keep)

Figure 20. Visualization of the MDP

r(at,st+1) = αat pest+1. (60)

To match our setting, the discount factor has to be set to one γ = 1.

The goal is to learn a policy π on offline data to generalize to the online period. The offline dataset is given by: Doffline = {sn,an,rn}N n=1.

The objective is defined as:

J(π) = Eτ pπ(τ)[ T +w t=w r(st,at)], (61)

which is the same objective as we defined, with the added option of defining a random policy to make decisions pπ(θ):

J(π) = Eθ pπ(θ)[ T +w t=w r(st,at)] (62)

= Eθ pπ(θ)[ T +w t=w αat + pest+1] (63)

= Eθ pπ(θ)[Cα(θ)]. (64)

Q-learning (approximate dynamic methods) The basic idea of Q-learning is to define a Q function and to derive a deterministic policy π from it. The Q function is defined as follows;

Qπ(st,at) = Eτ pτ st,at[ T +w t =t r(st ,at )] (65)

and the policy is set to:

π(at st) = δ(at = arg max Q(st,at)). (66)

When to retrain a machine learning model

Since the optimal policy π should satisfy

Q (st,at) = r(st,at) + Est T (st+1 st,at)[max at+1 Q (st+1,at+1)], (67)

one algorithm is to train Qϕ until that equation is satisfied.

In our case, the transition is deterministic, so we can define st+1 = t(st,at) and have

Q (st,at) = r(st,at) + maxat+1Q (t(st,at),at+1). (68)

The idea is then to parameterize Qϕ, and minimize the following for all samples in the dataset using the Bellman update:

n (Qϕ(sn,an) [r(sn,an) + max a Qϕ(s ,a )])2 . (69)

First we set the target:

yn = r(sn,an) + max a Qϕ(s ,a ) (70)

then we optimize

ϕ n (Qϕ(sn,an) yn)2. (71)

and the algorithm iterates between those two steps. We can therefore apply any Q-learning method to our problem, provided that it uses a standard Qϕ parameterization.

Connecting Q-learning to our UPF algorithm

In our setting, we have special knowledge of the structure of Q. First, there is no randomness on the transition state, so we know that:

yn = r(sn,an) + max an+1 Qϕ(t(sn,an),an+1) (72)

By definition, we have that:

Qϕ(st,at) = atα pes,t + max at+1 Qϕ(t(st,at),at+1) (73)

While computing the Bellman update and setting the target, we can see that the Q function of one of the last states Qϕ(s T,x, ) will have to predict the end performance:

Qϕ(s T,x, ) = pes T,x , (74)

= fϕ(s T,x). (75)

By the DAG structure of the transition function, and since the α value is known, we can parameterize recursively all the Qϕ functions with shareable components:

Qϕ(s T 1,x,a T 1,x) = αa T 1,x fϕ(s T 1,x) + max( α fϕ(s T,T ), fϕ(s T,x)), (76)

where each fϕ(s T 1,x) is modeling the performance pes T,x at that given state.

The MSE objective that is traditionally applied (Eqn. 71) can then be decomposed into 2 terms, where one of the terms corresponds to our objective:

L = n (Qϕ(sn,an) yn)2 (77)

= ( αan,x fϕ(sn) + max( α fϕ(s T,T ), fϕ(s T,x)) (78)

(anα + pesn + max an+1 Qϕ(t(sn,an),an+1))) 2 (79)

= (fϕ(sn) pesn + max( α fϕ(s T,T ), fϕ(s T,x)) + max an+1 Qϕ(t(sn,an),an+1))) 2 (80)

L = n (fϕ(sn) pesn) 2 + C. (81)

When to retrain a machine learning model

The term (fϕ(sn) pesn) 2 in the loss function aligns with our objective, as Ai,j represents our model s approximation of the performance metric pei,j. Therefore, with this specific parameterization, we can establish a connection between Q-learning and our learning method.

However, as noted in the main text, applying existing ORL methods to this problem would not be effective. The problem involves a deterministic transition matrix and a highly structured reward, both of which are uncommon in typical RL settings. Additionally, most RL methods prioritize scalability to large state or action spaces, use complex models, and assume access to plentiful data, making them ill-suited for our scenario. A key requirement for our approach is training efficiency, given our limited performance data and the need for online adaptation as more information becomes available. If the computational cost of deciding when to retrain is comparable to the retraining process itself, the approach becomes impractical.

A.11.1. OFFLINE RL BASELINES

In this section, we present results using an offline RL baseline that is appropriate for low-data settings: Least-Squares Policy Iteration (LSPI) (Lagoudakis & Parr, 2003). We follow the detailed RL formulation as previously presented. To implement LSPI, we use the model index i and timesteps t as states (following the formulation from the previous section). In LSPI, various approximation methods are introduced to solve the linear equation, but these are unnecessary in our case, as we can solve it exactly due to the small size of our problem. We present various versions of this baseline by changing the λ parameter. In Table 7, we can see that this proposed baseline is not competitive. These initial results for this basic formulation of the offline RL problem indicate that more care and design should be taken to appropriately solve this problem using offline RL, supporting that existing RL methods, as they are, may not be well-suited to solve the problem.

Table 7. AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. The bolded entries represent the best, and the underlined entries indicate the second best. The The denotes statistically significant difference with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

electricity Gauss circles airplanes yelp CHI epicgames i Wild

ADWIN-5% 2.8099 0.4533 0.0753 2.6353 0.1298 0.3217 3.7371 ADWIN-50% 2.8131 0.4848 0.0753 2.7147 0.1298 0.3238 4.2564 KSWIN-5% 3.8979 0.3975 0.0753 3.2300 0.1322 0.3420 4.4268 KSWIN-50% 4.0521 0.9530 0.0794 3.2042 0.1655 0.3537 4.4268 FHDDM-5% 3.1525 0.3893 0.0753 2.6577 0.1324 0.3298 4.4267 FHDDM-50% 3.4037 0.5918 0.0772 2.7077 0.1450 0.3389 4.4268 CARA cumul. 2.7147 0.3862 0.0731 2.2900 0.1299 0.3228 3.8922 CARA per. 2.8986 0.4678 0.0800 2.4061 0.1318 0.3260 3.7527 CARA 2.7198 0.3841 0.0726 2.2753* 0.1294 0.3202 3.9506

LSPI λ = 1 4.3820 1.0530 0.2412 3.7140 0.1493 0.3523 - LSPI λ = 0.5 4.5260 1.0837 0.2455 3.6924 0.1442 0.3566 - LSPI λ = 0.0 4.5317 1.0933 0.2478 3.5862 0.1378 0.3573 -

UPF (ours) 2.5782* 0.3829* 0.0668* 2.2865 0.1293* 0.3189* 3.0498*

oracle 2.4217 0.3724 0.0627 2.2298 0.1275 0.3170 2.4973

A.12. Relating our objective to the CARA formulation

In (Mahadevan & Mathioudakis, 2024), even though they are also tackling the retraining problem, they are formulating the problem differently.

Instead of using a binary vector to model the retraining decisions, they use a sequence of model indices S = [s1,...,s T ] with the constraint that st {0,...,t}. If st = t, it signifies a retrain.

The cost objective they consider is similar to ours; they sum over the timesteps to get the cumulative total cost. The cost per

When to retrain a machine learning model

timestep is encoded in an upper triangular matrix C:

Ψt,t if t < t κ if t = t (cost of retraining) o.w. (82)

where Ψt,t is defined as some relative staleness cost . The total cost is defined as:

Ccara(S) = T t=1 C[st,t]. (83)

The staleness cost is defined as the cost of using a model f1 to classify data from Q2, approximated by dataset D3:

Ψ(Q2,D3,f1) q Q2

1 D3 x,y D3 sim(q,x)ℓ(f1,x,y) (84)

The aim of this metric is to predict the performance of f1 on the query points in Q2 by computing the loss on a reference dataset D3. The idea is to weight the loss at each sample of D3 by how similar they are to the query samples in Q2 (this is the role of sim(q,x)).

ℓ(f3(q),yq) 1 D3 x,y D3 sim(q,x)ℓ(f1,x,y) (85)

Ψ(Q2,D3,f1) Ne EQ2[ℓ(f3(X),Y )] (86)

Nepet3,t2 (87)

The relative staleness cost is defined as the difference between staleness costs:

Ψt,t = Ψ(Qt,Dt,ft ) Ψ(Qt,Dt ,ft ). (88)

This is intended to approximate the relative gap of performance:

Ψt,t Ne(pet ,t pet,t) (89)

In our experiment, we directly use Ψ(Qt,Dt,ft ) as an approximation of pet ,t and apply the CARA algorithm directly on the staleness costs instead of using the relative staleness cost.

Relating it to our formulation Our objective is given by;

C(θ) = c θ 1 + e N T t=1 perθ,t. (90)

To understand the connection with our formulation, we start by rewriting the CARA cost as:

Ccara(S) = T t=1 1[st = t]κ + 1[st < t] Ψt,st (91)

= T t=1 1[st = t]κ + 1[st < t] Ψt,st (92)

T t=1 1[st = t]κ + Ne1[st < t](pest,t pet,t) from (89) (93)

Ccara(θ) = κ θ 1 + Ne T t=1 (perθ,t pet,t) switching to our notation with θ. (94)

This reveals the assumptions that are required for both solutions to coincide. First, this approximation for the loss of a future model ft should hold:

ℓ(ft(xq),yq) 1 Dt x,y Dt sim(xq,x)ℓ(f1,x,y) (95)

When to retrain a machine learning model

Second, in order to have:

C(θ) = Ccara(θ) (96)

κ = c + Ne T t=1 pet,t θ 1 . (97)

Proof: We require that:

c θ 1 + Ne T t=1 perθ,t = κ θ 1 + Ne T t=1 (perθ,t pet,t). (98)

This implies that:

c θ 1 + Ne T t=1 perθ,t = κ θ 1 + Ne T t=1 perθ,t Ne T t=1 pet,t , (99)

and hence that:

κ = c + Ne T t=1 pet,t θ 1 . (100)

The cost of retraining κ in the CARA formulation must thus scale with the minimum performance cost that can be obtained by always using the most recent model Ne T t=1 pet,t, divided by the number of retrains that have been made. It is of course not possible to set κ to this value, as it depends on θ, but it gives insight into how the formulations relate to each other.

A.13. Varying training data size

In this section, we provide experimental results where we assume that we have access to fewer offline time steps and analyze how it impacts the results. We display the relative improvement of the best baseline vs. the competing baselines by reporting normalized AUC values in Tables 8,9, and10. Overall, our method remains effective in scenarios with reduced training data. It demonstrates greater robustness compared to the CARA baselines, which can be explained by the fact that it can adapt to new information received during the online process, which CARA cannot do. With very few training steps (w = 2), the CARA baselines suffer the most, reaching more than twice the error for some datasets. With more data (w = 4), the relative performance is more in line with larger datasets (w = 7), with UPF remaining the best.

Table 8. w = 2. Normalized AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. We normalize by dividing by the best value for each dataset. The bolded entries represent the best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

w = 2 electricity airplanes yelp CHI epicgames Gauss circles

CARA 1.0000 1.0101 1.0100 1.0282 2.6519 1.4792 CARA c. 1.0669 1.0680 0.0544 2.7437 4.0150 1.6872 CARA per. 2.1971 1.6703 0.0661 2.9131 10.6965 1.8901

UPF 1.0258 1.0000* 1.0000* 1.0000 1.0000* 1.0000*

A.14. Results on the Wild Temporal dataset

In this section, we present preliminary results on one dataset from the suite of temporal datasets from Yao et al. (2022). Specifically, we present preliminray results from the yearbook dataset.

To construct our sequence of datasets Dt,..., we follow the construction from (Yao et al., 2022). For training, we iteratively add more samples from each year, spanning from 1930 to 2012. For testing, we evaluate only on samples from the most

When to retrain a machine learning model

Table 9. w = 4. Normalized AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. We normalize by dividing by the best value for each dataset. The bolded entries represent the best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

w = 4 electricity airplanes yelp CHI epicgames Gauss circles

CARA 1.0093 1.0024 1.0000 1.0063 1.0049 1.0653 CARA per. 1.1029 1.0721 1.0017 1.0168 1.0984 1.0045 CARA c. 1.0153 1.0060 1.0025 1.0220 1.0042 1.0501 UPF 1.0000* 1.0000* 1.0008 1.0000* 1.0000* 1.0000*

Table 10. w = 7. Normalized AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. We normalize by dividing by the best value for each dataset. The bolded entries represent the best. The denotes statistical significance with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

w = 7 electricity airplanes yelp CHI epicgames Gauss circles

CARA c. 1.0530 1.0065 1.0046 1.0122 1.0086 1.0944 CARA per. 1.1244 1.0575 1.0193 1.0223 1.2219 1.1976 CARA 1.0549 1.0000* 1.0008 1.0041 1.0031 1.0868 UPF (ours) 1.0000* 1.0050 1.0000* 1.0000* 1.0000* 1.0000*

recent year. As for the model ft, we use the ERM model from (Yao et al., 2022), and follow the training procedure from Yao et al. (2022). We use a similar setup to the one followed in our experiment, setting the offline window size w = 7, evaluating over an online phase of T = 8 steps, and presenting results over 10 trials (See table 11). Preliminary results for this dataset which can be seen in Table 12 are inline with the results from the main paper.

Table 11. Dataset description. w denotes the number of timestep of the offline phase, T denotes the number of timestep of the online phase. The Model describes the architecture used for each ft.

Dataset Model αmax w M<0 T Dataset size ( D ) Num. features Task

yearbook ERM 0.5 7 21 8 (varies) 32X32X3 Binary

A.15. List of timm pretrained vision models

b e i t b a s e p a t c h 1 6 2 2 4 , b e i t v 2 b a s e p a t c h 1 6 2 2 4 , caformer s18 , c a i t s 2 4 2 2 4 , c a i t x x s 2 4 2 2 4 , c a i t x x s 3 6 2 2 4 , c o a t l i t e m i n i , c o a t l i t e s m a l l , c o a t l i t e t i n y , c o a t m i n i , c o a t t i n y , c o a t n e t 0 r w 2 2 4 , c o a t n e t b n 0 r w 2 2 4 , coatnet nano rw 224 , c o a t n e t r m l p 1 r w 2 2 4 , coatnet rmlp nano rw 224 ,

When to retrain a machine learning model

Table 12. AUC of the combined performance/retraining cost metric ˆCα(θ), computed over a range of α values, for all datasets. The bolded entries represent the best, and the underlined entries indicate the second best. The The denotes statistically significant difference with respect to the next best baseline, evaluated using a Wilcoxon test at the 5% significance level.

CARA cumul 0.0351 CARA per. 0.0195 CARA 0.0322 UPF 0.0120*

Oracle 0.0105

coatnext nano rw 224 , convformer s18 , c o n v i t b a s e , c o n v i t s m a l l , c o n v i t t i n y , convmixer 1024 20 ks9 p14 , c o n v n e x t a t t o , c o n v n e x t a t t o o l s , convnext base , convnext femto , c o n v n e x t f e m t o o l s , convnext nano , c o n v n e x t n an o ols , convnext pico , c o n v n e x t p i c o o l s , convnext small , c o n v n e x t t i n y , c o n v n e x t t i n y h n f , c o n v n e x t v 2 a t t o , convnextv2 femto , convnextv2 nano , convnextv2 pico , c o n v n e x t v 2 t i n y , c r o s s v i t 1 5 2 4 0 , c r o s s v i t 1 5 d a g g e r 2 4 0 , c r o s s v i t 1 5 d a g g e r 4 0 8 , c r o s s v i t 1 8 2 4 0 , c r o s s v i t 1 8 d a g g e r 2 4 0 , c r o s s v i t 9 2 4 0 , c r o s s v i t 9 d a g g e r 2 4 0 , c r o s s v i t b a s e 2 4 0 , c r o s s v i t s m a l l 2 4 0 , c r o s s v i t t i n y 2 4 0 , c s 3 d a r k n e t f o c u s l , cs3darknet focus m , c s 3 d a r k n e t l , cs3darknet m , c s 3 d a r k n e t x , cs3edgenet x , c s 3 s e e d g e n e t x ,

When to retrain a machine learning model

c s 3 s e d a r k n e t l , c s 3 s e d a r k n e t x , cspdarknet53 , c s p r e s n e t 5 0 , c s p r e s n e x t 5 0 , darknet53 , darknetaa53 , d a v i t b a s e , d a v i t s m a l l , d a v i t t i n y , d e i t 3 b a s e p a t c h 1 6 2 2 4 , deit3 medium patch16 224 , d e i t 3 s m a l l p a t c h 1 6 2 2 4 , d e i t b a s e d i s t i l l e d p a t c h 1 6 2 2 4 , d e i t b a s e p a t c h 1 6 2 2 4 , d e i t s m a l l d i s t i l l e d p a t c h 1 6 2 2 4 , d e i t s m a l l p a t c h 1 6 2 2 4 , d e i t t i n y d i s t i l l e d p a t c h 1 6 2 2 4 , d e i t t i n y p a t c h 1 6 2 2 4 , densenet121 , densenet161 , densenet169 , densenet201 , densenetblur121d , dla102 , dla102x , dla102x2 , dla169 , dla34 , dla46 c , dla46x c , dla60 , d l a 6 0 r e s 2 n e t , d l a 6 0 r e s 2 n e x t , dla60x , dla60x c , dm nfnet f0 , dm nfnet f1 , dpn68 , dpn68b , dpn92 , dpn98 , e c a n f n e t l 0 , e c a n f n e t l 1 , e c a n f n e t l 2 , e c a r e s n e t 3 3 t s , e c a r e s n e x t 2 6 t s