# temporal_testtime_adaptation_with_statespace_models__f2a2a28e.pdf Published in Transactions on Machine Learning Research (10/2025) Temporal Test-Time Adaptation with State-Space Models Mona Schirmer m.c.schirmer@uva.nl Uv A-Bosch Delta Lab, University of Amsterdam Dan Zhang dan.zhang2@de.bosch.com Bosch Center for AI Eric Nalisnick nalisnick@jhu.edu Johns Hopkins University Reviewed on Open Review: https: // openreview. net/ forum? id= HFETOm Utr V Distribution shifts between training and test data are inevitable over the lifecycle of a deployed model, leading to performance decay. Adapting a model on test samples can help mitigate this drop in performance. However, most test-time adaptation methods have focused on synthetic corruption shifts, leaving a variety of distribution shifts underexplored. In this paper, we focus on distribution shifts that evolve gradually over time, which are common in the wild but challenging for existing methods, as we show. To address this, we propose STAD, a Bayesian filtering method that adapts a deployed model to temporal distribution shifts by learning the time-varying dynamics in the last set of hidden features. Without requiring labels, our model infers time-evolving class prototypes that act as a dynamic classification head. Through experiments on real-world temporal distribution shifts, we show that our method excels in handling small batch sizes and label shift. 1 Introduction Predictive models often have an expiration date. Real-world applications tend to exhibit distribution shift, meaning that the data points seen at test time are drawn from a distribution that is different than the training data s. Moreover, the test distribution usually becomes more unlike the training distribution as time goes on. An example of this is with recommendation systems: trends change, new products are released, old products are discontinued, etc. Unless a model is updated, its ability to make accurate predictions will expire, requiring the model to be taken offline and re-trained. Every iteration of this model life-cycle can be expensive and time consuming. Allowing models to remain fresh for as long as possible is thus an open and consequential problem. Test-time adaptation (TTA) (Liang et al., 2024; Yu et al., 2023) has emerged as a powerful paradigm to preserve model performance under a shifting test distribution. TTA performs online adaptation of a model s parameters using only test-time batches of features. By requiring neither access to labels nor source data, TTA algorithms can be employed in resource-constrained environments, whereas related approaches such as domain generalization, domain adaptation and test-time training cannot. Most TTA methods operate by minimizing an entropy objective (Wang et al., 2021) or updating normalization parameters (Schneider et al., 2020; Nado et al., 2020; Niu et al., 2023). Synthetically corrupted images (e.g. CIFAR-10-C) are by far the most commonly used benchmark for assessing progress on TTA, despite concerns about benchmark diversity (Zhao et al., 2023b). These shifts increase the degree of information loss over time, and well-performing TTA methods must learn to preserve a static underlying signal. In this work, we focus on an omnipresent distribution shift of quite a different nature: Temporal distribution shifts encode structural change, not just information loss. Gradual structural change over time is relevant for any deployed model that is operating continuously. While related subfields Published in Transactions on Machine Learning Research (10/2025) like gradual domain adaptation (GDA) (Kumar et al., 2020) and temporal domain generalization (TDG) (Bai et al., 2023) are dedicated to these shifts, they have received little attention in TTA. As we will demonstrate using datasets like the Functional Map of the World (FMo W), which classifies land use over time (e.g., rural to urban development), the setting of temporal test-time adaptation (Temp TTA) presents significant challenges for existing TTA methods. To address this gap, we propose State-space Test-time Adaptation (STAD), a method that builds on the power of probabilistic state-space models (SSMs) to represent non-stationary data distributions over time. STAD dynamically adapts a model s final layer to accommodate an evolving test distribution. Specifically, we employ a probabilistic SSM based on Bayesian filtering to model the evolution of the weight vectors in the final layer, where each vector represents a class, as distribution shift occurs. For generating predictions on newly acquired test batches, we use the SSM s posterior cluster means as the new parameters. STAD leverages Bayesian updating and does not rely upon normalization mechanisms. As a consequence, STAD excels in scenarios where many TTA methods collapse (Niu et al., 2023), such as adapting with very few samples and under label shift. Our contributions are the following: In Sec. 2, we detail the setting of Temp TTA, which aims to cope with shifts that gradually evolve due to variation in the application domain. Despite being ubiquitous in real-world scenarios, these shifts are understudied in the TTA literature and pose significant challenges to established methods, as we demonstrate in Sec. 5.1. In Sec. 3, we propose STAD, a novel method for Temp TTA. It adapts to temporal distribution shifts by modeling its dynamics in representation space. No previous work has explicitly modeled these dynamics, which we demonstrate is crucial via an ablation study (Sec. 5.3). In Sec. 5, we conduct a comprehensive evaluation of STAD and prominent TTA baselines under authentic temporal shifts. Our results show that STAD excels in this setting (Tab. 2), yet is applicable beyond temporal shifts providing performance gains on some reproduction datasets (Tab. 3), synthetic corruptions (Tabs. 4 and 12) and domain adaptation benchmarks (Tabs. 8 and 9). 2 Problem Setting Data & Model We focus on the traditional setting of multi-class classification, where X RD denotes the input (feature) space and Y {1, . . . , K} denotes the label space. Let x and y be random variables and P(x, y) = P(x) P(y|x) the unknown source data distribution. We assume x X and y Y are realisations of x and y. The goal of classification is to find a mapping fθ, with parameters θ, from the input space to the label space fθ : X Y. Fitting the classifier fθ is usually accomplished by minimizing an appropriate loss function (e.g. log loss). Yet, our method is agnostic to how fθ is trained and therefore easy to use with, for instance, a pre-trained model downloaded from the web. Temporal Test-Time Adaptation (Temp TTA) We are interested in adapting a model at test-time to a test distribution that evolves with time. More formally, let T = {1, . . . , T} be a set of T time indices. At test time, let the data at time t T be sampled from a distribution Qt(x, y) = Qt(x) Qt(y|x). The test distributions differ from the source distribution, Qt(x, y) = P(x, y) t > 0, and are non-stationary, meaning Qt(x, y) = Qt (x, y) for t = t . Like in standard TTA, we of course do not observe labels at test time, and hence we observe only a batch of features Xt = {x1,t, . . . , x N,t}, where xn,t Qt(x) (i.i.d.). Given the t-th batch of features Xt, the goal is to adapt fθ, forming a new set of parameters θt such that fθt has better predictive performance on Xt than fθ would have. Since we can only observe features, we assume that the distribution shift must at least take the form of covariate shift: Qt(x) = P(x) t > 0. In addition, a label shift may occur, which poses an additional challenge: Qt(y) = P(y) t > 0. Temporal shifts, as described above, have been the focus of temporal domain generalization (Bai et al., 2023) and gradual domain adaptation (Abnar et al., 2021). However, both paradigms operate during training, whereas Temp TTA is applicable at test time. In Tab. 1, we contrast Temp TTA with adjacent fields highlighting subfields that address temporal shifts. Notably, Temp TTA can be seen as a special case of continual TTA (CTTA) (Wang et al., 2022b) with the important distinction that the domain index t is inherently temporal. This is opposed to a categorical domain index (e.g. different corruption types) as is mostly studied in CTTA. Published in Transactions on Machine Learning Research (10/2025) Table 1: Comparison of Temp TTA with related fields Adaptation Available Test distribution stage samples non-stationary time-ordered Domain generalization (DG) train P(x, y) Temporal DG train P1(x, y), . . . , PTS(x, y) Domain adaptation (DA) train P(x, y), Q(x) Gradual DA train P(x, y), Q1(x), . . . , QT (x) Test-time training train, test P(x, y), Q(x) Test-time adaptation (TTA) test Q(x) Continual TTA test Q1(x), . . . , QT (x) Temp TTA test Q1(x), . . . , QT (x) 3 Tracking the Dynamics of Temporal Shifts We now present our method: the core idea is that adaptation to temporal distribution shifts can be done by tracking its gradual change in the model s representations. We employ linear SSMs to capture how test points evolve and drift. The SSM s cluster representations then serve as an adaptive classification head that evolves with the non-stationarity of the distribution shift. Fig. 1 illustrates our method. In Sec. 3.2, we first introduce the general model and then, in Sec. 3.3, we propose an efficient implementation that leverages the von Mises-Fisher distribution to model spherical features. 3.1 Adaptation in Representation Space Figure 1: STAD adapts to distribution shifts by inferring dynamic class prototypes wt,k for each class k (different colors) at each test time point. It operates on the representation space of the penultimate layer. Following previous work (Iwasawa & Matsuo, 2021; Boudiaf et al., 2022), we adapt only the last layer of the source model. This lightweight approach is reasonable for Temp-TTA since the distribution shifts gradually over time, hence constrained adaptation is needed. From a practical perspective, this circumvents backpropagation through potentially large networks such as foundation models and allows adaptation when only embeddings are provided e.g. by an API. More formally, let the classifier fθ be a neural network with L total layers. We will treat the first L 1 layers, denoted as f L 1 θ , as a black box that transforms the original feature vector x into a new (lower-dimensional) representation, which we denote as h. The original classifier then maps these representations to the classes as: E[y|h] = softmaxy W 0 h , where softmaxy ( ) denotes the dimension of the softmax s output corresponding to the y-th label index and W0 are the last-layer weights. As W0 will only be valid for representations that are similar to the training data, we will discard these parameters when performing Temp TTA, learning new parameters Wt for the t-th time step. These new parameters will be used to generate the adapted predictions through the same link function: E[y|h] = softmaxy W t h . In the setting of Temp TTA, we observe a batch of features Xt. Passing them through the model yields corresponding representations Ht, and this will be the data used for the probabilistic model we will describe below. Specifically, we will model how the representations change from Ht to Ht+1 next. 3.2 A Probabilistic Model of Shift Dynamics We now describe our general method for a time-evolving adaptive classification head. We assume that, while the representations Ht are changing gradually over time, they are still maintaining some class structure in the form of clusters. Our model will seek to track this structure as it evolves. For the intuition of the Published in Transactions on Machine Learning Research (10/2025) approach, see Fig. 1. The blue red, and green clusters represent classes of a classification problem. As the distribution shifts from time step t = 1 to t = 3, the class clusters shift in representation space. Using latent variables wt,k for the cluster centers, we will assume each representation is drawn conditioned on K latent vectors: ht,n p (ht|wt,1, . . . , wt,K), where K is equal to the number of classes in the prediction task. After fitting the unsupervised model, the K latent vectors will be stacked to create Wt, the last-layer weights of the adapted predictive model (as introduced in Sec. 3.1). We now move on to a technical description. Notation and Variables Let Ht = (ht,1, . . . , ht,Nt) RD Nt denote the neural representations for Nt data points at test time t. Let Wt = (wt,1, . . . , wt,K) RD K denote the K weight vectors at test time t. As discussed above, the weight vector wt,k can be thought of as a latent prototype for class k at time t. We denote with Ct = (ct,1, . . . , ct,Nt) {0, 1}K Nt the Nt one-hot encoded latent class assignment vectors ct,n {0, 1}K at time t. The k-th position of ct,n is denoted with ct,n,k and is 1 if ht,n belongs to class k and 0 otherwise. Like in standard (static) mixture models, the prior of the latent class assignments p(ct,n) is a categorical distribution, p(ct,n) = Cat(πt) with πt = (πt,1, . . . , πt,K) [0, 1]K and PK k=1 πt,k = 1. The mixing coefficient πt,k gives the a priori probability of class k at time t and can be interpreted as the class proportions. Next, we formally describe how we model the temporal drift of class prototypes. Dynamics Model We model the evolution of the K prototypes Wt = (wt,1, . . . , wt,K) with K independent Markov processes. The resulting transition model is p(Wt|Wt 1, ψtrans) = k=1 p(wt,k|wt 1,k, ψtrans), (1) where ψtrans denote the parameters of the transition density. At each time step, the feature vectors Ht are generated by a mixture distribution over the K classes, p(Ht|Wt, ψems) = k=1 πt,k p(ht,n|wt,k, ψems). (2) where ψems are the emission parameters. We thus assume at each time step a standard mixture model over the K classes where the class prototype wt,k defines the latent class center and πt,k the mixture weight for class k. The joint distribution of representations, prototypes and class assignments can be factorised as follows, p(H1:T , W1:T , C1:T ) = p(W1) t=2 p(Wt|Wt 1, ψtrans) t=1 p(Ct)p(Ht|Wt, Ct, ψems). (3) We use the notation H1:T = {Ht}T t=1 to denote the representation vectors Ht for all time steps T and analogously for W1:T and C1:T . A plate diagram of the probabilistic model is depicted in App. B. We next outline how we infer the latent class prototypes W1:T . Posterior Inference & Adapted Predictions The primary goal is to update the class prototypes Wt with the information obtained by the Nt representations of test time t. At each test time t, we are thus interested in the posterior distribution of the prototypes p(Wt|H1:t). Once p(Wt|H1:t) is known, we can update the classification weights with the new posterior mean. We can fit the probabilistic model and infer the posterior distribution for the class weights Wt and class assignments Ct with the Expectation-Maximization (EM) algorithm. In the E-step, we compute the posterior p(W1:T , C1:T |H1:T ). In the M-step, we compute the expectation of the complete-data log-likelihood (Eqn. (3)) with respect to this posterior and then maximize the resulting expression with respect to the model parameters ϕ: ϕ = arg max ϕ Ep(W,C|H) log p(H1:T , W1:T , C1:T ) , (4) where ϕ comprises the parameters of the transition and emission densities as well as the mixing coefficients, ϕ = {ψtrans, ψems, π1:T }, and we have abbreviated the posterior distribution by p(W, C | H) := Published in Transactions on Machine Learning Research (10/2025) p(W1:T , C1:T |H1:T ). After one optimization step, we collect the K class prototypes into a matrix Wt. Using the same hidden representations used to fit Wt, we generate the predictions via the model s softmax parameterization, yt,n Cat yt,n; softmax(W t ht,n) , (5) where yt,n denotes a prediction sampled for the representation vector ht,n. Note that adaptation can be performed online by optimizing Eqn. (4) incrementally, considering only data up to point t. To omit computing the complete-data log likelihood for an increasing sequence as time goes on, we employ a sliding window approach. Algorithm 1 outlines the overall, procedure of our method. The specific EM updates depend on the chosen parametric form of the SSM. We consider two instances: a Gaussian model and a hyperspherical model based on the von Mises Fisher distribution. The corresponding EM steps for these two cases are detailed in Algorithm 2 and Algorithm 3, respectively. Algorithm 1 STAD 1: Input: Source model fθ, test batches X1:T , sliding window size s 2: Initialize: mixing coefficients πt, weights Wt, transition and emission parameters ψtrans, ψems 3: for t T do 4: Define sliding window: St = {τ | max(1, t s) τ t} 5: Compute representations: Ht = f L 1 θ (Xt) 6: Fit probabilistic SSM via EM: Wt, Ct, πt, ψtrans, ψems = EM {Hτ, Wτ, Cτ, πτ}τ St, ψtrans, ψems 7: Predict: yt,n Cat yt,n; softmax(W t ht,n) Gaussian Model The simplest parametric choice for the transition and emission models is Gaussian. The resulting probabilistic SSM can be interpreted as a mixture of K Kalman filters (KFs) (Kalman, 1960). Owing to the linear Gaussian structure, the posterior expectation Ep(W,C|H)[ ] in Eqn. (4) can be computed in closed form using the standard KF prediction, update, and smoothing equations (Calabrese & Paninski, 2011; Bishop & Nasrabadi, 2006). The complete model specification is provided in App. B.1, and the corresponding EM updates are summarized in Algorithm 2. However, these closed-form computations come at a cost: they require matrix inversions of size D D and the storage of K D2 parameters. This makes the Gaussian formulation expensive for high-dimensional features and impractical in low-resource settings. Next, we present a model for spherical features that circumvents these limitations. 3.3 Von Mises-Fisher Model for Hyperspherical Features Figure 2: STAD-v MF: Representations lie on the unit sphere. STAD adapts to the distribution shift induced by changing demographics and styles by directing the last layer weights wt,k towards the representations Ht Choosing Gaussian densities for the transition and emission models, as discussed above, assumes the representation space follows an Euclidean geometry. However, prior work has shown that assuming the hidden representations lie on the unit hypersphere results in a better inductive bias for OOD generalization (Mettes et al., 2019; Bai et al., 2024). This is due to the norms of the representations being biased by in-domain information such as class balance, making angular distances a more reliable signal of class membership in the presence of distribution shift (Mettes et al., 2019; Bai et al., 2024). We too employ the hyperspherical assumption by normalizing the hidden representations such that ||h||2 = 1 and model them with the von Mises-Fisher (v MF) distribution (Mardia & Jupp, 2009), v MF(h; µk, κ) = CD(κ) exp κ µ k h (6) where µk RD with ||µk||2 = 1 denotes the mean direction of class k, κ R+ the concentration parameter, and CD(κ) the normalization constant. High values of κ imply larger concentration around µk. The v MF distribution is proportional to a Gaussian distribution with isotropic variance and unit norm. While previous Published in Transactions on Machine Learning Research (10/2025) work (Mettes et al., 2019; Ming et al., 2023; Bai et al., 2024) has explored training objectives to encourage representations to be v MF-distributed, we apply Eqn. (6) to model the evolving representations. Hyperspherical State-Space Model Returning to the SSM given above (Eqn. (3)), we specify both transition (Eqn. (1)) and emission models (Eqn. (2)) as v MF distributions, resulting in a hyperspherical transition model, p(Wt|Wt 1) = QK k=1 v MF(wt,k|wt 1,k, κtrans), and hyperspherical emission model, p(Ht|Wt) = QNt n=1 PK k=1 πt,kv MF(ht,n|wt,k, κems). The parameter size of the v MF formulation scales linearly with the feature dimension, O(DK), compared to the Gaussian case s O(D2K). Notably, the noise parameters, κtrans, κems simplify to scalar values which reduces memory substantially. Fig. 2 illustrates this STAD-v MF variant. Posterior Inference In contrast to the linear Gaussian case, the v MF distribution is not closed under marginalization. As a result, the posterior distribution required for computing the expectation in Eqn. (4) (E-step of the EM algorithm) cannot be expressed in closed form. To address this, we adopt a variational EM approach, approximating the posterior p(W, C | H) with a mean-field variational distribution q(W, C) following Gopal & Yang (2014): q(wt,k) = v MF( ; ρt,k, γt,k), q(ct,n) = Cat( ; λt,n). (7) The variational distribution q(W, C) factorizes over t, n, k and the objective from Eqn. (4) becomes arg maxϕ Eq(W,C) log p(H1:T , W1:T , C1:T ) . The optimal variational parameters are given by λt,n,k = πt,k CD(κems) exp κems Eq[wt,k] ht,n PK j=1 πt,j CD(κems) exp κems Eq[wt,j] ht,n , γt,k = ||βt,k||, ρt,k = βt,k/γt,k, (8) with βt,k = κtrans I{t>1} Eq[wt 1,k] + κems Nt X n=1 Eq[ct,n,k]ht,n + κtrans I{tmin St) 14: Σtrans = update Sigma Trans({E[wτ,kw τ,k], E[wτ,kw τ 1,k], E[wτ 1,kw τ 1,k], Ak}τ St, τ>min St) 15: Σems = update Sigma Ems({hτ,n, E[cτ,n,k], E[wτ,k], E[wτ,kw τ,k]}τ,n,k) 16: return Wt, Ct, πt, Σtrans, Σems Published in Transactions on Machine Learning Research (10/2025) B.2 Details on STAD-v MF Complete-data log likelihood Using the von Mises Fisher distribution as the hyperspherical transition and emission model, the log of the complete-data likelihood in Eqn. (3) becomes log p(H1:T , W1:T , C1:T ) = k=1 log p(w1,k) (25) n=1 log p(ct,n) + k=1 ct,n,k log p(ht,n | wt,k, κems) (26) k=1 log p(wt,k | wt 1,k, κtrans) (27) k=1 log CD(κ0,k) + κ0,kµ 0,kw1,k (28) k=1 ct,n,k log πt,k + log CD(κems) + κemsw t,kht,n (29) k=1 log CD(κtrans) + κtransw t 1,kwt,k (30) where κ0,k and µ0,k denote the prior parameters for t = 1. In practice, we set µ0,k to the source weights and κ0,k = 100 (see App. C). Variational EM objective As described in Sec. 3.3, we approximate the posterior p(W1:T , C1:T | H1:T ) with a variational distribution q(W1:T , C1:T ) assuming the factorized form q(W1:T , C1:T ) = k=1 q(wt,k) n=1 q(ct,n), (31) where we parameterize q(wt,k) and q(ct,n) with q(wt,k) = v MF( ; ρt,k, γt,k), q(ct,n) = Cat( ; λt,n), t, n, k. (32) We obtain the variational EM objective arg max ϕ Eq log p(H1:T , W1:T , C1:T ) , (33) where Eq(W1:T ,C1:T ) is denoted Eq to reduce clutter. E-step Taking the expectation of the complete-data log likelihood (Eqn. (25)) with respect to the variational distribution (Eqn. (31)) gives Eq[log p(H1:T , W1:T , C1:T )] = k=1 log CD(κ0,k) + κ0,kµ 0,k Eq[w1,k] k=1 Eq[ct,n,k] log πt,k + log CD(κems) + κems Eq[wt,k] ht,n k=1 log CD(κtrans) + κtrans Eq[wt 1,k] Eq[wt,k] (34) Published in Transactions on Machine Learning Research (10/2025) Solving for the variational parameters, we obtain λt,n,k = πt,k CD(κems) exp κems Eq[wt,k] ht,n PK j=1 πt,j CD(κems) exp κems Eq[wt,j] ht,n , (35) γt,k = ||βt,k||, ρt,k = βt,k/γt,k, (36) βt,k = κtrans I{t>1} Eq[wt 1,k] + κems Nt X n=1 Eq[ct,n,k]ht,n + κtrans I{t1} Eq[wτ 1,k] + κems PNτ n=1 Eq[cτ,n,k]hτ,n + κtrans I{τ